Project description

I am a 25-year-old data journalist at The Economist. Along with our team of coders, writers and designers, I help to produce data-driven stories for the newspaper. These are usually published in our Graphic Detail section, which is dedicated to quantitatively ambitious and visually striking articles.

The stories that I have most enjoyed working on have been those for which I have constructed novel datasets, and then used statistical models to answer interesting questions. Over the last year, I’ve been able to tackle a wide range of political, economic, social and cultural questions, such as:

What makes a country good at football? Ahead of the World Cup, I built a model that predicted historical results between national teams, using their GDP, population size and level of grassroots participation. I also devised a variable to quantify a country’s interest in football relative to other sports, using search data from Google Trends.

Which countries are most likely to fight wars? For the centenary of the armistice, I combined several historical datasets of conflicts with GDP and polity variables. I found that countries of middling income and democratic freedom were most likely to be sucked into armed conflict, a conclusion that clashes slightly with the academic consensus. (I asked several political scientists for their input, and published my methodology on GitHub.)

What is driving the surge of populist parties in Europe? After combining ideological ratings from the Chapel Hill expert survey with voting data, I found no consistent link between gaining popularity and policy positions on immigration or the economy. There was, however, an association between gaining votes and criticising the EU or “elites”.

Do football managers matter? I used an unusual measure of player skill – ratings from the FIFA video game series – to project how well clubs ought to perform. Few managers were able to consistently overachieve relative to the skill of their squad, which suggests that their impact is smaller than is generally believed.

Is Google biased? After Donald Trump accused the platform of discriminating against conservative websites, I scraped search results from Google’s news tab. I found that left-leaning publications do show up more often, but that this variance can be explained by their higher levels of accuracy, according to public surveys. (A more detailed analysis on this subject is forthcoming.)

Why do some countries have more liberal abortion laws than others? Ahead of Ireland’s referendum, I built a model to predict a country’s legal position based on an index from the Guttmacher Institute. I used Pew’s polling data about religiosity and economic data about women in the workforce. My model suggested that Ireland’s laws were less liberal than expected.

How long is the perfect book? I scraped ratings data from Goodreads for classic books, which suggest that longer works are rated more highly than shorter ones.

What makes this project innovative?

I think the most innovative aspect of my data journalism for The Economist has been to come up with new ways of measuring interesting concepts, and then to use these for statistical analysis. For my two articles about football, I tried to measure two frequently discussed concepts (how interested in a country in football? how good are a manager’s players?) using sources that have largely been ignored. For the former, I used the volume of searches recorded on Google Trends; for the latter, I used the FIFA video game series. Both variables produced very reliable fits for my models. For my stories about populism, warfare and abortion, I combined existing academic datasets in novel ways. I believe that this is the first time anyone has produced a regression of Chapel Hill ratings against changes in European vote shares, or used a regression model to predict the Guttmacher Institute’s index. There is a lot of political science literature about warfare, but generally academics have considered civil and interstate conflicts separately, rather than combining them into a single analysis. And for my stories about Google and book length, I used my scraping skills to produce datasets that had not previously been published. (Several other of my projects for Graphic Detail have involved large scraping tasks, usually performed in Python.)

What was the impact of your project? How did you measure it?

It is hard to measure the impact of our data journalism in print, but we have had a very positive response from readers to the new Graphic Detail section. Our data articles are frequently in the top ten each week in terms of total time spent reading them online. My article about the football World Cup was the second most read piece (in terms of reading hours) on our website in June. I was invited to speak about it on BBC Radio Four and BBC's World Service. We have also started to publish the methodology and data for our analyses, when possible, on GitHub.

Source and methodology

I have mentioned most of my sources and methodology above. Most of the regression analyses were done using OLS, though some used logistic models as well. (Evan Hensleigh, one of my colleagues, used a clustering algorithm to depict European parties on an ideological spectrum for the populism article.)

Technologies Used

I have primarily used Python for my data scraping and analysis, though I have also learned R over the last 12 months. Most of the scraping is done using Selenium, which has allowed me to collect data from interactive web pages (as well as static ones).

Project members

Dan Rosenheck, Alex Selby-Boothroyd, Matt McLean, James Fransham, Elliott Morris, Graham Douglas, Evan Hensleigh, Martin Gonzalez



Additional links


Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.