Project description

Councils use 377,000 people\’s data in effort to predict child abuse
We investigated the use of algorithms developed in the private sector to detect child abuse and intervene before it occurred in ‘high risk’ families, trained on vast amounts of personal data without informed consent.

Revealed: sick, tortured immigrants locked up for months in Britain
The UK is the only country in Europe where migrants can be locked up indefinitely, many without having committed crimes. In the absence of official data sources, we built our own dataset to shed light on the experience of detainees.

Children ‘bombarded’ with betting adverts during World Cup
During the World Cup, including pre-Watershed games not usually allowed to run gambling advertisements, included almost 90 minutes of betting ads during the tournament.

3,121 desperate journeys: exposing a week of chaos under Trump\’s zero tolerance
We analysed more than 6,000 documents relating to immigrants caught crossing the US border, proving the most comprehensive picture yet of immigrants’ prosecuted during Donald Trump’s zero tolerance policy.

Flatshare bias: room-seekers with Muslim name get fewer replies
This investigation collected quantitative evidence that, when contacting advertisements for flatshares, a person with the name “Muhammad” was significantly less likely to receive a positive or neutral response than “David”.

How Facebook is influencing the Irish abortion referendum
Analysis of almost 800 Facebook ads identified by the Transparency Initiative Referendum and analysed by the Guardian revealed major differences in how the two sides of the Irish abortion referendum campaign sought to influence voters.

Rape prosecutions plummet despite rise in police reports
We revealed the number of rape cases being charged by the Crown Prosecution Service plummeted to the lowest in a decade, despite an increase in the number of incidents reported to police. The data was released through Freedom of Information requests to reveal prosecutors had been quietly urged to take a more risk-averse approach in rape cases to help stem widespread criticism of the service’s low conviction rates.

Gender pay gap: What did we learn this year?
We analysed gender pay gap figures to reveal widespread inequality across British businesses. The data shows almost eight out 10 companies pay men more on average than women and every industry pays men more on average than women.

Revealed: the rise and rise of populist rhetoric
We uncovered a two-decade rise in populist rhetoric. We worked with a network of political scientists using textual analysis of public addresses of leaders in 40 countries to show the number of populist leaders has more than doubled since the early 2000s.

Banking leak exposes Russian network with link to Prince Charles
We collaborated with OCCRP and partners to track the movement of billions of dollars through a network of Russian-operated offshore companies into Europe and the US. Our work revealed a charity run by Prince Charles received donations from an offshore company that was used to funnel cash from Russia.

What makes this project innovative?

In this, the tenth anniversary year of Guardian Data, our three-person team has delivered more than 100 pieces of journalism, including short-turnaround news stories, in-depth investigations and inter-newsroom and cross-border collaborations. The team consists of Data Projects Editor, Caelainn Barr and data journalists Pamela Duncan and Niamh McIntyre. But, in practice ours is a far bigger team due to its emphasis on collaboration across the newsroom. Wherever possible we work alongside newsroom reporters, specialist correspondents and our visuals team, with their expertise strengthening our offering and vice versa. In the past 10 years the data team has evolved to deal with the ever-growing amount of data that exists. As the amount of available data has proliferated, so too have the methods we use to analyse it. We use programming for complex data analysis to allow us to analyse microdata or to dig into leaks involving millions of records. Our philosophy is that data is not just about numbers: each dataset has a human element and we endeavour to put this at the heart of each story. A database built from court documents revealing the scale and impact of Trump’s zero tolerance policy led to interviews with those who the policy affects. Revelations of how so many rape cases lead to charges being brought tells the story of so many women failed by prosecuters. An exclusive survey of almost 200 detainees held in seven deportation centres in England showing more than half were classed “at risk” revealed how the system fails the people in its care. Through data we show the extent to which real people are impacted by such policies and give voice to those affected.

What was the impact of your project? How did you measure it?

Combined our stories have reached and engaged millions of readers and been shared 100,000s of times through Twitter, Facebook and other social media platforms. However another measure of success for the team is the impact stories have in changing policies and reframing the debate on certain issues. In our reporting series on how rape cases are dealt with by the Crown Prosecution Service our work prompted a review of how complainants are treated and how their data is processed by police forces. The Guardian’s reporting prompted the launch of a comprehensive review by the Home Office of how rape cases are dealt with across the criminal justice system. The review will specifically be tasked with investigating “why there have been reductions in volumes of police referrals, CPS charges, prosecutions and convictions for rape and serious sexual assault cases”. Our work also prompted a review by the Information Commissioner’s Office into the disclosure and and retention of data from rape complainants. The revelation that football fans would have seen 90 minutes’ worth of bettings ads (equivalent to a full game of soccer) was widely cited (although seldom attributed) after the Guardian broke the story days before the tournament final. Our work helped inform the debate which, within six months of the article being published, led gambling companies to agree to a voluntary ban on advertising during live sports.

Source and methodology

We publish our methodology wherever appropriate, either as contextual information within the body or at the end of a news piece or, in the case of more in-depth stories, with a standalone methodology. Facebook adverts: The analysis was based on 788 Facebook ads (432 from the yes campaign, 356 from the no side) related to the Irish abortion referendum as identified by the Transparency Initiative Referendum. Two types of analysis were carried out. In the first instance, the text from the ads was analysed to identify each word/term used. A count was then carried out to establish how many adverts included each word. Comparisons between the yes and no sides were weighted to reflect the differing number of adverts carried by each. Zero tolerance: We used optical character recognition technology to convert thousands of PDF documents from the US Pacer (Public Access to Court Electronic Records) service into a machine readable format. These documents were then analysed and parsed using Regex to build the dataset which underpinned the story.

Technologies Used

Data journalism takes many forms and no two stories are ever built the same. In the past year the Guardian data team has incorporated various methods to produce a wide ranging body of work. Coding forms part of our toolbox. Stories like our investigation into the bias experienced by flat-seekers of a minority background would have been an onerous task without the ability to extract the data using a webscraper. The analysis of approximately 6,000 court documents relating to immigrants crossing into the US during the Trump administration’s zero tolerance policy would have been almost impossible without the use of Python and Regex. Programming allowed us to carry out text analysis on Facebook adverts carried on the platform in the run up to the Irish abortion referendum. But while programming can lift the load in some circumstances, other datasets required a more hands-on approach. Pamela Duncan used a spreadsheet manually recorded the duration of every advertisement shown during ITV’s World Cup coverage.

Project members

Caelainn Barr, Pamela Duncan and Niamh McIntyre


Additional links


Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.