Project description

We are the data unit of the Süddeutsche Zeitung and we work on different topics, ranging from politics to debates in popular culture to corruption and climate change. The following projects have been published between 26 March 2018 and 7 April 2019.

– We analysed more than 50,000 children’s and youth books from 70 years and 42,000 book covers and were able to reveal gender stereotypes in children’s books.
– We conducted a survey on rental prices in which 57,000 people participated. The project #MeineMiete shows how broken the rental market is in Germany and how housing is the new social question.
– We’ve scraped thousands of publications from the most popular predatory publishers and built a Linkurious database to help our investigative colleagues understand the problem through network analyses.
– Using open source intelligence tools and network analysis, we helped reconstruct the murder of the Maltese journalist Daphne Caruana Galizia and with the same approach were able to review a UNPOL investigation into the murder of two UN Officials in DRC.

– We have automatically generated texts with the election results for 91 constituencies for the state elections in Bavaria.
– We have analysed thousands of minutes of parliamentary sessions to find out how the debates in the german federal parliament, the Bundestag, have changed since the right-wing AfD party moved in.
– We evaluated the coalition agreement at the beginning of the parliamentary term and created a kind of to-do list of the government from 500,000 characters of text. We have found 139 self-imposed goals. 139 promises by which the government must be measured – and whose implementation is constantly documented in our coalition tracker.

– Climate change is one of the major issues of our time. In 2018, we published three major data-based research reports on heat, drought and snow that show how climate change is already manifesting itself today.

What makes this project innovative?

We do journalistic work in areas of life that are getting more and more important - but aren’t easy to access with regular journalistic methods and tools. With this approach we can create stories that are informative and investigative, which constitutes the core of SZ journalism. We use tools that derive from data science and computer science, mainly the programming languages R and Python for our data work. In the links provided we show different data driven approaches we used in our reporting: from crowdsourced information to academia-like polls, from scraping thousands of fake publications to analysis that examined data from the parliamentary proceedings of the German Bundestag.

What was the impact of your project? How did you measure it?

We regularly succeed in stimulating social debates with our data journalistic research and surprising approaches. For example 57,000 people took part in our survey on rental prices and our data analysis was also taken up and discussed by many politicians. The research on Fake Science led to the robber publisher Omics being sentenced to a fine of 50 million dollars.

Source and methodology

In our work we focus mainly on sources that are not easily accessible. Each project requires different methodical and data journalistic approaches. For the project on gender stereotypes in children's books, for example, the underlying database was the largest catalogue of children's literature in German-speaking countries. Of all these books, we evaluated information such as title, year of publication, author, publisher and, above all, the keywords entered for the respective work. The connections between the keywords can be investigated using network analysis methods: Which keywords often occur together? Which keywords are central? What are the direct and indirect relationships between them? How do the networks for boys and girls differ? In addition, the most important colors of 42,000 book covers were determined with the help of Google Vision For the analysis of the Bundestag minutes we used structured texts, which the Bundestag makes available as XML files and from this we built up our own database. The most complex step was the automated recognition of text patterns. For example: When is applause attributed to the SPD and when to the CDU/CSU? Who is the interjutor? What did she say? And to whom in which place? For this purpose we set up rules which are applied to the plenary minutes with the help of regular expressions. A table of 24,396 lines is created by many such rules that interlock. On the basis of this table we can draw conclusions. Because patterns can be seen, in which combinations factions clap for each other - or how the AfD splits the Bundestag. In the case of the government declaration, we found that Angela Merkel received 117 applauses, 55 from the Union and 49 from the SPD. The stenographers also received general cheerfulness once, laughter twice and 22 cheers.

Technologies Used

Mainly R and Python for data gathering and analysis, different Open Source Intelligence tools, Javascript for interactive publications.

Project members

Katharina Brunner, Felix Ebert, Sabrina Ebitsch, Christian Endt, Hannes Munzinger, Martina Schories, Benedict Witzenberger, Vanessa Wormer, Moritz Zajonz.


Additional links


Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.