Project description

A bank fails, and politicians save it with taxpayers’ money. This story repeats itself all around the world, most recently in Russia and Italy. To prevent costly bailouts, banking regulation has been devised and implemented. In particular, the documents of the Basel Committee on Banking Supervision play a key role for national rules.
Over the last forty years, banking regulation has grown extensively. The framework developed by the Basel Committee alone consists of two million words on thousands of pages. But what is actually stated in all these documents?
The Neue Zürcher Zeitung (NZZ) performed a text analysis on all the regulatory documents of the Basel Committe with final status as of 31st of July 2017.
The most obvious development was the sheer growth of text over the years – and a striking impact of the financial crisis (2007/2008) on the publication activities of the Basel Committee.
When diving deeper into the texts, we discovered how much banks "should" and how little they "must". And how the focus on risk types and the instruments used to prevent those risks have shifted over time, which has resulted in a cat-and-mouse game between financial institutes and the regulator.
The goal of this piece was, on the one hand, to show how strongly banking regulation has grown and how complex these texts are – and, at the same time, how vaguely many of its rules are formulated. We also wished to make the complex matter of banking regulation accessible to a wide audience. As such, the piece addressed financial experts as well as the general public.

What makes this project innovative?

We used unstructured data – text – and structured it by applying different methods of text analysis. Then, starting from some of the most frequent words and word combinations, we used expert knowledge from the banking sector as well as from linguistics in order to deepen our text analysis.
At the same time, we tried to make our results as accessible and transparent as possible: The story was written with readers that have no particular knowledge from the world of finance in mind. Most graphics were annotated in order to point out the most striking patterns to the reader. And the methods of our analysis were laid out in detail at the end of the article. The code used to generate our results was published and linked to in the methods part in order to make the analysis reproducible.

What was the impact of your project? How did you measure it?

The article had a wide audience on our news site (, as measured by our analytics tools, as well as on social media. A particularly high number of spontaneous, personal messages from the general public as well as from the financial sector showed us that the article and the analysis presented in it appealed to both audiences.

Source and methodology

We scraped the regulatory documents (mostly .pdf, sometimes .html) from the website of the Basel Committee on Banking Supervision (, and transformed them into plain text. Some of the older documents had to be run through an OCR-system (Optical Character Recognition) in order to be made machine readable. After this, the resulting text corpus was processed in different ways for different steps of the analysis (e.g. upper / lower case, stemming, removing stop words etc.). We counted the number of pages per document and aggregated these values by year in order to show the development of the publication activities of the Basel Committee, over time. We decided to show the number of pages as a proxy for the growth of banking regulation (rather than documents, words or characters), mainly because this measure is readily graspable for most readers: We all roughly know what it means to read a 300-page book. But many readers may have a hard time to put "7000 words" or "500 characters" into context. Text analytical tasks were performed using bash code. In order to illustrate the complexity of the regulatory texts, we counted sentence length and compared it to the mean sentence length in the British National Corpus, a text corpus containing a broad spectrum of text types. We then counted the frequency of words and word combinations (n-grams) in the corpus. Following three of the most frequent concepts, we then refined and adapted our methodology to elaborate on each of those. (1) The high frequency of the word "should" led us to adopt a linguistic viewpoint for a more detailed analysis: we counted the frequency of occurrence of modal verbs and modal constructions, over time to discover how much banks "should" and how little they "must". And how the financial crisis functioned as some sort of caesura: a higher proportion of more binding modals appears after 2007/2008. (2) The most frequent content-word, after "bank/banks" was "risk". We first looked at the most frequent collocations of "risk" in order to find the most frequent risk types, and then automatically counted these, in every document, and showed the proportions of their occurrences over time. (3) Another extremely frequent concept in the text corpus was the collocation "risk management": the Basel Committee has developed a number of instruments in order to prevent risks. We used expert knowledge from the banking sector to create a list of these instruments, and searched these terms and their morphological variants in the corpus. (2) and (3) led to graphics that show how the Basel Committee, over the past 40 years, detected different types of risk over time, and responded to those risks with particular instruments – instruments which financial institutes tried to evade, which results in an ongoing cat-and-mouse game between such institutes and the regulator.

Technologies Used

For the analysis, we mainly used custom code in R and bash. Visualizations were created in R and styled in Adobe Illustrator, and the latter was also used to show highlighted text extracts from regulatory documents.
In more detail: We scraped pdf-documents and transformed them to plain text using R. OCR was performed on the non-machine-readable .pdf-documents, using Adobe Reader Pro. Text processing (upper / lower case, stemming, removing stop words etc.) was performed using the R-library 'tm'. For more detailed text analytical tasks we wrote bash code. Data were visualized using the R-library 'ggplot2', and the resulting graphics were exported and styled using Adobe Illustrator.
The code (R and bash) can be accessed here:

Project members

Co-authors: Jürg Müller and Simon Wimmer. With help from David Bauer and Joana Kelén.


Additional links

Project owner administration

Contributor username


Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.