Since 2014, Brazilian politics have been enmeshed in its largest corruption scandal ever. Carwash Operation (Operação Lava Jato), as the Brazilian Federal Police investigation came to be known, revealed a large corruption scheme involving private contractors and public entities that promoted illegal campaign financing and facilitated the embezzlement of billions.
The scale of the operation was unprecedented in Brazil. Following the scandal, there were more than 1100 Carwash Operation cases under review at federal, regional, and Supreme courts; meanwhile, the federal police had been forced to build new software to process data, as more than 1.2 million gigabytes had been collected during the investigations.
Brazilian investigators, the media, and society at large faced enormous difficulties in examining the information produced. At the same time, it became more necessary than ever to increase transparency over the judicial knowledge generated by Carwash, as most of the parts investigated continued to service public and private counterparts in Brazil.
To address these issues, JOTA launched “Lava Jota”, a public database of judicial information related to Carwash. The project is a partnership with the lawtech Digesto. We put together all of Carwash’s legal actions on a single website and built an integrated search engine that allows you to search for keywords within each of the documents attached to the processes. All the data made available is public data from the Judiciary System, but not readily searchable by third parties.
The goal was to give journalists, researchers and lawyers access to the enormity of information produced within the scope of the Carwash operation, which is impossible to do manually.
But we are not done. JOTA built Lava Jota as a pro bono project. The financial resources received from sponsors and partners, like law and tecnology firms and Transparency International, only paid for the production costs. JOTA invested a lot of personnel time in order to make this information available to society.
The magnitude of the Carwash Operation also required a tremendous operational effort. Just to give you an example: uploading videos of court sessions and testimonies to YouTube can take hours, and there are dozens of such videos here. Also, automatic video transcription can take hours and is not perfect. A reporter must be assigned to read the transcript, identify key issues, and then choose the most important tags in each submission. We haven’t yet been able to do this for not even half the material available. There’s still plenty to do.
What makes this project innovative?
Lava Jota has proven that it is possible to build ambitious transparency projects even with few resources, as long as you have good ideas, good people, and good partners. And, of course, there is a huge demand for data analysis and transparency, particularly in the fight against corruption.
In sum, Lava Jota uses big data to advance JOTA’s mission of producing information and analysis that foster institutional transparency and rule of law in Brazil. Ultimately, it is an innovative tool to increase transparency in public contracts and improve the freedom of information regarding those involved in illicit acts.
What was the impact of your project? How did you measure it?
We have audience data showing the reach and repercussion in other countries. As of today, Lava Jota has received over 310,000 pageviews since its launching, with 251,000 unique pageviews. People from over 30 countries have accessed our files, which have helped numerous investigations.
Source and methodology
We developed a "web scraping" script for downloading data from our Supreme Court and all of the Courts of Appeals. From the start, we faced some issues extracting these data. We were forced to devised a great deal of strategies to structure the data and bypass captchas and other barriers. Once this first phase was in place, we gathered the documents in a database management system sufficiently robust to handle the great quantity of files. Finally, we made them available with searching functionalities on our webpage.
It is important to clarify that those documents came in different formats. For example, most of them were not searchable PDF files. We transformed all documents into searchable text and put them on our database. We also made videos available containing interrogatories with defendant in the lawsuits. We utilized technology to transcribe and classify them, and then had our reporters check the result. At last, we structured part of the data and generated a list of the involved people.
For handling the documents, we used "Elasticsearch", a suitable system to handle a great quantity of text data. We employed Microsoft Video Indexer for indexing and transcribing videos.
At last, we employed d3.js for the generation of data visualizations to display some of our structured data.
Tomas Camargo - coordinator
Ricardo Cabral - data engineer
Thomaz Rezende - designer and developer
Pedro Leme - developer
Márcio Falcão - editor
Kalleo Coura - editor
Guilherme Duarte - data scientist
Luis Viviani - reporter
Alexandre Leoratti - reporter
Bárbara Mengardo - reporter
Lívia Scocuglia - reporter
Gustavo Gantois - reporter