Project description

Since 2014, Brazilian politics have been enmeshed in its largest corruption scandal ever. Carwash Operation (Operação Lava Jato), as the Brazilian Federal Police investigation came to be known, revealed a large corruption scheme involving private contractors and public entities that promoted illegal campaign financing and facilitated the embezzlement of billions.

The scale of the operation was unprecedented in Brazil. Following the scandal, there were more than 1100 Carwash Operation cases under review at federal, regional, and Supreme courts; meanwhile, the federal police had been forced to build new software to process data, as more than 1.2 million gigabytes had been collected during the investigations.

Brazilian investigators, the media, and society at large faced enormous difficulties in examining the information produced. At the same time, it became more necessary than ever to increase transparency over the judicial knowledge generated by Carwash, as most of the parts investigated continued to service public and private counterparts in Brazil.

To address these issues, JOTA launched “Lava Jota”, a public database of judicial information related to Carwash. The project is a partnership with the lawtech Digesto. We put together all of Carwash’s legal actions on a single website and built an integrated search engine that allows you to search for keywords within each of the documents attached to the processes. All the data made available is public data from the Judiciary System, but not readily searchable by third parties.

The goal was to give journalists, researchers and lawyers access to the enormity of information produced within the scope of the Carwash operation, which is impossible to do manually.

But we are not done. JOTA built Lava Jota as a pro bono project. The financial resources received from sponsors and partners, like law and tecnology firms and Transparency International, only paid for the production costs. JOTA invested a lot of personnel time in order to make this information available to society.

The magnitude of the Carwash Operation also required a tremendous operational effort. Just to give you an example: uploading videos of court sessions and testimonies to YouTube can take hours, and there are dozens of such videos here. Also, automatic video transcription can take hours and is not perfect. A reporter must be assigned to read the transcript, identify key issues, and then choose the most important tags in each submission. We haven’t yet been able to do this for not even half the material available. There’s still plenty to do.

What makes this project innovative?

Not even the Brazilian Justice System and the Public Prosecutor’s office had been able to build such an engine, which is of obvious interest to the whole of society. In fact, the very same prosecutors and judges involved in the investigations started using Lava Jota to filter relevant data. And as it is possible to search for mentions to other countries affected by Carwash, Lava Jota became known throughout the world – but particularly in Latin America.

Lava Jota has proven that it is possible to build ambitious transparency projects even with few resources, as long as you have good ideas, good people, and good partners. And, of course, there is a huge demand for data analysis and transparency, particularly in the fight against corruption.

In sum, Lava Jota uses big data to advance JOTA’s mission of producing information and analysis that foster institutional transparency and rule of law in Brazil. Ultimately, it is an innovative tool to increase transparency in public contracts and improve the freedom of information regarding those involved in illicit acts.

What was the impact of your project? How did you measure it?

The project became a reference among Brazilian journalists, especially fact checkers. We are constantly referenced on the web (in portuguese, though). The main lawyers and prosecutors involved in Carwash Operation lawsuits turn regularly to Lava Jota to filter through case information and have personally talked about the project with JOTA. Lava Jota data was shared by members of the Carwash investigative task force itself, as well as on official government websites. There have also been academic projects launched based on the data Lava Jota made available.

We have audience data showing the reach and repercussion in other countries. As of today, Lava Jota has received over 310,000 pageviews since its launching, with 251,000 unique pageviews. People from over 30 countries have accessed our files, which have helped numerous investigations.

Source and methodology

As previously stated, the goal of the project was to collect data from the Brazilian Judiciary in order to make lawsuit files of the "Carwash" operation available to the public. Although they are supposed to be public information, these documents are not readily found on the official websites, due to the lack of transparency of the Brazilian Judiciary in general. In fact, they are found in a number of different pages, making it very difficult to check the cases.

We developed a "web scraping" script for downloading data from our Supreme Court and all of the Courts of Appeals. From the start, we faced some issues extracting these data. We were forced to devised a great deal of strategies to structure the data and bypass captchas and other barriers. Once this first phase was in place, we gathered the documents in a database management system sufficiently robust to handle the great quantity of files. Finally, we made them available with searching functionalities on our webpage.

It is important to clarify that those documents came in different formats. For example, most of them were not searchable PDF files. We transformed all documents into searchable text and put them on our database. We also made videos available containing interrogatories with defendant in the lawsuits. We utilized technology to transcribe and classify them, and then had our reporters check the result. At last, we structured part of the data and generated a list of the involved people.

Technologies Used

Lava Jota used a plethora of programs to create our searchable database. First, for downloading documents from the Courts, we utilized "python" for coding the crawlers. We also employed R language to analyze structured data.
For handling the documents, we used "Elasticsearch", a suitable system to handle a great quantity of text data. We employed Microsoft Video Indexer for indexing and transcribing videos.
At last, we employed d3.js for the generation of data visualizations to display some of our structured data.

Project members

Laura Diniz - coordinator
Tomas Camargo - coordinator
Ricardo Cabral - data engineer
Thomaz Rezende - designer and developer
Pedro Leme - developer
Márcio Falcão - editor
Kalleo Coura - editor
Guilherme Duarte - data scientist
Luis Viviani - reporter
Alexandre Leoratti - reporter
Bárbara Mengardo - reporter
Lívia Scocuglia - reporter
Gustavo Gantois - reporter


Additional links

Project owner administration

Contributor username


Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.