Publique-se is a platform that scrape lawsuits from the two highest courts in Brazil, download all the documents related to them (including proofs, invoices and other annexed files) and does a massive search in the millions of pages for politicians that are mentioned inside any of these documents. The result is the biggest databank in the country of lawsuits somehow related to these politicians: 30 thousand cases with 9 thousand politicians mentioned inside.
Our website provides a tool to search inside this big dataset, where you can look up for every document related to a politician or make a free search (i.e search for the name of a company to see if it appears related to lawsuits mentioning politicians). It also include filters by subject and periods of time, the possibility to download every original document related to your search and a lot of metadata related to the lawsuits, from where the user can dig deeper in its investigation.
It is a project coordinated by the Brazilian Association for Investigative Journalism and it’s main goal is address a transparency problem in the courts of the country and provide journalists with a tool to better investigate politicians and hold them accountable. The first problem the project tackle is related to the fact that, despite the fact that the court’s websites allow you to type the name of a politician to look for lawsuits, it does not allow you to access annexed files (where normally lies the most interesting part to investigative journalism) unless you have a lawyer register number. What some journalists do in Brazil is to ask for the help of a lawyer (any lawyer, it does not need to be related to the case) to download the files and only after it they can try to work on them ¬– by the way, this is perfectly legal. By downloading everything and indexing the files, Publique-se makes things easier for every journalist that wants to start an investigation.
The second issue is that you cannot do a free text search inside the courts websites. It is only possible to find a politician if it is mentioned in “parts related to the lawsuit”, inside the case’s metadata. It happens that many times politicians quoted inside the documents are not in this metadata, what makes the court’s search tool less useful to journalists. As publique-se does OCR and index every single page of the document, it makes possible to find unique mentions. On top of that, as we have gathered a huge database only with lawsuits related to politicians, our platform allows to search for specific terms of interested to see if they appear in these kind of lawsuits (companies, well know corrupt people, etc.)
Publique-se search in the top two courts of Brazil (Supreme Court and Justice Superior Court) is especially useful to journalists because of a brazilian rule known as “privileged forum”
What makes this project innovative?
We do not have knowledge of any databank following politicians so broad at a national level. Publique-se was inspired by OCCRP’s Investigative Dashboard, but our goal is different, more focused in opening in a very detailed level what is being said about every Brazilian politician inside the Justice system. We are trying to provide the transparency our country lacks to help journalists to hold politicians accountable. There is a similar (and very interesting) project in Brazil, Lava Jota, but only with lawsuits related to Car Wash Operation. In the very beginning we weren’t sure we could finish within our maximum deadline (before 2018 general elections). Our team, which involved a partnership with ABJ (Brazilian Association of Jurimetrics) had never done something this broad. It took months to perfect strategies to scrape 550 thousands lawsuits, each one containing many documents with many pages. After that, we had to OCR all this stuff (more months) and develop algorithms to identify politicians CPFs (sort of a Brazilian social security number) inside the pages of these tens of millions of files. Is was the biggest challenge we faced in any project we conducted in Abraji. What makes the project special is that it opens data that was previously closed only to lawyers in Brazil in a way that makes it more useful to all the society, especially to journalists. This data makes it easier to hold politicians accountable and to investigate ties between them and persons of interest.
What was the impact of your project? How did you measure it?
It’s tricky to measure success in views’s metrics because our focus is not to reach a broad audience, but to make journalists use the tool and investigate politicians with it. We have had major newspapers and political coverage websites in Brazil publishing stories with our data covering issues like the candidates with more mentions in lawsuits to candidates that had been sued for violence against their wifes. As of now (December, 2018), we are still trying to grasp the total number of articles published. Which is also hard, because most of them don’t quote the project’s name, as we provide a search tool to find public data related to freely accessible lawsuits. It would be like someone quoting Google for helping them to find something. This said, we had 15 k views in our webpage of journalists using the search tool, with 4k unique visitors.
Source and methodology
We gathered and cleaned data related to all persons that have run for elections since 2006 from the Supreme Electoral Court in Brazil (TSE). This resulted in a database with 1 million different politicians, their names and their unique CPFs (the social security number in Brazil) Before downloading, we identified all the electronic lawsuits (the ones with files availabe in PDF format) from the top two courts of Brazil (Supreme Court and Justice Superior Court) - more than 550k of them. As there are no digitized lawsuits dated before 2013, this became our data universe. We took months to download all the 3.5 million PDF documents related to these lawsuits. Another month to OCR all of them, keeping a txt file attached in our database to each PDF. We kept also all the metadata (author of the lawsuit, defendant, subject, judges, etc.) After that, we applied an algorithm to flag, inside each page of the documents, the ones that contained one of the 1 million CPFs related to politicians. We know that a politician could be quoted inside a document with no CPF number, but we decided to look only for the CPFs to avoid namesakes. This provided us with a list of 30 thousand different lawsuits in which pages were quoted 9 thousand different CPFs from politicians. This became our database. The user could them search for the names of the politicians, and we search engine would go after the pages that were flagged with the CPF of this politician.
We used scrappers coded in Python and R to search and download all the lawsuits. To OCR millions of pages we used Tesseract 4.0. We also had to code an algorithm in R to identify poticians CPFs. Our database is in MySql and the website (built in PHP) communicates to retrieve data via an API.
Tiago Mali Daniel Bramatti Thiago Herdy Paulo Campos