To investigate cases of fraud, bribery, state capture and money laundering, the Organized Crime and Corruption Research Project and its member centers need to combine leaked document sets, public records, and scraped data into a coherent picture that lets us follow the money. Using a custom-built open source software tool, the Aleph, we’ve brought together terabytes of data, over 200 scrapers and a unique collection of leaked material into a searchable topography of economic and political power. This unique store of evidence combines public-access data for the journalism community with protected spaces for our collaborative cross-border projects.
What makes this project innovative?
Rather than building one-off technology solutions for each investigation, OCCRP Data provides operational support for our projects, while also building up a long-term memory of leaks, public records and other data. Over the last three years, we've accumulated 28 million documents, 550 million data records about companies, individuals, state procurement, sanctions, land, ship and aircraft registries. In total, we're combining 21 TB of documents and 1180 database tables with a total size of 1.2 TB into a single resource. This includes support for multiple languages and alphabets, optical character recognition and named entity extraction. A particularly useful feature has been the ability to cross-reference a list of entities - such as a list of all the politicians in one country - against all the other databases and leaks in the corpus. Conducting data collection and analysis at this scale and diversity is unprecedented in the investigative journalism community, and provides a unique source of insight to our reporters and the public.
What was the impact of your project? How did you measure it?
We've used OCCRP Data as a leak analysis tool for a number of our recent investigations, including the Troika Laundromat collaboration; the Daphne Project; Golden Visas, a cross-border collaboration on the practice of investor visas; the ownership of high-end real estate by corrupt officials on Dubai's Palm Jumeirah and The World development projects; and an investigation into the sale of land in the Maldives. Outside of these larger collaborations, the tool is routinely used to background indivduals and companies by people both within and outside the OCCRP network. Amongst those users we see a large number of return and frequent visitors, which shows us that the site has become a routine part of their toolkit.
Source and methodology
We operate a complex infrastructure that includes tools for both repeatable and one-off data acquisition. Outside of the development of the open-source Aleph software, we've also built various auxilliary tools, including memorious, a scraping toolkit that collects data from 200+ sources of public and open data, the datavault, an archive of structured SQL data, and OpenSanctions, a project that collects and combines sanctions information from around the world.
Python, ElasticSearch, PostgreSQL, Tesseract, spaCy, RabbitMQ, Redis, React, PDF.js, Docker, Kubernetes, International Components for Unicode, Flanker E-Mail, LibreOffice.
Davit Khurshudyan, Jen, Tarashish Mishra, Emma Prest, Amy Guy, Iain Collins