How to handle massive datasets – and other lessons from OCCRP Data and Aleph

With its rapidly growing number of sources, OCCRP Data is becoming an essential tool for anyone investigating corruption.

The platform is developed by the Organized Crime and Corruption Reporting Project (OCCRP) in collaboration with other contributors, and its aim is to make it easy to investigate data from over 190 sources, including information from the Panama Papers, the Bahamas Leaks, the WikiLeaks State Department Cables, and OCCRP’s Laundromat investigations.

But OCCRP Data is only a part of the story: the platform is powered by Aleph, a software for searching, managing and analysing data. OCCRP is developing Aleph as an open-source tool that aims to help any journalistic team to make sense of big datasets.

We caught up with Friedrich Lindenberg, the developer behind Aleph and OCCRP Data, to find out what led to the software’s creation, how it can help journalists with their investigations, and how he hopes Aleph will develop in the future.

Have a look at the related Matchmaking project on the Den: ‘Help us translate OCCRP Data!’

How would you describe OCCRP Data in a sentence?
Friedrich Lindenberg

OCCRP Data is a platform for investigative reporters, or anyone looking into corruption, to find leads to investigate. OCCRP as an organisation focuses on governments’ and companies’ activities, so most of the included data is about company ownership, company activities, government procurement, leaked data from within companies and banks, and so on.

And how would you describe Aleph, the tool behind OCCRP Data?

Aleph is a piece of software that indexes large amounts of data for easy browsing and searching. We’re developing it as an open source project, with contributions from a few other organisations. OCCRP Data is essentially the biggest installation of Aleph.

In short, if you’re a developer, you might be interested in using Aleph to create your own database. And if you’re a journalist, you would probably be more interested in using OCCRP Data as a source.

Let’s start with OCCRP Data, and we’ll come back to Aleph later. OCCRP Data currently has over 190 data sources, and the number is growing. Where is the data from?

OCCRP has an institutional focus on Central and Eastern Europe, and on offshore jurisdictions, so that’s where a lot of the data comes from. But it does also include databases from around the world.

Some of the data is from email leaks and other one-time releases of information, but we also have around 180 scrapers that search the web and download new data every day, so those datasets are always up to date.

If you sign in to OCCRP Data, you get access to even more data sources, because there’s some data we don’t want to make freely available on the internet as it might help identify sources. For example, you need to sign in to access the data used in OCCRP’s Russian Laundromat investigations, where we worked with the Guardian, Berlingske, and other European media.

Logging in to OCCRP Data gives you access to the data used in the Russian Laundromat investigation.
How do you determine which new data sources you add to the platform?

Most commonly what happens is that we get an email from a reporter saying that they’re looking at a specific topic or country, and asking if we can get some new data, such as information about government contracts or company ownership. Then we go out and get whatever records we can find. We usually continue updating these datasets afterwards: Once a reporter has asked for it, we stay on the case.

We’re also more and more contacted by people who have received a big leak and are now struggling with how to deal with it. For example, last year a group in South Africa had received a hard drive from a senior business person, and they didn’t know what to do with it. This data lead to the Gupta Leaks investigation. We’re trying to get better at taking this kind of data, making it searchable, and helping find links with other investigations.

How much are journalists using OCCRP Data?

The site gets about 30,000 visits a month. At the moment a lot of the usage seems to relate to specific cases, for example when we do big collaborative projects, or when we include a new, leaked dataset in the database.

Ironically some of our most active users aren’t actually journalists, but they’re people from law firms and due diligence companies. For them it’s a free tool, when normally they would have to pay hundreds of thousands of dollars to access commercial databases. It also seems that businesses may be more eager to try new tools than journalists are.

Can you give examples of how journalists have used OCCRP Data?

Sometimes we surface data that journalists don’t know is available. Recently I was contacted by a journalist from Lithuania, who found documents on the OCCRP Data about their prime minister that they were able to use in their reporting. We had scraped these documents from various websites, and this journalist was not aware that the information was public.

We also do more systematic data mining, and are using the tool more and more to cross-reference data within the platform. If for example we have a list of people that have been involved in a money-laundering scheme, you can use the tool to run these names against other databases. This may help you find out if the same names show up elsewhere, such as in the Panama Papers, or on the US sanctions list, which makes it easier to contextualise information. We did this extensively during the Laundromat investigations.

You’re seeking help to translate your tool to new languages – what languages are you focusing on?

We’re specifically looking for help with translating the software into Spanish, French and German, as they are the easiest wins: they all use the Latin alphabet, and adding these languages would make the software relevant in more countries.

In terms of showing and searching information, we can already do it in all European languages and in the Cyrillic alphabet. We’re also looking into Arabic now, both having the website in Arabic and being able to process Arabic texts.

For tips on how to use OCCRP Data, click here.

Moving on to Aleph, the tool behind OCCR Data – what was the starting point for the creation of the tool?

I started working on Aleph before I joined OCCRP. The first version was created in 2015 for a project run by the International Consortium of Investigative Journalists (ICIJ). I wrote a piece of software that allowed us to download stock exchange filings as quickly as possible, import them, and make them searchable for the investigation.

When I started working for OCCRP, one of our biggest challenges was just being able to search all the documents we had. There was a ton of information from previous investigations, but nobody had a good overview of the data or how it possibly related to newer investigations we were doing. That’s when Aleph was revived, and it has now been in active development for a bit more than a year.

Aleph was first developed for ICIJ’s Fatal Extraction investigation
Why did you need to create an entirely new tool for this?

There’s a weird duality with data that we’re trying to solve. If you’re a technologist, you know that there’s two kinds of data, structured and unstructured, and that you need to approach these types differently. But if you’re a journalist, you could not care less about the different characteristics of data! A journalist wanting to find the name of a person or a company does not care whether it comes from an SQL database, an email, a government gazette, or whatever the source may be.

We’re basically trying to build a solution that makes all data, structured and unstructured, accessible uniformly, so that journalists don’t have to care what kind of data they’re dealing with. So Aleph supports everything from images to emails, and from pdf files to Word documents, and also structured information such as government contracts, information about individuals who own companies, information about those companies, about landownership…

At the same time, we’re trying to build a system that’s also very good at importing and managing leaked information and large datasets, like companies’ databases or procurement data.

For anyone interested in using Aleph to launch their own online database, rather than building on OCCRP Data, what’s that process like?

You don’t need that much technical skill to set it up on a server, and we’re trying to make it easier and easier. But you do need to be able to use a computer through a command line, and have a server rented somewhere, so it’s definitely not for everyone.

We’re working with Süddeutsche Zeitung in Germany, who have an internal deployment of the tool, and there’s a start-up in Berlin called Open Oil who run their own version of Aleph. We try to encourage people to set up their own thing, especially with leaked information; we don’t need to host every piece of data out there.

OpenOil’s version of Aleph includes over 2 million public domain documents, collected from financial regulators around the world.
On the other hand, if you have everything in one place, that could be useful too.

That’s kind of the irony of it. Everybody hates Twitter and Facebook, and in a certain way centralisation sucks, but it can also be just damn efficient thanks to the network effect.

It’s actually an interesting question: What does the network effect mean in journalism? How can we connect all this information in more meaningful ways? At the moment, for example, if we know that a company is registered at a particular location, we can show what other companies exist at that same address. Or if an email address comes up in one leak, we can already see if it’s part of other leaks. But I think that’s just scratching the surface, and we’re only starting to figure out what the network effect means for investigative reporting.

How has the tool evolved since you first created it?

Aleph used to be really awkward to use, it had two search boxes on the home page: one for structured data, and one for unstructured data. Which was clear to every technical person but made no sense to any journalistic human! So that’s gone away, and we’ve completely redone the user interface to make it as simple as possible. Aleph also has nice preview windows now, which makes it really quick to explore search results.

In terms of using the tool, we’re doing our best to make it usable for everyone, and I’m really keen to get more feedback. We’ve got user testing going on right now, so it’s only going to get better.

What next steps do you have for Aleph?

We want to focus on helping people manage their investigations inside the tool. We’ll add “case files”, which will make it easier for the user to keep all the information that’s relevant to their investigation together: they’ll be able to upload documents, which the system OCRs and makes searchable, and we’ll also make it possible for them to bookmark existing data in Aleph for their investigation.

We’re also trying to figure out if we can scale up, and if we can fundraise for the project. We hope that at some point we’re able to bring Aleph to an entirely new level, to have a larger, international software development team, and to be able to hire people who are experts in machine learning and in dealing with really large amounts of data. Ultimately, we’d like to build a kind of a start-up around investigative journalism and technology.

Check the Aleph Wiki for information on how to install the software.

Sounds like your role is right in the middle of those two worlds, tech and journalism?

Yeah, that’s why I love this particular job. In the morning you write some hard-core data mining code, and then in the afternoon you help investigate some huge political scandal. Bringing these two worlds together is super interesting.

Basically, I have two goals: One is helping our reporters to find the information they need using OCCRP Data. The other is to see if we can use Aleph to create a viable open source project in journalism that many organisations can contribute to and benefit from.


Interested in the latest essential reading about data journalism? Subscribe to the Den Bulletin, our newsletter sent out every Tuesday and Friday: