Project description

In early 2019, IBM announced that it was releasing a dataset of photos of people’s faces in an attempt to “reduce bias” in facial recognition technology, which is notoriously bad at accurately identifying people of color, particularly women.

In the announcement, which was broadly given positive coverage by the press, IBM revealed that it had taken pictures from photo-sharing site Flickr that had been published with a Creative Commons license. This meant that photos people uploaded ten years ago of their friends and family were now being used to train surveillance software that disproportionately targets minority communities.

NBC News set out to understand whether the photographers or subjects of those photos were happy about their pictures or faces being used in this way and discovered that many of them did not feel they had consented to this biometric data grab.

As the team investigated further, we realized the problem was far bigger than IBM, the entire facial recognition industry is built on a “dirty little secret”: millions of people’s photos taken from the web without their knowledge or consent.

What makes this project innovative?

We stored and algorithm-encoded hundreds of thousands of user data files on the web in a way that made it prohibitively difficult for others to scrape. We wanted to protect the users who had been a part of the IBM dataset while at the same time letting those who were in the IBM dataset know they were in there.

What was the impact of your project? How did you measure it?

More than 350,000 people visited the article, it was picked up by the BBC, Fortune, Mashable and the Verge, among others. A cofounder of Flickr used to the user lookup tool to find out photos of hers were in the IBM dataset, then let her twitter followers know.

Source and methodology

We downloaded the 64-gigabyte Yahoo dataset of 100 million flickr photos from the web. We obtained the 900,000+ record IBM dataset from an unnamed source. Joining those two datasets allowed us to attach the usernames of the Flickr users to the IBM dataset, which allowed us to reach out to Flickr users who had had their photos taken -- in one case, one of the Flickr users in the IBM dataset had been told by IBM that they weren’t in the dataset.

Technologies Used

We used CSVKit and other shell (bash) programs for dataset operations. We used node.js for encrypting the user data we published online and uploading the hundreds of thousands of user data files, and we used javascript, HTML and CSS to publish the front-end interface that let readers see if their photos were included in the IBM dataset.

Project members

Lead: Olivia Solon, support: Jeremia Kimelman and Joe Murphy



Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.