Project description

NBC News set out to recover, analyze and publish on the verified Russian troll tweets for accounts which Twitter had identified but then deleted all online records of the activity. These were accounts identified as being run by the Internet Research Agency, a Kremlin-linked firm tied to 2016 election meddling. By deleting the tweets Twitter made it impossible for citizens, journalists, researchers and lawmakers, any general reader in our broadcast and online audience, to understand and discuss the true extent of the online disinformation operation from both a content and network perspective. By open sourcing the data and publishing the tweets we recovered, we sought to correct this gap in the public record.

What makes this project innovative?

Twitter deleted a crime scene. We recovered data that appeared lost and then made it available for anyone to look at or use. Other news organizations recovered anywhere from 36,000 to 220,000 tweets from the accounts. But we were the only ones to publish the underlying data itself. This has opened up new possibilities for other outlets, so far it's been smaller regional outlets who have taken up the charge, to slice the data for their readers. The database is the definition of news you can use and will continue to be a resource for NBC News and the public. We partnered with Neo4j, an outside social network analysis software company, whose talents deepened our data pool bench. We encourage other newsrooms to look for such mutually beneficial partnerships to increase their depth and deepen their data journalism "bench" in an agile fashion.

What was the impact of your project? How did you measure it?

Nearly half-a-million citizens accessed the story on our site, allowing them to see the true scope of the operation and inoculate themselves against future propaganda. Dozens of publications did pickups, citations, and follow-on stories diving into the data on their own and localizing it to their geographical audience, including the Guardian, the Hill, Quartz, The Outline, Denver Post, Atlanta-Journal Constitutional, La Stampa (Italy), and Washington Times. Whenever anyone uses our data, they give us credit and a link back, becoming a force multiplier for our reporting. It went viral online and on Twitter, getting retweeted by Katie Couric, Human Rights Watch, amplified to millions of followers, and racking up 340k pageviews. We went on MSNBC to talk about it and did a Reddit AMA. It has inspired academic researchers to start doing papers on our database. Researchers from Stanford are feeding our data to AI bots to train them to spot future trolls and could flash an alert in your browser if you're reading an account that's part of a political botnet. Two days after we published our database of trolls, Special Counsel Mueller filed his lengthy speaking indictment against the 13 Russian trolls. It may have been a coincidence but it meant all of a sudden every digging into the open source reporting had a lot more material to work with.

Source and methodology

Twitter had deleted the data and online archives were incomplete. There were no easily available copies of the data because Twitter mandates that when it suspends accounts that anyone holding onto the data delete it as well. So we identified potential likely sources who might have had access to the Twitter API and kept records and just asked for the data. To the sources who wanted to participate, we provided the list of 2,752 verified usernames given to Congress by Twitter and published as evidence during the tech giant hearings. We asked the sources to cross-reference their records for tweets from those accounts. The three sources who were able to provide data requested anonymity to avoid potentially being identified as having broken their developer agreement and to avoid getting caught up in any politicization of the data.

We had confidence in the data's authenticity because a significant part of the source's records overlapped with another. There were also numerous times where we could “spot check” the data by looking for a copy of a tweet in an online archive, such as archive.org or archive.is or see pre-existing reference to it. The sources were able to recover over 200,000 tweets from 454 accounts. Some of the accounts had creation dates back to 2009.

For the analysis we started with a mixture of twitter streaming API data (one JSON object per line input) and CSV files, along with a few other miscellaneous artifacts like the house intelligence list as CSV. All of that data was transformed and loaded via Python. Once in neo4j, we did a variety of different graph analyses: (1) a graph “overlay” network that allowed us to more directly connect who was talking with who, abstracting out the details of the day-to-day tweets, (2) Community detection via label propagation, which is a graph method used to detect which users belong to which sub-communities, (3) PageRank, a method of determining relative influence in a social network, (4) Natural language processing; part-of-speech and concept tagging to determine people, places, events, entities, and adjectives used in the data source, and then (5) Various statistical methods were used to build histograms of most common occurrences of hashtags, people, locations, etc — split out by time window. We also created the “topic matrixes” where we had a hashtag or a person on the X axis, and the week on the Y axis, so that we could track relative frequency as it changed over time.

Technologies Used

Technologies used are primarily neo4j, but also python programming. Neo4j was used for all data analysis & querying, python was used primarily for massaging the data and bulk data loading in the very earliest stages.

Project members

EJ Fox
Maura Barrett
William Lyon (neo4j)
David Allen (neo4j)

Link

Additional links

Project owner administration

Contributor username

Followers

Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.