NBC News set out to recover, analyze and publish on the verified Russian troll tweets for accounts which Twitter had identified but then deleted all online records of the activity. These were accounts identified as being run by the Internet Research Agency, a Kremlin-linked firm tied to 2016 election meddling. By deleting the tweets Twitter made it impossible for citizens, journalists, researchers and lawmakers, any general reader in our broadcast and online audience, to understand and discuss the true extent of the online disinformation operation from both a content and network perspective. By open sourcing the data and publishing the tweets we recovered, we sought to correct this gap in the public record.
What makes this project innovative?
What was the impact of your project? How did you measure it?
Source and methodology
We had confidence in the data's authenticity because a significant part of the source's records overlapped with another. There were also numerous times where we could “spot check” the data by looking for a copy of a tweet in an online archive, such as archive.org or archive.is or see pre-existing reference to it. The sources were able to recover over 200,000 tweets from 454 accounts. Some of the accounts had creation dates back to 2009.
For the analysis we started with a mixture of twitter streaming API data (one JSON object per line input) and CSV files, along with a few other miscellaneous artifacts like the house intelligence list as CSV. All of that data was transformed and loaded via Python. Once in neo4j, we did a variety of different graph analyses: (1) a graph “overlay” network that allowed us to more directly connect who was talking with who, abstracting out the details of the day-to-day tweets, (2) Community detection via label propagation, which is a graph method used to detect which users belong to which sub-communities, (3) PageRank, a method of determining relative influence in a social network, (4) Natural language processing; part-of-speech and concept tagging to determine people, places, events, entities, and adjectives used in the data source, and then (5) Various statistical methods were used to build histograms of most common occurrences of hashtags, people, locations, etc — split out by time window. We also created the “topic matrixes” where we had a hashtag or a person on the X axis, and the week on the Y axis, so that we could track relative frequency as it changed over time.
William Lyon (neo4j)
David Allen (neo4j)