We scraped all the tweets made by presidential candidates in the Brazilian 2018 election. Since the story was published in an early stage of the run, 16 politicians were contemplated, totaling around 15.000 tweets. We applied statistical analysis techniques on this corpus to determine which were the most distinctive words used by each one of them – that is, the words that were used a lot by a single person and rarely used by all the others.
The story helped to reveal particular political positions of the men and women who were willing to run for office in one of the most competitive elections in Brazilian history. The text analysis was an objective way of profiling the contenders: when a candidate emphasizes a word, it indicates that the topic is important to him and relevant to the campaign, since those were the themes he chose to bring to the online public debate.
Some interesting insights: president Bolsonaro’s most typical word was “scoundrel”, along with words such as “leftist” and “communist”. The feminist representative Manuela D’Avila, that ended up running as Fernando Haddad’s vice-president candidate, talked a lot about “chauvinists”, “sexists” and “racists”.
What makes this project innovative?
I like how our data editor defined the story to people from others desks: it’s the most objective profile about presidential candidates we could possibly write. By using text analysis techniques, we were able to determine which topics each politician was choosing to bring to the public conversation. It’s a data-driven way of discovering, still in a early stage of the campaign, how things could look like if any of them ended up elected.
What was the impact of your project? How did you measure it?
The story was well-received by readers and researchers alike, being shared by editors of other major publications and scholars from federal universities. It was also the first story in our newsroom that had its source code published online in many years.
Source and methodology
Data was mined using Twitter’s public API. The full methodology was published along with the article. Here is an English translation: “The analysis was based on 15,654 posts published on Twitter between September 28, 2017 and April 23, 2018 by 14 politicians who pose as possible candidates for the Presidency of the Republic: Jair Bolsonaro (PSL), Marina Silva (REDE), Geraldo Alckmin (PSDB), Ciro Gomes (PDT), Manuela D'Ávila (PCdoB), Álvaro Dias (PODE), Lula (PT), Guilherme Boulos (PSOL), Henrique Meirelles (MDB), Michel Temer (MDB), Rodrigo Maia (DEM), Fernando Collor (PTC), João Amoedo (NOVO) and Joaquim Barbosa (PSB). To these were added the tweets by Fernando Haddad and Jaques Wagner, from PT, which are considered the "plan B" of the party in case the Superior Electoral Court decides to stop the candidacy of former president Lula. A chart was not drawn up for Joaquim Barbosa because he only published one tweet during the data collection interval. Two weeks later, he tweeted once again to announce his withdrawal from the run. To find out which words were statistically characteristic of each politician, we compared the terms used by a given pre-candidate and the terms used by all the others. For example, for every ten thousand words tweeted by Marina Silva, the word "sustainable" was used about 42 times. In the other candidates, this rate was less than 1. This means, therefore, that the term is typical of it. The calculation ignores words that were used exclusively by a candidate. This prevents meaningless terms such as typos or very particular language vices from polluting the analysis.” The source code is available in our GitHub page (see additional links).
Data collection and analysis was made with Python. The charts were drawn with d3.js and finishing touches were added using Adobe Illustrator.
Rodrigo Menegat - data and story writing Cecília do Lago - story writing Bruno Ponceano - infographics Vinícius Sueiro - design and web development Carlos Marin - web development