Project description

Facebook, Twitter, and Youtube have been heavily used by candidates in presidential races, in addition to more traditional outlets, such as TV and radio. How electors can follow up all this content? For the 2018 Brazilian electoral campaign, Folha de S. Paulo has launched “GPS Eleitoral” (Electoral GPS), a tool that collected all posts from candidates on Facebook and Twitter, as well as transcripts of TV insertions and Youtube videos.

Using a statistical technique called topic modeling, Electoral GPS showed every week, plus special issues, which subjects were prioritized by each candidate. This tool has categorized more than one million words along the electoral race.

We learned, for instance, that the candidate Jair Bolsonaro has prioritized criticizing his opponents. That was his main topic in 9 out of 10 weeks of campaign. Fernando Haddad, Bolsonaro’s main opponent, focussed on proposals and messages pro-Luiz Inacio Lula da Silva, his mentor in politics, arrested months earlier.

The run for governor of Sao Paulo was also analysed in the same manner as the presidential campaign.

Electoral GPS provided straightforward texts and infographics summing up the race for those who could not afford to read and watch all the candidates’ content, and also displayed shifts in each candidate’s strategy.

What makes this project innovative?

Electoral GPS has brought two big challenges. The first one was collecting all information sprawled by each campaign in multiple platforms. To do so, we wrote codes to collect content posted by each campaign on Facebook and Twitter. We also captured and stored TV spots and videos posted on Youtube, using Google’s Speech-to-text API to transcript the audio content of these videos to text. By the end of the first round campaign, we had transcripted more than 95 hours of videos. With this huge dataset, the second challenge was to give sense for all this information. We decided to use topic modeling to help us to reach a precise overview of the entire campaign. By solving these two big challenges, we were able to categorize a whole presidential race, week by week, something barely unique in both journalism and political science.

What was the impact of your project? How did you measure it?

Electoral GPS had one of the highest audiences among all content produced by Folha de S. Paulo (one of the main media outlets in Brazil) regarding the presidential campaign, both in terms of page views and retention time. It had also one of the highest audiences for the Folha’s data journalism desk in 2018.

With its outcomes, the tool also supported different electoral analysis from different sources, even for other media outlets.

For instance, Marcelo Leite, one of the most respected science columnist in the country, used Electoral GPS findings to criticize the lack of proposals for environmental policies in the race (even from Marina Silva, an activist in this field).

The tool was also mentioned by outlets other than Folha, such as Bandnews (broadcast and online outlet) and A Tribuna newspaper (local media outlet in Northern region).

Source and methodology

Starting from early August, we collected content produced and released by each campaign for the presidential office. Initially, the only content available was the speeches delivered by each candidate on their respective parties’ convention that confirmed their bids and letters of proposal delivered by each campaign. On August 16th, the official start of the presidential race, we started to collect the content produced and posted by each campaign on Facebook, Twitter and Youtube. On Facebook, only text content was collected, since the Facebook/Crowdtangle API does not allow for videos to be downloaded. We also recorded the content produced by campaigns for TV, which has to be broadcast by all TV outlets twice every day from Monday to Saturday, according to Brazilian electoral law. On a daily basis, a set of routines verified if new content was posted on each social media and downloaded any new content found. For video content, the audio was extracted and converted to text by communicating with Google’s Speech-to-text API. All content was stored in a single relational database, thus recording from which campaign it originated, when it was posted and in which medium. Every week, on Mondays, we would query the database for content produced in the past week. We then cleaned the text, stripping punctuation marks and stopwords. The remaining text was then stemmed, thus reducing each word to a radical. In order to measure the prevalence of different subjects on each campaign output, we used topic modelling, which finds groups of radicals that occur close to each other more frequently and estimate the proportion of each of the groups on any text source. We considered that each group found is a different subject and classified these groups based on the radicals the model associated in each group and on text samples. The model output was organized into infographics to facilitate readership comprehension, and we also wrote short texts highlighting the main findings. On each publication (on every Thursday from the beginning of the campaign until the last week), we would only show our findings for the top five candidates, according to the most recent poll available.

Technologies Used

Data collection routines were written in Python 3.6. These routines were executed as contained microservices on Docker 18.03, including the relational database, supported by PostgreSQL 9.6. As mentioned before, to convert audio extracted from video content to text, we used Google Speech-to-text API. We also used Google Cloud services to store the audio files generated in the process, along with the original video files. For data analysis, we wrote scripts in R 3.5.0. Topic modelling is implemented in the stm library within R.

Project members

Marina Merlo, Simon Ducroquet, Leonardo Diegues and Guilherme Garcia.


Additional links


Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.