Project description

These are five projects I led as data journalist at CBC over the past year. Here are brief descriptions.

1. Looking through nearly 10 million troll tweets released by Twitter, we found about 22,000 that seemed to target Canadians specifically, on the topics of immigration and pipelines. Not a huge number, but it showed that trolls know exactly the most divisive issues in Canada now, and they could ramp it up during an election year.

2. For the first time in 50 years, a party won the Quebec provincial election that wasn’t one of the two big parties that kept taking turns at the helm. To understand better how the election was decided, we compared poll results at each electoral riding to several census measures and found a few that correlate, and we presented these in a hexagonal cartogram.

3. To help Montrealers make sense of open crime data, we clustered crimes into a hexagonal grid, and aggregated the crimes by month to show trends over time. We did this with a guidance of a criminologist, who advised us how to present this data in a way that doesn’t spark panic.

4. Quebec’s electoral map changed radically in the last provincial election, and we created a guided aniated map that walked viewers through some of the biggest changes. Mapping results by polling sector, we were able to see how the winning party swept the suburbs and industrial regions, while an up-and-coming leftist party grew from humble origins in an artsy neighbourhood to dominating the centre of Montreal.

5. The 2018 provincial election in Quebec surprised a lot of people, but maybe it shouldn’t have. Using data from Vote Compass, a tool that compares your values to that of political parties, we found that on most issues, the average user response was most closely aligned with that of the winning party.

What makes this project innovative?

1. It was the first time in Canada that evidence was found of foreign influence campaigns on social media. The tools used were also sophisticated. The large size of the data required heavy-duty software: Python's pandas library and lots of regular expressions to find troll tweets that mention Canada-specific issues. The way we shared out finding and methodology was also new for Canadian media: the code and resulting datasets were published on GitHub in a highly annotated Jupyter notebook. 2. We used one of the basic social science methods to find correlations between vote results: the Pearson correlation. This helped us narrow down which census measures we would focus on in our storytelling. This was also the first time we used a cartogram to visualize election results. 3. This was the first time crime data was presented in this way. In fact, it was the only way the crime data was used by any news media as an audience-facing news app. It was also the CBC's first fully-automated news application, a Python script that handles everything, from downloading the data, doing the geographical calculations inside the hexbins, exporting the map files, and exporting the charts. 4. This was the first time election results were presented in this kind of step-by-step map tour in Quebec. It made use of Mapbox's "fly to" feature, which sends the viewer of a pleasing animated pan to different parts of a map. We stepped away from the classic electoral map that throws everything at the reader at once, but walks them through the most interesting parts of it. 5. This was a novel, data-driven way to gauge public sentiment that wasn't your standard poll.

What was the impact of your project? How did you measure it?

All the projects generated a lot of conversation on social media and in the CBC comments. The first project, on Twitter trolls, dominated the news cycle in Canada that day. Other news media reported on it and I was invited to several TV and radio shows to talk about my findings. The crime app directly fed new journalism. It allowed us to see crime hotspots and look into them, putting pressure on police to do something about it. We later learned that police started paying more attention to a car theft hotspot, and later releases of the data showed a significant decrease in these crimes. It also helped assuage people's fears by showing the Montreal is a safe city by the numbers, even if sensational crimes make it seem like the opposite. It also won an award from the Radio Television Digital News Association for best data storytelling.

Source and methodology

1. The source of the data was Twitter: After all the datasets were loaded into a single pandas dataframe, I used regular expressions to search them for Canada-specific tweets. For this, we came up with about 70 keywords that are unique to Canada. These include political figures, parties, hashtags used to talk about Canadian issues, media personalities, and controversial issues like specific pipelines and Indigenous matters. The tweets were then cleaned up (stripped of hashtags, @-mentions and URLs) to isolate they pure texts. 2. The data sources were the vote results by the Quebec Chief Returning Officer (DGEQ) and census data from Statistics Canada, aggregated to provincial ridings by Elections Quebec. The cartogram was made by Marc Lajoie, a developer-designer at Radio-Canada. We chose about 30 census variables and loaded them with vote results by party into pandas, a data analysis library in Python. We ran a Pearson correlation across all numbers and isolated those that were 0.6 or more, or -0.6 or below. This returned about 12 variables. We took a closer look and picked seven that seems to tell the most interesting story. To visualize it, I wrote a Python script that assigned a colour to a value in the data, and styled the SVG file of the cartogram. 3. The crime data comes from the City of Montreal. A Python script downloads the data every three months from the Montreal open data portal. It uses geopandas to do a spatial join between the crime incident points and the hexbin polygons, effectively doing a point-in-polygon count. The resulting GeoJSON is uploaded to Mapbox, which is used to visualize the data on a map. Another series of functions aggregate the data by year and time of day, and return JSON files which are used by the charts in the page. And another function compares the data to the same period the year before to signal areas that saw significant increases or decreased in crime. This is used internally to generate new stories. 4. The vote results and the geographic boundary files for the polling divisions came from Elections Quebec. The data came in a separate CSV for each electoral riding. A Python script was used to compile them into a single file and join the data to the boundary shapefiles. This was done for two election years, 2014 and 2018. Both shapefiles were uploaded to Mapbox, which served as the database and web map. The interactivity was built with HTML, CSS and JavaScript using the Mapbox API. 5. The data came from Vox Pop Labs, which created the Vote Compass. They provided 1,000 random responses to the Vote Compass for each of six issues in the election: health, immigration, education, environment, language, and economy. The data was plotted on a scatter plot using Python's Matplotlib library, exported to SVG, and send to a designer. In addition to the scatter plot, I calculated the Euclidean distance between the average user result and the party result to show in a dot plot that the winning party was, geometrically, closer to the people's will on most issues.

Technologies Used

1. Python (pandas, numpy, Seaborn) Jupyter notebook Excel (for sharing the isolated data) 2. Python, Illustrator 3. Python (geopandas), Mapbox 4. Python, Mapbox API, HTML, JavaScript 5. Python, Illustrator, Datawrapper

Project members

Santiago Salcido, Jeff Yates, Andre Guimaraes, Jonathan Montpetit



Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.