This is my individual portfolio of some of my best work this year.
R for Journalists is a blog that explains how the R programming language can be used for data journalism. It has two goals:
1. Be as accessible as possible to beginners with R
2. Use worked examples and focus on practical application rather than theory
The website is aimed at journalists who want to employ data journalism and specifically R more in their day-to-day work.
The goals of the weekly print pages are:
1. To provide interesting data-led content that is relevant to our weekly newspapers’ patches
2. To be as automated as possible
3. To be as simple as possible to accomplish
The remaining work forms part of my regular data journalism work. This always has the same goal: to provide interesting data-led content for our general interest readers.
Full details of each part of the portfolio can be found in the main link.
What makes this project innovative?
One of my most innovative uses of R is with our weekly print pages. I don’t believe there’s anyone else in data journalism who is regularly producing parsing millions of rows of data to create dozens of automated individualised print pages about interesting datasets every week.
Prisons under Pressure uses an innovative design to show the situation in each prison. It's designed to evoke the bars of a cell, with padlocks that the user can 'unlock' with a click to present a chart showing how the prison ranks against other institutions around the country.
Both summer-born children and high speed rail also shone new light on some talked-about issues. No one had ever looked at the impact of the inequalities of our early years education system all the way through to higher education before. Likewise our investigation into train speeds confirmed, using data, what many grumbling train passengers had long suspected while stuck on trains: that in many parts of the country they run far slower than they should.
What was the impact of your project? How did you measure it?
R for Journalists has received almost 5,000 page views, its Twitter account has more than 200 followers. Based on my innovative website I was invited to do a R workshop at a data journalism conference organised by Paul Bradshaw.
The timing of our High Speed Rail story coincided with the news that the Government was scaling back its promised electrification of certain rail lines outside London. This gave the story extra impact - it provided data-driven evidence to back up our readers’ and titles’ frustration at broken promises and continued under-investment in transport outside London. A Plymouth MP also picked up on our story about the poor state of his city's rail connections with the rest of Britain.
Source and methodology
I collated it all in Google Sheets and ranked the data for each institution based on how it compared to other prisons in England and Wales, as well as whether it was getting worse at that facility over time. I used both absolute numbers and rates per 100 prisoners to form fair comparisons.
To obtain the data for the summer-born children project I sent Freedom of Information requests to all universities in the Russell Group.
I collated the data and worked out the proportions of undergraduates born in January, February, March and so on.
As a baseline I took the data on births from the Office for National Statistics over 20 years
I found, for example, that 9.3 per cent of students at Oxford and Cambridge universities had their birthdays in September compared to 8.6 per cent of births in September.
Although 8.7 per cent of babies are born in July, only 7.9 per cent of Oxbridge students were born in this summer month, suggesting that slight disadvantage that was the central thrust of my story.
For the High Speed rail project I scraped the data on train times from TheTrainLine using Outwit Hub.
I downloaded the coordinates for the railway stations from publicly available Government data.
I used R, adapting an Excel version of the Haversine formula, to calculate the distances between each train station in Britain’s largest cities.
The data for the weeklies print pages comes from data.police.uk. I merged all the spreadsheets in R and assigned the relative local authority areas to our respective titles. I then filtered the data in R to isolate the crimes in each title's patch and ran some summary tables to work out the overall picture for each title. I also isolated each title's five worst streets, using Google's API to work out their postcodes.
I printed off a CSV and TSV file for our developer Carlos Novoa, who then ran the data through Photoshop to generate the print PDFs.