Project description

AP’s Data Team worked on an enormous breadth of projects in 2018, ranging from writing news stories about health, housing, the environment, state government policies and inequality to producing a number of news tools that our newsroom and others could benefit from.

Our submission for best data team includes only a small sample of what we worked on in 2018. Here’s a bit of what’s missing: Open-sourcing tools like Datakit to help newsrooms with project management; creating the interactive to display results of AP’s proprietary Vote Cast system of voter polling to news members; building SunHub, a tool for open-government advocates and journalists to track state legislation that limits public access to government data, documents and meetings; writing and performing analysis on more than 50 breaking, enterprise and long-term investigative news stories — a number of which brought policy changes and created lasting and impactful change.

Another part of our work difficult to show through a handful of story submissions: the exponential nature of our impact because for 16 of these stories, we shared our data, documentation, interactives and work with hundreds of AP member organizations to use in their own local reporting. Our Data Distribution program spurred hundreds of stories from other news groups, from tiny radio stations to large metro newspapers. We look for projects that have some kind of geographic granularity that allows us to help news organizations tell our national stories at a local level. We host webinars to walk reporters through the data and share all our methodology and documentation. We also answer reporters’ questions and help guide them in using the data during the multi-week embargo period they have to work with the data before the AP story publishes. The data distribution program became available for a small fee to non-AP member newsrooms in 2018, and was designed to help AP members, many of which have shrinking local newsrooms, tell important local stories that would be difficult to get without data analysis help.

Our sample of stories and projects included in this award are a fraction of the work our 12-member team did in 2018. But the work included here is also a great representation of what we work toward and our focus. In the case of the story about NRA grants to schools, the AP story and local stories from a data distribution sparked almost a dozen school systems — including in Pennsylvania, Florida, Colorado and New Mexico — to stop accepting NRA money. Our analysis of migrant child shelters — which found youth populations in the detention centers had skyrocketed, and the federal government was spending more than $1 billion a year to support them — sparked Congressional hearings and changes in the way the federal Health and Human Services agency operated, ultimately leading to many of the children being released. Our team aims for analyses that other organizations aren’t able to do, tools that help newsrooms understand their communities and do work they wouldn’t be able to do otherwise, assistance to smaller newsrooms in using data more frequently, and interactives that clearly represent data to the largest audience possible.

What makes this project innovative?

AP's data team stands alone in the broad range of projects and work we tackle -- from creating news interactives to writing stories to open-sourcing some of our work to creating newsgathering tools that the AP and other newsrooms can use. We share our work with members and the broader news community. We create the interactive displays of election results used by hundreds of AP customers on election night, but also do massive 10-month investigative projects. We use cutting edge advanced statistical techniques -- from machine learning to natural language processing -- and use technology to our greatest advantage, whether it be in creating major mapping projects to scraping hard-to-get data. Our Data Distributions program for AP members, other news organizations and university classrooms (for teaching purposes) takes all that innovation a step further -- it shares AP's best practices and technological know-how with members who might not have a dedicated data analyst in the newsroom. It teaches the next generation of journalists how to work with data. And it builds off AP's business model as a news cooperative, in that it assists our members where they need it the most -- taking complex data and turning it into impactful local stories. Our team is composed of software developers, front-end programmers, a DevOps technologist, interactives and graphic designers and analysts - including two with advanced statistical experience outside journalism - but we are all, first and foremost, journalists. We seek out new ways to tell stories: when HUD proposed a change in the way it calculates tenant rent, for instance, one of our analysts realized the change could have a huge impact on roughly 4 million tenants. He sought out individual-level tenant income and rent data held under a research agreement with a think tank, designed a multi-pronged analysis that took into account the multiple ways the changes would affect rental calculations, and answered his own question -- the rent hikes would be enormous, averaging nearly 30% for tenants in some cities. The day after his story ran -- AP had also shared the data to allow members to use their local figures in a story about impacts closer to home -- HUD rolled back its changes. This is how we use our technological skills -- to answer pressing journalism questions about our world, and to help other reporters explain their communities in a better way. Then we go and share our work with our members, performing the AP's historical role as a cooperative news agency, just through new and sophisticated means.

What was the impact of your project? How did you measure it?

The AP isn't a traditional newsroom in terms of being able to measure impact, since our stories and work run in such a wide variety of places and formats. The AP Data Team measures its impact in multiple ways: The breadth of our story play; our impact, in terms of generating societal change or conversation; our members' use of our data to tell local stories; and our impact and influence in setting the best practices of data journalism, in terms of adoption of our tools, workflow and practices. By all these measures, the Data Team had an amazing 2018. Each story in this nomination ran in hundreds of publications, our data was downloaded by member news organizations more than 1,400 times for use in hundreds of local stories, our stories spurred local and national investigations and reversed some government policies, and our methods and practices were widely lauded as industry-leading, sparking conversation at a number of conferences. For instance, the life expectancy project alone saw the following metrics: the AP-written piece received more than 140,000 views on at an average engagement time of more than a minute. The AP story also ran on the websites of most major news sources, including the New York Times, Washington Post, Yahoo news, NBS and many others. Meanwhile, news organizations also used the AP data to report out their own stories -- newsrooms such as the Cleveland Plain Dealer, Philadelphia Inquirer, Detroit News, MassLive, KREM, Newsday, Houston Chronicle, Victoria Advocate, McAllen Monitor and many others used AP's data distribution program to create their own stories and graphics. The data has been an invaluable service for this story and many others -- prompting testimonials from several editors about AP's service in helping tell local stories in a better and more interesting way: George Rodrigue, Editor & President, Cleveland Plain Dealer, 2019 “AP’s data service is invaluable for spotting local stories we otherwise would have missed. Last year, one database showed a 20-year gap in life expectancy between an inner-city census tract adjoining our world-class hospital complexes and a rich tract in the suburbs. Because the AP had already cleaned and formatted the data, it took us only two days to research and write the article. The disparity became the talk of the region.” Stan Wischnowski, Executive Editor, Philadelphia Inquirer “We found the AP data very helpful in describing life expectancy by neighborhood, which is critical in a city like ours with such sharp income divides. We were able to show how economic inequality and inequality of opportunity can manifest itself in huge differences in life expectancy. Data like this not only elevates our reporting but also allows for citizens to become more informed about their specific neighborhoods. That hyper-local information is invaluable.” We did 16 of these types of data distributions in 2018, with others changing laws and sparking policy debates. HUD dropped plans to change its method of calculating tenant rent the day after an AP story and data distribution ran showing that the average tenant would see a 20% increase in rent, and residents of some cities would see far greater. The data in the hand of local reporters allowed the Detroit News to confront HUD Secretary Ben Carson with the numbers during his trip that day to Detroit, and the agency announced it was backing off its plans soon after. Meanwhile, a partnership with Reveal from the Center for Investigative Reporting that included the AP sharing with members data on mortgage redlining sparked multiple investigations in Pennsylvania and other states, and a Congressional hearing. Roughly three dozen news organizations and open government organizations used our tool SunHub to assess and track government legislation at the state level. SunHub generated several news stories on its own, in allowing us to assess the types and quantity of legislation across all states in 2018. Beyond the impact of our stories and newsgathering tools, we strive to set industry-standard best practices for data analysis and technological teams in newsrooms. As part of this goal, we released Datakit as an open-source project management platform to help other newsrooms better organize their project workflows: We described our agile practices, such as daily standups and iterative cycles of development, at conferences and saw them adopted by several newsrooms. Still other news organizations adoped our less-technical practices, such as our 15-minute rule that requires team members to ask for help after working on a particular technical problem for more than 15 minutes.

Source and methodology

Life Expectancy story: AP obtained data from a partnership between The Robert Wood Johnson Foundation, The National Association for Public Health Statistics and Information Systems (NAPHSIS), and the National Center for Health Statistics (NCHS) -- The United States Small-Area Life Expectancy Project. We used American Community Survey figures at a tract level to perform the regression analysis, and then used clustering techniques to identify areas where tracts were significantly different from their neighboring areas. Tract-level data was shared with AP members. NRA grant story: The AP pulled seven years' worth of data from Schedule I forms out of the IRS' public XML files, formatted and standardized the data across years and then did some machine learning to classify the types of grantees, identifying schools and other community organizations of interest. Grant data was shared with AP members. HUD rent changes story: The nonpartisan Center on Budget and Policy Priorities conducted the analysis at The Associated Press’ request. CBPP has access to 2016 household-level renter data from HUD under a research agreement. The impact is calculated by directly applying the proposed rent formula to households and contrasting with current rent. CBPP then aggregated the household data at a state level and for the 100 largest metropolitan statistical regions. Data shared with AP members. Migrant children story: The shelter headcount and location data was hand-collected and obtained via a number of FOIAs through a partnership between the AP, Reveal from the Center for Investigative Reporting and the Texas Tribune. Data for federal funding comes from the Department of Health and Human Services' Tracking Accountability in Government Grants System. The specific funding covers grants awarded under the "Unaccompanied Alien Children Program". It was aggregated by the AP and matched to individual shelters. The data was shared with AP members and updated periodically through the rest of the year, with a major headcount update in December when the AP obtained leaked data on weekly headcount figures at each shelter. AP verified the headcounts against state-maintained counts in several states, sources and Congressional leaders who had seen small portions of the data. State of the Union: The AP used the texts of a number of public speeches by President Trump and other presidents to perform a text analysis Blacks left out of high-paying jobs story: The AP used Bureau of Labor Statistics and American Community Survey data to create job-specific ratios of black workers to their white counterparts, and then classified the jobs by type and by median salary. Thirty years of global warming: The AP used data from the U.S. Climate Divisional Dataset provided by the NCDC/NOAA. Emissions data comes from the U.S. Energy Information Administration. In calculating the temperature difference between the present day and the first half of the 20th century, the AP adopted the methodology used by the U.S. Global Change Research Program in the 2017 Climate Science Special Report. For each geography, a base average was created using monthly temperature data from 1901 to 1960. An average was then created with monthly temperature data from 1988 to 2017. The difference between the present 30-year period and 1901-1960 is the temperature change.

Technologies Used

For data analysis, our team works in R, with the help of a python-built project management system that runs off the command line. The project management system - Datakit - creates a uniform project structure, links to Gitlab and our Amazon S3 buckets for hosting static data sets, and also creates a pipeline to push data distributions directly to We use, a third-party service we've worked closely with since its inception, to host our data distributions. There, members have access to our datasets, documentation, data dictionaries, a discussion platform and an in-browser SQL platform that allows us to pre-write queries to filter and join datasets. In R, our analysis team makes full use of packages there to do mapping and spatial analysis, natural language processing, machine learning, cluster analysis and statistical tests such as a T-tests and regressions. Our news applications developers work in a combination of Ruby and Python. We built some newsroom-wide and public-facing apps in Django. We handle OCR and document dumps in a combination of Document Cloud, Open Semantic and Abbyy Fine Reader. Our election maps and election results -- which included for the first time an interactive that displayed results of our new polling system, Vote Cast -- and our interactives are built in javascript and D3.js. Our data lives in flat files on Amazon S3, and our code resides in repos on Gitlab. We run team communications through Slack -- including several bots, one of which asks us to use emojis to assess team health, code quality, and external and internal communications each week -- and we abide by Agile processes in our daily workflow.

Project members

Troy Thibodeaux, data science and news applications editor; Meghan Hoyer, data editor; Serdar Tumgoren, news apps team leader; Justin Myers, news automation editor; Bob Weston, Dan Kempton and Seth Rasmussen - news applications developers; Michelle Minkoff, Angeliki Kastanis, Larry Fenn and Nicky Forster - data journalists; Maureen Linke, interactive graphics


Additional links


Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.