Project description

The Wall Street Journal is proud to nominate this project for Investigation of the Year. As part of a series called “Hidden Influence,” WSJ reporters revealed for the first time clandestine efforts to pile thousands of fake public comments onto federal dockets for new regulations and policy proposals. Statements of support or in opposition to a policy change or proposed rule can influence outcomes of regulatory decisions that affect millions of people.

Using modern scientific survey techniques, Journal reporters sent nearly 1 million email surveys and received thousands of responses identifying individuals who said their identities were used to file comments that they did not write or authorize — a federal crime.

The Journal identified thousands of individuals who said comments about a “net neutrality” rule were not theirs, and identified hundreds of thousands of templated cut-and-paste comments that were more than 80% fake, with a margin of error of less than 3 percentage points.

For example, one comment that began “The unprecedented regulatory power the Obama administration imposed on the internet is smothering innovation” has been posted on the FCC website more than 818,000 times. The Journal sent surveys to 531,000 email accounts associated with that comment. More than 7,000 bounced back, the accounts defunct. Of the 2,757 who responded, 1,994, or 72%, said the comment was falsely submitted. The survey’s margin of error was plus or minus 1.86 percentage points.

Comments opposing a rule designed to rein in abuses in payday lending and opposed by the short-term loan industry also proved to be fake.

The Journal emailed about 13,000 surveys to those posting comments to the Consumer Financial Protection Bureau site, receiving back about 120 completed ones. Four out of 10 said they didn’t send the comment associated with them. These comments opposed the new regulations.

The Journal also exposed 200,000 “unique” comments the CFPB posted on its payday-lending proposal. They weren’t entirely unique. More than 100 sentences opposing the payday rule each appeared within more than 350 different comments.

What makes this project innovative?

The main innovation is this: We are unaware of a previous example of where polling has been used in investigative reporting to uncover fraud -- in this case, people’s misappropriated identities used to support causes with which they disagree.

We began with the challenge of confirming reports of fake posts among 23 million comments about the “net neutrality” repeal proposed by President Trump’s chairman of the Federal Communications Commission. First we had to find them. Efforts by other advocacy and journalism organizations were limited. One group invited people to report fake comments to their website with a dozen or so responses. Others tried contacting commenters individually, with equally small results. We determined to find a way to do this on a larger scale.

We also believed that this was likely not the first time that fake comments had been posted on a government website.

The Journal decided to combine modern polling techniques with mass-emailing software while digging into dockets using investigative instincts. First, we gathered comments and email addresses from government websites. (We used Python to scrape regulatory dockets and to extract contents and metadata from PDFs.) Then we consulted with Mercury Analytics, which also is a frequent subcontractor for the WSJ/NBC News Poll. Mercury helped devise the survey questions and tabulate results. Then the Journal emailed more than 950,000 surveys to people whose names and emails were associated with comments on websites.

In addition, the Journal partnered with a firm that analyzes massive data sets and uses natural language processing to help identify potential fakes that we could then survey. The firm, Quid, which also has worked with the Knight Foundation analyzing comments on regulatory dockets, turned to looking for anomalies in rules for payday lending and “net neutrality.” Quid helped us identify large sets of comments that were filled with the same or so similar that statistically they had to be computer-generated; we surveyed those comment sets and found dozens of fakes.

What was the impact of your project? How did you measure it?

The Wall Street Journal received dozens of responses from readers and attracted the attention of some in the administration and Congress and received prominent play in other media, including MSNBC’s Rachel Maddow show (http://www.msnbc.com/rachel-maddow/watch/american-identities-hijacked-to-fake-support-for-trump-policies-1128124483978) and NPR.
The day after publication, some lawmakers began calling on the Federal Communications Commission to add new safeguards against fraudulent public online comments.

Also in the wake of the report, the U.S. Government Accountability Office launched an investigation into fake comments and stolen identities used to comment on proposed federal regulations at various U.S. agencies.

Also, the top Democrat on the U.S. House Energy and Commerce Committee, citing the Journal report, asked the Trump administration Monday to investigate fake online comments.

Reporters also heard from activists in Ohio who said they had asked the Federal Energy Regulatory Commission to investigate the unauthorized use of names to support a gas pipeline, but had not heard from investigators.

The reporters also heard from Dr. John Woolley, professor of political science at University of California Santa Barbara and co-director of the American Presidency Project, who said he was immediately assigning the investigation to his graduate seminar. “Congratulations to both of you on a fantastically interesting article about online regulatory comments,” Dr. Woolley said in an email. “As a social scientist it makes me envious of the resources you were able to draw on and creativity of the research.”

Source and methodology

The Journal collected 10.1 million comments from several agencies. These usually included a commenter’s name and email address, the text of the comment and the date it was submitted. They also sometimes included a street address, time of submission and email-routing information.

For example, we filed a Freedom of Information Act request to CFPB and received more than 1 million comments, 334 gigabytes, on a hard drive paid for by WSJ and purchased by the government for data-security reasons.

We wrote web-scraping programs to collect the comments, which sometimes came as text and sometimes as PDFs. We also wrote programs to cull data and metadata from text and PDF collections and to check commenters’ email addresses with a database of identities known to have been released in well-known commercial hacks.

Quid Inc., a San Francisco firm that specializes in analyzing large collections of texts, assisted the Journal by flagging batches of identical or strongly similar comments for analysis.
Working with Mercury Analytics, we set out to determine whether there were people whose identities may have been compromised and to say whether people credited with submitting comments had actually submitted those comments. The Journal asked participants how they felt about net neutrality, how they felt about the comment, and if they had submitted the comment.

The Journal was able to identify:
· The percentage that actually submitted the comments;
· The percentage of those that did not submit the comments but agreed with the comments;
· The percentage of those that did not submit the comments and did not agree with the comments;

Together, the last two groups made up those those whose identifies had been compromised.
We found that a significant shar indicated that they had not submitted the comments and were upset by the misuse of their identities – regardless of whether they agreed with the comment.

Technologies Used

As mentioned above, we used Python to scrape dockets and to extract contents and metadata from PDFs. We also used C++ to extract email header fields from an Outlook file obtained in response to a public records request. We used Excel and SQL Server to store and analyze batches of comments.

Dow Jones’ chief software architect used the StrongView software stack to send out the emails once custom URLs for each individual were added to a CSV file of the email addresses. Mercury was able to present to each respondent the exact comment attributed to that person.

Quid used its proprietary software to analyze the text of millions of comments to find identical or similar comments and analyze their properties.

Mercury Analytics used its proprietary software to host the online survey and to collect and analyze the responses.

Project members

Paul Overberg
Shane Shifflett
Heather Seidel

Video

Link

Additional links

Project owner administration

Contributor username

Followers

Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.