Project description

Shane Shifflett’s work through 2018 and early 2019 has combined data analysis with visual and narrative reporting across multiple coverage areas leading to reforms and impact on several fronts. Shane both led projects and worked in teams focused on core Journal coverage areas to deliver insights to readers not found in other publications.

At the start of 2018, Shane and another reporter documented widespread fundraising fraud of at least $1 billion by startups promulgating promises of unrealistic returns sometimes using fake or stolen identities. Shane continued reporting on various cryptocurrency-related frauds and published a database of speculative fundraising offers to help Journal readers spot specific red flags associated with nearly 600 suspicious projects. Regulators have since taken action against some company’s named in the reports.

As details of Jamal Khashoggi’s murder in a Saudi consulate building emerged, a team of reporter’s including Shane were able to document recurrent stock market manipulation by Saudi officials. Using data published on the Saudi stock market’s website, Shane was able to show a years-long pattern of Saudi officials spending billions of dollars to quietly prop up its stock market and counter selloffs that have followed repeated political crises. The findings are critical to Journal readers charged with investing billions in emerging markets like Saudi Arabia’s economy, where the government’s trading activity is largely undisclosed.

More recently, Shane with a team of reporters showed how the largest online marketplace for caregivers, Care.com, used public data to generate tens of thousands of unclaimed profiles for small and medium-sized businesses, some falsely claiming state licenses, and endangered children in the process. Their reports prompted immediate reform by the company.

What makes this project innovative?

Shane’s innovation is using general programming skills to supercharge and scale the traditional reporting process. Each story in his portfolio elevated anecdotal reporting to show systematic fraud or abuse using advanced analysis and displayed those findings using a combination of narrative, graphics and databases to convey precise details for readers. His reporting process often combines bleeding-edge analysis using machine learning and natural language processing with traditional reporting to establish exhaustive facts. His work has also required industrial-scale persistence, which required contacting hundreds of sources mentioned in documents he obtained and parsed for insights.

What was the impact of your project? How did you measure it?

Shane’s work has sparked widespread discussion on social media, followup coverage from other outlets and reform. For instance, his pieces covering cryptocurrency frequently climbed to the top of a popular tech news aggregator, were hotly debated among industry leaders on Twitter and cited by researchers. The Securities and Exchange Commission fined one company detailed in his stories and is investigation another. A day after publishing an investigation into Care.com's platform, the company removed tens of thousands of unverified listings from the site. In addition, the company is implementing additional screening processes to help keep potential caregivers with criminal backgrounds from using the site. A month on, the piece has been covered by other outlets and at least one high-profile customer, Best Buy, cut ties with the company.

Source and methodology

For crypto-related stories: -Shifflett acquired the data from crypto exchanges, crypto transaction records, sites that aggregate coin offerings, crypto pricing data and thousands of independent crypto websites advertising their coin offerings. -Data was obtained at no cost but required more than 100 gigabytes of storage, which varied in monthly cost. -Data was unorganized and each dataset required extensive sorting, parsing and analysis. Each story presented unique challenges including: -Unique scrapers had to be built for each story. Rate-limiting meant that additional care had to be taken to resume downloads and distribute the data collection so that the millions of data points could be efficiently collected. -To identify plagiarized documents, the Journal collected thousands of documents, converted them to plain text, tagged hundreds of thousands of sentences and loaded them into a searchable database. -The Journal created a unique database that tracked market manipulation in real time by crypto trading groups. We had to comb through hundreds of chat histories for announcement of pump events, which were cataloged in a spreadsheet and combined with pricing data for the digital coins. -The Journal created a database of illicit funds moving through exchanges. To do this, we collected millions of questionable transactions from security researchers, reports and online directories and matched them with accounts held at the exchanges. -The Journal built databases identifying exchanges, services and token addresses for hundreds of entities transacting with bitcoin and ethereum cryptocurrencies. Reporters created a custom (and likely singular) database of coin offering deals that includes dates, fundraising, copies of their websites, images of their staff and text of documents that discuss mission statements, team biographies and the technical specifics of a project. For Care.com: -The Journal accessed traversed public directories of Care.com's website listing types of caregivers by state. The Journal generated hundreds of thousands of URLs to obtain listings of caregivers offering their services in every city supported by the company. -The Journal examined the results to find the five states on Care.com offering the most "day care" providers: California, Texas, Pennsylvania, Florida and New York. The Journal examined profile pages for the 22,421 "day care" providers discovered by following links from Care.com's state and city listings. From each of these provider profile pages, the Journal collected information such as title, address, phone, credentials and reviews. -The Journal found 7,256 "day care" providers indicating in their list of credentials they are "state licensed." The Journal then requested databases of currently licensed child care providers from five states and attempted to cross reference Care.com profiles with government records using combinations of phone, address and name of the facility listed on each provider's Care.com profile. -The Journal visited the profiles it had originally collected details of after the story ran and found nearly 45,000 that returned errors indicating they were no longer publicly accessible. For Saudi market manipulation: 0The Journal downloaded weekly reports for the last three years from the Tadawul's website and used machine-learning techniques to identify and parse out tabular data. Tables were copied to CSV files to aggregate fields allowing the Journal to identify how much stock the government was buying each week and how much was sold by other types of investors. -The Journal also obtained intra-day trading data from FactSet and compiled a database of Saudi news events. The Journal created functions to identify anomalous changes in prices and trading volume around timeframes of critical news coverage.

Technologies Used

Python and Javascript were used for most tasks with the following additional libraries to accelerate development: * Scrapy to scrape websites * PANDAS for markets analysis * MongoDB to store and aggregate data * Camelot to identify tabular data in documents * spaCy's natural language processing toolkit * textacy to create models used in tagging and analyzing text documents * Various blockchain interfaces to download records * D3 for visualization

Project members

Shane Shifflet

Link

Additional links

Followers

Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.