Project description

grafIA is already a prototype that generates statistic graphics automatically while the journalist writes the caption. It uses Artificial Intelligence and Machine Learning Comprehension to analyse in real time the edited text and suggest a related graphic to it.

It automatically generates statistics graphics related to three recurrent economic news subjects: gross national product, consumer price index and unemployment rate, working with one data source (National Institute of Statistics, INE). It works as a web browser plugin and runs in its own text editor as well as in WordPress text editors.

The system monitors the text while one writes it: headline, standfirst, first paragraph…associates words and expressions until it identifies (based on its experience and knowledge) the concept one is writing about.

As soon as it has identified the concept the system flashes a signal and offers a previsualization of a graphic that, if it fits the content the journalist is writing about, can be added to the text with only clicking on top of it. Once the graphic is accepted, the journalist can edit its colors so it fits with the website where the article will be published.

How to use GrafIA

Open Chrome and go to: https://googledni.igzdev.com/
Sign in with Google.
Go to this link and download extension.
Unzip compressed folder.
Copy chrome://extensions/ in your web browser and press enter.
Select upload compressed extension.
Upload the decompressed folder. It will appear as deactivated extension on the web browser.
To use the plugin just go to https://googledni.igzdev.com/ again and reload.
Prototype test has been done with employment, GDP and Consume Price Index news from El País, ABC, El Mundo and La Vanguardia newspapers. You can edit the masthead and caption fill forms (also copy and paste work properly) and a graphic will be offered (icon) after a few lines. Here is a set of news for coy and paste testing).

What makes this project innovative?

The goal here is to sum up two lines of work: to reach users through simple information visualizations that help to quickly understand complicated information, and add it to the use of AI as a necessary tool in any current newsroom. Through automation journalism, the aim is to deliver graphical information – generally of breaking news – in less than a second, and to make AI and the use of graphical statistics every day newsroom tools. This will be done by automatically generating simple, recurring graphics through Machine Reading Comprehension (MRC) and Natural Language Understanding (NLU). Using NLU to analyse economic news with an unstructured text, MRC will then deliver a natural response through a visual language.

What was the impact of your project? How did you measure it?

We did have a very satisfactory feedback of 25 beta testers that tried the tool and a positive reaction-interest of potential clients to grafIA presentations, such as our public appearance in AI Event Thinking Party 18 (Fundación Telefónica) and AI and Journalism (Prodigioso Volcán) sessions. Particular feedbacks made us think about two challenges: 1- open our media target to communication departments, (not only newspapers). It could be very useful for periodical (monthly, quarters and annual) reports. 2- focus on local newspapers with particular periodical data information (waves, tides, snow, farming data information...) These inputs encourage us to draw a plan to scale the model into the spanish market and also into a second language market, making the model able to learn new concepts. Next steps are public presented at: https://www.facebook.com/prodigiosovolcan/videos/309857576330918/ Time: 53:00

Source and methodology

Working with Terminus7 developers, we prepared a dataset of economic and employment news from different online Spanish national newspapers (EL PAÍS, ABC, EL MUNDO y La Vanguardia) for the ML system training, so it could learn itself what the journalist use to write. This set was focused on 400 Unemployment Rates, GDP and Consumer Price Index news, but had to fed the Terminus7 ML system with a similar dataset with other news (different topics as sport and international) for it to learn to discriminate no economic news. Also prepared a website editor very similar to newspapers CMS where journalist could write a masthead, caption,... The prototype runs as a Chrome plugin to check the text edited by journalist on this web editor. Once system understand the topic of the text after a few lines with key words and expressions, we run the POC second stage: data source direct link and data visualization graph selection. We created a simple graphic library of bars and line charts so the graphic is displayed to the user and it can be embebed in the text field.

Technologies Used

Automatic generation of simple and recurring infographics through a Machine Reading Comprehension (MRC) system. The experience is realized through the use of Natural Language Understanding (NLU) that analyzes economic news -such as such as unemployment or the growth of the economy- with an unstructured text giving a natural response (MRC) through a visual language such as graphic visualizations. The platform monitorizes the text writen by journalist, extracts key words through ML platform, checks a core database of graphics according to the content, find the associated source, extract the new data available regarding to the caption, generates a simple statistic graphic and offers it to the journalist for publishing. In detail: TRAINING THE MODEL A training set is manually tagged with news related to different topics selected for this prototype (unemployment, CPI, GDP). Several preprocessing and text mining algorithms are developed. Based on common Python libraries for Natural Language Processing. (scikit-learn-spaCy). Model in progress is able to swiftly process and classify any given text into different topics. Model is saved in HDF5 so it can be shared and deployed in a production environment. CREATING THE CHART DATABASE A set of robots continuously analyze and extract (Scrapy) data from sources (ex: INEbase) to retrieve the relevant data. The data are stored in an NoSQL database with links to the original data sources to ensure the traceability. Relevant queries are predefined to produce the most common charts for different topics (evolution, comparison, correlation, distribution…). These news graphics are stored as Vega.js JSON objects. RUNNING THE APP As the journalists write the article in the content manager editor (CMS), a plugin, or in this proof-of-concept (POC) Chrome extension, takes blocks of 50 words and sends them to our server for topic analysis. Using the previously trained model, the server determines the topic of the current 50 words caption. If matches one of the predefined topics, one or several of the previously stored charts in different formats (image and JSON object) is retrieved and returned Chart selected is then inserted at the CMS editor. For the development of the Graf+IA project we have employed different open source technologies that have been carefully chosen depending on the requirements of the different layer of the platform. Natural Language processing Most of the text analysis needed for the analysis, modelling and classification of news will be developed using open source libraries available in the Python language. NLTK (https://www.nltk.org/) is the leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries spaCy (https://spacy.io/) is considered an “industrial-strength” Python NLP library. It doesn’t offer as many options like NLTK does. Its philosophy is to only present one algorithm, the best one, for each purpose. spaCy it’s built on Cython, so it’s also lightning-fast. Gensim (https://radimrehurek.com/gensim/) is another well-optimized library for topic modeling and document similarity analysis. Its topic modeling algorithms, such as its Latent Dirichlet Allocation (LDA) implementation, are best-in-class. In addition, it’s robust, efficient, and scalable. Modelling and classification Scikit-learn (http://scikit-learn.org) is a free software machine learning library for the Python programming language.[3] It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. We will be using HDF5 (https://support.hdfgroup.org/HDF5/) for model persistence. API server Once the model has been trained, it has to be deployed in a server hat will reply to the requests coming from the browser of the journalist writing the article. We have chosen node.js, a Javascript-based asynchronous web server that will deal with serve as an API server, will receive the text, will run the model to identify the tpic, search the charts database for the relevant visualization, and send it back to the user. Data visualization Once we have defined the topic for the article, we need to generate the corresponding data visualizations to be added to the text. We have created several bots that continuously crawl the relevant information pages such as Instituto Nacional de Estadística (www.ine.es) and store the data in a document-based database (MongoDb) with links to the pages and files retrieved to ensure precise citation and traceability. For creating the charts, we use Vega.js (https://vega.github.io/vega/) Vega is a visualization grammar, a declarative language for creating, saving, and sharing interactive visualization designs. With Vega, you can describe the visual appearance and interactive behavior of a visualization in a JSON format, and generate web-based views using Canvas or SVG, or render them into PNG using Webkit.

Project members

Prodigioso Volcán agency Project Manager: Ana ormaechea News graphic design: Rafael Höhr Consultant Development Manager: Alberto Labarga Technological development : Intelygenz development company

Link

Additional links

Followers

Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.