Project description

In recent years, news on migration have been part of our everyday lives. Our analysis studies 10,330 pictures extracted from 42,845 articles published in Hungarian online media between 27th September 2014 and 11th June 2016. Our project aims to automatically identify the most significant topics of the more than ten thousand images.

To be able to build a topic model from the images, the information incorporated by the images had to be converted into textual information. In order to accomplish this task, the labelling service of Clarifai was applied. Then labels were treated as documents, from which topics were extracted with Latent Dirichlet Allocation (LDA). Based on the results of the topic model, the images were organized into seven groups. The first interactive visualization (Topics and their labels ) represents how strongly the topics are related. In other words, it gives an idea of how many times the topics were in the top 3 regarding each image.

We were not only interested in the main topics of the images, but also in the sex and age of the people they represent. Therefore after studying many freely available devices, we decided to create our own sex and age classifier. The results of age and sex identification by topics are shown by interactive diagrams. Furthermore, the correlation between our findings and Eurostat data regarding demography is also visualized.

Finally, we intended to identify what emotions of the refugees are documented by the images, hence we trained our own emotion recognition algorithm. Two visualizations illustrate the most dominant emotions per topics. The project targets a wide audience, namely the public, since migration is considered a serious contemporary issue, affecting millions of people.

What makes this project innovative?

Our project is unique regarding two aspects.

On the one hand, this is the single webpage that makes accessible and searchable a significant amount of online media articles and images on Hungarian migration crisis. It can be considered an open repository.

On the other hand, what makes our venture exceptional is the fact that images relating to migration are not only gathered, but also processed by colorful machine learning techniques. The data is analyzed with various methods (topic modelling, automatic face detection, gender and age identification, automatic emotion detection), which allow the reader to get a complex picture on the issue. Moreover readers are not left alone with the outcomes of the analysis, but the results are interpreted and contrasted with that of other researches. The figures (i.e. interactive dashboards, charts, static visualizations, collection of images) help one comprehend the data with less effort and see the correlations. Consequently the project encourages one to think critically about the migration crisis.

What was the impact of your project? How did you measure it?

Since we are not a media entity, publicity is not measured directly. However, the Hungarian version of this project got great publicity, namely it was published in a Hungarian online journal and the article was read by 24,000 persons. Furthermore, the Hungarian version of the project was selected to the top Hungarian projects of data journalism and data visualization. It was recommended to media studies classes by a Hungarian blog of teachers. The English version of the project aims to summarize our results in English and to provide a wider audience with the chance to know more about how migration is represented in online media.

Source and methodology

The corpus was compiled from the articles of the most significant Hungarian online news websites. The full list is available on the dashboard by clicking on the domain filter. Applying the search engines of the web pages, we made a search on the key words of the migration crisis (e.g. bevándorló ‘immigrant’ and migráns ‘migrant’). Then, following the records, the articles were harvested. Raw data were classified on the basis of similarity measures. Then the classes were examined by our annotator team, who filtered the majority of the non-relevant contents and duplicates. The data gathering resulted in 42,845 articles, from which 42,311 images were extracted. A significant amount of the images were not relevant. Most of these images were filtered by a simple heuristic: images below a certain size are usually logos of the web page they are displayed on or that of any other companies. The same procedure was followed in the analysis of images that of texts: the pictures were classified on the basis of similarity, supported by simhash algorithm. These groups were also checked by the annotator team to remove duplicates and non-relevant images. Finally, many images that were irrelevant to the research were also filtered with the help of topic models, which will be introduced in a later section of the article. As a result, we got 10,330 unique images, which were detected 55,003 times in the articles.

Technologies Used

We collected and processed our data with Python. The population pyramids and radar charts were made in D3, while the chart that describes the relationship between the topics and labels was created with the help of pyLDAvis. We used Boostrap html page template for our visualization. Moreover, different technologies were used to accomplish different project aims:

1) Topic model
To be able to build a topic model from the images, the information incorporated by the images had to be converted into textual information. In order to accomplish this task, the labelling service of Clarifai was applied. It matches each image with a relevant label on the basis of its content and provides the level of relevance, expressed in certainty values. Only those labels were considered which value was above 0.75. The labels that belonged to less than 25 images were excluded.

2) Automatic face detection
We had tested many freely available face detection devices, but finally we decided to use the pre-trained Haar Cascade model of the OpenCV library. Our decision was based on the fact that most of our images are of spontaneous moments, hence people do not look into the camera and their head can be covered by a hat, hood or head-scarf. Comparing with other algorithms, it gave much less false positive results, which convinced us that it is the best choice.

3) Automatic sex and age classification
After studying many freely available devices, we decided to create our own sex and age classifier. Our training data was the IMDB-WIKI-500k dataset, which contains the images and data (sex, age) of the actors of the IMDB data set. With the use of the Keras deep learning framework, a convolunional neural network was trained for sex and age identification

4) Automatic emotion detection
After getting familiar with the freely available devices of emotion recognition, we decided to train our own algorithm, to which we used fer2013 dataset. We trained a convolutional neural network to this task with the use of Keras deep learning framework. It performs at 70% accuracy, which means it provides valid evaluation to 70% of the images identified as a certain emotion. Furthermore, it is able to recall 58%, which means that 58% of the images belonging to a certain emotion is recognized.

Project members

Virág Ilyés, Eszter Katona, Orsolya Putz


Project owner administration

Contributor username


Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.