Project description

Although in Cuba we have the notion that there are no inequalities, at least not notorious, among Cubans, it is evident that, at least, there are differences. We wanted to expose this data as part of an investigation that we were carrying out, but we did not want to decide which groups were formed if we crossed the data of internal migration, skin color, age dependence, average salary and rural population.

Therefore, we chose two different algorithms of unsupervised learning that divided into groups all the municipalities of the country, depending on their proximity to certain centroids. In this way, groups are automatically determined and displayed on a map. This geojson of municipalities in Cuba was created by our team, since there was no usable map that reached that level of depth.

In addition to the map, below are other graphs that help to understand the phenomena, and to offer more information about the characteristics of the identified groups.

To make the narrative more interesting, we decided to create what we call audiotelling. Instead of text, our story is told in the voice of two of our journalists and, at the same time, animations related with the story happens. At any time, the user can stop the audio and interact with the graphics and the map, which are interactive, and after that resume the story.

What makes this project innovative?

The identification of the differences in Cuba is always tentative but it's difficult to acomplished withoud any bias. For this reason we thinked to use Artificial Intelligence algorithms that, only based, on the data we provided the y could identify groups of municipalities with common characteristics and, in this way, shows the differences in the geographical space of Cuba. For this reason we decided to use two unsupervised learning algorithms to identify the groups of municipalities. The work was conceived as a data app where everyone coud find out the different groups base on the information they want to cross. So, It's a tool for readers but also for researchers interested in this topics. The tool is not biased by the criteria of the researchers, it uses clustering algorithms to calculate the groups and use the centroids to identify these. Crossing two criteria of the five that we choose allows to show very different groups or very similar to each other, which denotes that these criteria influence the lives of people, and so does their geographical situation. We don't wanted to present the work only as a data app, we also wanted to tell our insides about the interesting facts we find out. So we try to create a narrative that could ilustrate how you can use the tool and also tell the story we one but without being intrusive. Instead of using traditional narrative, we did what we call audio-telling: we create a podcast in our voices and to the extent that our considerations are heard, the map shows what we indicated. But at the same time that the story is told the reader can interact with the tool. This is an interesting way to count the data, as it helps readers visually focus on the map. This is a powerful tool that allows other researchers and journalists to access official data, but also to cross and compare them, in order to determine social phenomena. In addition, the map can be consulted by decision-makers, and this will help to clarify some common points between minorities and poverty.

What was the impact of your project? How did you measure it?

This is the first tool that allows analyzing at a local level the differences in Cuba in terms of social criteria, but also economic ones. In that sense, it goes beyond journalism to become a tool of use for those working on minority issues. With the data shows the gap that exists between people in the city and people in the field, or people with different skin color. This work in a few days was very visited. It was quickly shared on social networks and was recommended by groups that support data journalism, but also by academics of hyperlocal issues, as well as by organizations that investigate issues of race, gender and youth. Many people an institutions were impressed for the use of audio and animations in a interactive way and also for applying clustering algorithm to detect socieconomic diffrences without the research bias. In just two days this work he exceeded, in his behavior on social networks, all the articles that we have published in Postdata.club history.

Source and methodology

The data used are all from documents of the National Bureau of Statistics and Information of Cuba (ONEI). The figures by province regarding the color of the foot were obtained from the Skin Color report according to the Population and Housing Census of 2012. The data relative to the rural population were obtained from the 2017 Statistical Yearbook of Cuba published in 2018. The numbers by province on average salary, internal migration and dependency ratio by age were obtained from the different Provincial Yearbooks of 2016. The dependency ratio by age indicates, approximately, the burden or pressure on the labor resources of a given territory or territory and its tendency is associated with the population aging process. This is calculated, for a region, dividing the number of inhabitants from 0 to 14 years plus those of 60 years or more among those from 15 to 59. With this formula the dependency values ​​were calculated by age for Holguín and Granma, because they don't appeared in their respective provincial yearbooks. It was added as a criterion, when determining the groups, whether the municipalities of Havana were included or not. This was decided by the obvious differences that the capital has with respect to the rest of the country. In this way, the analyzes can be done taking into account or not the capital municipalities. To determine the groups of municipalities, according to the specified criteria, two unsupervised classification algorithms were used. MeanShift to classify into groups automatically and KMeans to segment into three groups. The groups, for each possible combination, were precalculated and structured in a JSON file. Precalculation was done using Python and the implementations of the mentioned algorithms provided by Scikit-Learn. Likewise, since there were differences in scales between the different indicators, when groups were determined based on two of them, it was necessary, in order to obtain better results, to take them to the same scale. For this, the RobustScaler implementation, also present in Scikit-Learn, was used. With Scikit-Learn, the centroids of the groups obtained in the different groupings were also calculated. These centroids can be considered as the value or set of central or characteristic values ​​of each group. In the implementation of the tool Leaflet was used to work with the maps, C3JS for the graphs and Tablesorter for the interactive tables. This was programmed into directly into Javascript using JQuery. For the audio-telling, Soundcloud was used to reproduce the audio and its Widget API to perform the interactions with the tool based on the audio timing and content. All this was also programmed directly into Javascript using JQuery.

Technologies Used

We use python for all the data analysis. For the unsupervised learning algorithms and the data scaling we used Scikit-learning. For data representation we used the JSON format. We used Leaflet.js for the maping and the GEOJSON format to create the municipalities map. For the other graphs we used C3JS and D3JS, and Tablesorter for interactive tables. We use Soundcloud and it's Widget API to create the audio telling story. For all the programming we use javascript, jquery, html5, css3 and bootstrap. Github is the space where all the code of Postdata.club is published.

Project members

Saimi Reyes Yudivián Almeida Ernesto Guerra

Link

Followers

Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.