Project description

Being a so-called influencer is the dream job of the moment for a lot of young people. Getting a wealth of free products, or even bare cash, in exchange for an Instagram post is enticing, and the advertising industry seems to have discovered a new, effective form of approaching target groups.However, accusations started appearing that the followers of many influencers are, in fact, fake. Could these accusations be true? There has never been a systematic study on the subject – nobody, neither in or outside of Switzerland, has ever tried to thoroughly quantify the fake follower problem on Instagram.So that’s what we at SRF Data did. We trained a machine learning model to automatically classify 7 million Instagram accounts regarding their “fakeness”. By doing so, we found out that roughly a third of these accounts, following 115 Swiss influencers, are indeed fake.Some influencers had more than 50% fake followers, which raises questions about the integrity and authenticity of these follower bases. Consequently, the publication caused quite a stir in the influencer economy. In the end, we confronted each of the 115 Swiss influencers with our result and gave them the opportunity to give us a statement. We published the fake follower ratio of each influencer and his or her statement, if given, in an interactive list in our article. Link 1 links to the original article published on srf.ch (German). Link 2 points to the whole methodology (English), Link 3 to a making-of on datadrivenjournalism.net. Link 4 points to a making-of-video, specifically adapted for a young audience (with English subtitles).

What makes this project innovative?

First and foremost that we used a machine learning model, a so-called random forest, to automatically classify 7 million followers of 115 Swiss influencers. Secondly, that we transparently published the whole methodology behind this project, so every journalist on the planet can learn from it and reproduce it.

What was the impact of your project? How did you measure it?

Our story made the front page of the most read Swiss newspaper, "20 Minuten", the following day and caused quite a stir in the advertising scene. What's more important to us, though, is how we attracted a younger-than-usual audience for our flagship broadcast format, "10 vor 10". This is a daily TV magazine which brings background stories and investigative reports on current topics. Funnily, the influencer scene was quite overwhelmed by the publication on October 11th, so during that day, a lot of them made their followers aware of the TV show which would happen in the evening. So we assume that there has never been a younger audience watching "10 vor 10" than on this evening.

Source and methodology

Our work process can be divided into three main steps: Identify a representative sample of Swiss influencers. Download key metrics of the audience of these influencers. Given such key metrics, classify the audience into fake or real. The first step was rather straightforward. We asked Le Guide Noir for a list of the top 100 Swiss influencers, which they kindly provided to us. We scanned the list and removed celebrities (for example, Roger Federer) who were famous regardless of their Instagram profiles. In the end, we had a list of 115 Swiss influencers, their audience totalling to seven million people. Because we couldn’t manually go through seven million Instagram profiles (that would have taken more than five years of our precious time…), we had a statistical learning model in mind, which would use artificial intelligence to do the classification work for us. In order to do so, the model needed to be given so-called ‘features’, attributes or key metrics of the profiles to be classified. Turning back to our list of seven million followers, we scraped their features like ‘number of people following’ or ‘number of posts’ from their Instagram pages. We then fed these features into our statistical model, called ‘random forest’. The model went through each of the seven million rows of our follower database and, taking only seconds, classified them into fake or not fake. To assess the validity of our model, we retained pre-labelled 300 profiles to assess the model’s accuracy – for these 300, we already knew the answer and compared it against the model’s. Surprisingly, we achieved an overall accuracy of almost 95%, which we didn’t expect in the first place. That means that 95% of 300 test accounts were correctly classified.

Technologies Used

We used R and the package "caret" in order to create the statistical model, a "random forest". Everything about that is detailed in link 2 and link 3. For the frontend visualization - the list - we used ReactJS.

Project members

Jennifer Victoria ScurrellJulian SchmidliAngelo Zehr

Video

Link

Additional links

Project owner administration

Contributor username

Followers

Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.