Project description

In January 2019, an article titled “The Search Engine Baidu Is Dead “revealed Baidu’s intention to monopolize web traffic, as a large part of its search results points to its own websites, especially Baijiahao, a content production platform.

Baidu’s Vice President responded to the scandal, swearing that only less than 10% of results lead users to Baidu’s products. But actually, this statement doesn’t make sense, because it is always the first page of results will be seen by users, not the “whole results”.

We wonder how much Baidu’s own websites account for in the first page of search results, so we use 12520 heated words to test Baidu and found the ratio is higher than expected.

Baijiahao is a free content production platform with no strict identity verification rules. Any organizations or individuals can create accounts on it. Apparently, the platform is easy to have fake news. Baidu recommends the content from Baijiahao to users at such a high proportion, is it cautious enough when filtering the content? Does it ever think about to vouch for the information’s accuracy? This is a question.

For your reference, below is our project\’s translation:

A test of 12520 heated words: Does Baijiahao really occupy Baidu\’s search results?

China’s top search engine Baidu changed its webpage on Thursday under the challenge of “web traffic monopoly”. It removes the source website address beside each search result, which means unless people click in, they cannot tell where the information comes from.

Tuesday, Baidu was questioned to largely display content published by its own products, especially content production platform Baijiahao on the first page of search results by an article titled The Search Engine Baidu Is Dead.

The places which originally shows web addresses were replaced by media names and their logos. Like below, if you search Trump before Thursday, the place marked in red would show the URL of this item while now only shows “Global Times” and its logo.

You think the item is linked to Global Times’ official website? Then wrong, it is only Global Times’s Baijiahao homepage. And as there is no strict identity verification on Baijiahao, it is not beyond the range of possibility that this is a fake account.

Baidu’s Vice President Dou Shen has refuted the challenge, saying less than 10% of results pointing to Baijiahao. But Wuhui Wei, a professor from the Faculty of Media and Communication, Shanghai Jiaotong University, disputed this statement.

“This response evades the key problem. The article criticized the first page of results, but Baidu responded by saying ‘search results’, which can be understood as ‘whole results’. The problem is, unless there is a special purpose, few users will be interested in the content after the third page,” he said.

A research on search engines’ user behavior shows that 92% of users only click the first 5 result items. The number of clicks dropped sharply when it comes to the third page, and almost no clicks after the fifth page——which is in accordance with our real experience.

So, what is more, important than “proportion” is “sorting”. It’s quite different to display content on top pages or last pages.

Therefore, we use 12,520 heated words to test Baidu, to see how much Baijahao and Baidu’s other content platforms, including Baidu Baike (similar to Wikipedia), Baidu Post Bar(social media), Baidu Scholars account for in the first page of search results. Is the ratio really as small as 10%?

Our results show that 89.8% keyword’s first page results contain Baidu’s own websites and 84.5% of them contain Baijiahao.

For 50.3% of the keywords, the proportion Baidu’s products take up on their first page is more than 50%, among which 59.3% point to Baijiahao. That is to say, if there are 10 results linked to Baidu’s own websites on the first page, 6 of them point to Baijiahao.

Sometimes results from Baijiahao not only scattered on the page but also are gathered as “the most related information to the keyword” at the top of the page.

What kind of keywords is prone to have Baijiahao results? According to our test, the top 1% keywords with highest Baijiahao results proportion (more than 60%) are:

Keywords in the top 1% position mainly consist of people’s names (especially superstars), and words which are relevant to entertainment, healthcare, sports and common sense of life. Searching for accurate website names has a low probability to have Baijiahao results.

After Baidu changed its webpage, we cannot get search results’ real links even from pages’ source codes. We have to click in to know what exactly the website is. So we modified our data-mining codes and re-tested, just to find out the results were consistent with our previous test. In other words, though the webpage is changed, Baidu’s sorting algorithm remains the same.

Search engines, to a larger extent, decide what we see and how we thought. “Information equality” seems to be within reach, but the fact is the “gatekeeper”, who filters information has always existed——just change from people to machines in the Internet era.

What rules on earth are the search results sorted by? This is actually a mathematical problem.

In the 1990s, the primitive sorting principle was that the more keywords a content contained, the higher it ranked. Obviously, the first few results were probably web spam with duplicate keywords.

After Google became the search engine industry leader, a new sorting method called PageRank appeared. The more a page was linked by others, the higher it ranked.

Since then, the sorting algorithm was continuously upgraded. Factors such as page hits, keyword prominence, time spent by users on pages were later all included in the measurement. But the main purpose of the algorithm had always been finding out what most people are reading. Till now, the sorting algorithm is still a “core secret” of Internet companies.

However, with the emergence of information flow, the sorting algorithm was innovated. It was no longer “what people are reading” decided the sorting, but “what people may like to read”. Based on people’s user behavior, the algorithm can guess out your preference and recommend similar content that you are probably interested in. This form of content-delivering is now adopted by almost all search engines, including Google and Baidu.

Baidu’s revenue of 2018 financial year reached 100 billion yuan. In the second quarter, net revenue from mobile terminal services accounted for 77% of the total, rising from 5% year-on-year. For BaiduCore, a combination of Baidu’s search engines business and transaction business, nearly 20% of whole revenue were taken up by information flow services and artificial intelligence services, showing a year-on-year increase of 150%.

Such high profitability was largely contributed by Baijiahao, which successfully keeps users in Baidu’s own websites throughout their searching.

Information flow recommendation controls distribution channels and more than 1.9 million Baijiahao accounts built up a huge content pool. Just like the article said, is no longer a place for you to search for content on China’s internet, but rather an internal search for Baidu content.

According to search engine expert Rand Fishkin, on Google, 12.6% of user clicks go to the world’s top 100 websites. The other 87.4% point to ordinary websites, among which 11% point to Google’s own websites, including Google Map, Gmail, Google Book, etc. The ratio is apparently lower than Baidu.

In 2016, a university student from Shanxi, China died of cancer after he received problematic medical attention recommended by Baidu. The government later asked Baidu to overhaul its advertisement aggregation.

However, from the Baijiahao scandal, it seems Baidu has not fully realized the social responsibility a search engine company needs to undertake.

What makes this project innovative?

Firstly, Baidu’s scandal was big news at that time, but few media made a data journalism story of it. People were discussing that Baidu tended to keep users in its own websites and monopolize web traffic, but this was an abstract concept. We use data to explain to what extent Baidu’s products occupy the first page of search results, which is convincing and intuitive. And as Baidu is the largest search engine giant in China and people use it to get information every day, the data is what they really care about. Secondly, because of the scandal, Baidu changed its webpage which causes troubles for our news writing. We discovered that we can no more get the source link of each search result through the attribute in source codes when our data-mining process was half done. So we had to re-write our data scraping codes to continue our test.

What was the impact of your project? How did you measure it?

We publish our project on Shanghai Observer, an App owned by Jiefang Daily, and also PaiKe, a data journalism content platform.

The project achieved 1 million views and got 709 comments on Paike. And it was also the top-viewed story of that month.

Last but not the least, the story achieves two "first" : The first data journalism story in China that applies algorithm accountability, and the first to analyze the ranking methodology of the dominant search engine in China by computer technology and confirm users' suspicion

Source and methodology

We use Python to get the source codes of the first page, which contains all the source links of search results. If a link's domain name is "Baidu", then count it in and finally calculate its proportion among all links. The calculation of "Baijihao proportion" is the same. If a link contains string "Baijiahao" then count it in.

Technologies Used

We mainly use PyCharm to write Python language to do data scraping from Baidu.

Project members

Shuyao Xiao, Yin Tuo



Click Follow to keep up with the evolution of this project:
you will receive a notification anytime the project leader updates the project page.