Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

38
GOOGLE
How Big Data Is At The Heart Of Google’s Business Model

Background

More than any other company, Google are probably responsible for introducing us to the benefits of analysing and interpreting Big Data in our day-to-day lives.

When we carry out a Google search, we are manipulating Big Data. The size of Google’s index – its archive of every Web page it can find, which is used to return search results – is estimated to stand at around 100 petabytes (or 100 million gigabytes!) – certainly Big Data, by anyone’s standards.¹

But as we’ve seen over the past decade, bringing all the information on the Internet together to make it easier for us to find things was only the start of their plan. Google have gone on to launch Web browsers, email, mobile phone operating systems and the world’s biggest online advertising network – all firmly grounded in the Big Data technology with which they made themselves a household name.

What Problem Is Big Data Helping To Solve?

The Internet is a big place – since we moved online en masse in the 1990s, it’s been growing at a phenomenal rate and is showing no signs of slowing down. This size itself is a problem: when we have access to practically everything that anyone has ever known, how do we find what we need to help us solve our problems?

Not only is it big, the Internet is very widespread. Information is uploaded to servers that may be located anywhere in the world, meaning anyone wanting to browse through the data available to them is connecting to computers which are sometimes linked thousands of miles apart from each other. Getting individual bits of specific data through to the user doesn’t take long, with the speed at which information can travel along copper or fibre-optic cables – a matter of seconds. But that supposes the user knows where the data is located in the first place. Searching the entire Internet even for a very simple piece of information, if you didn’t know the precise IP address of the computer on which it was stored would take a very, very long time if you didn’t have an index.

With billions of pages of information available online, though, building an index isn’t trivial. It would take an army of humans an eternity to come up with anything approaching a comprehensive database of the Internet’s contents. So it had to be done automatically – by computers. This raised another problem: how would computers know what was good information and what was pointless noise? By default, computers can’t determine this on their own: they have no concept of the difference between useful and useless, unless we teach them and, anyway, what’s useless to one person may be critical to another person in order to solve their problems.

How Is Big Data Used In Practice?

Google didn’t invent the concept of a search engine, or a Web index. But very quickly after it was launched in 1997, they established it as the top dog – a title it’s gone on to hold for almost 20 years.

The concept which established it as a household name in every corner of the world, while its early competitors such as Alta Vista or Ask Jeeves are barely remembered, is known as Google PageRank. (Google have a fondness for making new names for things by putting two words together, but keeping both words capitalized as if they were still two separate words!)

PageRank was developed by Google founders Larry Page and Sergey Brin before they formed the company, during research at Stanford University. The principle is that the more pages link to a particular page, the higher that particular page’s “authority” is – as those linking sites are likely to be citing it in some way. Google created their first search algorithms to assign every page in its index a rank based on how many other sites using similar keywords (and so likely to be on the same topic or subject) linked to it, and in turn how “authoritative” (highly linked-to) those linking pages were themselves. In other words, this is a process which involves turning unstructured data (the contents of Web pages) into the structured data needed to quantify that information, and rank it for usefulness.

Google builds its index of the Web by sending out software robots – often called crawlers or spiders – which gather all of the text and other information, such as pictures or sounds, contained on a website and copy them to Google’s own vast archives – its data centres are said to account for 0.01% of all electricity used on the planet!

With the information now all stored in one place, it can be searched far more quickly – rather than trawl all around the world to find documents containing the information searchers are looking for, it’s all under one very large roof. Combined with PageRank and later developments such as Knowledge Graph (more on this below), it then does its best to match our query with information we will find useful.

What Were The Results?

At the time of writing, Google accounted for 89% of all Internet search use. Between them, closest competitors Yahoo, Bing and Baidu accounted for almost all of the remaining 11%, in that order.²

What Data Was Used?

Google uses the data from its Web index to initially match queries with potentially useful results. This is augmented with data from trusted sources and other sites that have been ranked for accuracy by machine-learning algorithms designed to assess the reliability of data.

Finally, Google also mixes in information it knows about the searcher – such as their past search history, and any information they have entered into a Google Plus profile, to provide a personal touch to its results.

What Are The Technical Details?

Google is said to have around 100 million gigabytes of information in its Web index, covering an estimated 35 trillion Web pages. However, this is thought to account for only 4% of the information online, much of it being locked away on private networks where Google’s bots can’t see it.

Its servers process 20 petabytes of information every day as it responds to search requests and serves up advertising based on the profiles it builds up of us.

The systems such as search, maps and YouTube that put Google’s massive amounts of data at our fingertips are based around their own database and analysis framework called BigTable and BigQuery. More recently, the company have also made these technologies available as cloud-based services to other businesses, in line with competitors such as Amazon and IBM.

Any Challenges That Had To Be Overcome?

Google and other search engines have traditionally been limited in how helpful they can be to humans by the language barrier between people and machines.

We’ve developed programming languages based around the concept of codes, which we can enter in an approximation of human language mixed with mathematics, and a computer can translate (through a program called an interpreter) into the fundamental 1s and 0s of binary, logical language – the only thing that computers can truly “understand”.

This is all well and good if you’re a computer programmer, but Google’s aim from the start was to put the world’s information at the fingertips of everyone, not just the technically elite. To this end, they have moved on to developing “semantic search” technology – which involves teaching computers to understand the words it is fed not just as individual objects but to examine and interpret the relationship between them.

Google does this by bringing a huge range of other information into consideration when it tries to work out what you want. Starting from 2007, the company introduced Universal Search. This meant that whenever a query was entered, the search algorithms didn’t just scour the Web index for keywords related to your search input. It also trawled vast databases of scientific data, historical data, weather data, financial data – and so on – to find references to what it thought you were looking for. In 2012, this evolved into the Knowledge Graph, which allowed it to build a database comprising not just facts but also the relationships between those facts. In 2014, this was augmented by the Knowledge Vault. This took things a step further still, by implementing machine-learning algorithms to establish the reliability of facts. It does this by working out how many resources other than the one presenting a particular piece of data as a “fact” were in agreement. It also examines how authoritative those sites that are “in agreement” are – by seeing how regularly other sites link to it. If lots of people trust it, and link to it, then it’s more likely to be trustworthy, particularly if it is linked to by sites which themselves are “high authority”, for example academic or governmental domains.

The ultimate aim appears to be to build an interface between computers and humans that acts in the same way as those we have been seeing in science fiction movies, allowing us to ask a question in natural, human language and be presented with exactly the answer we need.

What Are The Key Learning Points And Takeaways?

Google became undisputed king of search by working out more efficient ways to connect us with the data we needed than their competitors have managed.

They have held onto their title by constantly innovating. They monetized their search engine by working out how to capture the data it collects from us as we browse the Web, building up vast revenues by becoming the biggest sellers of online advertising in the world. Then they used the huge resources they were building up to rapidly expand, identifying growth areas such as mobile and Internet of Things (see Chapter 18, on Nest) in which to also apply their data-driven business model.

In recent years, competitors such as Microsoft’s Bing and Yahoo are said to be gaining some ground, although Google is still way out ahead as the world’s most popular search engine. But with further investments by Google into new and emerging areas of tech such as driverless cars and home automation, we can expect to see ongoing innovation and probably more surprises.

REFERENCES AND FURTHER READING

Statistic Brain Research Institute (2016) Total number of pages indexed by Google, http://www.statisticbrain.com/total-number-of-pages-indexed-by-google/, accessed 5 January 2016.
Statista (2015) Worldwide market share of leading search engines from January 2010 to October 2015, http://www.statista.com/statistics/216573/worldwide-market-share-of-search-engines/, accessed 5 January 2016.

For more about Google, visit:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 38: GOOGLE: How Big Data Is At The Heart Of Google’s Business Model

Create new playlist

Sign In

Sign Up