Chapter 1


The story of big data

We’ve always struggled with storing data. Not long ago, our holidays were remembered at a cost of $1 per photo. We saved only the very best TV shows and music recitals, overwriting older recordings. Our computers always ran out of memory.

Newer, cheaper technologies turned up the tap on that data flow. We bought digital cameras, and we linked our computers to networks. We saved more data on less expensive computers, but we still sorted and discarded information continuously. We were frugal with storage, but the data we stored was small enough to manage.

Data started flowing thicker and faster. Technology made it progressively easier for anyone to create data. Roll film cameras gave way to digital video cameras, even on our smartphones. We recorded videos we never replayed.

High-resolution sensors spread through scientific and industrial equipment. More documents were saved in digital format. More significantly, the internet began linking global data silos, creating challenges and opportunities we were ill-equipped to handle. The coup de grâce came with the development of crowd-sourced digital publishing, such as YouTube and Facebook, which opened the portal for anyone with a connected digital device to make nearly unlimited contributions to the world’s data stores.But storage was only part of the challenge. While we were rationing our storage, computer scientists were rationing computer processing power. They were writing computer programs to solve problems in science and industry: helping to understand chemical reactions, predict stock market movements and minimize the cost of complicated resource scheduling problems.

Their programs could take days or weeks to finish, and only the most well-endowed organizations could purchase the powerful computers needed to solve the harder problems.

In the 1960s and again in the 1980s, computer scientists were building high hopes for advancements in the field of machine learning (ML), a type of artificial intelligence (AI), but their efforts stalled each time, largely due to limitations in data and technology.

In summary, our ability to draw value from data was severely limited by the technologies of the twentieth century.

What changed towards the start of the twenty-first century?

There were several key developments towards the start of the twenty-first century. One of the most significant originated in Google. Created to navigate the overwhelming data on the newly minted world wide web, Google was all about big data. Its researchers soon developed ways to make normal computers work together like supercomputers, and in 2003 they published these results in a paper which formed the basis for a software framework known as Hadoop. Hadoop became the bedrock on which much of the world’s initial big data efforts would be built.

The concept of ‘big data’ incubated quietly in the technology sector for nearly a decade before becoming mainstream. The breakthrough into management circles seemed to happen around 2011, when McKinsey published their report, ‘Big data: The next frontier for innovation, competition, and productivity.2 The first public talk I gave on big data was at a designated ‘big data’ conference in London the next year (2012), produced by a media company seizing the opportunity to leverage a newly trending topic.

But even before the McKinsey paper, large data-driven companies such as eBay were already developing internal solutions for fundamental big data challenges. By the time of McKinsey’s 2011 publication, Hadoop was already five years old and the University of California at Berkeley had open-sourced their Spark framework, the Hadoop successor that leveraged inexpensive RAM to process big data much more quickly than Hadoop.

Let’s look at why data has grown so rapidly over the past few years and why the topic ‘big data’ has become so prominent.

Why so much data?

The volume of data we are committing to digital memory is undergoing explosive growth for two reasons:

  1. The proliferation of devices that generate digital data: ubiquitous personal computers and mobile phones, scientific sensors, and the literally billions of sensors across the expanding Internet of Things (IoT) (see Figure 1.1).
  2. The rapidly plummeting cost of digital storage.

The proliferation of devices that generate digital data

Technology that creates and collects data has become cheap, and it is everywhere. These computers, smartphones, cameras, RFID (radio-frequency identification), movement sensors, etc., have found their way into the hands of the mass consumer market as well as those of scientists, industries and governments. Sometimes we intentionally create data, such as when we take videos or post to websites, and sometimes we create data unintentionally, leaving a digital footprint on a webpage that we browse, or carrying smartphones that send geospatial information to network providers. Sometimes the data doesn’t relate to us at all, but is a record of machine activity or scientific phenomena. Let’s look at some of the main sources and uses of the data modern technology is generating.

Figure 1.1 Number of IoT devices by category.

Figure 1.1 Number of IoT devices by category.3

Content generation and self-publishing

What does it take to get your writing published? A few years ago, it took a printing press and a network of booksellers. With the internet, you only needed the skills to create a web page. Today, anyone with a Facebook or Twitter account can instantly publish content with worldwide reach. A similar story has played out for films and videos. Modern technology, particularly the internet, has completely changed the nature of publishing and has facilitated a massive growth in human-generated content.

Self-publishing platforms for the masses, particularly Facebook, YouTube and Twitter, threw open the floodgates of mass-produced data. Anyone could easily post content online, and the proliferation of mobile devices, particularly those capable of recording and uploading video, further lowered the barriers. Since nearly everyone now has a personal device with a high-resolution video camera and continuous internet access, the data uploads are enormous. Even children easily upload limitless text or video to the public domain.

YouTube, one of the most successful self-publishing platforms, is possibly the single largest consumer of corporate data storage today. Based on previously published statistics, it is estimated that YouTube is adding approximately 100 petabytes (PB) of new data per year, generated from several hundred hours of video uploaded each minute. We are also watching a tremendous amount of video online, on YouTube, Netflix and similar streaming services. Cisco recently estimated that it would take more than 5 million years to watch the amount of video that will cross global IP (internet protocol) networks each month in 2020.

Consumer activity

When I visit a website, the owner of that site can see what information I request from the site (search words, filters selected, links clicked). The site can also use the JavaScript on my browser to record how I interact with the page: when I scroll down or hover my mouse over an item. Websites use these details to better understand visitors, and a site might record details for several hundred categories of online actions (searches, clicks, scrolls, hovers, etc.). Even if I never log in and the site doesn’t know who I am, the insights are valuable. The more information the site gathers about its visitor base, the better it can optimize marketing efforts, landing pages and product mix.

Mobile devices produce even heavier digital trails. An application installed on my smartphone may have access to the device sensors, including GPS (global positioning system). Since many people always keep their smartphones near them, the phones maintain very accurate data logs of the location and activity cycles of their owner. Since the phones are typically in constant communication with cell towers and Wi-Fi routers, third parties may also see the owners’ locations. Even companies with brick-and-mortar shops are increasingly using signals from smartphones to track the physical movement of customers within their stores.

Many companies put considerable effort into analysing these digital trails, particularly e-commerce companies wanting to better understand online visitors. In the past, these companies would discard most data, storing only the key events (e.g. completed sales), but many websites are now storing all data from each online visit, allowing them to look back and ask detailed questions. The scale of this customer journey data is typically several gigabytes (GB) per day for smaller websites and several terabytes (TB) per day for larger sites. We’ll return to the benefits of analysing customer journey data in later chapters.

We are generating data even when we are offline, through our phone conversations or when moving past video cameras in shops, city streets, airports or roadways. Security companies and intelligence agencies rely heavily on such data. In fact, the largest consumer of data storage today is quite likely the United States’ National Security Agency (NSA). In August 2014, the NSA completed construction of a massive data centre in Bluffdale, Utah, codenamed Bumblehive, at a cost somewhere between 1 and 2 billion dollars. Its actual storage capacity is classified, but the governor of Utah told reporters in 2012 that it would be, ‘the first facility in the world expected to gather and house a yottabyte’.

Machine data and the Internet of Things (IoT)

Machines never tire of generating data, and the number of connected machines is growing at a rapid pace. One of the more mind-blowing things you can do in the next five minutes is to check out Cisco’s Visual Networking Index™, which recently estimated that global IP traffic will reach over two zettabytes per year by 2020.

We may hit a limit in the number of mobile phones and personal computers we use, but we’ll continue adding networked processors to devices around us. This huge network of connected sensors and processors is known as the Internet of Things (IoT). It includes the smart energy meters appearing in our homes, the sensors in our cars that help us drive and sometimes communicate with our insurance companies, the sensors deployed to monitor soil, water, fauna or atmospheric conditions, the digital control systems used to monitor and optimize factory equipment, etc. The number of such devices stood at approximately 5 billion in 2015 and has been estimated to reach between 20 and 50 billion by 2020.

Scientific research

Scientists have been pushing the boundaries of data transport and data processing technologies. I’ll start with an example from particle physics.

Case study – The large hadron collider (particle physics)

One of the most important recent events in physics was witnessed on 4 July 2012: the discovery of the Higgs boson particle, also known as ‘the god particle’. After 40 years of searching, researchers finally identified the particle using the Large Hadron Collider (LHC), the world’s largest machine4 (see Figure 1.2). The massive LHC lies within a tunnel 17 miles (27 km) in circumference, stretching over the Swiss–French border. Its 150 million sensors deliver data from experiments 30 million times per second. This data is further filtered to a few hundred points of interest per second. The total annual data flow reaches 50 PB, roughly the equivalent of 500 years of full HD-quality movies. It is the poster child of big data research in physics.

Figure 1.2 The world’s largest machine.

Figure 1.2 The world’s largest machine.5

Case study – The square kilometre array (astronomy)

On the other side of the world lies the Australian Square Kilometre Array Pathfinder (ASKAP), a radio telescope array of 36 parabolic antennas, each 12 metres in diameter6 and spanning 4000 square metres. Twelve of the 36 antennas were activated in October 20167, and the full 36, when commissioned, are expected to produce data at a rate of over 7.5 TB per second8 (one month’s worth of HD movies per second). Scientists are planning a larger Square Kilometre Array (SKA), which will be spread over several continents and be 100 times larger than the ASKAP. This may be the largest single data collection device ever conceived.

All of this new data presents abundant opportunities, but let’s return now to our fundamental problem, the cost of processing and storing that data.

The plummeting cost of disk storage

There are two main types of computer storage: disk (e.g. hard drive) and random access memory (RAM). Disk storage is like a filing cabinet next to your desk. There may be a lot of space, but it takes time to store and retrieve the information. RAM is like the space on top of your desk. There is less space, but you can grab what’s there very quickly. Both types of storage are important for handling big data.

Disk storage has been cheaper, so we put most data there. The cost of disk storage has been the limiting factor for data archiving. With a gigabyte (GB) of hard drive storage costing $200,000 in 1980, it’s not hard to understand why we stored so little. By 1990, the cost had dropped to $9000 per GB, still expensive but falling fast. By the year 2000, it had fallen to an amazing $10 per GB. This was a tipping point, as we’ll see. By 2017, a GB of hard drive storage cost less than 3 cents (see Figure 1.3).

This drop in storage cost brought interesting consequences. It became cheaper to store useless data rather than take time to filter and discard it (think about all the duplicate photos you’ve never deleted). We exchanged the challenge of managing scarcity for the challenge of managing over-abundant data, a fundamentally different problem. This story repeats itself across business, science and nearly every sector that relies on digital data for decisions or operations.

Figure 1.3 Historic cost of disk storage per GB (log scale).

Figure 1.3 Historic cost of disk storage per GB (log scale).9

Online companies had previously kept a fraction of web data and discarded the rest. Now these companies are keeping all data: every search, scroll and click, stored with time stamps to allow future reconstruction of each customer visit, just in case the data might prove useful later.

But exceptionally large hard drives were still exceptionally expensive, and many companies needed these. They could not simply buy additional smaller, inexpensive drives, as the data needed to be processed in a holistic manner (you can divide a load of bricks between several cars, but you need a truck to move a piano). For organizations to take full advantage of the drop in hard drive prices, they would need to find a way to make a small army of mid-sized hard drives operate together as if they were one very large hard drive.

Google’s researchers saw the challenge and the opportunity and set about developing the solution that would eventually become Hadoop. It was a way to link many inexpensive computers and make them function like a super computer. Their initial solution leveraged disk storage, but soon the attention turned to RAM, the faster but more expensive storage media.

The plummeting cost of RAM

Disk (hard drive) storage is great for archiving data, but it is slow, requiring time for computer processors to read and write the data as they process it. If you picture working at a very small desk next to an enormous filing cabinet, constantly retrieving and refiling papers to complete your work on this small desk, you’ll quickly realize the benefits of a larger desk. RAM storage is like that desk space. It’s much faster to work with, which is a significant benefit when processing the huge volumes of high-velocity data that the world was producing. But RAM is much more expensive than disk storage. Its price was also falling, but it had more distance to cover

Figure 1.4 Historic cost of RAM per GB (log scale).

Figure 1.4 Historic cost of RAM per GB (log scale).

Source: http://www.statisticbrain.com/average-historic-price-of-ram/

How much more expensive is RAM storage? In 1980, when a GB of hard drive cost $200k, a GB of RAM cost $6 million. By the year 2000, when hard drives were at $15 and could be used for scalable big data solutions, a GB of RAM was well above $1000, prohibitively expensive for large-scale applications (see Figure 1.4).

By 2010, however, RAM had fallen to $12 per GB, the price at which disk storage had seen its tipping point back in 2000. It was time for Berkeley labs to release a new RAM-based big data framework. This computational framework, which they called Spark, used large amounts of RAM to process big data up to 100 times faster than Hadoop’s MapReduce processing model.

The plummeting cost of processing power

The cost of computer processing has also plummeted, bringing new opportunities to solve really hard problems and to draw value from the massive amounts of new data that we have started collecting (see Figure 1.5).

Figure 1.5 Historic cost of processing power (log scale).

Figure 1.5 Historic cost of processing power (log scale).10

Why did big data become such a hot topic?

Over the last 15 years, we’ve come to realize that big data is an opportunity rather than a problem. McKinsey’s 2011 report spoke directly to the CEOs, elaborating on the value of big data for five applications (healthcare, retail, manufacturing, the public sector and personal location data). The report predicted big data could raise KPIs by 60 per cent and estimated hundreds of billions of dollars of added value per sector. The term ‘big data’ became the buzzword heard around the world, drawn out of the corners of technology and cast into the executive spotlight.

With so many people talking so much about a topic they so little understood, many quickly grew jaded about the subject. But big data became such a foundational concept that Gartner, which had added big data to their Gartner Hype Cycle for Emerging Technologies in 2012, made the unusual decision to completely remove it from the Hype Cycle in 2015, thus acknowledging that big data had become so foundational as to warrant henceforth being referred to simply as ‘data’ (see Figure 1.6).

Figure 1.6 Gartner Hype Cycle for Emerging Technologies, 2014.

Figure 1.6 Gartner Hype Cycle for Emerging Technologies, 2014.

Organizations are now heavily dependent on big data. But why such widespread adoption?

  • Early adopters, such as Google and Yahoo, risked significant investments in hardware and software development. These companies paved the way for others, demonstrating commercial success and sharing computer code.
  • The second wave of adopters did much of the hardest work. They could benefit from the examples of the early adopters and leverage some shared code but still needed to make significant investments in hardware and develop substantial internal expertise.

Today, we have reached a point where we have the role models and the tools for nearly any organization to start leveraging big data.

Let’s start with looking at some role models who have inspired us in the journey.

Successful big data pioneers

Google’s first mission statement was ‘to organize the world’s information and make it universally accessible and useful.’ Its valuation of $23 million only eight years later demonstrated to the world the value of mastering big data.

It was Google that released the 2003 paper that formed the basis of Hadoop. In January 2006, Yahoo made the decision to implement Hadoop in their systems.11 Yahoo was also doing quite well in those days, with a stock price that had slowly tripled over the previous five years.

Around the time that Yahoo was implementing Hadoop, eBay was working to rethink how it handled the volume and variety of its customer journey data. Since 2002, eBay had been utilizing a massively parallel processing (MPP) Teradata database for reporting and analytics. The system worked very well, but storing the entire web logs was prohibitively expensive on such a proprietary system.

eBay’s infrastructure team worked to develop a solution combining several technologies and capable of storing and analysing tens of petabytes of data. This gave eBay significantly more detailed customer insights and played an important role in their platform development, translating directly into revenue gains.

Open-source software has levelled the playing field for software developers

Computers had become cheaper, but they still needed to be programmed to operate in unison if they were to handle big data (such as coordinating several small cars to move a piano, instead of one truck). Code needed to be written for basic functionality, and additional code needed to be written for more specialized tasks. This was a substantial barrier to any big data project, and it is where open-source software played such an important role.

Open-source software is software which is made freely available for anyone to use and modify (subject to some restrictions). Because big data software such as Hadoop was open-sourced, developers everywhere could share expertise and build off each other’s code.

Hadoop is one of many big data tools that have been open-sourced. As of 2017, there are roughly 100 projects related to big data or Hadoop in the Apache Software Foundation alone (we’ll discuss the Apache foundation later). Each of these projects solves a new challenge or solves an old challenge in a new way. For example, Apache Hive allows companies to use Hadoop as a large database, and Apache Kafka provides messaging between machines. New projects are continually being released to Apache, each one addressing a specific need and further lowering the barrier for subsequent entrants into the big data ecosystem.

Keep in mind

Most of the technology you’ll need for extracting value from big data is already readily available. If you’re just starting out with big data, leverage as much existing technology as possible.

Affordable hardware and open-sourced software were lowering the barrier for companies to start using big data. But the problem remained that buying and setting up computers for a big data system was an expensive, complicated and risky process, and companies were uncertain how much hardware to purchase. What they needed was access to computing resources without long-term commitment.

Cloud computing has made it easy to launch and scale initiatives

Cloud computing is essentially renting all or part of an offsite computer. Many companies are already using one or more public cloud services: AWS, Azure, Google Cloud, or a local provider. Some companies maintain private clouds, which are computing resources that are maintained centrally within the company and made available to business units on demand. Such private clouds allow efficient use of shared resources.

Cloud computing can provide hardware or software solutions. Salesforce began in 1999 as a Software as a Service (SaaS), a form of cloud computing. Amazon Web Services (AWS) launched its Infrastructure as a Service (IaaS) in 2006, first renting storage and a few months later renting entire servers. Microsoft launched its cloud computing platform, Azure, in 2010, and Google launched Google Cloud in 2011.

Cloud computing solved a pain point for companies uncertain of their computing and storage needs. It allowed companies to undertake big data initiatives without the need for large capital expenditures, and it allowed them to immediately scale existing initiatives up or down. In addition, companies could move the cost of big data infrastructure from CapEx to OpEx.

The costs of cloud computing are falling, and faster networks allow remote machines to integrate seamlessly. Overall, cloud computing has brought agility to big data, making it possible for companies to experiment and scale without the cost, commitment and wait-time of purchasing dedicated computers.

With scalable data storage and compute power in place, the stage was set for researchers to once again revisit a technology that had stalled in the 1960s and again in the 1980s: artificial intelligence.

Takeaways

  • Modern technology has given us tools to produce much more digital information than ever before.
  • The dramatic fall in the cost of digital storage allows us to keep virtually unlimited amounts of data.
  • Technology pioneers have developed and shared software that enables us to create substantial business value from today’s data.

Ask yourself

  • How are organizations in your sector already using big data technologies? Consider your competitors as well as companies in other sectors.
  • What data would be useful to you if you could store and analyse it as you’d like? Think, for example, of traffic to your website(s), audio and video recordings, or sensor readings.
  • What is the biggest barrier to your use of big data: technology, skill sets or use-cases?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.120.136