Chapter 5


Understanding the big data ecosystem

What makes data ‘big’?

When referring to data as ‘big data’, we should expect to have one or more of ‘the three Vs’ first listed in 2001 by Gartner’s Doug Laney: volume, velocity and variety. You might also see creative references to additional Vs, such as veracity.39

  • Volume refers to the sheer quantity of data that you store. If you store the names and addresses of your immediate family, that is data. If you store the names and addresses of everyone in your country, that is a lot of data (you might need to use a different program on your computer). If everyone in your country sends you their autobiography, that is big data. You would need to rethink how you store such data.
    I described earlier how the NSA recently completed a data centre that may reach ‘one yottabyte’ of storage40 and how YouTube is perhaps the largest non-government consumer of data storage today. This is thanks to over one billion YouTube users,41 half of whom are watching from mobile devices, and who, all told, are uploading new video content at such a rapid rate that the content uploaded on 15 March alone could include high-definition video of every single second of the life of Julius Caesar. The world continues to change rapidly, and scientists predict that we will soon be storing newly sequenced genomic data at a rate even greater than that of YouTube uploads.42

Case study – Genomic data

Biologists may soon become the largest public consumers of data storage. With the cost of sequencing a human genome now under $1000, sequencing speeds at over 10,000 giga base pairs per week, and the creation of over 1000 genomic sequencing centres spread across 50 countries, we are now seeing a doubling of stored genomic data every 7 months.

Researchers at the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory (CSHL) recently published a paper42 predicting that the field of genomics will soon become the world’s largest consumer of incremental storage. They predict that as many as 2 billion people will have their full genomes sequenced over the next ten years.

In addition, new genome sequencing technologies are revealing previously unimagined levels of genome variation, particularly in cancers, meaning that researchers may eventually sequence and store thousands of genomes per individual.

  • Velocity refers to how rapidly data accumulates. Processing 100,000 product search requests on your webshop over the course of an hour is very different from processing those requests in a fraction of a second.
    Earlier I introduced the Square Kilometre Array (SKA), a next-generation radio telescope designed to have 50 times the sensitivity and 10,000 times the survey speed of other imaging instruments.43 Once completed, it will acquire an amazing 750 terabytes of sample image data per second.44 That data flow would fill the storage of an average laptop 500 times in the time it takes to blink and would be enough to fill every laptop in Paris in the span of a Parisian lunch break. When eBay first purchased its gold-standard, massively parallel database from Teradata in 2002, its storage capacity at that time would have been filled by this SKA data in under two seconds.
    Not every velocity challenge is a volume challenge. The SKA astronomers and the particle physicists at CERN discard most data after filtering it.
  • Variety refers to the type and nature of the data. Your traditional customer data has set fields such as Name, Address and Phone Number, but data is often free text, visual data, sensor data, or some combination of data and time stamps, which together preserve a complex narrative. The systems you use to store and analyse such data need to be flexible enough to accommodate data whose exact form can’t be anticipated. We’ll talk about technologies that can handle such data in Chapter 8.

The three Vs describe major challenges you’ll need to overcome, but they also open tremendous opportunities for you to benefit from data in ways previously not possible.

Keep in mind

As a rule of thumb, ‘big data’ refers to data challenges that could not be handled in an affordable, scalable way prior to recent developments in how we program ‘normal’ computers to work in unison.

Distributed data storage

There are three basic ways to deal with storage limitations:

  1. Buy a more expensive device with more storage, although twice the storage could mean five times the price. At some point, there is no bigger device available.
  2. Buy separate storage devices. In this case, you lose the functionality that comes from having your data in one place.
  3. Discard whatever data doesn’t fit in your system.

There is a more expensive technique. Specialized vendors produced massively parallel processing (MPP) databases, consisting of special networked hardware working in unison. They could scale up by adding additional machines, but this quickly became very expensive.

As we discussed in Chapter 1, two things changed the economics of data storage:

  1. The dramatic fall in the price of commodity computers (general purpose computers from HP, Dell, etc.), so that companies could afford to purchase small armies of them, even hundreds or thousands.
  2. The spread of open-source technologies for coordinating such computers, particularly the creation of the Hadoop software framework.45

Hadoop made it possible to scale storage costs linearly using the Hadoop Distributed Files System (HDFS). You no longer needed to spend five times the money for a bigger machine with twice the storage, but could get twice the storage with two smaller machines or ten thousand times the storage with ten thousand smaller machines. As we’ll discuss later, there are now several alternatives to Hadoop’s HDFS for low-cost, scalable storage.

The economics of scaling storage changed dramatically and there was a fundamental change in our ability to work with data. The ability to ask new questions of old data that you would otherwise have discarded brings increased agility to your organization. You can analyse any historical event at any level of detail. Rather than solving a storage problem, you can focus on leveraging data for competitive advantage.

Consider that in a 2015 Dell survey,1 73 per cent of organizations reported they had big data that could be analysed, with 44 per cent still uncertain how to approach big data. A similar study by Capgemini and EMC highlighted the disruptive nature of big data, with 65 per cent of respondents perceiving they risk becoming irrelevant and/or uncompetitive if they do not embrace big data, 53 per cent expecting to face increased competition from start-ups enabled by data and 24 per cent already experiencing entrance of competitors from adjacent sectors.32

Distributed computations

New big data technologies will help you do more than store data. They will help you compute solutions much more quickly. Consider the classic problem of searching for a needle in a haystack. If you split the haystack into 1000 small piles and put 1000 people on the project, your search will go much faster. The bigger the haystack, the more you’ll benefit from this approach. Many software applications work like this, and falling hardware prices have made it very attractive to purchase (or rent) additional computers to put to work on your most important problems.

The original Hadoop framework had two core components:

  1. HDFS, Hadoop’s distributed, scalable file system; and
  2. MapReduce, a programming model for running computations across multiple computers.

MapReduce provided a method to spread certain tasks over many machines, much like the haystack illustration. MapReduce did for computations what HDFS did for storage. Computing problems that previously took days could be run in hours or minutes using normal programming languages and hardware.

MapReduce is now being overtaken in many applications by a newer framework called Spark, developed at Berkeley University’s AMPLab and released in 2014. Spark has several advantages over MapReduce, including running 100× faster in many applications.

Fast/streaming data

‘Fast data’ is high-velocity data requiring immediate reaction. Many organizations consider leveraging fast data to be more important than leveraging big data.32 Much of today’s data is both fast and big, and fast data is often seen as a subset of big data.46

Consider the benefits of analysing and acting on your data in real time, while also storing it for later use. In the process, you’ll want to combine your new streaming data with data you’ve already stored when making real-time decisions. You’ll face special challenges implementing such real-time applications, and you’ll want to refer to developments related to the lambda architecture and, more recently, Apache Beam.

Why is processing streaming big data so challenging? Not only do you need to consider additional requirements in speed, bandwidth, consistency and timing, you often need real-time analytics for real-time responses, such as:

  • fraud checks during credit card purchases,
  • shutting down a malfunctioning machine,
  • rerouting data/traffic/power flow through a network, or
  • customizing webpages in real time to maximize the likelihood that a shopper makes a purchase, basing your customization on their last few seconds of activity.

You’ll see more streaming data with IoT (Internet of Things) technology, such as from moving vehicles or manufacturing systems. Because you’ll have strict latency (time) and bandwidth (volume) restrictions in such applications, you’ll need to make stricter choices regarding what to process in real time or to store for later analysis. This brings us to the topic of fog computing.

Fog computing/edge computing

Fog computing, also called ‘edge computing’, is processing data at the edges of a sensor network (see Figure 5.1). Such an architecture alleviates problems related to bandwidth and reliability.

If your sensor networks transfer the collected sensor data to a central computing hub and then return the results for execution, they will typically be limited by the transmission rates of technologies such as LoRaWAN (Long Range Wide Area Network), which is roughly 400 times slower than your phone’s 3G cellular network.

Such data movement may be completely unnecessary, and it introduces an additional potential point of failure, hence the push to move computing closer to the edge of the network.

Open-source software

Open-source software is what allowed big data technology to spread so rapidly. It is impossible to talk about big data without making frequent reference to this broad ecosystem of computer code that has been made freely available for use and modification.

Figure 5.1 Fog computing landscape, image from reference 46, p.

Figure 5.1 Fog computing landscape, image from reference 46, p. 15

History of open-source

In the early days of computing, computer code could be considered an idea or method, preventing it from being protected by copyright. In 1980, copyright law was extended in the USA to include computer programs.47

In 1983, Richard Stallman of MIT countered this ruling by establishing a movement aimed at promoting free and open collaboration in software development. He created a project (1983), a manifesto (1985) and a legal framework (1989) for producing software that anyone was free to run, copy, distribute or modify, subject to a few basic conditions (such as not attempting to resell the software). For reasons beyond comprehension, he called this the GNU project48 and the legal framework was the first of several versions of the General Public License (GPL).

One of the foundational pieces of software released in the GNU project was the now ubiquitous operating system known as Linux, released in 1992. I would be exaggerating only slightly if I said that Linux is now or has at one time been used by just about every software developer on this planet.

The other ubiquitous and functionally foundational piece of software released as open-source in the 1990s was the Apache HTTP server, which played a key role in the growth of the web. This software traces its origins back to a 1993 project involving just eight developers. In the fifteen years after its initial 1995 release, the Apache HTTP server provided the basic server functionality for over 100 million websites.49

Whereas most companies built business models around not making their software freely available and certainly not releasing source code, many software developers strongly supported using and contributing to the open-source community. Thus, both proprietary and open-source streams of software development continued to grow in parallel.

Soon something very interesting happened, marking a significant turning point in open-source. In 1998, Netscape Communications Corporation, which had developed a browser competing with Microsoft’s Internet Explorer, announced that it was releasing its browser source code to the public.50 Open-source was now growing from both corporate and private contributions.

In 1999, the originators of the already widely used Apache HTTP server founded the Apache Software Foundation, a decentralized open-source community of developers. The Apache Software Foundation is now the primary venue for releasing open-source big data software. Hadoop was released to Apache in 2006, and much of the software that runs on top of Hadoop’s HDFS has been licensed under the terms of the Apache Foundation.

Figure 5.2 shows the growth in Apache contributors 1999–2017.

Licensing

There are several commonly used licenses for open-source software, differing in restrictions on distribution, modification, sub-licensing, code linkage, etc. The original GPL from GNU is currently up to version 3. The Apache Foundation has its own version, as do organizations such as MIT.

Figure 5.2 Growth in Apache contributors: 1999–2017.

Figure 5.2 Growth in Apache contributors: 1999–2017.51

Code distribution

Open-source software is typically made available as source code or compiled code. Changes to code are managed on a version control system (VCS) such as Git, hosted on platforms such as GitHub or Bitbucket. These systems provide a transparent way to view each addition or modification in the code, including who made a change and when. Software developers use contributions to such projects as bragging rights on their CVs.

Advantages of open-source

Many programmers contribute to open-source projects out of conviction that software should be free. You might wonder why a corporation would contribute code, giving away software they’ve spent significant resources on developing. Software companies themselves have wondered this. In 2001, Jim Allchin, at the time a Microsoft executive, was quoted as saying:

‘I can’t imagine something that could be worse than <open-source> for the software business and the intellectual-property business.’52

Despite the strength of this statement, Microsoft has since made several contributions to open-source.

There are various reasons you might want to open-source your company’s software, including:

  • to encourage the spread of software for which you can sell supporting services or enhanced, non-public, versions;
  • to encourage the spread of software that will promote your other revenue streams. For example, when you open-source software running on hardware or (paid) software that you produce;
  • to harness the collective development power of the open-source community in debugging and improving software you’re developing to perform an important task within your company;
  • to satisfy licensing requirements when you’ve incorporated code subject to open-source licensing in a product that you are reselling; and
  • to promote your company as technologically advanced and help attract top talent.

Keep in mind

Adding software to open-source repositories is a good way to promote yourself and your company

Open-source for big data

Open-source has been instrumental in the big data ecosystem. The concepts behind Hadoop were originally published in a research paper in 2003, and Hadoop itself was created in 2006 as an (open-source) Apache project. Since then, dozens of software tools connected with Hadoop have been created in or moved to  the Apache Foundation.

Many big data tools not related to Hadoop are either part of the Apache Foundation or licensed under the Apache licence. MongoDB and Cassandra are two prominent examples. Spark, the RAM-based big data framework that we mentioned previously, was developed at Berkeley Labs and subsequently released as an Apache project. Neo4j, a graph database for big data, is not Apache but has a community version that is open-sourced, currently under version 3 of GPL.

There are many big data tools, both hardware and software, that are still proprietary, and there are situations in which you might prefer to use these proprietary solutions rather than open-source software. We’ll talk more about why in Chapter 9.

Cloud computing

Simply put, cloud computing is a model for sharing centralized computer resources, whether hardware or software. Cloud computing comes in a variety of forms. There are public clouds, such as AWS, Azure and Google Cloud, and there are private clouds, whereby larger corporations maintain centralized computing resources that are allocated in a flexible manner to meet the fluctuating needs of internal business units.

Cloud computing has become ubiquitous, both in personal and business use. We use cloud computing when we use Gmail or allow Apple or Google to store the photos we take on our smartphones. Companies such as Netflix depend heavily on cloud computing to run their services, as do the people using Salesforce software.

There are several reasons cloud computing is such a key part of the big data ecosystem.

  • Speed and agility: Cloud computing makes it fast and easy for you to experiment and scale up your data initiatives. Once an idea is in place, you no longer need to get approval for large hardware purchases, followed by a waiting period for installation. Instead, a credit card and minimal budget can enable you to launch a data initiative within a few minutes.
  • Reduced reliance on specialized skills: Beyond the basic setup of hardware and networking, which fall under the category of Infrastructure as a Service (IaaS), there are companies that will also provide the software layers necessary to launch a (big) data initiative. These include Platform as a service (PaaS), which encompasses the operating system, databases, etc., and Software as a Service (SaaS), which encompasses hosted applications such as Azure ML, Gmail, Salesforce, etc. The greater the number of non-core tasks you can offload to a service provider, the more you can focus on your key differentiators.
  • Cost savings: Whether cloud computing is cheaper than maintaining hardware and services in-house depends on your use case and on changing market prices. Either way, cloud computing provides you with an opportunity to move IT costs from CapEx to OpEx.

There are several more nuanced factors you should consider with cloud computing, and I’ll return to these when I talk about governance and legal compliance in Chapter 11. For now, the most important point is the tremendous gains in agility and scalability that cloud computing provides for your big data projects.

Now that we’ve described the main topics related to big data, we’re ready to talk about how to make it practical for your organization.

Takeaways

  • Data solutions today are designed to handle high volume, variety and velocity.
  • The key to processing such data is software that distributes the storage and computation loads over many smaller computers.
  • Publicly available, open-source software has been invaluable in spreading big data technology.
  • Cloud computing is a significant enabler for companies to initiate and scale their data and analytics endeavours.

Ask yourself

  • Which software applications do you absolutely need to build in-house or purchase from a vendor rather than leveraging open-source software? For all other applications, what open-source software is available and how might you benefit from it?
  • Which of your business applications would benefit from technology allowing you to analyse and provide appropriate responses in real time? Think, for example, of real-time personalization, recommendations, marketing, route planning and system tuning.
  • Are important applications in your organization being delayed because you are moving data back and forth to process it centrally? Which of those applications could be accelerated by de-centralizing computations or utilizing streaming technologies such as Spark or Flink?
  • Which parts of your IT have you not yet moved to the cloud? If you previously thought the cloud to be too risky or too expensive, which of your previous concerns are still relevant today? Cloud technology is developing quickly, and traditionally cautious companies are becoming more open to it.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.5.57