When referring to data as ‘big data’, we should expect to have one or more of ‘the three Vs’ first listed in 2001 by Gartner’s Doug Laney: volume, velocity and variety. You might also see creative references to additional Vs, such as veracity.39
Biologists may soon become the largest public consumers of data storage. With the cost of sequencing a human genome now under $1000, sequencing speeds at over 10,000 giga base pairs per week, and the creation of over 1000 genomic sequencing centres spread across 50 countries, we are now seeing a doubling of stored genomic data every 7 months.
Researchers at the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory (CSHL) recently published a paper42 predicting that the field of genomics will soon become the world’s largest consumer of incremental storage. They predict that as many as 2 billion people will have their full genomes sequenced over the next ten years.
In addition, new genome sequencing technologies are revealing previously unimagined levels of genome variation, particularly in cancers, meaning that researchers may eventually sequence and store thousands of genomes per individual.
The three Vs describe major challenges you’ll need to overcome, but they also open tremendous opportunities for you to benefit from data in ways previously not possible.
As a rule of thumb, ‘big data’ refers to data challenges that could not be handled in an affordable, scalable way prior to recent developments in how we program ‘normal’ computers to work in unison.
There are three basic ways to deal with storage limitations:
There is a more expensive technique. Specialized vendors produced massively parallel processing (MPP) databases, consisting of special networked hardware working in unison. They could scale up by adding additional machines, but this quickly became very expensive.
As we discussed in Chapter 1, two things changed the economics of data storage:
Hadoop made it possible to scale storage costs linearly using the Hadoop Distributed Files System (HDFS). You no longer needed to spend five times the money for a bigger machine with twice the storage, but could get twice the storage with two smaller machines or ten thousand times the storage with ten thousand smaller machines. As we’ll discuss later, there are now several alternatives to Hadoop’s HDFS for low-cost, scalable storage.
The economics of scaling storage changed dramatically and there was a fundamental change in our ability to work with data. The ability to ask new questions of old data that you would otherwise have discarded brings increased agility to your organization. You can analyse any historical event at any level of detail. Rather than solving a storage problem, you can focus on leveraging data for competitive advantage.
Consider that in a 2015 Dell survey,1 73 per cent of organizations reported they had big data that could be analysed, with 44 per cent still uncertain how to approach big data. A similar study by Capgemini and EMC highlighted the disruptive nature of big data, with 65 per cent of respondents perceiving they risk becoming irrelevant and/or uncompetitive if they do not embrace big data, 53 per cent expecting to face increased competition from start-ups enabled by data and 24 per cent already experiencing entrance of competitors from adjacent sectors.32
New big data technologies will help you do more than store data. They will help you compute solutions much more quickly. Consider the classic problem of searching for a needle in a haystack. If you split the haystack into 1000 small piles and put 1000 people on the project, your search will go much faster. The bigger the haystack, the more you’ll benefit from this approach. Many software applications work like this, and falling hardware prices have made it very attractive to purchase (or rent) additional computers to put to work on your most important problems.
The original Hadoop framework had two core components:
MapReduce provided a method to spread certain tasks over many machines, much like the haystack illustration. MapReduce did for computations what HDFS did for storage. Computing problems that previously took days could be run in hours or minutes using normal programming languages and hardware.
MapReduce is now being overtaken in many applications by a newer framework called Spark, developed at Berkeley University’s AMPLab and released in 2014. Spark has several advantages over MapReduce, including running 100× faster in many applications.
‘Fast data’ is high-velocity data requiring immediate reaction. Many organizations consider leveraging fast data to be more important than leveraging big data.32 Much of today’s data is both fast and big, and fast data is often seen as a subset of big data.46
Consider the benefits of analysing and acting on your data in real time, while also storing it for later use. In the process, you’ll want to combine your new streaming data with data you’ve already stored when making real-time decisions. You’ll face special challenges implementing such real-time applications, and you’ll want to refer to developments related to the lambda architecture and, more recently, Apache Beam.
Why is processing streaming big data so challenging? Not only do you need to consider additional requirements in speed, bandwidth, consistency and timing, you often need real-time analytics for real-time responses, such as:
You’ll see more streaming data with IoT (Internet of Things) technology, such as from moving vehicles or manufacturing systems. Because you’ll have strict latency (time) and bandwidth (volume) restrictions in such applications, you’ll need to make stricter choices regarding what to process in real time or to store for later analysis. This brings us to the topic of fog computing.
Fog computing, also called ‘edge computing’, is processing data at the edges of a sensor network (see Figure 5.1). Such an architecture alleviates problems related to bandwidth and reliability.
If your sensor networks transfer the collected sensor data to a central computing hub and then return the results for execution, they will typically be limited by the transmission rates of technologies such as LoRaWAN (Long Range Wide Area Network), which is roughly 400 times slower than your phone’s 3G cellular network.
Such data movement may be completely unnecessary, and it introduces an additional potential point of failure, hence the push to move computing closer to the edge of the network.
Open-source software is what allowed big data technology to spread so rapidly. It is impossible to talk about big data without making frequent reference to this broad ecosystem of computer code that has been made freely available for use and modification.
In the early days of computing, computer code could be considered an idea or method, preventing it from being protected by copyright. In 1980, copyright law was extended in the USA to include computer programs.47
In 1983, Richard Stallman of MIT countered this ruling by establishing a movement aimed at promoting free and open collaboration in software development. He created a project (1983), a manifesto (1985) and a legal framework (1989) for producing software that anyone was free to run, copy, distribute or modify, subject to a few basic conditions (such as not attempting to resell the software). For reasons beyond comprehension, he called this the GNU project48 and the legal framework was the first of several versions of the General Public License (GPL).
One of the foundational pieces of software released in the GNU project was the now ubiquitous operating system known as Linux, released in 1992. I would be exaggerating only slightly if I said that Linux is now or has at one time been used by just about every software developer on this planet.
The other ubiquitous and functionally foundational piece of software released as open-source in the 1990s was the Apache HTTP server, which played a key role in the growth of the web. This software traces its origins back to a 1993 project involving just eight developers. In the fifteen years after its initial 1995 release, the Apache HTTP server provided the basic server functionality for over 100 million websites.49
Whereas most companies built business models around not making their software freely available and certainly not releasing source code, many software developers strongly supported using and contributing to the open-source community. Thus, both proprietary and open-source streams of software development continued to grow in parallel.
Soon something very interesting happened, marking a significant turning point in open-source. In 1998, Netscape Communications Corporation, which had developed a browser competing with Microsoft’s Internet Explorer, announced that it was releasing its browser source code to the public.50 Open-source was now growing from both corporate and private contributions.
In 1999, the originators of the already widely used Apache HTTP server founded the Apache Software Foundation, a decentralized open-source community of developers. The Apache Software Foundation is now the primary venue for releasing open-source big data software. Hadoop was released to Apache in 2006, and much of the software that runs on top of Hadoop’s HDFS has been licensed under the terms of the Apache Foundation.
Figure 5.2 shows the growth in Apache contributors 1999–2017.
There are several commonly used licenses for open-source software, differing in restrictions on distribution, modification, sub-licensing, code linkage, etc. The original GPL from GNU is currently up to version 3. The Apache Foundation has its own version, as do organizations such as MIT.
Figure 5.2 Growth in Apache contributors: 1999–2017.51
Open-source software is typically made available as source code or compiled code. Changes to code are managed on a version control system (VCS) such as Git, hosted on platforms such as GitHub or Bitbucket. These systems provide a transparent way to view each addition or modification in the code, including who made a change and when. Software developers use contributions to such projects as bragging rights on their CVs.
Many programmers contribute to open-source projects out of conviction that software should be free. You might wonder why a corporation would contribute code, giving away software they’ve spent significant resources on developing. Software companies themselves have wondered this. In 2001, Jim Allchin, at the time a Microsoft executive, was quoted as saying:
‘I can’t imagine something that could be worse than <open-source> for the software business and the intellectual-property business.’52
Despite the strength of this statement, Microsoft has since made several contributions to open-source.
There are various reasons you might want to open-source your company’s software, including:
Adding software to open-source repositories is a good way to promote yourself and your company
Open-source has been instrumental in the big data ecosystem. The concepts behind Hadoop were originally published in a research paper in 2003, and Hadoop itself was created in 2006 as an (open-source) Apache project. Since then, dozens of software tools connected with Hadoop have been created in or moved to the Apache Foundation.
Many big data tools not related to Hadoop are either part of the Apache Foundation or licensed under the Apache licence. MongoDB and Cassandra are two prominent examples. Spark, the RAM-based big data framework that we mentioned previously, was developed at Berkeley Labs and subsequently released as an Apache project. Neo4j, a graph database for big data, is not Apache but has a community version that is open-sourced, currently under version 3 of GPL.
There are many big data tools, both hardware and software, that are still proprietary, and there are situations in which you might prefer to use these proprietary solutions rather than open-source software. We’ll talk more about why in Chapter 9.
Simply put, cloud computing is a model for sharing centralized computer resources, whether hardware or software. Cloud computing comes in a variety of forms. There are public clouds, such as AWS, Azure and Google Cloud, and there are private clouds, whereby larger corporations maintain centralized computing resources that are allocated in a flexible manner to meet the fluctuating needs of internal business units.
Cloud computing has become ubiquitous, both in personal and business use. We use cloud computing when we use Gmail or allow Apple or Google to store the photos we take on our smartphones. Companies such as Netflix depend heavily on cloud computing to run their services, as do the people using Salesforce software.
There are several reasons cloud computing is such a key part of the big data ecosystem.
There are several more nuanced factors you should consider with cloud computing, and I’ll return to these when I talk about governance and legal compliance in Chapter 11. For now, the most important point is the tremendous gains in agility and scalability that cloud computing provides for your big data projects.
Now that we’ve described the main topics related to big data, we’re ready to talk about how to make it practical for your organization.
13.58.5.57