Chapter 16
IN THIS CHAPTER
Identifying technological trends in predictive analytics
Exploring predictive analytics as a service
Applying open-source tools
Selecting a large-scale big data framework
At a broad level, big data is the mass of data generated at every moment. It includes — for openers — data emerging in real time from
And that list barely scratches the surface; think of big data as a growing worldwide layer of data. What to do with it all? Answer: Start by using predictive analytics as a means to extract valuable information from the mass of big data, find patterns in the data, learn new insights, and predict future outcomes.
Here's a familiar example: You might have noticed ads appearing on websites while you're browsing that “just happen” to mention products that you already intended to buy. No magic at work here; the sites that show such ads are utilizing predictive analytics to data-mine (dig up insights from) big data: your buying patterns leave a trail of valuable data online; the well-targeted ads show that someone is using that data.
This chapter shows you how predictive analytics can crack the big-data nut and extract the nutmeat. First we guide you through a number of trends in the predictive analytics market — in particular, predictive analytics as a service — that offer significant applications for businesses. You're also introduced to some aspects of predictive analytics and how it can be used to tame big data:
Traditional analytical techniques can only provide insights on the basis of historical data. Your data — both past and incoming — can provide you with a reliable predictor that can help you make better decisions to achieve your business goals. The tool for accomplishing those goals is predictive analytics.
Companies that adopt and apply this tool extensively are looking not only for insights, but also for usable forward-looking insights from multiple sources of data. Using a wide range of data, companies want to predict their customer's next action before it occurs, to predict marketing failures, detect fraud, or predict the likelihood that future business decisions will succeed.
As the use of predictive analytics has become more common and widespread, an emerging trend is (understandably) toward greater ease of use. Arguably the easiest way to use predictive analytics is as software — whether as a standalone product or as a cloud-based service provided by a company whose business is providing predictive analytics solutions for other companies.
If your company's business is to offer predictive analytics, you can provide that capability in two major ways:
A simple example can be as straightforward as these three steps:
A growing trend is to apply predictive analytics to data gathered from diverse sources. Deploying a typical predictive analytics solution in a distributed environment requires collecting data — sometimes big data — from different sources; an approach that must rely on data management capabilities. Data needs to be collected, pre-processed, and managed before it can be considered usable for generating actionable predictions.
The architects of predictive analytics solutions must always face the problem of how to collect and process data from different data sources. Consider, for example, a company that wants to predict the success of a business decision that affects one of its products by evaluating one of the following options:
The predictive analytics architect must engineer a model that helps the company make this decision, using data about the product from different departments (as illustrated in Figure 16-1):
For example, if a user posts a review about Product X that says, “I really like Product X and I'm happy with the price,” a sentiment extractor automatically labels this comment as positive. Such tools can classify responses as “happy,” “sad,” “angry,” and so on, basing the classification on the words that an author uses in text posted online. In the case of Product X, the predictive analytics solution would need to aggregate customer reviews from external sources (such as social networks and micro-blogs) with data derived from internal company sources.
Figure 16-1 shows such an aggregation of data from multiple sources, both internal and external — from the engineering and sales divisions (internal), and from customer reviews gleaned from social networks (external data that isn't found in the data warehouse) — which is also an instance of using big data in predictive analytics. A standard tool that can be used for such aggregation is Hadoop — a big-data framework that enables building predictive analytics solutions from different sources. (More about Hadoop in the next section.)
Delivering insights as new events occur in real time is a challenging task because so much is happening so fast. Modern high-speed processing has shifted the quest for business insight away from traditional data warehousing and toward real-time processing. But the volume of data is also high — a tremendous amount of varied data, from multiple sources, generated constantly and at different rates. Companies are eager for scalable predictive analytics solutions that can derive real-time insights from a flood of data that seems to carry “the world and all it contains.”
The demand is intensifying for analyzing data in real time and generating predictions quickly. Consider the real-life example (mentioned earlier in this chapter) of encountering an online ad placement that corresponds to a purchase you were already about to make. Companies are interested in predictive analytics solutions that can provide such capabilities as the following:
In addition to encouraging buying and voting along desired lines, real-time predictive analytics can serve as a critical tool for the automatic detection of cyber-attacks. For example, RSA — a well-known American computer and network security company — has recently adopted predictive analytics and big-data visualization as part of a web-threat detection solution. The tool analyzes a large number of websites and web users' activities to predict, detect, visualize, and score (rate) cyber-threats such as denial-of-service attacks and malicious web activities such as online banking fraud.
This section highlights two major open source Apache projects that are important and relevant to big data. The most important one is Apache Hadoop, a framework that allows storing and processing various datasets using parallel processing on a cluster of computers bundled together. The other project is Apache Spark, which is a recently developed computational engine that is responsible for distribution, scheduling, and managing large-scale data applications across many nodes in your cluster. It is designed to be flexible and relatively fast by trying to perform as much computations as possible in-memory (rather than going to disk).
Apache Hadoop is a free, open-source software platform for writing and running applications that process a large amount of data. It enables a distributed parallel processing of large datasets generated from different sources. Essentially, it's a powerful tool for storing and processing big data.
Hadoop stores any type of data, structured or unstructured, from different sources — and then aggregates that data in nearly any way you want. Hadoop handles heterogeneous data using distributed parallel processing — which makes it a very efficient framework to use in analytic software dealing with big data. No wonder some large companies are adopting Hadoop, including Facebook, Yahoo!, IBM, Twitter, and LinkedIn (who developed Hadoop’s components to solve big data problems that they were facing internally).
Before Hadoop, companies were unable to take advantage of big data, which wasn't analyzed and almost unusable. The cost to store that data in a proprietary relational database and create a structured format around it didn't justify the benefits of analyzing that data and making use of it. Hadoop, on the other hand, is making that task seamless — at a fraction of the cost — allowing companies to find valuable insights in the abundant data they have acquired and are accumulating.
The power of Hadoop lies in handling different types — in fact, any type — of data: text, speech, emails, photos, posts, tweets, you name it. Hadoop takes care of aggregating this data, in all its variety, and provides you with the ability to query all of the data at your convenience. You don't have to build a schema before you can make sense of your data; Hadoop allows you to query that data in its original format.
In addition to handling large amounts of varied data, Hadoop is fault-tolerant, using simple programs that handle the scheduling of the processing distributed over multiple machines. These programs can detect hardware failure and divert a task to another running machine. This arrangement enables Hadoop to deliver high availability, regardless of hardware failure.
Hadoop uses two main components to do its job: MapReduce and Hadoop Distributed File System. The two components work co-operatively:
Map( )
function which resides on the master node (networked computer). It divides the input query or task into smaller subtasks, which it then distributes to worker nodes that process the smaller tasks and pass the answers back to the master node. The subtasks are run in parallel on multiple computers.Reduce( )
function collects the results of all the subtasks and combines them to produce an aggregated final result — which it returns as the answer to the original big query.Consider a database of two billion people, and assume you want to compute the number of social friends of Mr. X and arrange them according to their geographical locations. That's a tall order. The data for two billion people could originate in widely different sources such as social networks, email contact address lists, posts, tweets, browsing histories — and that's just for openers. Hadoop can aggregate this huge, diverse mass of data so you can investigate it with a simple query.
You would use MapReduce programming capabilities to solve this query. Defining Map and Reduce procedures makes even this large dataset manageable. Using the tools that the Hadoop framework offers, you would create a MapReduce implementation that would do the computation as two subtasks:
Your MapReduce implementation program would run these subtasks in parallel, manage communication between the subtasks, and assemble the results. Out of two billion people, you would know who Mr. X's online friends are.
Hadoop provides a range of Map processors; which one(s) you select will depend on your infrastructure. Each of your processors will handle a certain number of records. Suppose, for example, that each processor handles one million data records. Each processor executes a Map procedure that produces multiple records of key-value pairs <G, N>
where G (key) is the geographical location a person (country) and N (value) is the number of contacts the person has.
Suppose each Map processor produces many pairs of the form <
key, value>, such as the following:
<France, 45>
<Morocco, 23>
<USA, 334>
<Morocco, 443>
<France, 8>
<Morocco, 44>
In the Reduce phase, Hadoop assigns a task to a certain number of processors: Execute the Reduce procedure that aggregates the values of the same keys to produce a final result. For this example the Reduce implementation sums up the count of values for each key — geographical location. So, after the Map phase, the Reduce phase produces the following:
<Morocco, 23+44+443=510> → <Morocco, 510>
<France, 8+45=53> → <France, 53>
Clearly, Mr. X is a popular guy — but this was a very simple example of how MapReduce can be used. Imagine you're dealing with a large dataset where you want to perform complex operations such as clustering billions of documents as mentioned in Chapter 6 where the operation and the data is just too big for a single machine to handle. Hadoop is the tool to consider.
Latest versions of Hadoop decouple MapReduce programming model and Cluster Resources Management by introducing a new technology called Yarn within the Hadoop framework.
Yarn stands forYet Another Resource Negotiator, which is a new component in the Hadoop framework. Yarn is responsible for cluster management and scheduling. Cluster resource management, as its name implies, manages the resources (such as memory and CPU) of the different clusters, including data nodes. In recent versions of Hadoop, Yarn took over cluster resource management; MapReduce is now only responsible for data processing.
Yarn was introduced to overcome major limitations of scalability in earlier versions of Hadoop. Users of the earliest version of Hadoop experienced major scalability, job tracker concurrency issues, and malfunctioning for a cluster of over about 3000 data nodes. Yarn handles job scheduling functionality as a separate generic application. The later allows the coexistence of different processing models with MapReduce. For example, with Yarn you can run machine learning algorithms, MapReduce programs, and stream processing.
Yarn is responsible for two main tasks:
Yarn can free some of the containers if they aren't being used or if they are complete, and allocates the freed resources to another application by client request. MapReduce in Hadoop 1.0 relied on a fixed number of Maps and Reduce processes that are allowed to run on a single node. This wasn't the best way to maximize the utilization of nodes in cluster. In later versions of Hadoop, Yarn allows clients’ applications to request resources of different memory and CPU sizes. A Yarn application has full control over the resources needed to fulfill its work.
The word ecosystem was invented in 1930 by botanist Roy Clapham, The word refers to physical and biological components that include plants, rocks, minerals, soil, water, and human beings, among other components, that interact with a unit. Each component has a specific role and exists to support other components. The same analogy applies to Hadoop.
Hadoop framework is made up of components that can interacts with each other’s and work as a unit to provide a Hadoop-based data analytics solution.
Figure 16-2 shows an overview of the up-to-date software components available in Hadoop Ecosystem. Some of the components, like Spark and Mahout, are explained later in this chapter. The Hadoop ecosystem is frequently changing. Figure 16-2 gives an overview of the components you can adopt on Hadoop for such purposes as predictive analytics, data processing, and SQL-like processing. Apache’s website maintains an up-to-date list of the Hadoop ecosystem: http://hadoop.apache.org
Apache Mahout is a machine-learning library that includes large-scale versions of the clustering, classification, collaborative filtering, and other data-mining algorithms that can support a large-scale predictive analytics model. A highly recommended way to process the data needed for such a model is to run Mahout in a system that's already running Hadoop (see the preceding section). Hadoop designates a master machine that orchestrates the other machines (such as Map machines and Reduce machines) employed in its distributed processing. In most cases, Mahout should be installed on the server it will run on (for example, master node).
Imagine you have large amount of streamed data — Google news articles — and you would like to cluster by topic, using one of the clustering algorithms mentioned in Chapter 6. After you install Hadoop and Mahout, you can execute one of the algorithms — such as K-means — on your data.
The implementation of K-means under Mahout uses a MapReduce approach, which makes it different from the normal implementation of K-means (described earlier in Chapter 6). Mahout subdivides the K-means algorithm into these sub-procedures:
Apache Mahout is a recently developed project; its functionality still has a lot of space to accommodate extensions. In the meantime, Mahout already uses MapReduce to implement classification, clustering, and other machine-learning techniques — and can do so on a large scale.
For a very straightforward installation of Hadoop, adopt Cloudera’s distribution. Cloudera’s Hadoop Distribution (CDH) is an integrated Apache Hadoop that contains pre-installed and configured components for a Hadoop production environment. CDH is free and is available on different formats of Linux packages, virtual machine images, and in tar files. You can also run CDH on the cloud. Everything in CDH is installed and ready to go. You can download the image type that corresponds to your preferred virtual machine tool. To download and get started on Hadoop installation, visit www.cloudera.com/downloads.html
three modes:
Pseudo-distributed mode: In this mode, there can be several mappers and reducers, but all daemons run on one machine (your local machine) using HDFS protocol.
This mode is very useful for developers to mimic a Hadoop environment for testing purposes.
Originally developed at the University of California Berkeley in 2009, Apache Spark has quickly emerged as a big data analytics processing engine. It's positioned to become part of the next generation of data analytics processing engines that can improve and co-exist with Hadoop in certain circumstances.
Cloudera (one of the main players in the big data Hadoop distribution arena) announced that Spark was going to be their default selection. They have positioned Spark as replacement for MapReduce in the Hadoop ecosystem where data analytics workloads will be processed. Last year, IBM endorsed and supported Apache Spark as “the most important new open source project in a decade that is being defined by data”. IBM also committed to educate over a million data scientists on Spark technology.
eBay uses Spark for analyzing and aggregating logs of transactional data. OpenTable uses Spark to support extract transfer and load (ETL) jobs (see chapter 9), it's also being used to power a recommender system (see Chapter 2) that leverages MLliB Spark’ component. Elsevier labs are evaluating Spark to build a machine learning pipeline to support content as a service, and many other companies are flocking to adopt Spark. So, what is Apache Spark?
Apache Spark is a relatively fast open source engine for general large scale processing. It's a computational engine that is responsible for scheduling, distributing tasks and managing applications across many nodes in the computing cluster.
Apache Spark is designed to be relatively fast, to support implicit data parallelism, to serve as a general workloads platform, and to provide fault-tolerance.
Spark is an extension of the MapReduce software paradigm. It adds the support of two main features:
The Spark framework runs complex data computations in memory which makes it, in most cases, more efficient and faster than MapReduce’s computations running on disk. It makes it easy to combine a heterogonous array of data processing types that can be part of a data analytics pipeline (streaming, interactive queries, and iterative machine learning algorithms).
Spark can run in a Hadoop cluster and it can integrate different data sources. It provides several application programming interfaces in Java, Scala, Python and SQL.
Having the ability to support multiple data processing types and integration capabilities with different data sources, such as Cassandra, Amazon S3, or Hadoop HBase, Spark helps mitigate the burden of maintaining separate tools for specific processing types. In fact, Spark was designed as a unified stack that encapsulates multiple integrated components.
The elegance of Spark allows data scientists to combines and invoke multiple components, which are defined in Spark (such as Spark SQL, Spark Streaming, and MLlib for machine learning), in the same manner and ease as invoking libraries in a software development project. For example, a large-scale software application for an international bank was developed using the Spark framework. The application relied on the graph library provided by Spark, known as GraphX, to perform graph-based computations (such as PageRank and Dijkstra algorithms) from a large dataset modeled as a graph. The dataset was streamed live using the streaming library on Spark. The same application was used by business users to perform queries on the graph-structured data and on the discovered insights thanks to the Spark SQL library – all in one software application.
Under Spark’s integration and comprehensive unified stack, the team had to maintain only one software application that offers multiple services, including analytics, ingesting data streams, and interactive queries.
Spark’s main abstraction is based on resilient distributed datasets, known as RDDs. RDDs are created by either parallelizing collections in your program or referencing a dataset in an external storage.
The term collection (also known as container) in computer programming often refers to an object that combines a set of elements into a single unit. A collection is defined and used to apply operations to manipulate, store, and retrieve data. For example, in Java, using Spark, you can parallelize a collection that you define in your program (line 3) using the Spark framework. Here's a simple example:
Lines 1 and 2 are used to initialize the Spark configuration and context. In lines 3 and 4, a collection of four numbers (dataset) is being paralleled by SparkContext and converted in an RDD named distributedData that can be operated on in parallel. To learn more about the operations that can now be applied to the RDD, visit http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds
RDDs also can be also created by referencing a dataset in an external storage. Spark creates RDDs from datasets residing in any storage supported by Hadoop including text files.
SparkContext defined in Spark Application Programming Interfaces (API) is used to create text file RDD by calling textFile method. Consider the following Java:
SparkConf conf = new SparkConf().setAppName(name).setMaster(master);
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> distributedFile = sparkContext.textFile("dataset.txt");
The data in dataset.txt file was converted to an RDD named distributedFile that you can apply all distributed operations like the map and reduce operations.
For more information about RDDs, visit http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds
The main components of Apache Spark are
In addition to the capabilities of the component as previously explained, Spark can run on several cluster managers, including Hadoop Yarn, Apache Mesos, and Spark scheduler. The cluster manager allows Spark to scale from a few nodes to several thousand.
The data science legend and inventor Amr Awadallah co-founded Cloudera, the most widely used Hadoop distribution in the industry. He was right when he said in an interview published at Kdnugget: “I have no favorites among the projects in our distribution, I love them all equally. It is the combined power of these projects, the platform, where the true power lies.”
We believe that newer versions of Hadoop framework, along with more new components including new versions of Spark, will emerge and more new technologies will be invented. Sometimes, it is hard to keep up with the new trends. For example, Hadoop and Spark coexist and can support your enterprise in providing a comprehensive solution.
In leadership meetings at a large organization where we were evaluating Hadoop distributions, we were overwhelmed by the sales presentations that highlight the technologies being presented. However, behind closed doors and together with leadership, we decided to focus on these points that would help us pick our choice:
3.141.31.125