Chapter 16

Targeting Big Data

IN THIS CHAPTER

Identifying technological trends in predictive analytics

Exploring predictive analytics as a service

Applying open-source tools

Selecting a large-scale big data framework

At a broad level, big data is the mass of data generated at every moment. It includes — for openers — data emerging in real time from

  • Online social networks such as Facebook
  • Micro-blogs such as Twitter
  • Online customer-transaction data
  • Climate data gathered from sensors
  • GPS locations for every device equipped with GPS
  • User queries on search engines such as Google

And that list barely scratches the surface; think of big data as a growing worldwide layer of data. What to do with it all? Answer: Start by using predictive analytics as a means to extract valuable information from the mass of big data, find patterns in the data, learn new insights, and predict future outcomes.

Here's a familiar example: You might have noticed ads appearing on websites while you're browsing that “just happen” to mention products that you already intended to buy. No magic at work here; the sites that show such ads are utilizing predictive analytics to data-mine (dig up insights from) big data: your buying patterns leave a trail of valuable data online; the well-targeted ads show that someone is using that data.

This chapter shows you how predictive analytics can crack the big-data nut and extract the nutmeat. First we guide you through a number of trends in the predictive analytics market — in particular, predictive analytics as a service — that offer significant applications for businesses. You're also introduced to some aspects of predictive analytics and how it can be used to tame big data:

  • Utilize big data to predict future outcomes
  • Explore predictive analytics as a service
  • Apply free, open-source tools to fuse, and make use of data from different sources
  • Prepare to build a predictive analytics model
  • Build and test a proof-of-concept predictive analytics model

Major Technological Trends in Predictive Analytics

Traditional analytical techniques can only provide insights on the basis of historical data. Your data — both past and incoming — can provide you with a reliable predictor that can help you make better decisions to achieve your business goals. The tool for accomplishing those goals is predictive analytics.

Companies that adopt and apply this tool extensively are looking not only for insights, but also for usable forward-looking insights from multiple sources of data. Using a wide range of data, companies want to predict their customer's next action before it occurs, to predict marketing failures, detect fraud, or predict the likelihood that future business decisions will succeed.

Exploring predictive analytics as a service

As the use of predictive analytics has become more common and widespread, an emerging trend is (understandably) toward greater ease of use. Arguably the easiest way to use predictive analytics is as software — whether as a standalone product or as a cloud-based service provided by a company whose business is providing predictive analytics solutions for other companies.

If your company's business is to offer predictive analytics, you can provide that capability in two major ways:

  • As a standalone software application with an easy-to-use graphical user interface: The customer buys the predictive analytics product and uses it to build customized predictive models.
  • As a cloud-based set of software tools that help the user choose a predictive model to use: The customer applies the tools to fulfill the requirements and specifications of the project at hand, and the type of data that the model will be applied to. The tools can offer predictions quickly, without involving the client in the workings of the algorithms in use or the data management involved. These services don't require that the user have prior knowledge of the inner workings of the provided models and services. Cloud-based services allow for third-party integration with no need for client interaction, and also offer a marketplace for other services.

A simple example can be as straightforward as these three steps:

  1. A client uploads data to your servers, or chooses data that already resides in the cloud.
  2. The customer applies some of the available predictive model to that data.
  3. The customer reviews visualized insights and predictions from the results of the analysis or service.

Aggregating distributed data for analysis

A growing trend is to apply predictive analytics to data gathered from diverse sources. Deploying a typical predictive analytics solution in a distributed environment requires collecting data — sometimes big data — from different sources; an approach that must rely on data management capabilities. Data needs to be collected, pre-processed, and managed before it can be considered usable for generating actionable predictions.

The architects of predictive analytics solutions must always face the problem of how to collect and process data from different data sources. Consider, for example, a company that wants to predict the success of a business decision that affects one of its products by evaluating one of the following options:

  • To put company resources into increasing the sales volume
  • To terminate manufacture of the product
  • To change the current sales strategy for the product

The predictive analytics architect must engineer a model that helps the company make this decision, using data about the product from different departments (as illustrated in Figure 16-1):

  • Technical data: The engineering department has data about the product's specifications, its lifecycle, and the resources and time needed to produce it.
  • Sales data: The sales department has information about the product's sales volume, the number of sales per region, and profits generated by those sales.
  • Customer data from surveys, reviews, and posts: The company may have no dedicated department that analyzes how customers feel about the product. Tools exist, however, that can automatically analyze data posted online and extract the attitudes of authors, speakers, or customers toward a topic, a phenomenon, or (in this case) a product. The process is known as sentiment analysis or opinion mining.
image

FIGURE 16-1: Data about Product X being aggregated from different sources.

For example, if a user posts a review about Product X that says, “I really like Product X and I'm happy with the price,” a sentiment extractor automatically labels this comment as positive. Such tools can classify responses as “happy,” “sad,” “angry,” and so on, basing the classification on the words that an author uses in text posted online. In the case of Product X, the predictive analytics solution would need to aggregate customer reviews from external sources (such as social networks and micro-blogs) with data derived from internal company sources.

Figure 16-1 shows such an aggregation of data from multiple sources, both internal and external — from the engineering and sales divisions (internal), and from customer reviews gleaned from social networks (external data that isn't found in the data warehouse) — which is also an instance of using big data in predictive analytics. A standard tool that can be used for such aggregation is Hadoop — a big-data framework that enables building predictive analytics solutions from different sources. (More about Hadoop in the next section.)

Real-time data-driven analytics

Delivering insights as new events occur in real time is a challenging task because so much is happening so fast. Modern high-speed processing has shifted the quest for business insight away from traditional data warehousing and toward real-time processing. But the volume of data is also high — a tremendous amount of varied data, from multiple sources, generated constantly and at different rates. Companies are eager for scalable predictive analytics solutions that can derive real-time insights from a flood of data that seems to carry “the world and all it contains.”

The demand is intensifying for analyzing data in real time and generating predictions quickly. Consider the real-life example (mentioned earlier in this chapter) of encountering an online ad placement that corresponds to a purchase you were already about to make. Companies are interested in predictive analytics solutions that can provide such capabilities as the following:

  • Predict — in real time — the specific ad that a site visitor would most likely click (an approach called real-time ad placement).
  • Speculate accurately on which customers are about to quit a service or product in order to target those customers with a retention campaign (customer retention and churn modeling).
  • Identify voters who can be influenced through a specific communication strategy such as a home visit, TV ad, phone call, or email. (You can imagine the impact on political campaigning.)

In addition to encouraging buying and voting along desired lines, real-time predictive analytics can serve as a critical tool for the automatic detection of cyber-attacks. For example, RSA — a well-known American computer and network security company — has recently adopted predictive analytics and big-data visualization as part of a web-threat detection solution. The tool analyzes a large number of websites and web users' activities to predict, detect, visualize, and score (rate) cyber-threats such as denial-of-service attacks and malicious web activities such as online banking fraud.

Applying Open-Source Tools to Big Data

This section highlights two major open source Apache projects that are important and relevant to big data. The most important one is Apache Hadoop, a framework that allows storing and processing various datasets using parallel processing on a cluster of computers bundled together. The other project is Apache Spark, which is a recently developed computational engine that is responsible for distribution, scheduling, and managing large-scale data applications across many nodes in your cluster. It is designed to be flexible and relatively fast by trying to perform as much computations as possible in-memory (rather than going to disk).

Apache Hadoop

Apache Hadoop is a free, open-source software platform for writing and running applications that process a large amount of data. It enables a distributed parallel processing of large datasets generated from different sources. Essentially, it's a powerful tool for storing and processing big data.

Hadoop stores any type of data, structured or unstructured, from different sources — and then aggregates that data in nearly any way you want. Hadoop handles heterogeneous data using distributed parallel processing — which makes it a very efficient framework to use in analytic software dealing with big data. No wonder some large companies are adopting Hadoop, including Facebook, Yahoo!, IBM, Twitter, and LinkedIn (who developed Hadoop’s components to solve big data problems that they were facing internally).

Before Hadoop, companies were unable to take advantage of big data, which wasn't analyzed and almost unusable. The cost to store that data in a proprietary relational database and create a structured format around it didn't justify the benefits of analyzing that data and making use of it. Hadoop, on the other hand, is making that task seamless — at a fraction of the cost — allowing companies to find valuable insights in the abundant data they have acquired and are accumulating.

The power of Hadoop lies in handling different types — in fact, any type — of data: text, speech, emails, photos, posts, tweets, you name it. Hadoop takes care of aggregating this data, in all its variety, and provides you with the ability to query all of the data at your convenience. You don't have to build a schema before you can make sense of your data; Hadoop allows you to query that data in its original format.

In addition to handling large amounts of varied data, Hadoop is fault-tolerant, using simple programs that handle the scheduling of the processing distributed over multiple machines. These programs can detect hardware failure and divert a task to another running machine. This arrangement enables Hadoop to deliver high availability, regardless of hardware failure.

Hadoop uses two main components to do its job: MapReduce and Hadoop Distributed File System. The two components work co-operatively:

  • MapReduce: Hadoop's implementation of MapReduce is based on Google's research on programming models to process large datasets by dividing them into small blocks of tasks. MapReduce uses distributed algorithms, on a group of computers in a cluster, to process large datasets. It consists of two functions:
    • The Map( )function which resides on the master node (networked computer). It divides the input query or task into smaller subtasks, which it then distributes to worker nodes that process the smaller tasks and pass the answers back to the master node. The subtasks are run in parallel on multiple computers.
    • The Reduce( )function collects the results of all the subtasks and combines them to produce an aggregated final result — which it returns as the answer to the original big query.
  • Hadoop Distributed File System (HDFS): HDFS replicates the data blocks that reside on other computers in your data center (to ensure reliability) and manages the transfer of data to the various parts of your distributed system.

Consider a database of two billion people, and assume you want to compute the number of social friends of Mr. X and arrange them according to their geographical locations. That's a tall order. The data for two billion people could originate in widely different sources such as social networks, email contact address lists, posts, tweets, browsing histories — and that's just for openers. Hadoop can aggregate this huge, diverse mass of data so you can investigate it with a simple query.

You would use MapReduce programming capabilities to solve this query. Defining Map and Reduce procedures makes even this large dataset manageable. Using the tools that the Hadoop framework offers, you would create a MapReduce implementation that would do the computation as two subtasks:

  • Compute the average number of social friends of Mr. X.
  • Arrange Mr. X's friends by geographical location.

Your MapReduce implementation program would run these subtasks in parallel, manage communication between the subtasks, and assemble the results. Out of two billion people, you would know who Mr. X's online friends are.

Hadoop provides a range of Map processors; which one(s) you select will depend on your infrastructure. Each of your processors will handle a certain number of records. Suppose, for example, that each processor handles one million data records. Each processor executes a Map procedure that produces multiple records of key-value pairs <G, N> where G (key) is the geographical location a person (country) and N (value) is the number of contacts the person has.

Suppose each Map processor produces many pairs of the form <key, value>, such as the following:

  • Processor Map#1: <France, 45>
  • Processor Map#2: <Morocco, 23>
  • Processor Map#3: <USA, 334>
  • Processor Map#4: <Morocco, 443>
  • Processor Map#5: <France, 8>
  • Processor Map#6: <Morocco, 44>

In the Reduce phase, Hadoop assigns a task to a certain number of processors: Execute the Reduce procedure that aggregates the values of the same keys to produce a final result. For this example the Reduce implementation sums up the count of values for each key — geographical location. So, after the Map phase, the Reduce phase produces the following:

  • <Morocco, 23+44+443=510> → <Morocco, 510>

    <France, 8+45=53> → <France, 53>

Clearly, Mr. X is a popular guy — but this was a very simple example of how MapReduce can be used. Imagine you're dealing with a large dataset where you want to perform complex operations such as clustering billions of documents as mentioned in Chapter 6 where the operation and the data is just too big for a single machine to handle. Hadoop is the tool to consider.

Latest versions of Hadoop decouple MapReduce programming model and Cluster Resources Management by introducing a new technology called Yarn within the Hadoop framework.

Apache Yarn

Yarn stands forYet Another Resource Negotiator, which is a new component in the Hadoop framework. Yarn is responsible for cluster management and scheduling. Cluster resource management, as its name implies, manages the resources (such as memory and CPU) of the different clusters, including data nodes. In recent versions of Hadoop, Yarn took over cluster resource management; MapReduce is now only responsible for data processing.

technicalstuff Yarn is sometimes referred to as the operating system of Hadoop, because Yarn manages Hadoop resources and ensures high availability of Hadoop features.

Yarn was introduced to overcome major limitations of scalability in earlier versions of Hadoop. Users of the earliest version of Hadoop experienced major scalability, job tracker concurrency issues, and malfunctioning for a cluster of over about 3000 data nodes. Yarn handles job scheduling functionality as a separate generic application. The later allows the coexistence of different processing models with MapReduce. For example, with Yarn you can run machine learning algorithms, MapReduce programs, and stream processing.

Yarn is responsible for two main tasks:

  • Answering to a client request by allocating processes (known as containers) that will run on physical machines
  • Managing containers that were allocated for each application

Yarn can free some of the containers if they aren't being used or if they are complete, and allocates the freed resources to another application by client request. MapReduce in Hadoop 1.0 relied on a fixed number of Maps and Reduce processes that are allowed to run on a single node. This wasn't the best way to maximize the utilization of nodes in cluster. In later versions of Hadoop, Yarn allows clients’ applications to request resources of different memory and CPU sizes. A Yarn application has full control over the resources needed to fulfill its work.

Hadoop ecosystem at a glance

The word ecosystem was invented in 1930 by botanist Roy Clapham, The word refers to physical and biological components that include plants, rocks, minerals, soil, water, and human beings, among other components, that interact with a unit. Each component has a specific role and exists to support other components. The same analogy applies to Hadoop.

Hadoop framework is made up of components that can interacts with each other’s and work as a unit to provide a Hadoop-based data analytics solution.

Figure 16-2 shows an overview of the up-to-date software components available in Hadoop Ecosystem. Some of the components, like Spark and Mahout, are explained later in this chapter. The Hadoop ecosystem is frequently changing. Figure 16-2 gives an overview of the components you can adopt on Hadoop for such purposes as predictive analytics, data processing, and SQL-like processing. Apache’s website maintains an up-to-date list of the Hadoop ecosystem: http://hadoop.apache.org

image

FIGURE 16-2: Hadoop Ecosystem.

Apache Mahout

Apache Mahout is a machine-learning library that includes large-scale versions of the clustering, classification, collaborative filtering, and other data-mining algorithms that can support a large-scale predictive analytics model. A highly recommended way to process the data needed for such a model is to run Mahout in a system that's already running Hadoop (see the preceding section). Hadoop designates a master machine that orchestrates the other machines (such as Map machines and Reduce machines) employed in its distributed processing. In most cases, Mahout should be installed on the server it will run on (for example, master node).

Imagine you have large amount of streamed data — Google news articles — and you would like to cluster by topic, using one of the clustering algorithms mentioned in Chapter 6. After you install Hadoop and Mahout, you can execute one of the algorithms — such as K-means — on your data.

The implementation of K-means under Mahout uses a MapReduce approach, which makes it different from the normal implementation of K-means (described earlier in Chapter 6). Mahout subdivides the K-means algorithm into these sub-procedures:

  • KmeansMapper reads the input dataset and will assign each input point to its nearest initially selected means (cluster representatives).
  • KmeansCombiner procedure will take all the records — <key, value> pairs — produced by KmeansMapper and produces partial sums to ease the calculation of the subsequent cluster representatives.
  • KmeansReducer receives the values produced by all the subtasks (combiners) to calculate the actual centroids of the clusters which is the final output of K-means.
  • KmeansDriver handles the iterations of the process until all clusters have converged. The output of a given iteration, a partial clustering output, is used as the input for the next iteration. The process of mapping and reducing the dataset until the assignment of records and clusters show no further changes. (For more about K-means, see Chapter 6.)

Apache Mahout is a recently developed project; its functionality still has a lot of space to accommodate extensions. In the meantime, Mahout already uses MapReduce to implement classification, clustering, and other machine-learning techniques — and can do so on a large scale.

Installing Hadoop

For a very straightforward installation of Hadoop, adopt Cloudera’s distribution. Cloudera’s Hadoop Distribution (CDH) is an integrated Apache Hadoop that contains pre-installed and configured components for a Hadoop production environment. CDH is free and is available on different formats of Linux packages, virtual machine images, and in tar files. You can also run CDH on the cloud. Everything in CDH is installed and ready to go. You can download the image type that corresponds to your preferred virtual machine tool. To download and get started on Hadoop installation, visit www.cloudera.com/downloads.html three modes:

  • Fully-distributed mode: Normal mode with namenodes and datanodes.
  • Pseudo-distributed mode: In this mode, there can be several mappers and reducers, but all daemons run on one machine (your local machine) using HDFS protocol.

    This mode is very useful for developers to mimic a Hadoop environment for testing purposes.

  • Hadoop in standalone mode: All jobs will run as one mapper and one reducer in your local file system (not HDFS). Running Hadoop in standalone mode is known to be a practical model for developers that can allow them to write and debug applications developed in the MapReduce programming paradigm.

Apache Spark

Originally developed at the University of California Berkeley in 2009, Apache Spark has quickly emerged as a big data analytics processing engine. It's positioned to become part of the next generation of data analytics processing engines that can improve and co-exist with Hadoop in certain circumstances.

Cloudera (one of the main players in the big data Hadoop distribution arena) announced that Spark was going to be their default selection. They have positioned Spark as replacement for MapReduce in the Hadoop ecosystem where data analytics workloads will be processed. Last year, IBM endorsed and supported Apache Spark as “the most important new open source project in a decade that is being defined by data”. IBM also committed to educate over a million data scientists on Spark technology.

eBay uses Spark for analyzing and aggregating logs of transactional data. OpenTable uses Spark to support extract transfer and load (ETL) jobs (see chapter 9), it's also being used to power a recommender system (see Chapter 2) that leverages MLliB Spark’ component. Elsevier labs are evaluating Spark to build a machine learning pipeline to support content as a service, and many other companies are flocking to adopt Spark. So, what is Apache Spark?

Apache Spark is a relatively fast open source engine for general large scale processing. It's a computational engine that is responsible for scheduling, distributing tasks and managing applications across many nodes in the computing cluster.

Apache Spark is designed to be relatively fast, to support implicit data parallelism, to serve as a general workloads platform, and to provide fault-tolerance.

Spark is an extension of the MapReduce software paradigm. It adds the support of two main features:

  • Interactive queries over your data in lieu of exploring data and performing computations that could take hours
  • Ability to process data streams

The Spark framework runs complex data computations in memory which makes it, in most cases, more efficient and faster than MapReduce’s computations running on disk. It makes it easy to combine a heterogonous array of data processing types that can be part of a data analytics pipeline (streaming, interactive queries, and iterative machine learning algorithms).

Spark can run in a Hadoop cluster and it can integrate different data sources. It provides several application programming interfaces in Java, Scala, Python and SQL.

Having the ability to support multiple data processing types and integration capabilities with different data sources, such as Cassandra, Amazon S3, or Hadoop HBase, Spark helps mitigate the burden of maintaining separate tools for specific processing types. In fact, Spark was designed as a unified stack that encapsulates multiple integrated components.

The elegance of Spark allows data scientists to combines and invoke multiple components, which are defined in Spark (such as Spark SQL, Spark Streaming, and MLlib for machine learning), in the same manner and ease as invoking libraries in a software development project. For example, a large-scale software application for an international bank was developed using the Spark framework. The application relied on the graph library provided by Spark, known as GraphX, to perform graph-based computations (such as PageRank and Dijkstra algorithms) from a large dataset modeled as a graph. The dataset was streamed live using the streaming library on Spark. The same application was used by business users to perform queries on the graph-structured data and on the discovered insights thanks to the Spark SQL library – all in one software application.

Under Spark’s integration and comprehensive unified stack, the team had to maintain only one software application that offers multiple services, including analytics, ingesting data streams, and interactive queries.

Spark’s main abstraction is based on resilient distributed datasets, known as RDDs. RDDs are created by either parallelizing collections in your program or referencing a dataset in an external storage.

The term collection (also known as container) in computer programming often refers to an object that combines a set of elements into a single unit. A collection is defined and used to apply operations to manipulate, store, and retrieve data. For example, in Java, using Spark, you can parallelize a collection that you define in your program (line 3) using the Spark framework. Here's a simple example:

  1. SparkConf conf = new SparkConf().setAppName(name).setMaster(master);
  2. JavaSparkContext sparkContext = new JavaSparkContext(conf);
  3. List<Integer> dataset = Arrays.asList(10, 30, 60, 33);
  4. JavaRDD<Integer> distributedData = sparkContext.parallelize(dataset);

Lines 1 and 2 are used to initialize the Spark configuration and context. In lines 3 and 4, a collection of four numbers (dataset) is being paralleled by SparkContext and converted in an RDD named distributedData that can be operated on in parallel. To learn more about the operations that can now be applied to the RDD, visit http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds

RDDs also can be also created by referencing a dataset in an external storage. Spark creates RDDs from datasets residing in any storage supported by Hadoop including text files.

SparkContext defined in Spark Application Programming Interfaces (API) is used to create text file RDD by calling textFile method. Consider the following Java:

SparkConf conf = new SparkConf().setAppName(name).setMaster(master);

JavaSparkContext sparkContext = new JavaSparkContext(conf);

JavaRDD<String> distributedFile = sparkContext.textFile("dataset.txt");

The data in dataset.txt file was converted to an RDD named distributedFile that you can apply all distributed operations like the map and reduce operations.

For more information about RDDs, visit http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds

Main Components of Spark

The main components of Apache Spark are

  • Spark Core. This is the main component in the Spark framework that consists of the API for creating and applying operations on the RDDs. Spark Core also is responsible for interacting with data sources, task scheduling, fault tolerance and recovery, and managing memory.
  • Spark Streaming. This is the component that is responsible for enabling your application to process live streams of data, such as logs of transactional data. Spark Streaming was designed to offer scalability and fault-tolerance.
  • Spark SQL. This is the component that is responsible for querying structured data. It supports SQL and Hive Query Language (HQL). It can be used to run queries on a variety of data sources and formats, such as JSON and Hive tables.
  • Spark MLlib. MLlib encapsulates a set of machine learning algorithms, such as data classification algorithms, data clustering algorithms, recommender systems algorithms, algorithms for predictive model evaluations and dimensionality reduction algorithms.
  • Spark GraphX. GraphX allows you to create graphs. These can be graphs of urban data or graphs of a social network. The Graph X component also includes algorithms that can be applied to analyze graphs, such as the PageRank algorithm.

In addition to the capabilities of the component as previously explained, Spark can run on several cluster managers, including Hadoop Yarn, Apache Mesos, and Spark scheduler. The cluster manager allows Spark to scale from a few nodes to several thousand.

Choosing a framework

The data science legend and inventor Amr Awadallah co-founded Cloudera, the most widely used Hadoop distribution in the industry. He was right when he said in an interview published at Kdnugget: “I have no favorites among the projects in our distribution, I love them all equally. It is the combined power of these projects, the platform, where the true power lies.”

We believe that newer versions of Hadoop framework, along with more new components including new versions of Spark, will emerge and more new technologies will be invented. Sometimes, it is hard to keep up with the new trends. For example, Hadoop and Spark coexist and can support your enterprise in providing a comprehensive solution.

In leadership meetings at a large organization where we were evaluating Hadoop distributions, we were overwhelmed by the sales presentations that highlight the technologies being presented. However, behind closed doors and together with leadership, we decided to focus on these points that would help us pick our choice:

  • High availability
  • Fault tolerance
  • Scalability
  • Data availability
  • Federation
  • Multi-programming paradigm support, including MapReduce
  • Security
  • Data privacy
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.31.125