Chapter 17

Getting Ready for Enterprise Analytics

IN THIS CHAPTER

Preparing a high-level architecture

Introducing the world of analytics-as-a-service

Getting ready for a rapid prototype of your proof-of-value

In the world of enterprise architecture for data analytics, there are no clear standards. The design of an architecture depends on the data science problem you're addressing for your business.

This chapter introduces high-level requirements that you might need to consider for your enterprise architecture for big data.

There's also a summary of the most widely adopted tools for enterprise analytics, including RapidMiner, KNIME, Google Analytics, IBM Watson, and Microsoft Revolution R Enterprise. The end of this chapter presents the fundamentals of getting ready to build a rapid prototype for your predictive analytics efforts for your organization.

Enterprise Architecture for Big Data

In perspective, the goal for designing an architecture for data analytics comes down to building a framework for capturing, sorting, and analyzing big data for the purpose of discovering actionable results, as shown in Figure 17-1.

image

FIGURE 17-1: Thinking of the architecture that will transform big data into actionable results.

There is no one correct way to design the architectural environment for big data analytics. However, most designs need to meet the following requirements to support the challenges big data can bring. These criteria can be distributed mainly over six layers, as shown in Figure 17-2, and can be summarized as follows:

  • Your architecture should include a big data platform for storage and computation, such as Hadoop or Spark, which is capable of scaling out.
  • Your architecture should include large-scale software and big data tools capable of analyzing, storing, and retrieving big data. These can consist of the components of Spark (discussed earlier in this chapter), or the components of Hadoop ecosystem (such as Mahout and Apache Storm). You might also want to adopt a big data large-scale tool that will be used by data scientists in your business. These include Radoop from RapidMiner, IBM Watson, and many others.
  • Your architecture should support virtualization. Virtualization is an essential element of cloud computing because it allows multiple operating systems and applications to run at the same time on the same server. Because of this capability, virtualization and cloud computing often go hand in hand. You might also adopt a private cloud in your architecture. A private cloud offers the same architecture as a public cloud, except the services in a private cloud are restricted to a certain number of users through a firewall. Amazon Elastic Computer Cloud is one of the major providers of private cloud solutions and storage space for businesses, and can scale as they grow.
  • Your architecture might have to offer real-time analytics if your enterprise is working with fast data (data that is flowing in streams at a fast rate). In a scenario where, you would need to consider an infrastructure that can support the derivation of insights from data in near real time without waiting for data to be written to disk. For example, Apache Spark’s streaming library can be glued with other components to support analytics on fast data streams.
  • Your architecture should account for Big Data security by creating a system of governance around the supply of access to the data and the results. The big data security architecture should be in line with the standard security practices and policies in your organization that govern access to data sources.
image

FIGURE 17-2: The layers of enterprise data architecture.

If you're looking for a robust tool to help you get started on data analytics without the need for expertise in the algorithms and complexities behind building predictive models, then you should try KNIME (www.knime.org), RapidMiner (http://rapidminer.com), or IBM Watson (www.ibm.com/analytics/watson-analytics), among others.

Most of the preceding tools offer a comprehensive, ready-to-use toolbox that consists of capabilities that can get you started. For example, RapidMiner has a large number of algorithms from different states of the predictive analytics lifecycle, so it provides a straightforward path to quickly combining and deploying analytics models.

With RapidMiner, you can quickly load and prepare your data, create and evaluate predictive models, use data processes in your applications and share them with your business users. With very few clicks, you can easily build a predictive analytics model as simple as shown in Figure 17-3.

image

FIGURE 17-3: Drag-and-drop analytics with RapidMiner.

RapidMiner can be used by both beginners and experts. RapidMiner Studio is an open-source predictive analytics software that has an easy-to-use graphical interface where you can drag and drop algorithms for data loading, data preprocessing, predictive analytics algorithms, and model evaluations to build your data analytics process.

RapidMiner was built to provide data scientists with a comprehensive toolbox that consists of more than a thousand different operations and algorithms. The data can be loaded quickly, regardless of whether your data source is in Excel, Access, MS SQL, MySQL, SPSS, Salesforce, or any other format that is supported by RapidMiner. In addition to data loading, predictive model building and model evaluation, this tool also provides you with data visualization tools that include adjustable self-organizing maps and 3-D graphs.

RapidMiner offers an open extension application programming interface (API) that allows you to integrate your own algorithms into any pipeline built in RapidMiner. It's also compatible with many platforms and can run on major operating systems. There is an emerging online community of data scientists that use RapidMiner where they can share their processes, and ask and answer questions.

Another easy-to-use tool that is widely used in the analytics world is KNIME. KNIME stands for the Konstanz Information Miner. It's an open source data analytics that can help you build predictive models through a data pipelining concept. The tool offers drag-and-drop components for ETL (extraction, Transformation and Loading) and components for predictive modeling as well as data visualization. KNIME and RapidMiner are tools that you can arm your data science team to easily get started building predictive models. For an excellent use case on KNIME, we invite you to go over the paper “The Seven Techniques for Dimensionality Reduction” published at www.knime.org/files/knime_seventechniquesdatadimreduction.pdf.

RapidMiner Radoop for Big Data

RapidMiner Radoop is a product by RapidMiner that extends predictive analytics toolbox on RapidMiner Studio to run on Hadoop and Spark environments. Radoop encapsulates MapReduce, Pig, Mahout, and Spark. After you define your workflows on Radoop, then instructions are executed in Hadoop or Spark environment, so you don't have to program predictive models but focus on model evaluation and development of new models.

For security, Radoop supports Kerberos authentication and integrates with Apache Ranger and Apache Sentry.

For more information about RapidMiner, visit www.rapidminer.com

Analytics as a Service

Imagine living in a world where you can order customized a la carte analytics solutions, as depicted in Figure 17-4, which will fit your business needs and improve your return on investment (ROI). We almost live in that world of analytics-as-a-service (AAAS).

image

FIGURE 17-4: Cloud-based Analytics Services.

The AAAS world has arrived, and it spans across an array of services that include storage, deployment, cloud based analytics, AB testing-based analytics (see Chapter 2), data-at-rest analytics and fast data analytics. According to a recent report by Research and Markets, the AAAS market is projected to grow from $5.9 billion in 2015 to $22 billion in 2020. For more information on the report, you can visit www.researchandmarkets.com/research/nz6l2h/global

According to the same report, primary users of AAAS will include retail, healthcare, manufacturing, and government. The main players in this space are Amazon, Google, Microsoft, IBM, and Intel.

Google Analytics

Google offers a suite of products that provide analytics services: Google Prediction API, Google Analytics 360 suite, and Google Big Query. We will give an over overview of each product that you may want to adopt in your enterprise architecture.

Google Prediction API

Google offers cloud-based services for large scale machine learning that can integrate into your application. The machine learning services offered by Google consist of, but aren't limited to, sentiment analysis (also known as opinion mining) on customer data, spam email detection, fraud detection, document and email classification, and recommendation systems (see Chapter 2). Using the representational state transfer (REST) Application Programming Interface (API), you can leverage Google’s cloud services to build software applications that derive predictions of class labels or numerical values. This service can be used for different formats and data types. We encourage you to start with the basic “Hello World” of Google Prediction API at https://cloud.google.com/prediction/docs/hello_world

For more information on pricing and other details, visit https://cloud.google.com/prediction

Google Analytics 360 suite

In March 2016, Google announced the birth of a game changer in the world of AAAS. This product, Google Analytics 360, is directed towards customer segmentation and targeting.

Google Analytics 360 suite is a newer version of the Google Analytics product. It consists of six products. These are three of the most relevant components:

  • Google Audience Center: This component can help your marketing department find customers among similar lines of business. It serves as a data management platform (DMP). We speculate that the backbone of the data is based on Gmail, search data and traffic, and the Android platform, which is the dominant mobile operating system.
  • Google Optimize: This product is an analytics-based A/B testing platform (see Chapter 2) that is an extension of Google Experiments. Your marketers can use this product to create several versions of your site, log the behavior of customers on your site and optimize your site to target customers, and create better experiences for them.
  • Google Analytics: This product analyzes customer data from different sources to power the ad products in support of marketing campaigns.

Google BigQuery

Through a cloud-based service named Google BigQuery, you can query large datasets and apply analytics to your data hosted on Google’s machines. BigQuery allows you to load, query, view, and manage big datasets on Google’s infrastructure. BigQuery offers a REST API that will allow you to make calls in your application using Java, Python, or .NET to query your data. BigQuery also offers a web-based, user-friendly interface and command-line tool.

For more information on how to get started on BigQuery, visithttps://cloud.google.com/bigquery

IBM Watson

The IBM Watson Developer Cloud can offer you an array of services that can be part of your analytics toolset. IBM Watson offers services including, but not limited to, entity and entity-relationship recognition, speech-to-text conversion, and personality insights that can extract a set of features from people’s text that define their personality characteristics, among many other machine learning services.

For more information about Watson services, you will need to visit the IBM Watson services catalog at

www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/services-catalog.html

IBM Watson services can be accessed via a Representational State Transfer (REST) Application Programming Interface (API).

Bluemix is another IBM cloud platform where Watson-based applications can be deployed. Bluemix offers Apache Spark, Big Insights for Apache Hadoop, Cloud-based NoSQL databases, time series databases, and many others under the IBM cloud infrastructure.

Microsoft Revolution R Enterprise

If you happen to know R, think of Revolution R as an R++, a bigger R, or an R for Big Data Analytics. Revolution R Enterprise was recently acquired by Microsoft. The services offered include predictive analytics algorithms and statistical models that are compatible with R language and suitable for massive data.

Revolution R Enterprise provides a scalable R that allows programs written in R to be deployed in enterprise clusters including Hadoop environments. Revolution R offers a free, open-source suite of models known as Microsoft R Open. It contains the most up-to-date R packages from R Foundation for Statistical Computing. It provides a high-performance R engine that is based on multi-threaded processing. Revolution R Enterprise is another product by Revolution R that provides distributed R algorithms known as ScaleR that you will need in the data science lifecycle. For more details, visit www.revolutionanalytics.com

Preparing for a Proof-of-Value of Predictive Analytics Prototype

This section addresses briefly the steps needed to build a prototype for a proof-of-concept and a proof-of-value predictive model. (For more details on how to build a predictive model, refer to Chapters 8 to 11.)

Prototyping for predictive analytics

A good model needs a prototype as proof of concept. To build a prototype for your predictive analytics model, start by defining a potential use case (a scenario drawn from the typical operations of your business) that illustrates the need for predictive analytics.

Fitting the model to the type of decision

Business decisions take diverse forms. As you undertake to build a predictive analytics model, you're concerned with two main types of decision:

  • Strategic decisions often aren't well defined, and focus only on the big picture. Strategic decisions have an impact on the long-term performance of your company — and correlate directly with your company's mission and objectives. Senior managers are usually the ones who make strategic decisions.
  • Operational decisions focus on specific answers to existing problems and define specific future actions. Operational decisions usually don't require a manager's approval, although they generally have a standard set of guidelines for the decision-makers to reference.

The two classes of decisions require different predictive analytics models:

  • The chief financial officer of a bank might use predictive analytics to gauge broad trends in the banking industry that require a company-wide response (strategic).
  • A clerk in the same bank might use a predictive analytic model to determine credit-worthiness of a particular customer who is requesting a loan (operational).

With these two major types of decisions in mind, you can identify colleagues at your company who make either operational or strategic decisions. Then you can determine which type of decision is most in need of predictive analytics — and design an appropriate prototype for your model.

Defining the problem for the model to address

The basic objective of a predictive analytics model is make your business do better. You may find that a prototype of your model will work best if you apply it to operational examples and then link those results to solving broad, high-level business issues. Here is a list of general steps to follow next:

  1. Focus on the operational decisions at your company that can have a major impact on a business process. One of the biggest reasons for this is that operational decisions are the kinds of decisions that are more likely deployed into an automated system.
  2. Select those processes that have a direct effect on overall profitability and efficiency.
  3. Conduct one-to-one interviews with the decision-makers whose support you want to cultivate for your project. Ask about
    • The process they go through to make decisions.
    • The data they use to make decisions.
    • How predictive analytics would help them make the right decisions.
  4. Analyze the stories you gathered from the interviews, looking for insights that clearly define the problem you're trying to solve.
  5. Pick one story that defines a problem of a small scope that your prototype model should be able to address it. Scaling down for a prototype is harder than it seems. For example, in the example that follows, the complete model might send the optimal promotion to each customer, the prototype model might only focus on handling one promotion.

    For example, suppose you interviewed a marketing specialist who is struggling to decide which customers to send a specific product ad. He or she has a limited budget for the ad campaign, and needs to have high confidence in the decision. If predictive analytics could help focus the campaign and generate the needed confidence, then you have an appropriate problem for your prototype to address.

Defining your objectives

An effective way to state your business objectives clearly is as a bulleted list of user decisions. Then run your prototype to generate predictions and scores for each possible decision. For example, in the earlier example of Product X, you could list your objectives as a range of possible business decisions to be assessed:

  • Increase the sales volume of Product X.
  • Terminate manufacture of Product X.
  • Change the marketing strategy behind Product X:
    • Increase ads in a specific geographical location.
    • Increase ads for specific customers.

The predictive model will evaluate these decisions according to their future likelihood of successful profitability. The output might indicate, for example, that the company has an 80-percent chance of increasing profit by increasing the sales volume of Product X.

Find the right data

After you've clearly stated the business objective and the problem you're willing to tackle, the next step is to collect the data that your predictive model will use. In this phase, you have to identify your data source(s). For example, if you're developing a prototype for predicting the right decision on a specific product, then you need to gather both internal and external data for that product. You shouldn't restrict the type or source of data, as long as it's relevant to the business goal.

If (say) your company is considering the introduction of a new hybrid sports car, you can contact the sales department and gather information about the sales data generated by similar products. You can contact the engineering department to find out how much the components cost (how about those longer-lasting batteries?), as well as the resources and time needed to produce the product (any retooling needed?). You might also include data about previous decisions made about a similar product (say, an overpowered convertible introduced some years ago), and their outcome (market conditions and fuel prices depressed sales).

tip You might want to consider using big data related to the product in question. For example, download customer reviews about company products, tweets or Facebook posts where the products are mentioned. One way to do that is to use application programming interfaces (APIs) provided by those companies. For example, if you want to gather tweets that contain a specific word, Twitter provides a set of APIs that you could use to download such tweets. There's a limit to how much data you can capture free of charge; in some cases, you might have to pay to keep downloading the needed data from Twitter.

When you've determined the most relevant data and the most useful source from which to get it, start storing the data you intend to use for your predictive model. Data may need to undergo some preprocessing, for which you'd use techniques mentioned in Chapter 9.

Design your model

For a prototype, your input could be a data matrix (see Chapter 6) that represents known factors derived from historical data.

Such a data matrix, when analyzed, can produce output that looks something like this:

  • 57.6 percent of customers stated they were unhappy with the product.
  • The product requires three hours on average to produce.
  • Positive sentiment on the product is 80 percent.

Inputs to the prototype model could include historical data about similar products, the corresponding decisions made about them, and the impact of those decisions on your business processes. The prototype's output would be predictions and their corresponding scores as possible actions toward attaining the objectives you've set.

To get a usable prototype, you have to employ a mixture of techniques to build the model. For example, you could use DBSCAN or K-means algorithm as one of the data clustering algorithms; you could use it to build clusters like these:

  • Products that were terminated — and that decision's impact on profit
  • Products that were increased in volume and that decision's impact on profit
  • Products whose marketing strategy was changed and that decision's impact profit

Then you could use classification algorithms such as a decision tree or Naïve Bayes (see Chapter 7) that would classify or predict missing values (such as sales profit value) for the product in question (Product X).

Testing your predictive analytics model

This section introduces testing your model by using a test dataset that's similar to your training dataset. (See Chapter 10 for more on testing and evaluating your model.)

Identify your test data

To evaluate your predictive analytics model, you have to run the model over some test data that it hasn't seen yet. You could run the model over several historical datasets as input and record how many of the model's predictions turn out correctly.

Run the model on test data

Evaluating your predictive model is an iterative process — essentially trial and error. Effective models rarely result from a mere first test. If your predictive model produces 100-percent accuracy, consider that result too good to be true; suspect something wrong with your data or your algorithms. For example, if the first algorithm you use to build your prototype is the Naïve Bayes classifier and you aren't satisfied with the predictions it gives you when you run the test data, try another algorithm such as the Nearest Neighbor classifier (see Chapter 6). Keep running other algorithms until you find the one that's most consistently and reliably predictive.

During the testing, you might find out that you need to revisit the initial data that you used to build the prototype model. You might need to find more relevant data for your analysis.

remember Consider adding to your checklist before closing the prototype lifecycle:

  • Always verify that the steps involved in building the model are correct.
  • Comparing the output of the model on the test dataset to the actual results will help you evaluate the accuracy of your model.

The higher the confidence in the results of your predictive model, the easier it is for the stakeholders to approve its deployment.

To make sure that your model is accurate, you need to evaluate whether the model meets its business objectives. Domain experts can help you interpret the results of your model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.55.18