Chapter 17
IN THIS CHAPTER
Preparing a high-level architecture
Introducing the world of analytics-as-a-service
Getting ready for a rapid prototype of your proof-of-value
In the world of enterprise architecture for data analytics, there are no clear standards. The design of an architecture depends on the data science problem you're addressing for your business.
This chapter introduces high-level requirements that you might need to consider for your enterprise architecture for big data.
There's also a summary of the most widely adopted tools for enterprise analytics, including RapidMiner, KNIME, Google Analytics, IBM Watson, and Microsoft Revolution R Enterprise. The end of this chapter presents the fundamentals of getting ready to build a rapid prototype for your predictive analytics efforts for your organization.
In perspective, the goal for designing an architecture for data analytics comes down to building a framework for capturing, sorting, and analyzing big data for the purpose of discovering actionable results, as shown in Figure 17-1.
There is no one correct way to design the architectural environment for big data analytics. However, most designs need to meet the following requirements to support the challenges big data can bring. These criteria can be distributed mainly over six layers, as shown in Figure 17-2, and can be summarized as follows:
If you're looking for a robust tool to help you get started on data analytics without the need for expertise in the algorithms and complexities behind building predictive models, then you should try KNIME (www.knime.org
), RapidMiner (http://rapidminer.com
), or IBM Watson (www.ibm.com/analytics/watson-analytics
), among others.
Most of the preceding tools offer a comprehensive, ready-to-use toolbox that consists of capabilities that can get you started. For example, RapidMiner has a large number of algorithms from different states of the predictive analytics lifecycle, so it provides a straightforward path to quickly combining and deploying analytics models.
With RapidMiner, you can quickly load and prepare your data, create and evaluate predictive models, use data processes in your applications and share them with your business users. With very few clicks, you can easily build a predictive analytics model as simple as shown in Figure 17-3.
RapidMiner can be used by both beginners and experts. RapidMiner Studio is an open-source predictive analytics software that has an easy-to-use graphical interface where you can drag and drop algorithms for data loading, data preprocessing, predictive analytics algorithms, and model evaluations to build your data analytics process.
RapidMiner was built to provide data scientists with a comprehensive toolbox that consists of more than a thousand different operations and algorithms. The data can be loaded quickly, regardless of whether your data source is in Excel, Access, MS SQL, MySQL, SPSS, Salesforce, or any other format that is supported by RapidMiner. In addition to data loading, predictive model building and model evaluation, this tool also provides you with data visualization tools that include adjustable self-organizing maps and 3-D graphs.
RapidMiner offers an open extension application programming interface (API) that allows you to integrate your own algorithms into any pipeline built in RapidMiner. It's also compatible with many platforms and can run on major operating systems. There is an emerging online community of data scientists that use RapidMiner where they can share their processes, and ask and answer questions.
Another easy-to-use tool that is widely used in the analytics world is KNIME. KNIME stands for the Konstanz Information Miner. It's an open source data analytics that can help you build predictive models through a data pipelining concept. The tool offers drag-and-drop components for ETL (extraction, Transformation and Loading) and components for predictive modeling as well as data visualization. KNIME and RapidMiner are tools that you can arm your data science team to easily get started building predictive models. For an excellent use case on KNIME, we invite you to go over the paper “The Seven Techniques for Dimensionality Reduction” published at www.knime.org/files/knime_seventechniquesdatadimreduction.pdf
.
RapidMiner Radoop is a product by RapidMiner that extends predictive analytics toolbox on RapidMiner Studio to run on Hadoop and Spark environments. Radoop encapsulates MapReduce, Pig, Mahout, and Spark. After you define your workflows on Radoop, then instructions are executed in Hadoop or Spark environment, so you don't have to program predictive models but focus on model evaluation and development of new models.
For security, Radoop supports Kerberos authentication and integrates with Apache Ranger and Apache Sentry.
For more information about RapidMiner, visit www.rapidminer.com
Imagine living in a world where you can order customized a la carte analytics solutions, as depicted in Figure 17-4, which will fit your business needs and improve your return on investment (ROI). We almost live in that world of analytics-as-a-service (AAAS).
The AAAS world has arrived, and it spans across an array of services that include storage, deployment, cloud based analytics, AB testing-based analytics (see Chapter 2), data-at-rest analytics and fast data analytics. According to a recent report by Research and Markets, the AAAS market is projected to grow from $5.9 billion in 2015 to $22 billion in 2020. For more information on the report, you can visit www.researchandmarkets.com/research/nz6l2h/global
According to the same report, primary users of AAAS will include retail, healthcare, manufacturing, and government. The main players in this space are Amazon, Google, Microsoft, IBM, and Intel.
Google offers a suite of products that provide analytics services: Google Prediction API, Google Analytics 360 suite, and Google Big Query. We will give an over overview of each product that you may want to adopt in your enterprise architecture.
Google offers cloud-based services for large scale machine learning that can integrate into your application. The machine learning services offered by Google consist of, but aren't limited to, sentiment analysis (also known as opinion mining) on customer data, spam email detection, fraud detection, document and email classification, and recommendation systems (see Chapter 2). Using the representational state transfer (REST) Application Programming Interface (API), you can leverage Google’s cloud services to build software applications that derive predictions of class labels or numerical values. This service can be used for different formats and data types. We encourage you to start with the basic “Hello World” of Google Prediction API at https://cloud.google.com/prediction/docs/hello_world
For more information on pricing and other details, visit https://cloud.google.com/prediction
In March 2016, Google announced the birth of a game changer in the world of AAAS. This product, Google Analytics 360, is directed towards customer segmentation and targeting.
Google Analytics 360 suite is a newer version of the Google Analytics product. It consists of six products. These are three of the most relevant components:
Through a cloud-based service named Google BigQuery, you can query large datasets and apply analytics to your data hosted on Google’s machines. BigQuery allows you to load, query, view, and manage big datasets on Google’s infrastructure. BigQuery offers a REST API that will allow you to make calls in your application using Java, Python, or .NET to query your data. BigQuery also offers a web-based, user-friendly interface and command-line tool.
For more information on how to get started on BigQuery, visithttps://cloud.google.com/bigquery
The IBM Watson Developer Cloud can offer you an array of services that can be part of your analytics toolset. IBM Watson offers services including, but not limited to, entity and entity-relationship recognition, speech-to-text conversion, and personality insights that can extract a set of features from people’s text that define their personality characteristics, among many other machine learning services.
For more information about Watson services, you will need to visit the IBM Watson services catalog at
www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/services-catalog.html
IBM Watson services can be accessed via a Representational State Transfer (REST) Application Programming Interface (API).
Bluemix is another IBM cloud platform where Watson-based applications can be deployed. Bluemix offers Apache Spark, Big Insights for Apache Hadoop, Cloud-based NoSQL databases, time series databases, and many others under the IBM cloud infrastructure.
If you happen to know R, think of Revolution R as an R++, a bigger R, or an R for Big Data Analytics. Revolution R Enterprise was recently acquired by Microsoft. The services offered include predictive analytics algorithms and statistical models that are compatible with R language and suitable for massive data.
Revolution R Enterprise provides a scalable R that allows programs written in R to be deployed in enterprise clusters including Hadoop environments. Revolution R offers a free, open-source suite of models known as Microsoft R Open. It contains the most up-to-date R packages from R Foundation for Statistical Computing. It provides a high-performance R engine that is based on multi-threaded processing. Revolution R Enterprise is another product by Revolution R that provides distributed R algorithms known as ScaleR that you will need in the data science lifecycle. For more details, visit www.revolutionanalytics.com
This section addresses briefly the steps needed to build a prototype for a proof-of-concept and a proof-of-value predictive model. (For more details on how to build a predictive model, refer to Chapters 8 to 11.)
A good model needs a prototype as proof of concept. To build a prototype for your predictive analytics model, start by defining a potential use case (a scenario drawn from the typical operations of your business) that illustrates the need for predictive analytics.
Business decisions take diverse forms. As you undertake to build a predictive analytics model, you're concerned with two main types of decision:
The two classes of decisions require different predictive analytics models:
With these two major types of decisions in mind, you can identify colleagues at your company who make either operational or strategic decisions. Then you can determine which type of decision is most in need of predictive analytics — and design an appropriate prototype for your model.
The basic objective of a predictive analytics model is make your business do better. You may find that a prototype of your model will work best if you apply it to operational examples and then link those results to solving broad, high-level business issues. Here is a list of general steps to follow next:
Pick one story that defines a problem of a small scope that your prototype model should be able to address it. Scaling down for a prototype is harder than it seems. For example, in the example that follows, the complete model might send the optimal promotion to each customer, the prototype model might only focus on handling one promotion.
For example, suppose you interviewed a marketing specialist who is struggling to decide which customers to send a specific product ad. He or she has a limited budget for the ad campaign, and needs to have high confidence in the decision. If predictive analytics could help focus the campaign and generate the needed confidence, then you have an appropriate problem for your prototype to address.
An effective way to state your business objectives clearly is as a bulleted list of user decisions. Then run your prototype to generate predictions and scores for each possible decision. For example, in the earlier example of Product X, you could list your objectives as a range of possible business decisions to be assessed:
The predictive model will evaluate these decisions according to their future likelihood of successful profitability. The output might indicate, for example, that the company has an 80-percent chance of increasing profit by increasing the sales volume of Product X.
After you've clearly stated the business objective and the problem you're willing to tackle, the next step is to collect the data that your predictive model will use. In this phase, you have to identify your data source(s). For example, if you're developing a prototype for predicting the right decision on a specific product, then you need to gather both internal and external data for that product. You shouldn't restrict the type or source of data, as long as it's relevant to the business goal.
If (say) your company is considering the introduction of a new hybrid sports car, you can contact the sales department and gather information about the sales data generated by similar products. You can contact the engineering department to find out how much the components cost (how about those longer-lasting batteries?), as well as the resources and time needed to produce the product (any retooling needed?). You might also include data about previous decisions made about a similar product (say, an overpowered convertible introduced some years ago), and their outcome (market conditions and fuel prices depressed sales).
When you've determined the most relevant data and the most useful source from which to get it, start storing the data you intend to use for your predictive model. Data may need to undergo some preprocessing, for which you'd use techniques mentioned in Chapter 9.
For a prototype, your input could be a data matrix (see Chapter 6) that represents known factors derived from historical data.
Such a data matrix, when analyzed, can produce output that looks something like this:
Inputs to the prototype model could include historical data about similar products, the corresponding decisions made about them, and the impact of those decisions on your business processes. The prototype's output would be predictions and their corresponding scores as possible actions toward attaining the objectives you've set.
To get a usable prototype, you have to employ a mixture of techniques to build the model. For example, you could use DBSCAN or K-means algorithm as one of the data clustering algorithms; you could use it to build clusters like these:
Then you could use classification algorithms such as a decision tree or Naïve Bayes (see Chapter 7) that would classify or predict missing values (such as sales profit value) for the product in question (Product X).
This section introduces testing your model by using a test dataset that's similar to your training dataset. (See Chapter 10 for more on testing and evaluating your model.)
To evaluate your predictive analytics model, you have to run the model over some test data that it hasn't seen yet. You could run the model over several historical datasets as input and record how many of the model's predictions turn out correctly.
Evaluating your predictive model is an iterative process — essentially trial and error. Effective models rarely result from a mere first test. If your predictive model produces 100-percent accuracy, consider that result too good to be true; suspect something wrong with your data or your algorithms. For example, if the first algorithm you use to build your prototype is the Naïve Bayes classifier and you aren't satisfied with the predictions it gives you when you run the test data, try another algorithm such as the Nearest Neighbor classifier (see Chapter 6). Keep running other algorithms until you find the one that's most consistently and reliably predictive.
During the testing, you might find out that you need to revisit the initial data that you used to build the prototype model. You might need to find more relevant data for your analysis.
The higher the confidence in the results of your predictive model, the easier it is for the stakeholders to approve its deployment.
To make sure that your model is accurate, you need to evaluate whether the model meets its business objectives. Domain experts can help you interpret the results of your model.
52.15.55.18