Chapter 11. Analytics: An Architecture Introduction

If the universe can be mathematically explained, can we use math for competitive business advantage, I wonder?

IT business units in enterprises have reached a local optimum (that is to say, they are maxed out or close to it, in automating business processes): Almost every enterprise business unit—for example, HR, Accounting, Payroll, Operations, and so on—has standard IT Systems and automation. The standardization of IT Systems and automation minimizes options that may be used to drive any differentiation between competing organizations. While standardization has advantages, it also brings organizations to a level playing field.

Data has been touted the currency of the twenty-first century and for all good reasons. The advent of modern technology that can emit, capture, ingest, and process data on the order of petabytes and zetabytes is becoming commonplace. By using this data to drive insights and using these insights to drive proactive decision making and actions, enterprises can drive competitive and differentiated advantages (Davenport and Harris 2007; Davenport et al. 2010)—the survival kit! Analytics is the discipline that leverages data (of any size, form, and variety) to drive insights and support optimized decision making. It has become the most commonly referenced and sought-after discipline in both business and IT. There is hardly any organization of repute, credibility, or potential that does not have analytics as a part of its business strategy.

In this chapter I briefly touch upon the value of analytics and its various forms and also provide an architectural teaser around some key functional building blocks of an analytics blueprint. Such a blueprint may be used as a baseline to further expand on, customize, and develop analytics reference architectures supporting an enterprise’s business strategy around analytics and its adoption road map.

And as has been customary throughout this book, my hope is to drive another feather in your cap, to add more wood behind the tip of your architecture arrow, and armor you with some key know-how around designing analytically powered enterprise solutions. Your quiver gets fuller with more potent arrows!

Disclaimer: This chapter is not intended to provide a detailed discourse of analytics, covering all of its architectural aspects such as context diagrams, operational models, and infrastructure considerations. This is a conscious decision because such an exhaustive treatment would require a book unto itself—maybe a hint to myself!

Why We Need It

The need for software architectures for any domain (in this case, analytics) was discussed earlier in this book. Here, let’s discuss in more detail why analytics, in itself, is important. In the interest of brevity, I want to keep this discussion to just enough: a plethora of additional information is available on the Internet anyway.

Analytics is the new path to value creation—the value that holds the key to unlocking the major traits of effective decision making. The effectiveness of making a decision is characterized by its timeliness, the confidence levels around its accuracy, and a streamlined process to execute and act on it in ways that position the enterprise to seize the opportunity.

Let’s start by looking at an excerpt from an informative research work that the IBM Institute of Business Value (IBV) conducted and published as a paper titled “Analytics: The Speed Advantage.” Based on IBV’s extensive research:

• 63% of organizations realized a positive return on analytic investment within a year.

• 69% of speed-driven analytics organizations created a significant positive impact on business outcomes.

• Use of analytics is primarily focused on customer-centric objectives (53%) with operational efficiency not lagging too far behind (at 40%).

• An organization’s ability to convert analytical insights into decision making actions is influenced by the pervasiveness of the usage of analytics across the organizations along with the breadth of technical capabilities leveraged to support analytics. Those who are the leaders in that pack are the ones who can act with the speed required for competitive advantage with 69% of the ones studied reporting a significant impact on business outcomes, 60% reporting a significant impact of revenues, and 53% of them reporting gaining a significant competitive advantage.

• The leaders in the pack (see the bullet above) are the ones who are the most effective in their speed to acquire data, to analyze data and generate insight, and to act on it in a timely and opportune manner to drive positive impact and competitive advantage for the enterprise. (IBM Institute of Business Value n.d.)

The paper goes on to provide supporting evidence regarding why data-driven organizations are winning the race in the marketplace.

While all the points highlighted here are relevant to why analytics and its adoption have become essential for any enterprise to foster competitive advantage, the last point is particularly interesting, especially in the light of an analytics architecture. Acquiring data, analyzing it to extract information, and being able to generate optimized recommendations to act on the data require a foundational technology underpinning.

If analytics is relevant, essential, and imperative for an enterprise, it surely needs a good architecture treatment.

Dimensions of Analytics

Just as DNA holds the secret to all human characteristics, insight lies encoded in the various strands of data: what DNA is to humans, data is to business insights. The various forms of data (that is, its variety), the various rates at which data is generated and ingested (that is, its velocity), the various sizes in which data is generated (that is, its volume), and the trustworthiness of the data (that is, its veracity) typically constitute the four key characteristics (that is, variety, velocity, volume, and veracity) of data that influence and provide clues on how the analytical imprints can be unlocked. It suffices to say that the staggering rate at which the expansion of data volumes and velocity continues to be relentless, the veracity index of the data comes more and more under scrutiny.

Analytics is being leveraged in a multitude of ways to foster better decision making. The use of analytics can be broadly classified into five categories or dimensions:

• Operational (real-time) Analytics

• Descriptive Analytics

• Predictive Analytics

• Prescriptive Analytics

• Cognitive Computing

The various forms of analytics form a spectrum and address a continuum for supporting business insights, starting from what is happening right now (that is, at the point of business impact) and extending to acting as an advisor to humans (that is, an extension to human cognition).

Let’s explore the continuum!

Operational Analytics

Operational Analytics focuses on highlighting what is happening right now and brings it to the attention of the relevant parties as and when it is happening. The “right now” connotation implies a real-time nature of analytics. Such an analytics capability, owing to its real-time nature, requires the generation of insights while the data is in motion. In such scenarios, in which the decision latency is in seconds or subseconds, data persistence (that is, storing it in a persistence store before retrieving it for analytics) is not conducive to generating insights in real time. Analytical insights need to be generated while the data is in motion—that is, at the point where the data is first seen by the system. The data flows through continuously (that is, it is streaming) while analytics is applied on the data in the spectrum of the data continuum. It is referred to by various names: data-in-motion analytics, Operational Analytics, or real-time analytics. Some examples of operational or real-time analytics are

• Providing stock prices and their temporal variation every second

• Collecting machine instrumentation (for example, temperature, pressure, and amperage) of fixed or moving assets (for example, pipelines, compressors, and pumps) and monitoring their operational patterns, in real time, to detect anomalies in their operating conditions

• Detecting motion detection in real time, through video imaging and acoustic vibrations (for example, analyzing video feeds, in real time, from drones to detect political threat)

Descriptive Analytics

The form known as Descriptive Analytics focuses on highlighting the description and analysis of what already happened and providing various techniques to slice and dice (that is, get different views of the same data or a subset thereof) the information in multiple intuitive ways to drive analytical insights of historical events. It is also called after-the-fact analytics, owing to its nature of describing what happened in the past. Traditional business intelligence (or BI, as we know it) was primarily this. The power of BI lies in the various techniques used to present information for analysis such that the root causes of business events (for example, historical trends for car battery recall reasons, efficiency and productivity losses in the manufacturing assembly line) are easier to analyze, comprehend, and understand. Owing to its after-the-fact nature, it is performed on data at rest; that is, on persisted data. Some examples of Descriptive Analytics are

• Performing comparative analysis on production metrics across multiple similar production plants such as oil platforms and semiconductor fabrication assembly lines

• Comparing the productivity of field operators across multiple shifts in a day

• Comparing how the average availability of a machine degrades over the years

Predictive Analytics

Predictive Analytics primarily focuses on predicting what is going to happen in the future by either analyzing how something happened in the past or by detecting and learning patterns used to classify future behavior. Predictive Analytics relies on building predictive models that typically transform known data representing an entity into a classification, probability estimation, or some other relevant measure of future behavior. Predictive models are typically built using algorithms that look at large volumes of historical data. Algorithm-based models are primarily data driven; that is, various statistics of the data define the characteristics of the model. Models are developed using various statistical, probabilistic, and machine-learning techniques to predict future outcomes. Supervised learning and unsupervised learning form the basis of the two broad categories of machine-learning techniques. Building reliable and effective models generally requires transformation of the raw data inputs into features that the analytical algorithm can exploit. Some examples of Predictive Analytics are

• Predicting the chances of a given patient having a certain form of skin cancer given a sample of her skin

• Predicting the remaining life of a critical component of any expensive heavy equipment such as coal-mining equipment

• Predicting whether a person applying for a bank loan will default on his payment

Note: I encourage you to research further into supervised and unsupervised machine-learning techniques. You may start by understanding the fundamentals of regression, classification, and clustering schemes.

Prescriptive Analytics

The form known as Prescriptive Analytics focuses on answering the question around what you should be doing (that is, prescribe) if something were to happen (that is, predictive) in the future. Stated differently, Prescriptive Analytics addresses how and what to provide as recommendations and actions that may be taken based on some future event that is predicted to happen.

Prescriptive Analytics relies on optimized decision making. It typically considers one or more predictive outcomes and combines them with other factors to arrive at an optimized recommendation that is typically actionable in nature. It may leverage tools and techniques, around business rules, optimization algorithms, or a combination thereof, to come up with recommendations. Whereas Operational, Descriptive, and Predictive Analytics tell us what is happening now, what happened in the past, or what is going to happen in the future, Prescriptive Analytics actually prescribes what to do or what actions to take if such events were to happen. Rules engines correlate multiple input events that take place across both space (that is, in multiple locations) and in time (that is, at different points in time) along with external events such as weather, operating conditions, maintenance schedules, and so on, to come up with actionable recommendations. Because this form of analytics may be a bit esoteric, in the spirit of practicality, let me provide an example as illustration.

An elderly man is driving his relatively new BMW M5 on a bright sunny Sunday morning on May 24, 2015. Imagine that a predictive model predicts this man’s car gearbox will stop functioning in the next 30 days and flashed an alert on the car dashboard. Other than getting upset, this man may not know to do anything else other than turning around and making plans to immediately take the car to the dealer. A Prescriptive Analytics module comes in and intervenes! It figures out that the car has free servicing, under warranty, on June 14, 2015 (that is, in the next 21 days and that the confidence level of the predictive outcome does not change too much between 21 days and 30 days; in other words, as an example, the model predicts that the gearbox can break in 21 days from now with a confidence level of 85 percent and that it can break in 30 days from now with a confidence level of 88 percent). Given these three data points (that is, the time window of opportunity, confidence levels, and upcoming warrantied scheduled service maintenance window), the Prescriptive Analytics system performs a business-rules–based optimization and subsequently sends out a notification to the car owner with a precise recommendation: “Bring in the car for service on June 14; your car is going to be just fine, and we will take care of it free of charge!” This is an example of a prescriptive and actionable recommendation.

Some examples of Prescriptive Analytics are

• Recommending creation of a work order (along with job procedures) to fix equipment

• Recommending deferring the maintenance of a high-valued equipment component close to its planned maintenance window

• Recommending an optimum price point to sell a specialty chemical (for example, an oxy-solvent) at which there will be a decent profit margin and also a higher probability of the buyer’s acceptance of the price

Cognitive Computing

Cognitive Computing focuses on systems that “think” to generate insights that are human-like; at least, that is the basic idea! This is a relatively new paradigm because there is a fundamental difference in how these systems are built and how they interact with humans. Traditional systems generate insights at various levels—descriptive, predictive, prescriptive, and so on—where humans perform most of the directing. Cognitive-based systems, in contrast, learn and build knowledge, understand natural language, and reason and interact more naturally with human beings than traditional programmable systems. Cognitive systems extend the capabilities of humans by augmenting human decision-making capacity and helping us make sense of the growing amount of data that is germane to a situation—a data corpus (that is, its sheer volume and wide variety) that is typically beyond the capacity of a human brain to process, analyze, and react to in a period of time that fosters competitive decision-making advantage.

Cognitive Computing is very much in its infancy stages (IBM Institute of Business Value n.d.), leaving various opportunities in its potential evolution. Organizations need to set realistic expectations, and they certainly should set long-term plans instead of trying to achieve immediate gains from it. Expecting immediate gains would not only be frustrating to the enterprise but also would not be acknowledging the true potential of Cognitive Computing; potential is the operative word here.

Cognitive systems, such as the technology behind the IBM Watson computer that participated and won the Jeopardy event, are based on an open domain question-answering technique called DeepQA (IBM n.d.). The technique, at a very high level, leverages sophisticated and deep natural language processing capabilities, along with advanced statistical and probabilistic algorithms, to arrive at the best possible answer to any question. The corpus of data on which it applies the techniques is primarily unstructured and semistructured in nature and can be a combination of data available in the public domain and privately held enterprise content. Some examples of cognitive systems are

• IBM Watson participating in the popular Jeopardy television show and winning the competition against the top-ranked Jeopardy participants

• An advisor system that assists oil and gas engineers detect a potential “stuck pipe” situation in an oil rig

• A cognitive system that can streamline the review processes between a patient’s physician and his health plan

The preceding discussion provided a brief introduction to the various dimensions of analytics. Each dimension is a field unto itself, and professionals could easily spend their entire career in any one discipline perfecting expertise and then building into adjacent domains, fields, and dimensions.

It is important to understand that those organizations that will enjoy competitive advantage in the marketplace are the ones that will break away from the traditional approaches of human intuition and expertise-based sense and response mode of business automation. They will move to one from which the next-generation efficiencies and differentiation will be achieved by providing precise, contextual analytics at the point of business impact, thereby adopting a real-time, fact-driven predict and act modus operandi. This fundamental shift will be made possible only through a serious investment in analytics as a part of the organization’s business strategy. Any such strategic business reason, to invest in analytics, needs to be supplemented with an innovative solution approach that is built on a strong foundation of complementary advanced analytics techniques that collectively will provide a 360-degree view of whatever it takes to provide insightful decisions.

Regarding advanced analytical techniques, a strong architectural foundation is paramount to consolidate the required features, techniques, and technologies to support the organization’s business strategy—a perfect segue into our next section!

Analytics Architecture: Foundation

Any nontrivial IT System must have an architecture foundation. What I describe in this section is a functional model of an analytics reference blueprint (or architecture or model). This blueprint addresses each layer of the architecture stack and strives to address a wide coverage of use-case scenarios in which analytics applications may be implemented across businesses that consider analytics to be a strategic initiative focused on developing a distinctive business advantage. You can also think of it as an analytics capability model describing a set of capabilities that may be required for any enterprise to consider when it embarks on its analytics journey. This model does not require to support all of the capabilities—at least not all at once. The maturity of an enterprise’s adoption of analytics, along with its prioritized business imperatives, typically dictates the iterative rollout of the capabilities.

Quite frequently, I have stumbled upon analytics architecture models (or blueprints) that focus on developing subsets of data architectures along with their access management and integration. An analytics reference architecture or model should assign primary focus on analytics while addressing data architecture to the extent that is commensurate in enabling an analytics framework or platform.

It is important for the technical community to realize that, while making data and information accessible and actionable is imperative (that is, analyzable), the core discipline of analytics focuses on building systems of insight and hence requires a different mindset and focus. Systems of insight, the focus of analytics, aim at converting data into information, from information to insight, and from insight to actionable outcomes, and subsequently sharing that information, insight, and actionable outcomes with the appropriate personas. The interaction with users forms the basis of what we call systems of engagement—putting a user in the driver’s seat while arming her with the information, insight, and actionable outcomes (which forms the core of the systems of insight) required to drive home successfully. The type of information, insight, and actionable outcomes generated, along with the required analytic capabilities, may be categorized by the user type or personas. The following are some illustrative examples of user types and their analytics focus:

• Business executives may be interested only in business metrics and, hence, on reports that highlight one or more performance measures with the ability to view the same data but through different views (for example, revenue by region, revenue by product, and so on).

• System engineers may be interested in root-cause analysis and, hence, expect to be able to drill down from a metrics-based view to a summary view and further down to a detailed and granular root-cause analysis, to determine the actual cause of critical events (for example, operations shutdown or random maintenance episodes).

• Data scientists are responsible for performing ad hoc analysis on a multitude of data sets, across heterogeneous systems, leveraging a wide variety of statistical and machine-learning algorithms to identify patterns, trends, correlations, and outliers that may be used to develop predictive and prescriptive analytic capabilities for the enterprise.

If we study the usage patterns and the expectations to extract intelligence from data, we can categorize analytics into five dimensions, as described earlier. These categories or dimensions define the five pillars of corporate intelligence into which the discipline of analytics can be constructed. It is important to acknowledge that the focus of analytics is fundamentally different from that of data and its management, focusing primarily on generating systems of insight that drive systems of engagement between the human and “things” (machines, processes, and the entire connected ecosystem).

Let’s dive a little deeper into the reference model.

The Layered View: Layers and Pillars

Figure 11.1 depicts the layered view of an analytics architecture reference model. The layers and pillars, along with the capabilities discussed, are meant to be used as guidelines and not a strict prescription for adherence. Architectures and architects alike need to have enough flexibility to be both adaptive and resilient; principles, guidelines, and constraints aim to provide such flexibility and resilience.

Before we get further ahead, let me state my intentional use of the terms analytics reference model (ARM), analytics reference blueprint (ARB) and analytics reference architecture (ARA) interchangeably; ARM, ARB, and ARA are one and the same for the sake of this discussion. You never know which phrase will stick with your team and your customers; having three options to choose from is not bad!

image

Figure 11.1 A layered view of an analytics reference architecture.

ARA is composed of a set of horizontal and cross-cutting layers. Some of the horizontal layers are focused on data acquisition, data preparation, data storage, and data consolidation, whereas some others cover the solutions and their end-user consumption. The cross-cutting layers, as the name suggests, provide a set of capabilities that are applicable to multiple horizontal layers.

ARA introduces the concept of pillars, representing the five dimensions of analytics (just below the Analytics Solutions layer in Figure 11.1). Pillars represent a set of related capability. The capabilities supported by each of the pillars can cross-pollinate, comingle, or coexist (because they are at the same level and hence adhere to the fundamental principles of a layered architecture). They not only harness the capabilities from all of the horizontal layers that lie below (the pillars) but also can leverage the capabilities from the vertical cross-cutting layers.

Although some of the key characteristics of each layer may be highlighted, they are by no means fully exhaustive. In the spirit of just enough, my goal here is merely to introduce you to the concepts and provide a foundation on which you can build your ARA!

ARA/ARM/ARB is composed of seven horizontal and three vertical cross-cutting layers. The horizontal layers are built from the bottom up, with each layer building on the capabilities and functionalities of the layers below. The layers are Data Types, Data Acquisition and Access, Data Repository, Models, Data Integration and Consolidation, Analytics Solutions, and Consumers. The five layers from the bottom—Data Types, Data Acquisition and Access, Data Repository, Models, and Data Integration and Consolidation—form the data foundation based on which the analytic capabilities are built. The Analytics Solutions layer describes the various analytically powered solutions that can be offered to the consumer. The topmost layer (that is, the Consumers layer) represents a set of techniques that may be leveraged to interface with the end users—the visual interfaces.

The next sections elaborate on the horizontal layers, vertical layers, and the pillars.

The Horizontal Layers

The following sections define each of the horizontal layers and the collective functionality each one of them is expected to provide in the overall ARB.

Data Types

The lowest layer in the ARB, Data Types, acknowledges the fact that the various data types and data sources are spread across a broad spectrum ranging from traditional structured data to data types that are categorized as unstructured in nature.

This layer enforces the expectation of the ARB to address the broad spectrum of data sources and types that may be ingested into the system for further processing. Examples of structured data types include transactional data from routine maintenance, point-of-sales transactions, and so on. Semistructured data types represent common web content, click streams, and so on, whereas unstructured data is represented by textual content (for example, Twitter feeds), video (for example, surveillance camera feeds), audio (for example, acoustic vibration from operating machines), and so on.

Data Acquisition and Access

The Data Acquisition and Access layer focuses on supporting various techniques to acquire and ingest the data from the gamut of Data Types (the layer below) and make the data ready and available for provisioning and storage. The architecture components in this layer must support the abilities to acquire transactional (structured) data, content (semistructured) data, and highly unstructured data, while being able to accommodate various data ingest rates—from well-defined periodic data feeds to intermittent or frequent data feed updates to real-time streaming data.

Data Repository

The Data Repository layer, as its name suggests, focuses on provisioning the data. The purpose of this architecture layer is to focus on supporting the capabilities required to capture the ingested data from the Data Acquisition and Access layer and to store it based on the appropriate types of data. The layer should also provide storage optimization techniques to reduce the total cost of ownership of IT investments on technologies required to support the expected capabilities.

Models

The Models layer focuses on abstracting physical data and its storage into a technology-agnostic representation of information. The capabilities of this layer can also be viewed as consolidating and standardizing on the metadata definitions for an industry or an enterprise; the business and technical metadata collectively satisfies the metadata definition.

Some organizations may adopt a well-known industry standards model (for example, ACORD in insurance (ACORD n.d.), HITSP in health care (Healthcare Information Technology Standards Panel [HITSP] n.d.) and try to organize their own enterprise data around such standards. Some other organizations may develop their own versions, whereas some others prefer meeting in the middle: starting with a relevant industry standard and extending it to fit their own enterprise data and information needs and guidelines. Regardless of the approach an enterprise adopts, a metadata definition of both the business and technical terms is essential; it shields the interfaces used to access the data from the underlying implementation of how data is persisted in the Data Repository layer.

The architectural building blocks in this layer aim to formulate a metadata schema definition that may be used to define the data and their relationships (semantics or otherwise) on entities provisioned in the Data Repository layer.

Data Integration and Consolidation

The Data Integration and Consolidation layer focuses on providing an integrated and consolidated view of data to the consuming applications. Components in this layer may serve as a gatekeeper and a single point of access to the data that is provisioned in the various components within the Data Repository layer. The components in this layer may leverage the metadata definitions enforced in the Models layer in an effort to standardize on a prescribed mechanism to access and interpret the enterprise data, allowing applications and users to formulate business-aligned information retrieval queries.

Consolidated data requires various integration techniques to either physically collate data from multiple, often disparate, data sources or to provide a set of virtual queryable view interfaces to the physically federated (in multiple systems) data. Physical data consolidation activities and techniques often manifest themselves as data warehouses or domain-specific data marts. Data virtualization techniques aim at providing virtual queryable view interfaces to data sets that are physically distributed in multiple data sources and repositories.

Analytical Solutions

The Analytical Solutions layer focuses on classes of solutions that are powered by analytics at its core. Solution classes are typically industry specific (for example, retail, health care, oil and gas, mining, and so on); even within an industry, there are differences between the solution’s manifestations in different organizations. As an example, if a Question Answering Advisor is a type of solution, it could be implemented as a Drilling Advisor supporting deep sea oil drilling as well as a Maintenance Advisor supporting optimized maintenance of costly equipment.

The solutions at this layer leverage one or more capabilities from the various dimensions of analytics and integrate them to support a specific genre of analytics solutions.

Consumers

The Consumers layer focuses on providing a set of user interface facades that may be leveraged to interact with and consume the features and functions of the analytical solutions.

The components in this layer ensure that existing enterprise applications can leverage the analytical solutions; there also exist user interface widgets (either standalone or integrated) that expose the analytics outcomes and allow users to interact with the solutions.

In the spirit of fostering collaboration and knowledge sharing, components in this layer have a collective responsibility to extend the value reach of analytics into the broader enterprise IT landscape.

The Vertical Layers

The three cross-cutting (that is, vertical) layers are as follows:

Governance—This is a discipline in its own right. Rather than illustrating governance as a foundational discipline, I focus on the three subdisciplines of governance—namely, data governance, information governance, and analytic governance.

Metadata—This defines and describes the data used to describe data.

Data and Information Security—This layer addresses the security underpinnings of how data needs to be stored, used, archived, and so on.

Note: Figure 11.1 does not depict Governance as a cross-cutting layer; rather it shows the three subdisciplines.

Data Governance

Data Governance focuses on managing data as an enterprise asset. It defines and enforces processes, procedures, roles, and responsibilities to keep enterprise data free from errors and corruption by leveraging practical disciplines. The purpose is to address business, technical, and organizational obstacles to ensuring and maintaining data quality.

Some of the areas that may be addressed under data governance include

Data Quality—Measuring the quality, classification, and value of the enterprise data.

Data Architecture—Modeling, provisioning, managing, and leveraging data consistently through the enterprise, ideally as a service.

Risk Management—Building trusted relationships between various stakeholders involved in the creation, management, and accountability of sensitive information.

Information Lifecycle Management (ILM)—Actively and systematically managing enterprise data assets throughout their lifetime to optimize availability of an organization’s data assets; support access to information in a timely manner; and ensure that the information is appropriately retained, archived, or shredded.

Audit and Reporting—Ensuring proper routing and timely audit checks are exercised and appropriate reports communicated to those who either need to take action or be informed about any data stewardship issues.

Organizational Awareness—Fostering a collaborative approach to data stewardship and governance across the enterprise, paying particular attention to the most critical areas of the business.

Stewardship—Implementing accountability for an organization’s information assets.

Security and Privacy Compliance—Ensuring the organization has implemented commensurate controls (for example, policies, processes, and technology) to provide adequate assurance to various stakeholders that the organization’s data is properly protected against misuse (accidental or malicious).

Value Creation—Using formulated metrics to quantify how an organization realizes returns on investment in its use and potential monetization of enterprise data.

Integration Governance

Integration Governance focuses on defining the process, methods, tools, and best practices around consolidating data from federated data sources to form an integrated and intuitive view of the enterprise business entities. The discipline also drives the adoption and usage of metadata that provides a technology-agnostic definition and vocabulary of business entities and their relationships, which may be leveraged to exchange information across applications and systems in a consistent (and ideally standardized) manner.

The areas covered by Integration Governance may include

• Developing best practices around integration architecture and patterns to consolidate data from multiple data sources

• Developing a standards-based canonical metadata and message model

• Exposing integration services for consumption and governing their use by other layers of the architecture

Analytic Governance

Analytic Governance focuses on managing, monitoring, developing, and deploying the right set of analytic artifacts across the five disciplines of Descriptive Analytics, Predictive Analytics, Prescriptive Analytics, Operational Analytics, and Cognitive Computing. The discipline defines the process and policies that should be formulated and executed to manage the life cycle of artifacts created from the various analytics pillars.

This relatively new construct exists in acknowledgment of the fact that analytics is a separate discipline requiring its own life-cycle management. This layer is evolving and therefore will only mature over time.

The focus of Analytic Governance may include

• Developing the best practices, guidelines, and recommendations that may be leveraged to maximize the value generated through analytics

• Developing processes, tools, and metrics to measure the use of and the value harnessed from analytics in an enterprise

• Developing processes around managing, maintaining, and monitoring the analytics artifacts across their life cycle

• Developing analytics patterns that may drive the use of a multitude of capabilities from and across the different analytics pillars to build analytic solutions

• Developing processes, methods, and tools on how analytic functions and capabilities may be exposed As-a-Service for use and consumption

Metadata

The Metadata layer focuses on establishing and formalizing a standardized definition of both business terms and technical entities for an enterprise. The architectural building blocks and their associated components in this layer encourage building a metadata schema definition that may be used to organize the data and their relationships (semantics or otherwise) on entities provisioned in the Data Repository layer. Such metadata definitions form the basis of the information models in the Models layer.

Data and Information Security

The Data and Information Security layer focuses on any additional data security and privacy requirements that assume importance in the context of analytics. Data, as it gets prepared and curated for analytics, needs to be cleansed of any personal information and anonymized, masked, and deduplicated such that identity is masked and privacy not compromised. During the data preparation tasks, the components in this layer enforce just that.

The Pillars

ARA/ARM/ARB is composed of five pillars, each of which focuses on each of the dimensions of analytics. The five pillars are Descriptive Analytics, Predictive Analytics, Prescriptive Analytics, Operational Analytics, and Cognitive Computing. The combined capability supported by the five pillars aims at providing a reasonably well-addressed platform for providing holistic coverage of analytics capabilities for any enterprise.

The following sections provide high-level definitions of each pillar and the collective functionality each one of them is expected to provide in the overall ARB.

Descriptive Analytics

Descriptive Analytics, also known as after-the-fact analytics, focuses on providing intuitive ways to analyze business events that have already taken place—that is, a metric-driven analytical view of facts that have occurred in the past. It uses historical data to produce reports, charts, dashboards, and other forms of views that render insights into business performance against the strategic goals and objectives. For example, a mining company’s business goal may be to maintain or increase the amount of coal produced per unit time. Tonnage Per Hour is a key performance metric or measure for such an enterprise. A business goal for an electronics manufacturing company may be to reduce the rate of scraps generated during the manufacturing and assembly of electronics circuit boards. Cost of Product Quality could be a key performance metric for such an enterprise.

Some of the key characteristics or capabilities expected from the components in this pillar may include

• Leveraging predefined performance measures and metrics around strategic goals and objectives and using them to leverage the design of the reports and dashboards.

• Supporting different views of the analytical data for different personas (that is, user roles) and user communities. Examples include executive dashboards displaying only a few top-level metrics and a field supervisor’s view of performance data for each equipment product line (such as for a truck, loader, or bulldozer). Also, they may provide drill-down (into reports) capabilities to perform root-cause analysis across one or more dimensions of the analytical data.

• Providing metadata definitions to support both precanned and ad hoc reports on data, which is consistent and quality controlled.

• Supporting the optimized retrieval of data from multiple database and data warehouse systems in ways such that the heterogeneity (of the data systems) is abstracted from the reporting widgets.

Predictive Analytics

Predictive Analytics focuses primarily on developing statistical and probabilistic models to predict the occurrence of business critical events; it also qualifies the models with a confidence level quantifying the probability of its occurrence.

As mentioned earlier in the chapter, the modeling techniques are categorized broadly into two categories: supervised and unsupervised learning. Supervised learning uses historical data, which contains instances of the past occurrences of a particular business critical event, to build predictive models that can predict the future occurrences of the same (or similar) business-critical event. Unsupervised learning does not have the luxury of any known business-critical events in the past; it finds patterns in a given data set that it uses to group (or cluster) the data without having prior knowledge of the groups. Components and techniques in this layer support the two broad classifications of modeling techniques.

Continuously analyzing and looking for new trends and patterns necessitates access to data for intensive computations on a variety of data sources. As such, a dedicated analytical development sandbox with dedicated computational workload influences some of the capabilities and components required to be supported in this layer.

The primary user of the capabilities in this layer is the data scientist community (the ones who are in the highest demand in this millennia!). These users leverage sophisticated statistical, stochastic, and probabilistic techniques and algorithms to build and train models that can predict the future with a high enough level of confidence scoring.

Once some trend or pattern can be detected and proven to be able to predict a business event that drives value, its underlying analytical models may influence the metric-driven objectives and goals that are used in the Descriptive Analytics pillar. Hence, new reports (in the Descriptive Analytics pillar) often become relevant and important based on the outcome from the continuous analysis performed in this pillar.

Some of the key characteristics or capabilities expected from the components in this pillar may include

• Empowering and enabling data scientists with commensurate tools and infrastructure to perform exploratory and intensive data crunching and computing tasks

• Using a broad range of statistical techniques

• Supporting an integrated development environment (IDE) to automate model-building and deployment tasks

Prescriptive Analytics

Prescriptive Analytics focuses on optimizing the results of multiple, possibly disparate, analytical outcomes coupled with external conditions and factors. The main components in this pillar are the ones that provide various tools and techniques for developing mathematical optimization models and for correlating (typically business rules based) multiple events to generate prescribed outcomes that are both optimized and actionable.

An example of a mathematical optimization technique may be a linear programming model that provides an optimized price point for a spot price of any raw goods such as copper or gold. An example of a rules-based optimization may be to identify the most opportunistic time to decommission any costly production equipment for maintenance (based on a combination of a prediction of the equipment’s failure and its upcoming nearest window of time for planned maintenance).

Some of the key characteristics or capabilities expected from the components in this pillar may include

• Optimization engines with complex mathematical models and techniques for constraint-based optimization of a target outcome

• A business rules engine that is capable of correlating multiple discrete events that may occur at different locations (that is, in different coordinates in space) and at different times, and navigating decision trees to arrive at one of many possible recommendations

Operational Analytics

Operational Analytics, or real-time analytics, focuses on generating analytical insights from data in motion. It employs techniques to bring the analytical functions to the data. In traditional techniques, the data is at rest, and processing functions such as SQL or SQL-like queries are applied to the data that is already persisted. In Operational Analytics, the analytical functions and algorithms are applied at various times, knowing fully well that the data set on which the processing operates may be radically different between two points in time. As an example, if a sentiment analysis algorithm is being put to test (during the cricket world cup finals) across a streaming data set from Facebook and Twitter, it is quite possible that, in a particular time window, the analytical algorithm works on a data set that has no Facebook data and contains only Twitter data, while in another time window, the same analytical algorithm has to work on a data set that has an equal volume from Twitter and Facebook feeds.

Some of the key characteristics or capabilities expected from the components in this pillar may include

• Support for ingesting data at very high frequencies and generating insight from the streaming data (that is, on data in motion) before it is stored

• Ability to operate on newly generated data in operational data warehouses; this can apply complex event-processing techniques to correlate events from multiple systems and trigger alerts

• Ability to invoke the predictive analytical models in real time; that is, on streaming data

• Support for both structured as well as unstructured data with an emphasis on generating insight from continuous streaming semistructured and unstructured data

Cognitive Computing

Cognitive Computing represents a relatively new field in the computing era, one in which computing systems are not just a slave of humans (that is, they process based on how humans program them) anymore but can build their own knowledge and “learn”; they can understand natural language and can engage and interact with humans more naturally and intuitively. Such systems are expected to provide insights and responses that are backed by confidence-weighted supporting evidence (supporting the responses). As an example, a healthcare advisor may be a cognitive system that can advise doctors on the possible diagnosis of patients and suggest appropriate medical care. Of course, the doctor would have the discretion to accept or reject the advice.

Some of the key characteristics or capabilities expected from the components in this pillar may include

• Ability to provide expert assistance to humans in a timely and effective manner

• Ability to make decisions (and augment the human cognition) based on supporting evidence that keeps growing as the body of relevant information in the world continues to grow

• Ability to exploit the vast body of available information by deriving contextual relationships between entities and continuously generate new insights

I hope this description of the layers and pillars provides a base foundation for you to develop an analytics architecture blueprint. The architecture building blocks that further elaborate the capabilities of each layer may provide the next level of detail.

Architecture Building Blocks

This part of the chapter briefly touches on some of the main architecture building blocks that enable the realization of the capabilities in each of the layers and pillars of the ARA.

I do not claim to be exhaustive and complete in identifying every single hitherto conceived building block, for two main reasons. First, the discipline of analytics has still not fully matured; therefore, the list of such architecture components will only change, mature, or be enhanced over time. Second, in the spirit of flexibility, it is important not to pigeonhole architects into a set of basic architecture building blocks; we need room to innovate—combine the pieces, nix some, and introduce some more—all in the context of addressing the problem at hand and the solutions we seek!

So, the intent of the following sections is to get you thinking and may just get you started. I first address the ABBs in the horizontal and vertical (that is, cross-cutting) layers before addressing the same for the five analytics pillars.

Figure 11.2 provides an illustrative depiction of how an ARB might look. Yes, it may morph—changing its shape, size, content, form, and other dimensions. But we always look for a good starting point, don’t we?

image

Figure 11.2 Illustrative architecture building blocks of an analytics architecture blueprint.

The following sections focus on highlighting the architecture building blocks (ABBs) in each layer. The descriptions, by intent, are kept short, some shorter than others. Therefore, you must research deeper into the capabilities on your own.

Data Types ABBs

I do not illustrate any specific architecture building blocks in the Data Types layer. However, it is important to recognize that this layer must be able to inform the other layers about the different variety of data types that may be required to be ingested into and supported by the system.

Structured data is typically well formed, which means it is amenable to following a well-defined and designed data schema. Data that is grouped into semantic chunks has the same attributes, follows the same order, and can be consistently defined. Examples are transactional data from trade executions, point-of-sales transactions of consumer retail products, and so on; they can be provisioned in relational databases, data warehouses, or data marts.

Semi-structured data can typically be organized into semantic entities such that similar entities (which may or may not have the same attributes) can be grouped together and can be formulated through semantic relationships between entities. Examples are data captured from web clickstreams and data collected from web forms and so on.

Unstructured data does not have any predefined format; can be of any type, shape, and form; and is not amenable to any structure, rules, or sequence. Examples include free-formed text and some types of audio.

Data Acquisition and Access ABBs

The Data Acquisition and Access layer is shown to support three ABBs: Transactional Data Access Services, Operational Data Access Services, and Real-Time Data Access Services. The services in this layer facilitate the ingestion of data of different types and generated at different rates. Appropriate technology components supporting the different services also reside in this layer.

Transactional Data Access Services focuses on Extract, Transform, Load (ETL) techniques used to acquire the data primarily from transactional data sources and applying data transformations and formatting necessary to convert the data into the standard format as dictated by the schema designs of the database systems where the data is expected to be provisioned. As a part of the data transformation process, appropriate data quality rules and checks are expected to be applied to ensure that the data conforms to the metadata definitions of the data standards. This ABB primarily transfers data in a batch mode, from the transactional source systems to the target data repository. The frequency of the batches may range from hourly to once or multiple times in a day.

Operational Data Access Services focuses on acquiring data from sources where the frequency of data generation is in real time (more or less) and hence is much higher in frequency than the data sources from where data is acquired by the Transactional Data Access Services ABB. It is important to note that the data source could still be transactional systems; however, the rate of data generation is far more than what may be supported by traditional batch-oriented systems. Various services are leveraged to acquire the data. A technique known as Change Data Capture (CDC) may be leveraged to move the data from the source to the data storage in a way that minimizes the additional workload on the transactional data source. In situations in which the traditional execution time intervals for batch data transfers may not be adequate, techniques like CDC may assist in mitigating risks of failure in long-running ETL jobs. Micro Batch is another technique that may be leveraged; it facilitates supporting a much shorter batch window of data acquisition. The difference between CDC and Micro Batch is in the specific techniques used to ingest the data. A third technique may be Data Queuing and Push, which uses a different process to acquire operational data, relying on asynchronous modes of sending the data from the data sources to the appropriate data storage. Asynchronous data push, similar to CDC, adds minimal workload on the transactional source systems.

Real-Time Data Access Services focuses on acquiring data from source systems that generate data at rates that are not possible to commensurately support the ingestion by even the Operational Data Access Service ABB; this resides in the realm of near real-time to real-time data feeds, and the types of data typically range between semistructured to unstructured. There is a limit to which the window for batch data acquisition (supported by the other two services in this layer) can be reduced. Beyond this, different capabilities are needed to support the high to ultra-high data volumes and frequencies. This service may employ techniques such as Data Feed Querying or socket- or queue-based continuous data feeds to ingest data in near real time to real time. When the data is acquired, it may be normalized into a set of <key, value> pairs among other formats (for example, JSON), which flattens the data into its basic constituents that encapsulate the information.

Data Repository ABBs

The Data Repository layer is shown to support four ABBs: Structured Data Store, Unstructured Data Store, Content Data Store, and Semantic Data Store. Each of the ABBs addresses specific capabilities.

Structured Data Store focuses on storage for data sets that are inherently structured in nature; that is, it follows a well-defined data schema that is often called schema on write, which implies that the data schema is defined and designed before data is written to the persistent store. As such, the storage components are primarily relational in nature, supporting various data normalization techniques.

Unstructured Data Store focuses on storing primarily unstructured data sets. Examples of such data sets may include machine-generated data, from trading floor transactions (for example, from telephone conversations between customer and trader for trade transactions), and from social networking sites and the Internet in general (for example, customer sentiments on product, stock prices, world affairs affecting oil prices, weather patterns). The data stores are typically schema-less, which implies that data of any structure and form may be provisioned (also referred to as “dumped” in colloquial IT lingo). It is often called schema on read, which implies that the structure and semantics may be defined during retrieval of data from such data stores.

Content Data Store focuses mainly on storing enterprise content. Enterprise documents (for example, technical specifications, policies, and regulation laws) typically fall under this category. A separate class of technology called Content Management Systems (CMS) is purpose built to store, retrieve, archive, and search massive amounts of heterogeneous enterprise content.

Semantic Data Store focuses primarily on storing semistructured data sets that may have undergone semantic preprocessing. Triple Store is a technology that may be used to store semantic-aware data sets; it stores data in the form of a triplet (that is, a triplet tuple). Each tuple consists of a <subject, object, predicate> construct. A Search Index Repository technique, as its name suggests, may be used to store the indexes that are created after applying semantic processing to all searchable content.

Models ABBs

The Models layer may have three ABBs: the Industry Standard Models, Custom Enterprise Models, and Semantic Models.

Industry Standard Models represent an industry standard data, information, or process model that is agreed upon, at the industry level, to be a standard and is typically maintained by some standards body or consortium. ACORD in insurance and HITSP in patient care are examples of such standards. Organizations that either want to or are required to (by regulation laws) adopt an industry standard for information exchange adopt the standard models (either in whole or in part) to implement data exchange between their IT Systems.

Custom Enterprise Models represent information or data models that are typically developed indigenously within an organization. Such models may either be a derivative, extension, or customization of an industry model or a completely home-grown model. The intent of the model is the same as that of an industry standard model—that is, to work as a facade between the physical representation of the data and the means by which it is exchanged and consumed by systems and applications.

Semantic Models focus on developing ontology models representing specific business domains or subsets thereof. The word ontology is typically used to denote at least three distinctly different kinds of resources that have distinctly different kinds of uses, not all of which lie within the realm of natural language processing (NLP) and text processing. Such models are used to develop an interface to navigate and retrieve data from semantic stores, for consumption and use by components in other layers in the ARB. (Refer to the “Semantic Model” sidebar in Chapter 9, “Integration: Approaches and Patterns.”)

The collective capabilities of the representative ABBs in this layer are intended to facilitate a technology-agnostic and highly resilient integration approach as it pertains to efficient data and information exchange.

Data Integration and Consolidation ABBs

The Data Integration and Consolidation layer may be supported by three ABBs: Enterprise Data Warehouse, Data Virtualization, and Semantic Integration. These ABBs foster consolidated and virtualized access to heterogeneous data and are ideally expected to leverage the components and artifacts in the Models layer (refer to Figure 11.2) to standardize on a contextual representation (of the consolidated data) and access (through virtualization) to the data.

Enterprise Data Warehouse focuses on developing and providing a consolidated representation of the most critical enterprise information and knowledge—for example, performance metrics, business financials, operational metrics, and so on; this information is critical for enterprise reporting and insights into business operations and performance and is often considered to be a trusted source of enterprise information. Data marts, data warehouses, and their operational data warehouse variation typically fall under this category. Operational data warehouses support data feeds at a frequency that is much higher than the data currency maintained in a traditional data warehouse without compromising on the data read performances. Data marts represent a subset of the data that is stored in a data warehouse. Each subset typically focuses on a specific business domain—for example, customer, product, sales, inventory, and so on. Data marts can also represent subdomains within each business domain in scenarios where the business domain is complex and requires further classification. Examples of such subdomains may be product pricing and product inventory.

Data Virtualization focuses on providing virtualized access to multiple federated data repositories in ways such that the technology complexities of federated and distributed queries are encapsulated in this building block (thereby shielding the complexities from the consuming applications and systems). One of the key functionalities that may be expected would be to package and prefabricate frequently occurring correlated queries and expose the collection as a single-query (that is, retrievable) interface to the consuming and requesting applications. A typical technology implementation could be to take a user-defined or an application-specific query request and abstract the routing of query subsets to potentially different data sources and subsequently combine or consolidate the individual query subset results into a single integrated result set to return to the consuming applications.

Semantic Integration focuses on providing a set of interfaces that facilitate semantic query building and executing. SPARQL (which stands for SPARQL Protocol And RDF Query Language; see W3C 2008) is an example of a semantic query language for databases in which the data is typically stored in the form of a triple store (for example, in a Semantic Data Store).

It is important to highlight that the integration facilitated through the Semantic Integration and Data Virtualization ABBs is runtime in nature, whereas Enterprise Data Warehouse is typically a physical integration or consolidation construct.

While not mandatory for the layers and pillars above (that is, the Descriptive Analytics, Predictive Analytics, and so on) to leverage the functionality exposed by the ABBs in this layer, best practices often advocate exercising due diligence to leverage the capabilities of this layer as a mechanism to virtualize information access.

Analytics Solutions ABBs

The Analytics Solutions layer hosts prefabricated end-to-end solutions that focus on solving a specific class of business problem. It is impractical to point out specific building blocks at this layer because the components at this layer are not really ABBs but more of packaged solutions. I kept the ABBs in the heading to maintain consistency and not confuse you by introducing YAT (Yet Another Term)!

In the spirit of consistency, or at least the look and feel of the ABB view of the ARB, I depicted some representative solutions:

Predictive Customer Insight (IBM 2015) focuses on extending the benefits of an organization’s marketing and customer service systems. It does so by leveraging a combination of advanced analytics techniques to deliver the most important customer-related KPIs by leveraging data around buyer sentiments and delivering personalized customer experience.

Predictive Asset Optimization (IBM n.d.) focuses on leveraging a combination of various advanced analytic techniques to improve the Overall Equipment Effectiveness (OEE) of critical enterprise assets (for example, heavy equipment, factory assembly-line machines, rotatory and nonrotating equipment in an oil and gas platform, aircraft engines, among many more). It does so by predicting the health of costly and critical assets relative to potential failures, much ahead of time, such that proper actions may be taken to reduce costly unplanned downtimes of the most important and critical assets.

Next Best Action (IBM 20122013) focuses on developing and providing optimized decisions and recommending actions that may be taken to minimize the potential adverse impact of a business-critical event that may be forthcoming. Optimized decision making can be applied to various types of enterprise assets: customers with regard to increasing loyalty, products with regard to reduced cost of production, employees with regard to reducing attrition rates, and so on.

Recommender Systems (Jones 2013) focuses on generating contextual recommendations to a user or a group of users on items or products that may be of interest to them, either individually or collectively. It leverages multiple machine-learning techniques such as collaborative filtering (CF), content-based filtering (CBF), hybrid approaches combining variations of CF and CBF, Pearson correlations, clustering algorithms, among other techniques, to arrive at an ordered (by relevance) set of recommendations. Netflix and Amazon employ such recommender systems, or variations thereof, to link customer preferences and buying or renting habits with recommendations and choices.

Question Answering Advisor focuses on leveraging advanced natural language processing (NLP); Information Retrieval, Knowledge Representation & Reasoning; and machine-learning techniques and applying them to the field of open-domain question answering. An application of open-domain question answering is IBM’s DeepQA (IBM n.d.), which uses hypothesis-generation techniques to come up with a series of hypotheses to answer a specific question, and uses a massive amount of relevant data to gather evidence in support of or refuting the hypotheses, followed by scoring algorithms to ultimately arrive at the best possible answer. IBM’s Watson is a classic example of such a solution.

Consumers ABBs

The Consumers layer is represented by five ABBs: Enterprise Applications, Enterprise Mobile Applications, Reporting Dashboard, Operational Dashboard, and Enterprise Search. The focus on the ABBs in this layer is to provide different channels to expose analytics capabilities and solutions for enterprise consumption. The ABBs are strictly representative in nature, implying that other components may be supported in this layer.

Enterprise Applications represent the classes of applications in an enterprise that are used either by one or more lines of business or by the entire organization. Such applications may require interfacing with the analytics capabilities or solutions to extend the value of their legacy enterprise applications. As an example, a SAP Plant Maintenance (SAP PM) system may receive a recommendation to create a maintenance work order from a decision optimization analytic solution.

Enterprise Mobile Applications represent a relatively new and upcoming class of enterprise applications that are primarily built for the mobile platform. Such applications benefit from receiving notifications for actions from analytic solutions. In other cases, an analytic application may be fully mobile enabled—that is, built as a native mobile application on the iOS or the Android platform. One such example is an application for airline pilots to help them decide on the optimized refueling for the aircraft, running natively on an iOS platform (think iPads) and powered by analytics.

Reporting Dashboard provides a platform to build, configure, customize, deploy, and consume reports and dashboards that not only are visual manifestations of data in data marts, cubes, or warehouses but also serve as various means to slice and dice the information and represent it in multiple intuitive ways for analysis.

Operational Dashboard provides a visual canvas and platform to render data and information that is being generated and obtained in real time—that is, at a rate which is faster than it is possible to persist and analyze before being rendered. An example may be collecting data from a temperature and pressure sensor on a valve in an oil platform and visualizing the data as a real-time trend immediately upon its availability.

Enterprise Search represents a class of consumer applications that focus on providing different levels of analytical search capabilities to retrieve the most contextual and appropriate results from the body of enterprise content. It can also act as a front end to analytic solutions such as the Question Answering Advisor (refer to Figure 11.2).

Metadata ABBs

The Metadata layer is represented by three ABBs: Analytic Metadata, Semantic Metadata, and Structured Metadata. The ABBs in this layer work in close conjunction with the ABBs in the Models layer in an effort to develop a standardized abstraction to information management and representation.

Analytic Metadata focuses on defining, persisting, and maintaining the gamut of metadata required to support the various facets of analytics in an enterprise. The most common analytic metadata is for capturing the data definitions required for all the precanned reports that are typically executed either periodically or upon user requests. Reporting requires its own metadata definitions, which determine how the data elements on the reports are constructed and are related to each other and to the data sources from where the content needs to be retrieved to populate the reports. Additionally, the navigation design for multiple visual pages and widgets is also considered analytic metadata. Similarly, data model representations required to train and execute predictive models are also part of the analytic metadata. The definition of business rules, along with its input parameter set, is also considered analytic metadata. The scope of analytic metadata is determined by the variety of analytics supported in an enterprise.

Semantic Metadata focuses on the foundational components required to build a semantic information model for the entire information set or its subset thereof. Language models based on a dictionary of terms, a thesaurus, grammar and rules around semantic relationships between entities and terms, may define ontologies that form the underpinning of semantic metadata.

Structured Metadata focuses on defining the metadata definitions for business entities along with their constraints and rules that influence how the Structured Data Store ABB (in the Data Repository layer) may define its schema definitions. It needs to address different types of metadata, for example, Business Metadata, Technical Metadata, and Metadata Rules. The Business Metadata may encapsulate the business entity concepts and their relationships; the Technical Metadata may be used to formulate the constraints on the attributes that define the business entities; the Metadata Rules may define rules and constraints governing the interrelationships between entities and their ultimate realization as physical schema definitions for the Structured Data Store ABB.

Data and Information Security ABBs

The Data and Information Security layer is represented by only one ABB: Identity Disambiguation. This is admittedly sparse; the field of information security is starting to get the attention it deserves in the light of data being increasingly considered as an enterprise asset. This will continue to grow and mature over time. As an example, with Internet of Things (IoT) becoming increasingly pervasive, connectivity and interaction with the device instrumentations (which run critical operations, such as oil production, refinery operations, steel productions, and so on) require more secure networks and strict access mechanisms, to interact with the device instrumentations.

Identity Disambiguation focuses on ensuring that the proper masking and filtering algorithms are applied to disambiguate the identity of assets (especially humans) whose data and profile information may be leveraged in analytical decision making.

We’ve concluded our treatment of the representative ABBs in the various layers of the ARA. With the layers given some attention to identify a set of representative ABBs, now let’s apportion equal attention to the analytics pillars. They too deserve some further discussion.

Descriptive Analytics ABBs

The Descriptive Analytics pillar is represented by three ABBs: Reporting Workbench, Dimensional Analysis, and Descriptive Modeling.

Reporting Workbench provides and supports a comprehensive set of tools to define and design analytical reports that support a set of predefined business metrics, objectives, and goals. It should additionally support the ability and tooling to test and deploy the reports and widgets onto a deployment runtime. Some nonfunctional features worthy of consideration may include (but are not limited to)

• Ease of use to configure and define the reports and widgets by business users

• The richness, fidelity, and advanced visual features to support attractive, intuitive, and information-rich visualizations

• Customizability capabilities to connect to different data sources and graphical layouts

Dimensional Analysis provides the capability to slice and dice the data across various dimensions to develop a domain-specific view of data and its subsequent analysis. This ABB also supports tools and techniques for developing data marts and data cubes to represent data for specific domains and targeted analytical reports on historical data.

Descriptive Modeling develops data models that specifically cater to the generation of business reports that can describe, in multiple ways, how users may like to analyze (and hence display) the information. Such models are built on top of the data models in data warehouses and data marts, focusing on generating flexible reports.

Predictive Analytics ABBs

The Predictive Analytics pillar is represented by three ABBs: Predictive Modeling, Analytics Workbench, and Analytics Sandbox.

Predictive Modeling focuses on employing data analytics along with statistical and probabilistic techniques to build predictive models, which can predict a future event supported by a degree of confidence of the event’s occurrence. It leverages two broad classes of techniques: supervised and unsupervised learning. As illustrated earlier in the chapter, in supervised learning, the target outcome (or variable), which is to be predicted, is known ahead of time (for example, failure of an aircraft engine). Statistical, algorithmic, and mathematical techniques are used to mine and analyze historical data to identify trends, patterns, anomalies, and outliers and quantify them into one or more analytical models containing a set of predictors that contribute to predicting the outcome. In unsupervised learning, neither the target is known ahead of time, nor are there any historical events available. Clustering techniques are used to segregate the data into a set of clusters, which help determine a natural grouping of features, and more importantly of behavior and pattern, in the data set.

Analytics Workbench provides an integrated set of tools to help the data analysts and data scientists perform the activities around data understanding, data preparation, model development and training, model testing, and model deployment.

Some of the capabilities provided by the workbench may be (but are not limited to)

• Mathematical modeling tools and techniques (for example, linear and nonlinear programming, stochastic techniques, probability axioms and models)

• Ability to connect to the analytics sandbox

• Coverage of the most common techniques (for example, SQL, SPARQL, and MapReduce) for introspecting data from multiple storage types (that is, data warehouses, semantic data stores, and structured data stores)

• Ability to perform text parsing

• Ability to build semantic ontology models

Analytics Sandbox provides the infrastructure platform required to perform all activities necessary to build, maintain, and enhance the Predictive Analytics assets. The sandbox needs to ensure that commensurate capacity (shared or dedicated) is available to compute and run complex, intensive algorithms and their associated number crunching against very large data sets. The sandbox is expected to provide data scientists with access to any and all data sources and data sets that may be interesting or required to perform a commensurate level of data analysis necessary to build predictive models.

Some considerations may be (but are not limited to)

• A dedicated sandbox environment where the necessary data and tools are made available for analysis

• A shared sandbox environment that is configurable, appropriately partitioned, and workload optimized

Prescriptive Analytics ABBs

The Prescriptive Analytics pillar is represented by three ABBs: Business Systems Interface, Business Rules Engine, and Decision Optimization.

Business Systems Interface addresses having the output of Prescriptive Analytics outcomes available to the various enterprise business systems of the organization. It exploits the capabilities of an Analytical Data Bus (a new term I just introduced!) to push the generated insights (from this layer) to the business systems.

Note that although the Analytical Data Bus is not represented explicitly in the reference architecture, its physical realization may be the standard Enterprise Service Bus (ESB), which is usually present in most IT integration middleware landscapes.

Business Rules Engine focuses on providing the necessary tooling and runtime environment to support building, authoring, and deploying business rules. The intent of this component could be to provide the flexibility for business users to author business rules by combining and correlating the outcomes from, for example, predictive models, external factors (such as environmental conditions and human skill sets), actions, and event trigger outputs. The purpose is to correlate them both in space (from multiple locations) and in time (occurring at different times) to come up with more prescriptive outcomes. It may serve as an enabler to the Decision Optimization building block.

Decision Optimization builds on top of capabilities realized from ABBs within the Prescriptive Analytics tower and from other analytics towers; it focuses on applying optimization techniques. Constrained and unconstrained optimization methods, linear programming, and nonlinear programming (such as quadratic programming) are some of the techniques used to formulate maximizations or minimizations of objective functions. Examples may be to maximize the profit margin of an energy and utilities company or to minimize the cost of servicing warrantied items for a retail company.

Operational Analytics ABBs

The Operational Analytics pillar may be represented by three ABBs: Real-Time Model Scoring, Real-Time Rules Execution, and Real-Time KPIs and Alerts.

Real-Time Model Scoring focuses on executing the predictive models in real time; that is, on the data in motion. It allows the predictive models to be invoked at the point where data is ingested into the system, thereby enabling real-time scores that allow the business to take actions in near real time. As an example, a predictive model can determine whether a semiconductor fabrication will have quality issues and hence result in scrap. Such a model can be invoked at the time the fabrication assembly line produces the data from the robotic equipment. This results in early detection of scrap and thus reduces the Cost of Product Quality (COPQ), which is a key business metric in the semiconductor manufacturing industry.

Real-Time Rules Execution focuses on executing the business rules in real time, that is, on the data in motion. It allows the business rules to be invoked at the point where data is ingested into the system, thereby enabling real-time execution of business rules. As an example, rules that can determine whether a credit card transaction is fraudulent can be invoked at the time when the transaction data is being captured.

Real-Time KPIs and Alerts focuses on computing key operational metrics defined as key performance indicators. The KPIs, which can range anywhere between simple formulations to complex state machine derivations, may be calculated on the data in motion. That is, they are calculated as and when the generated data is available in the system. Such KPIs can be annotated with thresholds and other measures that, when compromised, can result in the generation of alerts that can be notified to the relevant users in near real time. As an example, the deviation of the operating conditions of a mining machine (for example, equipment working underground to produce coal) can be formulated into a set of complex state machines and associated KPIs. These state machines and KPIs can be computed in real time. Alerts can be generated to inform the operators that the machine is not being used to its optimum capacity. Such real-time KPIs and alerts enable the operators to make necessary changes so that they can obtain maximum production in shift operations.

Cognitive Computing ABBs

The Cognitive Computing pillar may be represented by three ABBs: Insight Discovery, Semi-Autonomic Decisioning, and Human Advisor.

Insight Discovery focuses on continuously mining the combination of new and existing information corpora to discover new relationships between entities in preparation of supporting more enriched evidence when faced with complex real-world questions.

Semi-Autonomic Decisioning focuses on parsing real-world questions, breaking down the questions into smaller constituent questions, generating multiple hypotheses for each subquestion, gathering evidence in support or refutation of each hypothesis, and then leveraging confidence weightages (that is, statistical and mathematical techniques to derive the best outcome) to finally combine and generate the best possible response. The component, in its current state of maturity, still serves as an aid to the human decision-making system (hence, semi-autonomic) with the ultimate future goal to be the decision maker!

Human Advisor focuses on combining the capabilities of the insight discovery and the semi-autonomic decisioning components to function as an interactive guide (with a rich and intuitive graphical user interface) to humans, helping them through question-answering sessions to arrive at a well-informed and evidence-supported answer.

This completes our illustration of the ABBs of an ARB!

It may be worthwhile to note that the market, geared toward providing the components in the layers and pillars, is competitive by its very nature. The product vendors will continue to keep coming up with enhanced capabilities in support of a combination of features and functions. Do not be surprised if you come across vendor products supporting multiple features within or across layers or pillars in the ARB.

Summary

The analytics clock should keep ticking, generating moments of insight.

Analytics is at work. Most organizations that are serious about identifying innovative ways of lowering costs, increasing revenue, and differentiating themselves for competitive advantages are making analytics a mainstream business strategy.

This chapter identified five foundational subdisciplines within analytics that form the analytics continuum: Descriptive Analytics, Predictive Analytics, Prescriptive Analytics, Operational Analytics, and Cognitive Computing.

Descriptive Analytics answers the question what already happened? Predictive Analytics attempts to foretell what may happen in the future. Prescriptive Analytics attempts to prescribe what we should do if something happens. Operational Analytics brings the application of analytics to where data is generated. Finally, Cognitive Computing attempts to aid the human as an advisor.

One theory postulates that an organization’s analytics maturity should follow this order—that is, start with Descriptive Analytics and then move into Predictive, Prescriptive, and then Cognitive. Another theory postulates that an organization can simultaneously mature itself in most if not all of the analytic disciplines. There is no one correct answer, and the choice depends on the business imperatives and strategy. Operational Analytics does not need to follow the sequence because it caters to real-time analytics on data in motion; not all organizations may require it, nor would it strictly depend on the other analytic disciplines as a prerequisite.

I framed an analytics reference architecture consisting of seven horizontal and three vertical, cross-cutting, layers along with five pillars (representing the analytics continuum). The architecture layers address how different data types require different data ingestion techniques; different data storage capabilities provision the data; leveraging a model-based approach, driven by metadata definitions, to consolidate and virtualize the data for consistent and standardized access; ensuring proper governance around data, integration, and analytic assets with appropriate data and information security measures. The pillars focus on the five analytic disciplines: Descriptive -> Predictive -> Prescriptive -> Operational -> Cognitive. Often a reference architecture is met with an unnecessary waste of energy in analyzing whether it is a reference architecture or not; in such situations, it is okay for us, as practical architects, to give it different, less-conflicting, titles such as analytics reference model, analytics architecture blueprint, and so on.

It is important to acknowledge that the reference architecture serves as a guideline to define a baseline from which you can innovate, improvise, and develop an analytics architecture that supports not only the business strategy and objectives but also acknowledge the IT capabilities of an organization. Furthermore, I illustrated all concepts in exhaustive detail; I meant to make you aware of their relevance and hence the imperative nature to exercise self-driven research in such topics (for example, ontologies, cognitive computing, industry standard information models).

For a practical software architect, having a firm understanding of analytics and its capabilities could be an important differentiation!

Like all good things, this book too needs to come to an end. I reflect back on the topics that I tried to cover and feel that I was able to address the areas in software architecture that I had in mind when I conceptualized this book. However, just as with anything close to our hearts that we do not want to leave or finish, I keep thinking about what else I could have shared with you. I made up my mind to dedicate one last chapter to sharing some of the experiences that I have picked up over my professional years. The next chapter, thus, is a collection of a few such experiences. Although they were gathered the hard way, they were rich in the lessons I learned from them.

References

ACORD. (n.d.) Retrieved from http://www.acord.org/. This insurance industry standards specification also consists of a data and information standard.

Davenport, T., & Harris, J. (2007). Competing on analytics: The new science of winning. (Boston: Harvard Business Review Press).

Davenport, T., Harris, J., & Morison, R. (2010) Analytics at work: Smarter decisions, better results. (Boston: Harvard Business Review Press).

Healthcare Information Technology Standards Panel (HITSP). (n.d.) Retrieved from http://www.hitsp.org. This site shares information across organizations and systems.

IBM. (2012–2013). Smarter analytics: Driving customer interactions with the IBM Next Action Solution. Retrieved from http://www.redbooks.ibm.com/redpapers/pdfs/redp4888.pdf.

IBM. (2015). The new frontier for personalized customer experience: IBM Predictive Customer Intelligence. Retrieved from http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=WH&infotype=SA&appname=SWGE_YT_HY_USEN&htmlfid=YTW03379USEN&attachment=YTW03379USEN.PDF#loaded.

IBM. (n.d.) FAQs on DeepQA. Retrieved from https://www.research.ibm.com/deepqa/faq.shtml.

IBM. (n.d.) Predictive asset optimization. Retrieved from http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=SA&subtype=WH&htmlfid=GBW03217USEN.

IBM Institute of Business Value. (n.d.) Analytics: The speed advantage. Retrieved from http://www-935.ibm.com/services/us/gbs/thoughtleadership/2014analytics/.

IBM Institute of Business Value. (n.d.) Your cognitive future. Retrieved from http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=XB&infotype=PM&appname=CB_BU_B_CBUE_GB_TI_USEN&htmlfid=GBE03641USEN&attachment=GBE03641USEN.PDF#loaded.

Jones, T. (2013). Recommender systems. Retrieved from http://www.ibm.com/developerworks/library/os-recommender1/.

W3C. (2008, January 15). SPARQL specifications. Retrieved from http://www.w3.org/TR/rdf-sparql-query/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.198.94