Chapter 2. Solution reference architecture

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Solution reference architecture

This chapter describes possible reference architectures that you can use when you deploy big data and analytics solutions, including the integration of its components. This chapter refers to architectures that are implemented on IBM Power Systems servers, and it focuses on a solution that implements the following solutions:

•Big data exploration

•Relational database with acceleration for analytics

•Analytical decision management

•Reporting and dashboarding on top of many sources of data (structured and unstructured)

Both hardware architecture and software components from the point of view of deploying the solution are described in this chapter. This chapter also illustrates how the components integrate to build a robust and ease-to-manage analytics solution.

You can create the environments by using all or part of the architecture, according to your goals, environment size, and so on.

This chapter also describes the following topics:

•Hardware architecture, including virtualization and management

•Software architecture for applications

•Data store approach

•IBM BigInsights® for Apache Hadoop cluster management deployment

The information is organized into the following areas:

•Big data and analytics general architectures

•Hardware reference architecture

•Software reference architecture

•Solution reference architecture

2.1 Big data and analytics general architectures

With the continuous growth of data and its variety, big data and analytics disciplines and architectures are extensively studied and improved. To gain effective insights from data, it is necessary to understand and explore the many sources, how to store them effectively, and how to discover and visualize it.

Data sources vary by source and structure. The traditional sources are transactional database applications that contain structured data. New sources with data in an unstructured format are explored more often. Certain sources of data are text-based documents, such as logs from web applications and data feeds from social-networking applications. The integration of these data sources forms the foundation for federated analytics. Correlating and analyzing these varieties of data is essential to gain valuable insights. The speed to get these insights is critical to the enterprise.

Figure 2-1 shows a logical view of a high-level big data, data storage, analysis, discovery, and visualization reference architecture.

Figure 2-1 IBM big data and analytics reference architecture

From a big data perspective, Hadoop clusters are intensively used as data repositories for landing and exploring many kinds of structured and unstructured data, and discovering ways of gaining insights from it. IBM BigInsights for Apache Hadoop is the IBM platform for managing and analyzing persistent big data. It is built on top of the IBM Open Data Platform for Apache Hadoop, which consists of entirely Apache Hadoop open source components, such as Apache Ambari, HDFS, Flume, Hive, and ZooKeeper (Figure 2-2 shows the components for this solution).

The components that apply to your deployment depend on the data sources that you plan to integrate. Consider your current plans and future needs when you decide about the initial deployment so that the infrastructure can easily grow to support your business as your business changes.

Figure 2-2 IBM BigInsights for Apache Hadoop major components

The data that is stored in Hadoop clusters can provide many kinds of actionable insights. Different sorts of analyses can be performed from this data. For example, based on past events and recorded transactions, you can use reporting and analysis tools, such as IBM Cognos, to help you gain insight and help to answer questions, such as “Why did it happen?” You can also use modeling and predictive analysis tools, such as IBM Statistical Package for the Social Sciences (SPSS) to help you search in the data for answers to questions, such as “What happened?”

Modeling and predictive analysis tools help with the decision-making processes.

In this way, analytics applications can use big data environments to help businesses in many areas, such as customer experience, fraud detection, and IT economics.

2.2 Hardware reference architecture

Selecting a hardware architecture to support the deployment of a big data and analytics solution requires an understanding of the relevant components and how they affect many aspects, such as performance, reliability, availability and serviceability (RAS), costs, and management. This publication shows a few available alternatives and how these components can help with the deployment of an efficient environment. Additionally, this book shows the deployment of a big data and analytics solution scenario that shows many of the reference architecture components.

One aspect to consider is which hardware architecture is correct for your environment. The answer to this question depends on whether you own any IBM Power Systems hardware, the type of hardware, whether you virtualize the hardware, or how you virtualize the hardware. To start this process, consider the following questions:

•Do you use existing servers or new servers?

•Do you use high-end Power Systems servers or entry-level Power Systems servers?

•Do you use only Linux Power Systems servers or generic Power Systems servers?

•Do you use virtualization or bare-metal servers?

•Do you deploy the solution in a cloud environment or in a technical computing cloud environment?

2.2.1 IBM Power Systems servers

IBM Power Systems servers can provide high-processing capacity, high bandwidth, and highly reliable hardware. However, you must consider a few points and make several decisions when you select the hardware for a big data and analytics environment deployment.

Two basic types of IBM Power Systems servers, enterprise and scale-out, exist:

•Enterprise servers provide enough capacity to grow the environment by using the same hardware (vertically). They offer advantages, such as systems consolidation, reduced floor space, and energy savings. You can use these servers to deploy applications that benefit from a scale-up architecture in terms of processor and memory, and virtualization features.

•Scale-out servers are based on the principle that the application will grow in capacity by adding new servers to the environment (horizontally). You can use these servers to deploy big data clustered workloads that do not benefit from a scale-up architecture in terms of memory and processor or from server virtualization (deployed in a bare-metal node architecture).

If you plan to deploy analytics services that can be managed within a virtualized environment, it makes sense to use high-end servers. You can manage logical partitions (LPARs) in your servers to create resource boundaries so that your application workload can fully use what it is given and still ensure consolidation of your hardware. Moreover, you can create a multitenant (cloud-style) environment by using the IBM Cloud Manager with Openstack, which handles the cloud management and multitenancy for you. However, achieving symmetry might be more challenging in this case.

Scale-out servers can be managed by using IBM Platform Computing solutions where you have a clearer distinction between node roles, such as management nodes, and computing or data nodes. No physical hardware overlap exists among these roles. With this architecture type, it is easier to achieve symmetry of the hardware architecture. Also, You can use software, such as the IBM Platform Cluster Manager - Advanced Edition, to automate the deployment of an IBM Open Data Platform for Apache Hadoop on the nodes.

Additionally, several predefined big data solutions can be deployed, such as IBM Data Engine for Analytics and IBM POSHv2, to take advantage of the IBM Open Data Platform for Apache Hadoop and IBM BigInsights for Apache Hadoop workloads.

Another consideration about the server is about using virtualization or bare metal servers. Consider the following important points:

•Scale-out or scale-up components

•Solution symmetry

•Automated, available deployment tools

•Size of the servers (scale-out or enterprise)

Perhaps, you are considering a mixed implementation in terms of virtualization. For example, you are thinking about virtualizing applications, such as Cognos, SPSS, and DB2 that traditionally scale up, and you choose not to deploy them as clustered nodes. In a BigInsights for Apache Hadoop cluster, you can implement the same nodes for the management nodes and use bare-metal nodes for the data nodes because they are traditionally scale-out components.

You also need to keep in mind that you can be restricted to use a specific operating system due to application support. For example, BigInsights for Apache Hadoop supports Linux platforms only. Since the launch of the IBM POWER7® hardware, IBM made available server models that run only Linux (as opposed to also running IBM AIX® and IBM i). At the time that this publication was written, IBM has the following available Power Systems server models that run only Linux:

•Power S812L (POWER8)

•Power S822L (POWER8)

•Power S824L (POWER8)

These server models fit more into the entry-level or mid-level server category than the high-end server category. So, the servers can benefit more from the bare-metal approach than the virtualized approach. However, if you must virtualize them, you can without any challenges. The POWER8 server models also benefit from PowerKVM virtualization. The S812L or S822L server model that is combined with PowerKVM-based virtualization provides a good option for price-performance in terms of virtualizing a Power Systems server for Linux environments only.

For more information about the IBM Power Systems server models, see the IBM Power Systems Quick Reference Guide at the following website:

https://ibm.biz/Bd4yQU

2.2.2 General architecture for BigInsights for Apache Hadoop

An BigInsights for Apache Hadoop cluster consists of management nodes and data nodes. Management nodes host the following services:

•Ambari

•Oozie

•Big SQL

•Catalog

•ZooKeeper

•HBase

•Hive

•IBM Platform Symphony®

Data nodes are the nodes that perform work based on the workloads that are running in the cluster. These nodes are interconnected through Internet Protocol networks. Therefore, a well-designed network architecture is important, and it plays a central role in cluster performance. As a preferred practice for a BigInsights for Apache Hadoop cluster, define at least three networks:

• A data network

• A public network

• A user administrative network

Another aspect of a BigInsights for Apache Hadoop hardware architecture is where the data is stored. Because of the way that the MapReduce framework works, jobs are scheduled on nodes where data is local to minimize network transfers of massive amounts of data. Therefore, the typical architecture uses disks that are assigned to only a single node, which is called a shared-nothing architecture. File systems must be aware of this architecture to achieve the goals of MapReduce workloads.

The Hadoop file system supports the shared-nothing architecture. Also, Spectrum Scale supports the shared-nothing architecture through its File Placement Optimizer feature. This book focuses on the use of Spectrum Scale as the file system to build the architecture design, working with File Placement Optimization and also approaching it with an IBM Elastic Storage™ Server perspective. For more information about possible disk layouts, see 2.2.5, “Data storage” on page 14.

Going further into the architecture, you can add nodes that ease the management of the hardware in the BigInsights for Apache Hadoop cluster. Imagine that you want to either add or remove a node from an existing BigInsights for Apache Hadoop cluster, or you want to create multiple, independent BigInsights for Apache Hadoop clusters. Performing these tasks manually is a time-consuming task. Cluster management software eases this task.

The scenarios and examples that are implemented in this book used IBM Platform Cluster Manager - Advanced Edition to provision the BigInsights for Apache Hadoop nodes. Platform Cluster Manager - Advanced Edition can perform bare-metal provisioning and apply cluster templates during this process so that you can conveniently deploy a whole cluster, with management and data nodes, in an automated fashion. The advantages for this configuration are described in 2.3.6, “Cluster management” on page 34.

You can add a system management node to your overall BigInsights for Apache Hadoop server farm to install Platform Cluster Manager - Advanced Edition for managing your hardware. This addition changes the network layout of your network environment because you set Platform Cluster Manager - Advanced Edition to communicate with the flexible service processor (FSP) ports of your Power Systems hardware. Also, Platform Cluster Manager uses a provisioning network to deploy systems. You can use your administrative network or a fourth network to isolate the traffic for systems provisioning. A detailed explanation of the networking components is available in 2.2.4, “Networking” on page 12.

2.2.3 General architecture for analytics applications

Many analytic applications can be implemented in a big data and analytics environment. This publication focuses on several of these applications:

•Reporting and analysis with the IBM Cognos Business Intelligence, which allows the visualization of data from different sources, such as BigInsights for Apache Hadoop and IBM DB2

•Decision Management with IBM SPSS Analytical Decision Management so that you can set up a “business rules engine” for optimizing transactional decisions and consistently maximize outcomes

•In-memory acceleration for transactional databases with DB2 BLU to maximize performance and efficiency when you analyze transactional data from an online analytical processing (OLAP) approach

These workloads provide flexibility for how you might deploy them. The workloads can easily be deployed as LPARs by using a scale-up hardware architecture, benefitting from virtualization and system consolidation.

DB2 BLU specifically accelerates analytic workloads by working with columnar tables and placing the minimum memory. A scale-up architecture might provide a better memory bandwidth, increasing memory throughput, and better memory RAS, which helps prevent risks of data loss.

In a virtualized environment deployment, you can use the Hardware Management Console (HMC) and other cloud management software, such as IBM PowerVC and IBM Cloud Manager with Openstack, as the infrastructure management components.

In a deployment that uses individual nodes for each application or dedicated LPAR in scale-out systems, Platform Cluster Manager - Advanced Edition can also be used for the deployment and management of those nodes, similar to BigInsights for Apache Hadoop nodes.

Additionally, in a production environment, we suggest that you deploy a high-availability environment to minimize unplanned downtime and minimize the effect of hardware or software failures.

For data storage, consider the use of a storage area network (SAN) for images, repositories, and data. Take advantage of the benefits of storage consolidation, live partition mobility, and so on. Software-defined storage is also an approach that can be used for additional benefits. For more information, see 2.2.5, “Data storage” on page 14.

2.2.4 Networking

BigInsights for Apache Hadoop uses three networks: administrative, public, and data.

The administrative network is used for accessing the nodes to perform administrative tasks, such as verify logs, start or stop services, and perform maintenance. Administrators use it to use Secure Shell (SSH) to get into the nodes or to access them through virtual network computing (VNC). This network can be as simple as a 1-Gb Ethernet port with high availability through Ethernet bonding.

Based on your environment’s requirements, the administration network can be segregated into separate virtual local area networks (VLANs) or subnets. It is directly connected to your company’s administrative network through a firewall to prevent non-IT management personnel from reaching the IT servers.

The public network is the gateway to the applications and services that are provided by the BigInsights for Apache Hadoop cluster. Think of the public network as the public face of your corporate network. It is the network that you use to access the Ambari web interface or the BigInsights home web portal and perform your big data work. Although all cluster nodes can be connected to this network, the management nodes are the only nodes with configurable services, such as an HTTP server service, on them. The reason why you connect all of your nodes to the public network is to prevent cabling rework if you are working with a dynamic environment, for example, when you manage multiple clusters through Platform Cluster Manager - Advanced Edition.

The data network is a private, fast interconnect network for the cluster nodes. It is used to move data among nodes, and move data into or out of the Hadoop file system for processing. The data network can be built with 10-Gb Ethernet adapters, InfiniBand adapters, or any other technology that provides high-throughput and low-latency network data transfers.

A BigInsights for Apache Hadoop cluster can connect to the corporate data network by using one or more edge nodes. These edge nodes provide a layer between your BigInsights for Apache Hadoop cluster and your data network. You can use these nodes to import data into your cluster. These nodes can be other Power Systems servers that are running Linux, or any other server type at all. If you think of a large BigInsights for Apache Hadoop cluster, each rack can have an edge node, although this configuration is not mandatory.

Applications, such as Cognos Business Intelligence and SPSS Modeler, can use the edge nodes to connect to the cluster and use the capabilities of the Big SQL component on BigInsights for Apache to connect to those clusters. For more information about this integration, see Chapter 4, “Scenario: How to implement the solution components” on page 49.

Figure 2-3 shows how the networking architecture looks in an environment that follows the guidelines in this chapter.

Figure 2-3 High-level BigInsights for Apache Hadoop cluster architecture: nodes, networks, and disks

You can deploy the general architecture that is described in this chapter in any kind of environment, whether the environment is bare-metal nodes, logical partitions (LPARs) in larger servers, or even a cloud environment.

If you plan to use Platform Cluster Manager - Advanced Edition to perform cluster management, add two more networks to the hardware architecture: the provisioning and service networks.

The service network is used for the hardware-level management functions, such as power-cycling the nodes in the cluster, hardware status monitoring, firmware configuration, and hardware console access. The service network connects Platform Cluster Manager - Advanced Edition to the FSP port of your Power Systems hardware in the same fashion as an HMC is connected to those ports. In fact, the Power Systems hardware has two FSP ports through which it communicates with the external world for hardware management. If you use an HMC to manage your hardware, it uses the primary HMC port on the server. So, in this case, you can use the secondary HMC port to allow Platform Cluster Manager to manage the hardware, also.

The provisioning network is used by Platform Cluster Manager - Advanced Edition to transfer operating system installation images onto the hardware, and to perform preinstallation and post-installation scripts for deployment customization.

Figure 2-4 shows a complete network architecture of a BigInsights for Apache Hadoop environment that is managed by Platform Cluster Manager - Advanced Edition. This scenario implements a provisioning network that differs from the administrative network. Because of the low traffic of an administrative network, and because provisioning traffic happens at particular points only, these two networks can be the same network.

Figure 2-4 BigInsights for Apache Hadoop cluster network diagram

Similar to big data clusters, other analytic applications can have their own network configuration requirements. The applications that are the focus in this book use a traditional network arrangement. The applications basically need access to the corporate data network and the administrative network (to perform administrative tasks). You might need access to other networks if your infrastructure has special requirements, for example, a backup network.

2.2.5 Data storage

When you implement a big data and analytics solution, you must consider many options for data storage, for example, internal disks and external disks.

When you deploy applications, for example, DB2, SPSS, and Cognos, we suggest that you use external storage benefits, such as flexibility, performance, and management capabilities. A software-defined storage approach can be considered by using IBM Spectrum Scale™. The storage solution can be deployed through an Elastic Storage Server implementation. This implementation can combine Power Systems servers, storage enclosures, and disks with Spectrum Scale and its Native RAID technology to provide analytic and technical computing storage and data services for analytic workloads.

When you plan for analytics application storage subsystems, consider the requirements for different applications, including bandwidth and operational throughput. For example, DB2 and Hadoop workloads behave differently and their profiles of data access differ, which might require different disk layout and capabilities.

When you deploy a Hadoop cluster, the simplest shared-nothing disk layout that can be used with MapReduce workloads is the use of internal disks in the cluster nodes. It is usually the most cost-effective scenario. However, you can still achieve a shared-nothing environment by using disks that are external to the machines, either on storage expansion units or storage devices.

Scenarios that work with an external storage device can use high availability in terms of disk access and also ensure a shared-nothing layout. This task is accomplished by assigning the disks to two nodes simultaneously and by using Spectrum Scale to assign primary and secondary Network Shared Drive (NSD) servers in an alternated fashion. If the primary disk server node for a storage disk fails, the secondary node can still serve the disk, and it serves the disk only if the primary node fails. This task is controlled by a Spectrum Scale File Placement Optimizer (FPO) failure groups configuration. For more information about how this technology works, see 2.3.5, “Spectrum Scale and File Placement Optimizer” on page 33.

Consider that Hadoop-based technologies process large amounts of data that is local on a server to reduce I/O transfers over the network and to use fast I/O. Assume that your environment consists of 100 nodes, each with access to 10 disks, for a total of 1,000 disks. Can a single SAN unit and its SAN topology provide enough bandwidth to feed I/O to all 1,000 disks with performance that is as good as though each of the 100 nodes accessed 10 internal disks? For Hadoop workloads, we do not recommend that you use a SAN architecture without considering I/O performance.

2.2.6 IBM Data Engine for Analytics reference architecture

The IBM Data Engine for Analytics - Power Systems Edition (IDEA) provides an expertly designed, tightly integrated solution for running big data workloads. Consider choosing this solution for your analytics environment.

IDEA consists of a hardware and software implementation. It uses IBM Power Systems, which are managed by Platform Cluster Manager Advanced Edition to deploy a big data cluster. Standard open source MapReduce applications are enabled through the inclusion of IBM Open Platform for Apache Hadoop. Additional added value analytics are available through the optional inclusion of BigInsights Data Scientist or BigInsights Data Analyst. This optimized configuration enables users to become productive quickly.

The IDEA architecture was designed to provide client value through the following capabilities:

•Integrated complete cluster solution

It has the necessary hardware and software components for implementing a BigInsights for Apache Hadoop cluster and start developing applications on top of it.

•Best-in-class hardware

The IBM Power Systems hardware is known for its performance, high availability through component redundancy, robustness, and reliability. Moreover, and especially for Hadoop workloads, because certain Power Systems server models are targeted to run Linux only, it is a compelling choice over other x86-based server models. In essence, you have all of the Power Systems servers’ advantages at prices that compete with x86 servers.

•Innovative storage

Implement the IBM Elastic Storage Server, which provides the cluster storage solution, with scalable Portable Operating System Interface (POSIX)-compliant storage to house both structured and unstructured data. The Elastic Storage Server is built by using IBM Spectrum Scale, which is based on the same General Parallel File System (GPFS) technology that solved the challenges of managing large data sets in High Performance Computing (HPC) environments for over two decades.

•Flexibility for storage and computing sizing

Use the shared storage approach, by using Elastic Storage Server, to tailor configuration ratios between CPUs and storage capacity on the data nodes, according to the characteristics of the cluster usage. Also, you can scale them independently. Clusters that support more storage-intensive applications can require a lower CPU-to-storage capacity and clusters that support more CPU-intensive applications can require a higher CPU-to-storage ratio.

•Multitenancy

The architecture supports multitenancy, which is achieved by using Platform Symphony Advanced Edition to configure a share of resources between groups of users with guaranteed service level agreements (SLAs).

•Ease of deployment

The full solution is assembled and installed at an IBM delivery center before delivery with all included software preinstalled. Onsite services personnel integrate the solution into the client data center. The solution includes Platform Cluster Manager Advanced Edition to simplify deployment and monitoring of the cluster.

IDEA is built by using IBM POWER8 systems and uses all of the benefits, such as CPU performance, high memory bandwidth, and high RAS capabilities. All of the servers in the cluster are configured with Red Hat Enterprise Cluster.

The hardware architecture consists of the following components:

•An HMC

The HMC’s main function is to manage the system management node.

•A system management node

Platform Cluster Manager Advanced Edition is installed on a system management node. It is used to provision and monitor the nodes that make up the cluster. The system management node is also used as a repository for operating systems and software images for initial installation and updates. Usually, one node for each system is sufficient, but, if the size of the cluster is large, or if high availability is important, two or more system management nodes are required. IBM Power S812L is used for this node.

•Analytics node: Management nodes

Management nodes are used for the management services in the BigInsights for Apache Hadoop cluster. Management nodes are typically distributed across three to six nodes, depending on the services that will run and whether high availability is required. Two management nodes can run in LPARs in a single IBM Power S822L server.

•Analytics node: Data nodes

These servers store the data in the distributed environment. In a default configuration, each node is configured in a Power S822L server in a full partition configuration. If Big SQL is included in the solution through the inclusion of the optional BigInsights Data Scientist or Analyst added value packages, the data nodes must be configured with two LPARs for each server instead of a single LPAR for each server.

•Analytics node: Edge nodes

An edge node is an optional node type that acts as a gateway between the BigInsights for Apache Hadoop cluster and the external environment as a path to load and unload data. These nodes are configured as management nodes with additional connections to external network. Edge nodes can run as either one or two LPARs for each IBM Power S822L server.

•Elastic Storage Server

The Elastic Storage Server (ESS) is an integrated shared storage solution that provides the Hadoop Distributed File System (HDFS)-compatible file system through Spectrum Scale. The solution consists of two Power S822L servers, each of which runs a single LPAR, and with two different disk enclosure types. The smaller enclosure has a 24-disk capacity and the larger enclosure has a 60-disk capacity. Four models are available, which use the smaller enclosure type. ESS 5146-GS1 uses one enclosure. ESS 5146-GS2 uses two enclosures. ESS 5146-GS4 uses four enclosures. ESS 5146-GS6 uses six enclosures. Three models are available, which use the larger enclosure type. ESS 5146-GL2 uses two enclosures. ESS 5146-GL4 uses four enclosures. ESS 5146-GL6 uses six enclosures.

Figure 2-5 shows an example of a preconfigured IDEA cluster.

Figure 2-5 IDEA infrastructure components example

IDEA is configured with three networks: management, service, and data. The management network provides functions for both administrative and provision networks, as described in 2.2.4, “Networking” on page 12.

Both management and service networks use a 1Gb Ethernet top-of-rack (TOR) switch. The service network VLAN requires one connection to the system management node, one connection to each server for out-of-band FSP hardware management, one connection to each network switch for out-of-band switch management, and two connections to each Elastic Storage Server storage enclosure for out-of-band storage management. The management network VLAN requires one connection to the system management node and one connection for each analytics node and storage server.

The data network VLAN requires one or two connections to the system management node, one or two connections to each analytics node, and a variable number of physical links for each storage server, which is determined by balancing aggregate network bandwidth to the aggregate storage bandwidth. The data network can use one of the following high-performance switch options:

•10 Gb Ethernet top-of-rack switch (Mellanox SX1410, SX1400, or SX1036)

•40 Gb Ethernet top-of-rack switch (Mellanox SX1710)

•InfiniBand Fourteen Data Rate (FDR) (56 Gbps) top-of-rack switch (Mellanox SX6036)

Figure 2-6 shows the IDEA networks.

Figure 2-6 Networks that are used in IDEA

2.3 Software reference architecture

A big data and analytics environment can support many processes, as described in 2.1, “Big data and analytics general architectures” on page 6. The deployments of these environments consist of a set of software components, each of which offers benefits to your enterprise.

This section focuses on a set of software components to help solve business problems and manage environments:

•IBM BigInsights for Apache Hadoop for data storage and analysis

•Cognos Business Intelligence for reporting and dashboarding

•SPSS for predictive analytics

•DB2 with BLU acceleration for transactional data and OLAP operations

•Spectrum Scale for data storage and parallel access

•Platform Cluster Manager - Advanced Edition for infrastructure management in a cluster

2.3.1 IBM BigInsights for Apache Hadoop and IBM Open Platform with Apache Hadoop clusters

IBM BigInsights for Apache Hadoop is a software platform for discovering, analyzing, and visualizing data from disparate sources. The solution is used to help process and analyze the volume, variety, and velocity of data that continually enters organizations every day. BigInsights is a collection of added value services that can be installed on top of the IBM Open Platform with Apache Hadoop, which is the open Hadoop foundation.

By combining these technologies, BigInsights for Apache Hadoop extends the Hadoop open source framework with enterprise-grade security, governance, availability, integration into existing data stores, tools that simplify developer productivity, and more.

Hadoop is a computing environment that is built on top of a distributed, clustered file system that is designed specifically for large-scale data operations. Hadoop is designed to scan through large data sets to produce its results through a highly scalable, distributed batch processing system. Hadoop consists of two main components: a file system, which is known as the Hadoop Distributed File System (HDFS), and a programming paradigm, which is known as Hadoop MapReduce. To develop applications for Hadoop and interact with HDFS, you use additional technologies and programming languages, such as Pig, Hive, Flume, and many others.

Figure 2-2 on page 7 shows the software components on the BigInsights for Apache Hadoop architecture. Figure 2-7 shows how these components are packaged from a licensing perspective. IBM BigInsights Analyst, IBM BigInsights Data Scientist, and IBM BigInsights Enterprise Management extend the Hadoop open source framework by adding value packages according to what will be implemented in your big data cluster environment.

Figure 2-7 IBM BigInsights for Apache Hadoop software components

These components are the basis for building a BigInsights for Apache Hadoop cluster. This solution can be deployed on the hardware architectures that are described in 2.2, “Hardware reference architecture” on page 8.

This software stack runs on Linux on Power. For BigInsights for Apache Hadoop on Power Systems, the only supported version at the time that this publication was written is Red Hat Enterprise Linux 7.1 little endian. For more information about operating system requirements, see 4.2.5, “Installing the BigInsights value-add packages” on page 73.

Hadoop Distributed File System (HDFS)

Open source Hadoop traditional deployments use the Hadoop Distributed File System to store and share data across the many nodes in a cluster. Data is broken into smaller pieces that are called blocks, then they are distributed throughout the nodes in the cluster. This process also includes copying the blocks to increase the fault tolerance of the cluster. It is common to have a total of three copies of the data in HDFS deployments.

Am HDFS implementation has two major components:

•DataNode

Each HDFS cluster has a number of DataNodes, with one DataNode for each node in the cluster. DataNodes manage the storage that is attached to the nodes on which they run. When a file is split into blocks, the blocks are stored in a set of DataNodes that are spread throughout the cluster. DataNodes are responsible for serving read and write requests from the clients on the file system, and they also handle block creation, deletion, and replication.

•NameNode

An HDFS cluster supports NameNodes. An active NameNode and a standby NameNode are a common setup for high availability. The NameNode regulates the access to files by clients, and it tracks all data files in HDFS. The NameNode determines the mapping of blocks to DataNodes, and handles operations, such as opening, closing, and renaming files and directories. All of the information for the NameNode is stored in memory, which provides quick response times when you add storage or read requests.

The NameNode is the repository for all HDFS metadata, and the user data never flows through the NameNode. A typical HDFS deployment has a dedicated computer that runs only the NameNode because the NameNode stores metadata in memory. If the computer that runs the NameNode fails, the metadata for the entire cluster is lost, so this server is typically more robust than other servers in the cluster.

IBM Open Platform for Apache Hadoop can be deployed by using an HDFS implementation. BigInsights for Apache Hadoop extends these capabilities by providing Spectrum Scale with File Placement Optimization technology as an alternative that brings many advantages over HDFS. This approach and its benefits are described in 2.2.5, “Data storage” on page 14. It is also possible to use an existing Spectrum Scale installation with a new IBM BigInsights for Apache Hadoop new implementation. From an architectural point of view, the use of Spectrum Scale does not change anything in terms of the number of cluster nodes. It replaces one file system software for another file system software.

The Elastic Storage Server, which is described in 2.2.5, “Data storage” on page 14, is based on Spectrum Scale technology. Also, it can be used in these deployments, as a dedicated installation, storing only the BigInsights for Apache Hadoop data, or as a shared installation, where other Spectrum Scale file systems might be created to store data from other applications, such as Cognos and DB2.

MapReduce and YARN

MapReduce is a programming paradigm where applications are divided into self-contained units of work, and each of them can run on any node in the cluster. In a Hadoop cluster, a MapReduce program is known as a job. A job is run by being broken down into pieces that are known as tasks. These tasks are scheduled to run on the nodes in the cluster where the data exists.

In IBM Open Platform with Apache Hadoop, the MapReduce framework, MapReduce v2, runs as a YARN workload framework. The benefits of this new approach are that resource management is separated from workload management, and MapReduce applications can coexist with other types of workloads, such as Spark or Slider.

MapReduce v2 jobs are executed by YARN in the Hadoop cluster. The YARN ResourceManager creates a MapReduce ApplicationMaster container, which requests additional containers for mapper and reducer tasks. The ApplicationMaster communicates with the NameNode to determine where all of the data that is required for the job exists across the cluster. It attempts to schedule tasks on the cluster where the data is stored, rather than sending data across the network to complete a task. The YARN framework and the HDFS typically exist on the same set of nodes, which enables the ResourceManager program to schedule tasks on nodes where the data is stored.

The reduce task is always completed after the map task. A MapReduce job splits the input data set into independent chunks that are processed by map tasks, which run in parallel. These bits, which are known as tuples, are key and value pairs. The reduce task takes the output from the map task as input, and it combines the tuples into a smaller set of tuples. Each MapReduce ApplicationMaster monitors its created tasks. If a task fails to complete, the ApplicationMaster will reschedule that task on another node in the cluster.

This distribution of work enables map tasks and reduce tasks to run on smaller subsets of larger data sets. Ultimately, this distribution provides maximum scalability. The MapReduce framework also maximizes parallelism by manipulating data that is stored across multiple clusters. MapReduce applications do not have to be written in Java, although most MapReduce programs that run natively under Hadoop are written in Java.

Integrating BigInsights for Apache Hadoop and IBM Platform Symphony

IBM Platform Symphony is a resource scheduler for grid environments. It works with grid-enabled applications, and it can provide high resource utilization rates with low latency for certain types of jobs.

Platform Symphony can be used in a BigInsights for Apache Hadoop environment as a job scheduler for MapReduce tasks. Platform Symphony can replace the open source Hadoop scheduler in a framework that is based on MapReduce. It can provide the following advantages:

•Better performance by providing lower latency for certain MapReduce-based jobs.

•Dynamic resource management that is based on slot allocation according to job priority and server thresholds.

•A fair-share scheduling scheme with 10,000 priority levels for the jobs of an application.

•A complete set of management tools for providing reports, job tracking, and alerting.

•Reliability by providing a redundant architecture for MapReduce jobs in terms of name nodes (in case the Hadoop file system is in use), job trackers, and task trackers.

•Support for rolling upgrades, maximizing the uptime of your applications.

•Open so that it is compatible with multiple application programming languages (APIs) and languages, such as Hive, Pig, and Java. Also, it is compatible with both HDFS and Spectrum Scale.

IBM value add package: Big SQL

One of the most valuable features that is added by IBM BigInsights for Apache Hadoop is IBM Big SQL. Big SQL is a software layer that allows users and applications to query the Hadoop cluster by using familiar SQL statements.

It is a massively parallel processing (MPP) SQL engine that deploys directly on the physical Hadoop Distributed File System (HDFS) or Spectrum Scale cluster. This SQL engine pushes processing down to the same nodes that hold the data. Big SQL uses a low-latency parallel execution infrastructure that accesses Hadoop data natively for reading and writing.

Big SQL consists of two services: head and worker. Queries are received by the head nodes, which push them to the data nodes to process and return the results. Deployments have head nodes, which are installed on the Hadoop management nodes, and worker nodes, which are installed on two head nodes (a primary and a secondary).

Big SQL uses the Hive database catalog (HCatalog) for table definitions, location, storage format, and the encoding of input files. This Big SQL catalog is on the head node. If the data is defined in the Hive Metastore and accessible in the Hadoop cluster, Big SQL can get to it. Big SQL stores part of the metadata from the Hive catalog locally for ease of access and to facilitate query execution.

Big SQL uses the IBM Data Server Client drivers. This driver package uses the same standards-compliant Java Database Connectivity (JDBC), Java Combined Client (JCC), Open Database Connectivity (ODBC), call level interface (CLI), and .NET drivers that are used in other IBM software products, such as DB2 for Linux, UNIX, and Windows, IBM DB2 for z/OS®, and IBM Informix® database software. Because the same driver is shared across these platforms, other languages that already use these drivers, such as Ruby, Perl, Python, and PHP Hypertext Preprocessor (PHP), can interact with Big SQL with no additional custom configuration or drivers. Therefore, applications can interact between traditional database management systems (DBMSs) or data warehouse systems and Big SQL.

BigSheets

BigSheets is a spreadsheet-like, web-based application that allows the dynamic analysis of data. With BigSheets, users can work with smaller subsets of the data to ensure that they are performing high-value transformations before they perform those transformations on the whole cluster. This situation keeps the workload on the system down and provides valuable insight.

BigR

R is an open source language that is used for statistical analysis and creating graphical displays of data. Although IBM BigInsights for Apache Hadoop does not install R itself, it does provide BigR. BigR is a collection of functions that integrate with R and remove the complexity of converting these jobs into MapReduce. The result is that your BigR jobs scale with your cluster.

2.3.2 DB2 with BLU Acceleration

IBM BLU® Acceleration® is one of the most significant advances in technology in DB2 and in the database market in general. Available with the IBM DB2 10.5 release, BLU Acceleration delivers performance improvements for analytic applications and reporting by using dynamic in-memory optimized columnar technologies.

Although BLU Acceleration is an important new technology in DB2, BLU Acceleration is built directly into the DB2 kernel. BLU Acceleration is not only an extra feature. It is a part of DB2, and every component of DB2 is aware of BLU Acceleration. BLU Acceleration still uses the same storage unit of pages, the same buffer pools, and the same backup and recovery mechanisms.

For more information about BLU Acceleration, see Architecting and Deploying DB2 with BLU Acceleration, SG24-8212, at the following website:

http://www.redbooks.ibm.com/abstracts/sg248212.html

Typical experiences of using DB2 with BLU Acceleration show good approaches and results:

•Helps you to achieve performance improvements of about 10x to 20x

•Helps increase storage savings versus decompressed data of about 5x to 20x

The performance and response time of IT systems, especially business intelligence systems, when you run reports are always a source of concern. No matter what is done, these systems’ performance and response times can always be improved.

Simple to implement and use

Keep it simple is a strong, almost mandatory concept in BLU Acceleration. Because of its implementation, maintenance, and daily usability, the following keep-it-simple characteristics are present:

•One setting is necessary to optimize the DB2 system for BLU Acceleration. It is only necessary to set one database variable, when the database is used for analytic workloads, and optimize for optimal analytics performance.

•No additional workload and maintenance: No indexes, multi-dimensional clustering (MDC), statistics views, manual reorganization, or RUNSTATS (these last two tasks are automated).

•All of the features are built into the DB2 kernel: SQL, language interfaces, administration, reusing the DB2 process model, storage concepts, and utilities.

•Simple table creation and conversion.

Column store

The most basic and prominent feature of BLU Acceleration is the column-organized table type. Column-organized tables store each column on a separate set of pages on disk, reducing the necessary I/O for processing queries that are loaded into memory from disk. The following features helped generate savings in tests:

•Minimal I/O: By only performing I/O in the columns and values that match the query, and by reducing the working set of pages during the query progression

•Work that is performed directly in columns: By working on individual columns for predicate evaluations, joins, scans, and so on, and not materializing rows until necessary to build the result set

•Improved memory density and extreme compression: By keeping columnar data compressed in memory and by packing more data values into a small amount of memory or disk

•Cache efficiency: By packing data into CPU cache-friendly structures

By being able to store both row-organized and column-organized tables in the same database, users can implement BLU Acceleration even in database environments where mixed online transaction processing (OLTP) and online analytical processing (OLAP) workloads are required. Again, BLU Acceleration is built into the DB2 engine. The SQL, optimizer, utilities, and other components are fully aware of both row-organized and column-organized tables at the same time.

Data skipping

Data skipping avoids the unnecessary processing of irrelevant data, further reducing the I/O that is required to complete a query.

Automatic detection of large sections of data that does not qualify for a query can be ignored. Data skipping is used for data in memory (buffer pool) and on disk and helps to significantly reduce I/O, memory, and CPU consumption.

BLU Acceleration performs data skipping in the following way. As data is loaded into column-organized tables, BLU Acceleration tracks the minimum and maximum values on ranges of rows in metadata objects that are called the synopsis tables. These metadata objects (or synopsis tables) are dynamically managed and updated by the DB2 engine without intervention from the DBA.

When a query is run, BLU Acceleration looks up the synopsis tables for ranges of data that contains the value that matches the query. It effectively avoids the blocks of data values that do not satisfy the query, and it skips straight to the portions of data that matches the query. The net effect is that only necessary data is read or loaded into system memory, which in turn provides a dramatic increase in the speed of the query execution because much of the unnecessary scanning is avoided.

Extreme or adaptive compression

The column data is compressed with actionable compression, which preserves order so that the data can be used without decompression, resulting in storage and CPU savings and a higher density of useful data that is held in memory.

This benefit is possible because of the following features:

•Massive compression with approximate Huffman encoding, considering that the more frequent the value, the fewer bits it takes.

•Encoded values that are packed into bits matching the register width of the CPU.

•Late materialization, which is the ability to operate on the data while it is still compressed. Predicates and joins work directly on encoded values (actionable compression).

In addition to column-level compression, BLU Acceleration also uses page-level compression when appropriate to help further to compress the data based on the local clustering of values on individual data pages.

Because BLU Acceleration can handle query predicates without decoding the values, more data can be packed in the processor cache and buffer pools, which results in less disk I/O, better use of memory, and more effective use of the processor. Therefore, query performance is better and storage utilization is also reduced.

Deep hardware exploitation

BLU Acceleration optimizes the entire access to the hardware and its usage, seeking every opportunity (memory, CPU, and I/O). BLU Acceleration is designed to fully use all of the computing resources that are provisioned to the DB2 server by using Single Instruction Multiple Data (SIMD)-capable CPUs.

SIMD instructions are low-level specific CPU instructions. DB2 can use a SIMD instruction to get results from multiple data elements (perform equality predicate processing, for example) if they are in the same register. DB2 has deep processor and memory usage, including AIX Workload Management (WLM) characteristics, within its own workload policies.

Considering the processor usage, DB2 offers these functions:

•Deep usage of simultaneous multithreading (SMT)

•Key IBM POWER® value proposition with the ability to dispatch many threads

•Decimal arithmetic that is performed directly on the DECFLOAT accelerator

DB2 with BLU Acceleration has special algorithms that automatically take advantage of the built-in parallelism in the processors if SIMD-enabled hardware is available. The algorithms are another feature in BLU Acceleration that allows the use of special hardware instructions that work on multiple data elements with a single instruction.

Core-friendly parallelism

BLU Acceleration is a dynamic in-memory technology. It efficiently uses the number of processor cores in the current system, allowing queries to be processed by using multi-core parallelism and scale across processor sockets. You maximize the processing from processor caches and minimize the latencies from reading from memory and, last, from disk.

Core-friendly parallelism consists of comprehensive algorithms that are designed to carefully place and align data that is likely to be revisited into the processor cache lines to maximize the hit rate to the processor cache, increasing the effectiveness of cache lines.

Parallel vector processing with multi-core parallelism, single instruction, and multiple data parallelism helps to improve performance and use available CPU resources better.

Several physical attributes of the server are listed:

•Queries on BLU tables are automatically parallelized.

•The power of multiple CPU cores is used fully.

•CPU cache efficiency is maximized to optimize cache lines.

Optimal memory caching

DB2 automatically adapts the way that it operates based on the organization of the table (row-organized or column-organized) that is being accessed.

BLU Acceleration uses several attributes to optimize the memory cache:

•New algorithms cache effectively in RAM (buffer pool).

•Data can be larger than RAM. You do no need to ensure that all data fits in memory.

•BLU Acceleration separates caching algorithms for BLU/OLTP data:

– BLU: Scan-friendly caching that minimizes I/O

– OLTP: Least recently used (LRU)-based page cleaning that reclaims buffer pool space without regard to future I/O.

BLU Acceleration includes a set of big data-aware algorithms for cleaning out memory pools that are more advanced than the typical LRU algorithms that are associated with traditional row-organized database technologies. These BLU Acceleration algorithms are designed from the bottom up to detect data patterns that are likely to be revisited, and to hold those pages in the buffer pool. These algorithms work with DB2 traditional row-based algorithms.

2.3.3 Predictive Analytics with SPSS

The ability to predict outcomes with a reasonable degree of confidence is of great strategic importance in current businesses. To address this need, the Statistical Package for the Social Sciences (SPSS) is available as part of the IBM Business Analytics solution portfolio.

IBM SPSS consists of a comprehensive set of tooling that uses data for decision-making. It is a solid product with over 40 years of presence in the market. It can be used to provide statistics, create predictive models, and deploy all of these analyses in your business.

The capability to predict is an advantage in your return on investment (ROI). You might have a good sense of your business or eventually several pre-built rules to help with decision making. However, it is not until you use analytics in decision management that you can, sustainably, choose the best path to follow, for every point of impact of your business.

Predictive analytics can capture insights from historic data patterns and provide you with evidence from this data. Predictive analytics is flexible in its modeling to understand changes in trends, and it can analyze massive amounts of structured and unstructured data. Also, it can promptly avail the information that is learned from the insights to everyone who needs access to it. It increases ROI because business areas are not analyzed independently.

Cross-departmental analyses are performed. These analyses do not leave out any business relationships that were uncovered or considered too small. As a result, every aspect of the business is optimized with these analyses.

IBM SPSS software is organized in product families:

•Statistics provides evidence that is based on data and verifies hypotheses.

•Modeling works out accurate predictions to aid decision making.

•Deployment helps you act upon the impact points in your operations.

The following sections describe an overview and highlight the products from each of these families.

Statistics

This family is represented by the IBM SPSS Statistics software suite. It is based on sophisticated mathematics to validate hypotheses and assumptions. It is widely used by government, commercial, and academic institutions to solve business and research problems. You can use it to test an opinion on a new product, predict the acceptance of ideas, experiment with allocation within a supply chain, or test the efficiency of a medical treatment. This software uses data to back up (or not) your theories. With this backup, you are more confident when you decide.

Modeling

This family is represented by the IBM SPSS Modeler software. The previous family, statistics, is used to test hypotheses. With the SPSS Modeler software, you can create business models to predict future outcomes. It uncovers hard-to-identify relationships (within structured and unstructured data) that seem unrelated at first. You can predict the future and understand what happens based on what happened before. This capability is useful to prevent customer churn, for example, and to help people consistently make decisions. Another benefit of this software is that it can also explain the factors that drive future outcomes. You can use it to mitigate risks and take advantage of opportunities.

With the SPSS Modeler, you can create models in an intuitive and quick fashion, without programming. The SPSS Modeler includes the advanced, interactive visualization of models. Multiple techniques can be included within a model, and the results are easy to understand and communicate to your staff.

Deployment

This family is represented by the following software:

•IBM SPSS Decision Management

•IBM SPSS Collaboration and Deployment Services

The first software, SPSS Decision Management, is intended to automate and optimize small decisions that are made in day-to-day business operations in real time. It combines predictive analytics with business rules. The models are created in an easy-to-use interface with which the business user interacts without the specialized help of an analyst, statistician, or data miner. This more independent process allows people at any level of the organization to create automated models for making small decisions, helping to optimize every aspect of the overall business operation.

The second software, SPSS Collaboration and Deployment Services, enables widespread use and deployment of predictive analytics. It provides a centralized, secure, and auditable placeholder of analytical assets and advanced capabilities for management and control of predictive analytic processes, and sophisticated mechanisms to deliver the results of these assets to users.

IBM SPSS Collaboration and Deployment Services architecture

Figure 2-8 illustrates the typical architecture of an SPSS Collaboration and Deployment Services deployment, which consists of these components:

•The central SPSS Collaboration and Deployment Services repository

•The database server

•Execution servers

•Client servers that access the SPSS Collaboration and Deployment Services repository

Figure 2-8 SPSS Collaboration and Deployment Services architecture

In this architecture, all of the analytics assets are stored on the SPSS Collaboration and Deployment Services repository. Clients can access these assets through web services or by using specialized client tooling for communications.

The requests that are performed by these clients are sent to execution servers, which perform all of the work on top of the analytics data. The results are then stored in the SPSS Collaboration and Deployment Services, and they can be accessed by the requesting clients.

The following list describes the components that are presented in the architecture of Figure 2-8 on page 28:

•IBM SPSS Collaboration and Deployment Services Repository

This component is used for collecting and storing analytical assets. It includes models and data at a centralized location.

•IBM SPSS Collaboration and Deployment Services Deployment Manager

This component is responsible for creating, executing, and automating the analytical task. It includes updating the model that is stored in the repository by users.

•IBM SPSS Collaboration and Deployment Services Deployment Portal

This web browser-based thin-client interface accesses the IBM SPSS Collaboration and Deployment Services Repository, runs analyses, and views output.

•BIRT Report Designer for IBM SPSS

Ad hoc reports against relational and file-based data sources can be created by using BIRT Report Designer for IBM SPSS.

•IBM SPSS Collaboration and Deployment Services Enterprise View Driver

Use this component to access IBM SPSS Collaboration and Deployment Services Enterprise View objects that are stored in the repository, including IBM SPSS Statistics and third-party applications.

•Browser-based Deployment Manager

This component is used by the SPSS administrator to perform, tune, and update system management tasks.

2.3.4 Reporting insights with IBM Cognos Business Intelligence

Organizations are constantly under pressure to understand and react quickly to new information. In addition, the complexity and volumes of data for all aspects of the environments in which organizations operate are increasing. Markets, regulatory environments, customer and supplier data, competitive information, and internal operational information all affect how data is viewed and interpreted. It is imperative for organizations to react correctly, dynamically, and in a timely fashion to answer key business questions and to outperform the competition.

From business intelligence to financial performance and strategy management to analytics applications, Cognos software can provide what your organization needs to become top-performing and analytics-driven. With products for the individual, workgroup, department, mid-sized business, and large enterprise, Cognos software is designed to help everyone in your organization make decisions that achieve better business outcomes for now and in the future.

Cognos Business Intelligence features

Cognos Business Intelligence provides the following features:

•Reports

Cognos Business Intelligence software helps ensure that users are equipped with the reports they need to make fact-based decisions in a system that is simpler, faster, and easier to manage. From professional report authors who design one-to-many reports for the enterprise, to business users who need to create their own ad hoc queries or customize existing reports, Cognos Business Intelligence reporting capabilities fit the needs of users throughout your organization.

•Analysis

With the analytics capabilities of Cognos Business Intelligence software, users can explore information and different perspectives easily and intuitively to ensure that they are making the correct decisions. General business users can easily view, assemble, and analyze the information that is required to make better decisions. Additionally, business and financial analysts can take advantage of more advanced, predictive, or what-if analysis capabilities.

•Scorecards

Scorecards enable your organization to capture corporate strategy and communicate that strategy at the operational level. Executives and managers can define quantifiable goals and targets and track performance for business units, operating subsidiaries, and geographic regions to quickly identify the areas that need attention.

•Dashboards

With dashboards, users can access, interact, and personalize content in a way that supports the unique way that they make decisions. Security-rich access to historic, current, and projected data means that users can quickly move from insight to action.

•Statistics

Statistics capabilities help you incorporate statistical results with core business reporting, reducing the time that it takes to analyze data and prepare business presentations that are based on that analysis.

•Mobile business intelligence

Mobile business intelligence capabilities enable your mobile workforce to interact with information in new ways by delivering relevant business intelligence wherever the workers are. Users interact with trusted business intelligence through a rich and visual experience, whether offline or online. The flexible platform ensures that mobile decision making is simple, reliable, and safe.

•Real-time monitoring

Real-time monitoring capabilities provide your employees on the leading edge with a rich view of operational KPIs¹ and measures while they occur to support up-to-the-moment decision making.

•Collaboration

Collaboration capabilities help individuals, key stakeholders, workgroups, and teams align their strategic objectives, build stronger relationships, learn from history, and use resources for important decision making effectively.

•Planning and budgets

Get the right information to the right people in the form they need it to plan, budget, and forecast. Planning and budgeting capabilities in the IBM Cognos Business Intelligence software support a wide range of requirements, from high-performance, on-demand customer and profitability analysis and flexible modeling to enterprise contribution for a broad range of users.

Cognos Business Intelligence components overview

IBM Cognos Business Intelligence is an integrated business intelligence suite that provides a wide range of functionality to help you understand the data of your organization. Everyone in your organization can use Cognos Business Intelligence to view or create business reports, analyze data, and monitor events and metrics so that they can make effective business decisions.

Cognos Business Intelligence integrates the following business intelligence activities in one web-based solution, as shown in Table 2-1.

Table 2-1 IBM Cognos Business Intelligence list of components

Activity	Component
Publishing, managing, and viewing content	IBM Cognos Connection
Interactive workspaces	IBM Cognos Business Insight™
Simple reporting and data exploration	IBM Cognos Business Insight Advanced
Ad hoc querying	IBM Cognos Query Studio
Managed reporting	IBM Cognos Report Studio
Event management and alerting	IBM Cognos Event Studio
Scorecarding and metrics	IBM Cognos Metric Studio
Analyzing your business	IBM Cognos Analysis Studio
Working with IBM Cognos Business Intelligence content in Microsoft Office	IBM Cognos for Microsoft Office

Before you use IBM Cognos Business Intelligence, you must understand how each of the components that make up the IBM Cognos Business Intelligence user interfaces can help you perform your job.

Cognos transaction flow

You can explore the transaction flow through the Cognos environment, which is primarily driven by HTTP requests. The IBM Cognos environment consists of three components: the gateway module, the authentication and authorization module, and the report execution model, as shown in Figure 2-9.

Figure 2-9 Cognos transaction and workflow diagram

The three primary components in a Cognos deployment are listed:

•Cognos Gateway

The Cognos Gateway is the primary user interface and is accessed through the Cognos Connection and portal URL. User authentication and namespace validation happen at this layer. However, internally, the Gateway communicates with the Content Manager through the Dispatcher for authentication and validation. After successful authentication, a request is sent to the Content Manager for further processing. After the report execution completes, it reaches the Gateway and it is sent to the client.

•Cognos Content Manager

In an IBM Cognos environment, the main controlling unit is the Content Manager. It performs several important functions. Every Cognos environment consists of only one primary Content Manager. However, a secondary Content Manager can be configured. The secondary Content Manager takes over as primary only when the primary Content Manager is unavailable. Even in a busy environment, all requests and transactions are handled by the single primary Content Manager.

•Application tier and dispatcher

The dispatcher is the main working thread in the Cognos environment that is used to generate reports that are based on user requests. Each dispatcher hosts several services, including the Presentation Service, which is used in report generation. The dispatcher fetches the required data from the different data sources, and it renders this data in the report based on the specification. Each dispatcher creates multiple business intelligence business processes, and each process handles one request at a time.

2.3.5 Spectrum Scale and File Placement Optimizer

IBM Spectrum Scale is software-defined storage for high-performance, large-scale workloads on-premises or in the cloud. Built on the award-winning IBM General Parallel File System (GPFS), this scale-out storage solution provides file, object, and integrated data analytics for the following areas:

•Compute clusters (technical computing)

•Big data and analytics

•Hadoop Distributed File System (HDFS)

•Private cloud

•Content repositories

•File placement optimization

Spectrum Scale File Placement Optimizer (Spectrum Scale-FPO) is a high-performance, cost-effective storage methodology that started as a clustered file system and evolved into more than a file system. Today, Spectrum Scale is a full-featured set of file management tools, including advanced storage virtualization, integrated high availability, and automated tiered storage management, and offers the performance to effectively manage large quantities of file-based data.

Spectrum Scale supports various application workloads, and it is effective in large and demanding environments. Spectrum Scale is installed in clusters, and it supports big data, analytics, gene sequencing, digital media, and scalable file serving. All indications are that BigInsights for Apache Hadoop might bring more unstructured and file-based data into the application.

For high-performance computing environments, IBM Spectrum Scale offers a distributed, scalable, reliable, and single namespace file system. Spectrum Scale-FPO is based on a shared-nothing architecture so that each node on the file system can function independently and be self-sufficient within the cluster. Typically, Spectrum Scale-FPO can be a substitute for HDFS, removing the need for the HDFS NameNode, Secondary NameNode, and DataNode services.

However, in performance-sensitive environments, placing Spectrum Scale metadata on higher-speed drives might improve the performance of the Spectrum Scale file system.

Spectrum Scale-FPO has significant and beneficial architectural differences from HDFS. HDFS is a file system that is based on Java that runs on top of the operating system file system, and it is not Portable Operating System Interface (POSIX)-compliant. Spectrum Scale-FPO is a POSIX-compliant, kernel-level file system that provides Hadoop with a single namespace, distributed file system with performance, manageability, and reliability advantages over HDFS.

As a kernel-level file system, Spectrum Scale is unaffected by the impact that is incurred by HDFS as a secondary file system, running within a Java virtual machine (JVM) on top of the operating systems’ file system. As a POSIX-compliant file system, files that are stored in Spectrum Scale-FPO are visible to authorized users and applications by using standard file access/management commands and APIs. An authorized user can list, copy, move, or delete files in Spectrum Scale-FPO by using traditional operating system file management commands without logging in to Hadoop.

Additionally, Spectrum Scale-FPO has significant advantages over HDFS for backup and replication. Spectrum Scale-FPO provides point-in-time snapshot backup and off-site replication capabilities that enhance cluster backup and replication capabilities.

When you use Spectrum Scale-FPO instead of HDFS as the cluster file system, the HDFS NameNode and Secondary NameNode daemons are not required on cluster management nodes, and the HDFS DataNode daemon is not required on cluster data nodes. Equivalent tasks are performed by Spectrum Scale in a distributed way across all nodes in the cluster, including the data nodes. From an infrastructure design perspective, including Spectrum Scale-FPO can reduce the number of required management nodes.

Because Spectrum Scale-FPO distributes metadata across the cluster, no dedicated name service is needed. Management nodes within the BigInsights for Apache Hadoop predefined configuration or HBase predefined configuration that are dedicated to running the HDFS NameNode or Secondary NameNode services can be eliminated from the design. The reduced number of required management nodes can provide sufficient space to allow more data nodes within a rack.

For more information about implementing IBM Spectrum Scale-FPO in an InfoSphere® BigInsights solution, see the Deploying a big data solution using IBM Spectrum Scale-FPO white paper at the following website:

http://ibm.co/1NBnGTj

2.3.6 Cluster management

When you manage the Power System nodes in your big data and analytics environment, you can choose from alternatives, such as the Hardware Management Console (HMC) or Platform Cluster Manager - Advanced Edition.

The HMC is the traditional way to manage Power Systems servers. It provides a graphical user interface (GUI) to perform management tasks, such as configuring hardware alerts and creating and configuring LPARs.

You can also use IBM Platform Cluster Manager - Advanced Edition to manage an environment of clusters. It provides the following benefits:

•Management of multitenancy environments

You can create multiple, isolated clusters within your server farm.

•Support for deploying multiple products

This book describes a scenario that implements Platform Cluster Manager - Advanced Edition to deploy an BigInsights for Apache Hadoop cluster, but this book might also be used to automate the deployment of other solutions, such as solutions that are based on IBM Symphony, IBM Load Sharing Facility (LSF®), GridEngine, PBS Pro, and open source Hadoop.

•On-demand and self-service provisioning

You can create cluster definitions and use them to deploy the cluster nodes automatically. A person with little or no cluster setup knowledge can then deploy a cluster environment quickly.

•Increased server consolidation

By being able to grow or shrink a cluster environment dynamically, you minimize the amount of idle resources because of the creation of siloed clusters.

One of the parts that integrate the software architecture of a Platform Cluster Manager - Advanced Edition solution is xCAT. Past versions of Platform Cluster Manager - Advanced Edition integrated xCAT into the whole solution as an add-on, external software component. With Platform Cluster Manager - Advanced Edition Version 4.2, xCAT comes integrated within the solution.

During the Platform Cluster Manager - Advanced Edition Version 4.2 installation, xCAT is installed on the node. Part of the xCAT configuration is automatically performed during this step. However, you can use xCAT commands to further configure or reconfigure your cluster provisioning environment. As of the writing of this book, several features still must be configured through xCAT:

•Establishing a connection to the server’s FSP port for hardware operations management

•Creating a hardware profile with the correct serial connection settings for connecting to the LPAR console

•Optional: Using the Platform Cluster Manager environment as a Dynamic Host Configuration Protocol (DHCP) server to the FSP hardware management network.

If you are running a version of Platform Cluster Manager - Advanced Edition that is older than Version 4.2, you must install and configure an xCAT environment separately.

2.4 Solution reference architecture

This section describes a big data and analytics solution deployment scenario and its reference architecture to illustrate possible implementations on an enterprise business operation.

2.4.1 Solution scenario overall topology

In this publication, big data and analytics tools were used to implement a business scenario, which can be used in many companies, such as retail companies, that sell products to their customers by using a webstore or e-commerce solution.

In this scenario, an executive from a fictional company can see a Cognos Business Intelligence dashboard that consolidates sales information and results for the company and combine that information with customer social media sentimental analysis to produce a marketing campaign to boost the company’s sales. The dashboard contains the following information:

•Revenue trend by product category

•Revenue by product and year

•Gross profit by country

•Revenue social media sentiment by product

The Cognos Business Intelligence dashboard pulls revenue, product categories, and gross profit data from a DB2 with BLU Acceleration database, which might be the company’s transactional system or even a data warehouse that might be fed from it.

The revenue social media sentiment is stored on a BigInsights for Apache Hadoop cluster, and the Big SQL technology is used to query this data and display the results on the dashboard. Before this data is stored, it can be captured by using tools, such as IBM Streams or the IBM Bluemix® with Insights for Twitter service.

Figure 2-10 shows how Cognos Business Intelligence consolidates the data into a single dashboard.

Figure 2-10 Cognos Business Intelligence consolidates data from different sources

By using this information, the company executive can check for products to target that need more sales based on decreased revenue year-to-year and that have positive social media feedback so that the company can define a campaign and a target audience for it.

Based on the executive decision and goals, a company’s researcher first creates input data that contains an individual’s demographic data (gender and geographic region), banking payment status, and sentiment polarity. This profile will be an input for the SPSS Modeler and Analytical Decision Manager to try to find the best scenario that will predict the most effective expected profit.

After those steps, SPSS Analytical Decision Management for Customer Interactions identifies the best offer for each customer and provides recommendations for offers to deliver to customers through a call center or an email. The offer will be available on a webstore when the customer logs in to the webstore.

Figure 2-11 illustrates the campaign solution process and data flow.

Figure 2-11 Campaign solution process and data flow

2.4.2 Solution scenario architecture

This solution used IBM Power Systems and IBM Data Engine for Analytics capabilities and benefits. To demonstrate IBM Power Systems flexibility, the implementation used AIX and Red Hat Enterprise Linux (RHEL) distributions to install the software components.

The following software component operating systems were installed to implement this solution:

•IBM DB2 with BLU Acceleration on an RHEL 7.1 little endian edition node.

•IBM Open Platform for Apache Hadoop on RHEL 7.1 little endian edition, through an IDEA cluster implementation, which consists of two management nodes (also used as edge nodes) and three data nodes.

•IBM SPSS Collaboration and Deployment Services, IBM SPSS Modeler, and IBM SPSS Analytical Decision Manager on an AIX 7.1 node.

•IBM Cognos Business Intelligence on an RHEL 7.1 big endian edition node.

Figure 2-12 shows the logical architecture for this solution scenario implementation.

Figure 2-12 Solution scenario architecture

In this implementation, the HMC was used to manage only the System Management Node, which is used to deploy and manage the other nodes, connecting to the flexible service processors (FSPs) from the other nodes through the service network.

The management network has two major purposes: to provision the IBM Open Data Platform for Apache Hadoop, DB2, Cognos Business Intelligence, and SPSS nodes, as described in 2.2.4, “Networking” on page 12, and to perform administrative tasks.

All of the nodes, except the System Management Node, used the data network to send and receive application data, for example, when the Cognos Business Intelligence queries revenue data from the DB2 database or sentimental analysis data from the Open Data Platform for Apache Hadoop (connecting to the Big SQL Head).

The IBM Open Data Platform consists of two management nodes and three data nodes.

Figure 2-13 shows the hardware components that are used to implement the solution scenario that was described earlier in this section.

Figure 2-13 Hardware components that are used in the solution scenario implementation

For the IBM Data Engine for Analytics Components, the following configuration was used:

•Elastic Storage Server Model GL2 for storing data.

•Two Hadoop Management Node LPARs, each of which uses half of the resources of a Power S822L server. This server has the following configuration:

– 24 x 3.02 GHz cores

– 256 GB memory

– 4 - 12 x 1.8 TB 10K 2.5-inch serial-attached SCSI (SAS) hard disk drives (HDDs)

– Split backplane

– 2 x 4-port 1-Gigabit Ethernet adapter

– 2 x dual-port 10-Gigabit Ethernet adapter

– 2 x partitions with equal resources on a server

•Three analytic nodes, which are used as Hadoop data nodes, with each node as a full partition on a Power S822L with the following configuration:

– 24 x 3.02 GHz cores

– 256 GB Memory

– 2 x 4-port 1-Gigabit Ethernet adapter

– 2 x dual-port 10-Gigabit Ethernet adapter

•One System Management Node (that uses Platform Cluster Manager - Advanced Edition administration capabilities) as a full partition on Power S812L with the following configuration:

– 10 x 3.425 GHz cores

– 32 GB memory

– 2 x 1.8 TB 10K 2.5-inch SAS HDDs

– 2 x 4-port 1-Gigabit Ethernet adapter

– 2 x dual-port 10-Gigabit Ethernet adapter

For the other analytic applications, the following configuration was used:

•One Power S822 server for the SPSS node, with the following configuration:

– 24 x 3.02 GHz cores

– 256 GB Memory

– 4 x 1.8 TB 10K 2.5-inch SAS HDDs

– 2 x 4-port 1-Gigabit Ethernet adapter

– 2 x dual-port 10-Gigabit Ethernet adapter

•Four Power S822L servers (one for each of the remaining nodes) with the following configuration:

– 24 x 3.02 GHz cores

– 256 GB Memory

– 4 x 1.8 TB 10K 2.5-inch SAS HDDs

– 2 x 4-port 1-Gigabit Ethernet adapter

– 2 x dual-port 10-Gigabit Ethernet adapter

For LAN networking, two switches were used:

•IBM System Networking RackSwitch G8052

1 GbE top-of-rack switch for the management and service networks

•IBM System Networking RackSwitch G8264

1 GbE top-of-rack switch for the data networks

This configuration reflects the hardware that was available for this solution scenario deployment. The sizing for the server can vary according to the workload that the implementation will support and any special network requirements.

For more information about the installation of the software and hardware components, see Chapter 4, “Scenario: How to implement the solution components” on page 49. The details about the software component integration are in Chapter 5, “Scenario: Integration of the components for the solution” on page 133.

¹ Key performance indicators (KPIs) are used by companies to better evaluate their current level of business success and to help plan for the future.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2. Solution reference architecture

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 2. Solution reference architecture