CHAPTER 10

image

Big Data Case Studies

This chapter examines how the various patterns discussed in previous chapters can be applied to business problems in different industries. To arrive at the solution of a given business problem, architects apply combinations of patterns across different layers of the entire application architecture as appropriate to the unique business requirements and priorities of the problem at hand. The following case studies exemplify how architects combine patterns to solve particular business problems.

Case Study: Mainframe to Hadoop-Based NoSQL Database

Problem

A financial organization’s current data warehouse solution is based on a legacy mainframe platform. This solution is becoming very expensive as more and more data gets generated every day. Moreover, because the databases supported are legacy formats (such as line IMS and IDMS), it is not easy to transform and merge this data with the other data sources in the enterprise for joint analytical processing. The CIO is looking for a less expensive and more current platform.

Solution

The CIO concluded that migrating the legacy data to a NoSQL-based platform (such as HP Vertica) would provide the following benefits:

  • A higher level of data compression, providing lower storage costs and improved performance
  • A native data load option, avoiding the need to use a third-party ELT tool
  • Easier integration
  • Better co-analysis of data from multiple data sources in the organization

Figure 10-1 shows the patterns implemented in migrating to a NoSQL platform.

9781430262923_Fig10-01.jpg

Figure 10-1. NoSQL migration architecture

Examples of technologies used include the following:

  • HP Vertica
  • VSQL (for Native ELT: Extract, Load, Transform)
  • AutoSys (for scheduling)
  • Unix Shell/Perl scripting

Table 10-1. Patterns implemented in the Mainframe to Hadoop case study

Pattern Type

Pattern Name

Big data storage pattern

NoSQL Pattern

Ingestion and streaming pattern

Just-In-Time Transformation Pattern

Analysis and visualization pattern

Compression Pattern

Big data access pattern

Stage Transform Pattern

Case Study: Geo-Redundancy and Near-Real-Time Data Ingestion

Problem

A high-tech organization has multiple applications spread geographically across multiple data centers. All application usage logs have to be synchronized with every data center for near-real-time analysis. The current implementation of the RDBMS is capable of providing replication across data centers, but it is very expensive and the cost is increasing as more data accumulates every day. What cost-efficient solution would enable active-active geo-redundant ingestion across data centers to address failover and provide more near-real-time access to data?

Solution

The big data architects choose an open-source (hence low-cost) NoSQL-based platform (such as Cassandra) that can be configured for fast data synchronization and replication across data centers, high availability, and a high level of data compression for lower storage costs and improved performance. This solution provides very high, terabyte-scale ingestion rates across data centers.

Figure 10-2 shows the patterns implemented in changing to a geo-redundant NoSQL-based platform.

9781430262923_Fig10-02.jpg

Figure 10-2. Geo-redundancy architecture

Table 10-2. Patterns implemented in the Geo-Redundancy case study

Pattern Type

Pattern Name

Big data storage

NoSQL Pattern

Ingestion and streaming pattern

Real-Time Streaming Pattern

Case Study: Recommendation Engine

Problem

An organization has an existing recommendation engine, but it is looking for a high-performing recommendation engine and reporting tool that can handle its increasing volumes of data. The existing implementation is based on a subset of the total data and hence is failing to generate optimal recommendations. What high-performing recommendation engine could look at the current volume data in its totality and scale up to accommodate load increases going forward?

Solution

The organization’s combinatory solution is to move to a Hadoop-based storage mechanism (providing increased capacity), a NoSQL-based Cassandra database for real-time log-processing (providing higher-speed data access), and an R-based solution for machine-oriented learning.

Figure 10-3 shows the patterns implemented to enable real-time streaming for machine learning.

9781430262923_Fig10-03.jpg

Figure 10-3. Real-time streaming for machine-learning architecture

Table 10-3. Pattern implemented for the Recommendation Engine case study

Pattern Type

Pattern Name

Ingestion and streaming pattern

Real-Time Streaming Pattern

Examples of technologies used include the following:

  • Cassandra
  • HDFS, Hive, HBase, Pig, Hive
  • Map-R

Case Study: Video-Streaming Analytics

Problem

A telecommunication organization needs a solution for analyzing customer behavior and viewing patterns in advance of a rollout of video-over-IP (VOIP) offerings. The logs have to be compared to region-specific, feature-specific existing system data spread across multiple applications. Because the volume of data is already huge and the VOIP logs data will add many terabytes, the organization is looking for a robust solution to apply across all devices and systems.

Solution

The CTO chooses a Hadoop-based big data implementation capable of storing and analyzing the huge volume of raw system data and scaling up to accommodate the VOIP metadata: namely, a consolidated log-access, log-parse, and analysis platform that is able to transform data using Pig, to store data in HDFS and NoSQL MongoDB, and to incorporate machine-learning tools for analytics.

Figure 10-4 shows the patterns implemented to enable video-streaming analytics.

9781430262923_Fig10-04.jpg

Figure 10-4. Video analytics architecture

Table 10-4. Pattern implemented for the Video Analytics case study

Pattern Type

Pattern Name

Ingestion and streaming pattern

Real-Time Streaming Pattern

Examples of technologies used include the following:

  • Hadoop
  • Python
  • Memcache
  • Jetty, Apache
  • Web/Mobile Dashboards/Analytics
  • Amazon EMR

Case Study: Sentiment Analysis and Log Processing

Problem

An existing ecommerce organization experienced system failures and data inconsistencies during the holiday season. Major issues included penalties tied to performance-based service-level agreements (SLAs)s. The organization is looking for a new platform that could take the holiday season load, help them avoid penalties, and ensure customer satisfaction.

Solution

The company decided to set up a big data platform with Hadoop and Hive to enable web and application server historic and real-time log analysis: namely, a NoSQL-based solution (such as MongoDB) for analyzing the application logs and an R-based machine-learning engine and visualization tool (such as Tableau) for better viewing of requests, faster resolution of defects, reduced down time, and better customer satisfaction.

Figure 10-5 shows the patterns implemented to enable scalable sentiment analysis and log processing.

9781430262923_Fig10-05.jpg

Figure 10-5. Sentiment-analysis and log-processing architecture

Table 10-5. Patterns implemented for the Sentiment Analysis case study

Pattern Type

Pattern Name

Ingestion and streaming pattern

Real-Time Streaming Pattern

Big data analysis and visualization pattern

Zoning Pattern Compression Pattern

Big data access pattern

Stage Transform Pattern

Big data storage

NoSQL Pattern

Examples of technologies used include the following:

  • HDFS, Hive, HBase
  • NoSQL - MongoDB.
  • R
  • Log Data Processing
  • MapReduce
  • Compuware DynaTrace
  • Data Analytics – Tableau

Case Study: Real-Time Traffic Monitoring

Problem

An organization wants to create a real-time traffic analysis and prediction application that can be used to control traffic congestion and streamline traffic flow. The application must be targeted to provide cost optimization in commuting and help reduce waiting time and pollution levels.

Data has to be captured from existing government-provided datasets that include sources such as traffic-camera, traffic-sensor, GPS, and weather-prediction systems. The government data needs to be coupled with social media to assist in predicting traffic speed and volume on roads.

The analysis scenarios include the following:

Analysis of historical data to gain insights and understand patterns of behavior of traffic and road incidents

Prediction of traffic speed and volume well ahead of time, based on analysis of real-time and historical traffic data

Prediction of alternate cost-effective commute paths by analyzing situational traffic conditions across the entire transportation network

The application needs to provide a catalog of services based on social media, governmental data, and different dataset options.

Solution

The organization decided to set up a big data platform using Hadoop, an abstracted layer of data above HDFS in the form of HP Vertica, and a visualization tool. The organization opted to use the cloud-based Amazon Web Service for storage and analytics.

Multiple patterns are applied at various layers of the architecture, as depicted in Figure 10-6. The patterns shown in that figure were used to enable monitoring of traffic in real time.

9781430262923_Fig10-06.jpg

Figure 10-6. Traffic-monitoring architecture

Table 10-6. Patterns implemented for Traffic Monitoring

Pattern Type

Pattern Name

Ingestion and streaming pattern

Real-Time Streaming Pattern

Big data analysis and visualization pattern

Zoning Pattern Compression Pattern

Big data access pattern

Service Locator Pattern

Big data storage pattern

NoSQL Pattern

NFR patterns

Distributed Search Optimization Access Pattern

Examples of technologies used include the following:

  • Hadoop
  • HP Vertica
  • Web/Mobile Dashboards/Analytics
  • Amazon Web Services

Case Study: Data Exploration for Suspicious Behavior on a Stock Exchange

Problem

A financial organization processes millions of order entries per day. Whenever online statistical surveillance models identify suspicious behavior, the organization wants to have enhanced capability to gather data pertinent to the suspicious behavior as quickly and cheaply as possible.

The solution needs to be able to do the following:

  • Integrate social media data with historical orders and trades
  • Gather information from other sources within the organization
  • Present this information in an integrated fashion

Solution

The lead architect applied the patterns mentioned in Figure 10-7. The solution is based on Hadoop, Storm, Flume, and IBM Netezza. DataStax Cassandra acted as the NoSQL database to enable real-time analysis.

Figure 10-7 shows the patterns implemented to enable data forensics on a stock exchange.

9781430262923_Fig10-07.jpg

Figure 10-7. Data forensics on a stock exchange

Table 10-7. Patterns implemented for the Data Forensics case study

Pattern Type

Pattern Name

Ingestion and streaming pattern

Real-Time Streaming Pattern

Big data analysis and visualization pattern

Zoning Pattern Compression Pattern

Big data access pattern

Service Locator Pattern

Big data storage

NoSQL Pattern

Examples of technologies used include the following:

  • Hadoop
  • IBM Netezza
  • DataStaX Cassandra
  • Tableau
  • R

Case Study: Environment Change Detection

Problem

An institute wants to build an application that detects environmental changes to water resources in real time. The application has to source data from multiple data sources (such as sensor and meteorological sources) hosted in various environmental institutes and government departments. The data has to be presented to scientists and energy analysts for real-time monitoring of the water resources and environmental data.

Solution

The CTO chooses an all-IBM big data platform with IBM BigInsights, IBM InfoSphere Streams, and IBM Vivisimo as the technologies applied against the patterns shown next.

Figure 10-8 shows the patterns implemented to enable environment change detection.

9781430262923_Fig10-08.jpg

Figure 10-8. Environment change prediction

Table 10-8. Patterns implemented in Environment Change Prediction

Pattern Type

Pattern Name

Ingestion and streaming pattern

Real-Time Streaming Pattern

Ingestion and streaming pattern

Just-In-Time Transformation Pattern

Analysis and visualization patterns

Compression Pattern

Big data access pattern

Stage Transform Pattern

Examples of technologies used include the following:

  • IBM Vivisimo
  • IBM BigInsights
  • IBM Cognos

Summary

A multitude of practical business, academic, financial, and scientific problems are susceptible to solution using big data architectures. The patterns described in this book can be applied to all the layers of your big data architecture. The rapid pace of technological advances in tools and products ensures the continual emergence of new patterns, new variants of existing patterns, and new combinations of patterns in increasingly industrialized out-of-the box solutions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.9.148