Big Data Case Studies
This chapter examines how the various patterns discussed in previous chapters can be applied to business problems in different industries. To arrive at the solution of a given business problem, architects apply combinations of patterns across different layers of the entire application architecture as appropriate to the unique business requirements and priorities of the problem at hand. The following case studies exemplify how architects combine patterns to solve particular business problems.
Case Study: Mainframe to Hadoop-Based NoSQL Database
Problem
A financial organization’s current data warehouse solution is based on a legacy mainframe platform. This solution is becoming very expensive as more and more data gets generated every day. Moreover, because the databases supported are legacy formats (such as line IMS and IDMS), it is not easy to transform and merge this data with the other data sources in the enterprise for joint analytical processing. The CIO is looking for a less expensive and more current platform.
Solution
The CIO concluded that migrating the legacy data to a NoSQL-based platform (such as HP Vertica) would provide the following benefits:
Figure 10-1 shows the patterns implemented in migrating to a NoSQL platform.
Figure 10-1. NoSQL migration architecture
Examples of technologies used include the following:
Table 10-1. Patterns implemented in the Mainframe to Hadoop case study
Pattern Type |
Pattern Name |
---|---|
Big data storage pattern |
NoSQL Pattern |
Ingestion and streaming pattern |
Just-In-Time Transformation Pattern |
Analysis and visualization pattern |
Compression Pattern |
Big data access pattern |
Stage Transform Pattern |
Case Study: Geo-Redundancy and Near-Real-Time Data Ingestion
Problem
A high-tech organization has multiple applications spread geographically across multiple data centers. All application usage logs have to be synchronized with every data center for near-real-time analysis. The current implementation of the RDBMS is capable of providing replication across data centers, but it is very expensive and the cost is increasing as more data accumulates every day. What cost-efficient solution would enable active-active geo-redundant ingestion across data centers to address failover and provide more near-real-time access to data?
Solution
The big data architects choose an open-source (hence low-cost) NoSQL-based platform (such as Cassandra) that can be configured for fast data synchronization and replication across data centers, high availability, and a high level of data compression for lower storage costs and improved performance. This solution provides very high, terabyte-scale ingestion rates across data centers.
Figure 10-2 shows the patterns implemented in changing to a geo-redundant NoSQL-based platform.
Figure 10-2. Geo-redundancy architecture
Table 10-2. Patterns implemented in the Geo-Redundancy case study
Pattern Type |
Pattern Name |
---|---|
Big data storage |
NoSQL Pattern |
Ingestion and streaming pattern |
Real-Time Streaming Pattern |
Case Study: Recommendation Engine
Problem
An organization has an existing recommendation engine, but it is looking for a high-performing recommendation engine and reporting tool that can handle its increasing volumes of data. The existing implementation is based on a subset of the total data and hence is failing to generate optimal recommendations. What high-performing recommendation engine could look at the current volume data in its totality and scale up to accommodate load increases going forward?
Solution
The organization’s combinatory solution is to move to a Hadoop-based storage mechanism (providing increased capacity), a NoSQL-based Cassandra database for real-time log-processing (providing higher-speed data access), and an R-based solution for machine-oriented learning.
Figure 10-3 shows the patterns implemented to enable real-time streaming for machine learning.
Figure 10-3. Real-time streaming for machine-learning architecture
Table 10-3. Pattern implemented for the Recommendation Engine case study
Pattern Type |
Pattern Name |
---|---|
Ingestion and streaming pattern |
Real-Time Streaming Pattern |
Examples of technologies used include the following:
Case Study: Video-Streaming Analytics
Problem
A telecommunication organization needs a solution for analyzing customer behavior and viewing patterns in advance of a rollout of video-over-IP (VOIP) offerings. The logs have to be compared to region-specific, feature-specific existing system data spread across multiple applications. Because the volume of data is already huge and the VOIP logs data will add many terabytes, the organization is looking for a robust solution to apply across all devices and systems.
Solution
The CTO chooses a Hadoop-based big data implementation capable of storing and analyzing the huge volume of raw system data and scaling up to accommodate the VOIP metadata: namely, a consolidated log-access, log-parse, and analysis platform that is able to transform data using Pig, to store data in HDFS and NoSQL MongoDB, and to incorporate machine-learning tools for analytics.
Figure 10-4 shows the patterns implemented to enable video-streaming analytics.
Figure 10-4. Video analytics architecture
Table 10-4. Pattern implemented for the Video Analytics case study
Pattern Type |
Pattern Name |
---|---|
Ingestion and streaming pattern |
Real-Time Streaming Pattern |
Examples of technologies used include the following:
Case Study: Sentiment Analysis and Log Processing
Problem
An existing ecommerce organization experienced system failures and data inconsistencies during the holiday season. Major issues included penalties tied to performance-based service-level agreements (SLAs)s. The organization is looking for a new platform that could take the holiday season load, help them avoid penalties, and ensure customer satisfaction.
Solution
The company decided to set up a big data platform with Hadoop and Hive to enable web and application server historic and real-time log analysis: namely, a NoSQL-based solution (such as MongoDB) for analyzing the application logs and an R-based machine-learning engine and visualization tool (such as Tableau) for better viewing of requests, faster resolution of defects, reduced down time, and better customer satisfaction.
Figure 10-5 shows the patterns implemented to enable scalable sentiment analysis and log processing.
Figure 10-5. Sentiment-analysis and log-processing architecture
Table 10-5. Patterns implemented for the Sentiment Analysis case study
Pattern Type |
Pattern Name |
---|---|
Ingestion and streaming pattern |
Real-Time Streaming Pattern |
Big data analysis and visualization pattern |
Zoning Pattern Compression Pattern |
Big data access pattern |
Stage Transform Pattern |
Big data storage |
NoSQL Pattern |
Examples of technologies used include the following:
Case Study: Real-Time Traffic Monitoring
Problem
An organization wants to create a real-time traffic analysis and prediction application that can be used to control traffic congestion and streamline traffic flow. The application must be targeted to provide cost optimization in commuting and help reduce waiting time and pollution levels.
Data has to be captured from existing government-provided datasets that include sources such as traffic-camera, traffic-sensor, GPS, and weather-prediction systems. The government data needs to be coupled with social media to assist in predicting traffic speed and volume on roads.
The analysis scenarios include the following:
Analysis of historical data to gain insights and understand patterns of behavior of traffic and road incidents
Prediction of traffic speed and volume well ahead of time, based on analysis of real-time and historical traffic data
Prediction of alternate cost-effective commute paths by analyzing situational traffic conditions across the entire transportation network
The application needs to provide a catalog of services based on social media, governmental data, and different dataset options.
Solution
The organization decided to set up a big data platform using Hadoop, an abstracted layer of data above HDFS in the form of HP Vertica, and a visualization tool. The organization opted to use the cloud-based Amazon Web Service for storage and analytics.
Multiple patterns are applied at various layers of the architecture, as depicted in Figure 10-6. The patterns shown in that figure were used to enable monitoring of traffic in real time.
Figure 10-6. Traffic-monitoring architecture
Table 10-6. Patterns implemented for Traffic Monitoring
Pattern Type |
Pattern Name |
---|---|
Ingestion and streaming pattern |
Real-Time Streaming Pattern |
Big data analysis and visualization pattern |
Zoning Pattern Compression Pattern |
Big data access pattern |
Service Locator Pattern |
Big data storage pattern |
NoSQL Pattern |
NFR patterns |
Distributed Search Optimization Access Pattern |
Examples of technologies used include the following:
Case Study: Data Exploration for Suspicious Behavior on a Stock Exchange
Problem
A financial organization processes millions of order entries per day. Whenever online statistical surveillance models identify suspicious behavior, the organization wants to have enhanced capability to gather data pertinent to the suspicious behavior as quickly and cheaply as possible.
The solution needs to be able to do the following:
Solution
The lead architect applied the patterns mentioned in Figure 10-7. The solution is based on Hadoop, Storm, Flume, and IBM Netezza. DataStax Cassandra acted as the NoSQL database to enable real-time analysis.
Figure 10-7 shows the patterns implemented to enable data forensics on a stock exchange.
Figure 10-7. Data forensics on a stock exchange
Table 10-7. Patterns implemented for the Data Forensics case study
Pattern Type |
Pattern Name |
---|---|
Ingestion and streaming pattern |
Real-Time Streaming Pattern |
Big data analysis and visualization pattern |
Zoning Pattern Compression Pattern |
Big data access pattern |
Service Locator Pattern |
Big data storage |
NoSQL Pattern |
Examples of technologies used include the following:
Case Study: Environment Change Detection
Problem
An institute wants to build an application that detects environmental changes to water resources in real time. The application has to source data from multiple data sources (such as sensor and meteorological sources) hosted in various environmental institutes and government departments. The data has to be presented to scientists and energy analysts for real-time monitoring of the water resources and environmental data.
Solution
The CTO chooses an all-IBM big data platform with IBM BigInsights, IBM InfoSphere Streams, and IBM Vivisimo as the technologies applied against the patterns shown next.
Figure 10-8 shows the patterns implemented to enable environment change detection.
Figure 10-8. Environment change prediction
Table 10-8. Patterns implemented in Environment Change Prediction
Pattern Type |
Pattern Name |
---|---|
Ingestion and streaming pattern |
Real-Time Streaming Pattern |
Ingestion and streaming pattern |
Just-In-Time Transformation Pattern |
Analysis and visualization patterns |
Compression Pattern |
Big data access pattern |
Stage Transform Pattern |
Examples of technologies used include the following:
Summary
A multitude of practical business, academic, financial, and scientific problems are susceptible to solution using big data architectures. The patterns described in this book can be applied to all the layers of your big data architecture. The rapid pace of technological advances in tools and products ensures the continual emergence of new patterns, new variants of existing patterns, and new combinations of patterns in increasingly industrialized out-of-the box solutions.
3.145.9.148