Big data has disrupted entire industries. Innovative use case in the fields of financial services, telecommunications, transportation, health care, retail, insurance, utilities, energy, and technology (to mention a few) have revolutionized the way organizations manage, process, and analyze data. In this chapter, I present real big data case studies from six innovative companies: Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard. Information and details about the case studies are referenced from Cloudera's website: www.Cloudera.com .
Navistar
Navistar is one of the leading manufacturers of commercial buses, trucks, defense vehicles, and engines.
Use Cases
As we collected more data, the analytic process slowed to a near halt on our legacy systems.
—Ashish Bayas, CTO at Navistar
Solution
Using IoT devices, machine learning and predictive analytics powered by Cloudera, Navistar has completely overhauled the way we sell, maintain and service our customers’ vehicle fleets.
—Ashish Bayas, CTO at Navistar
Technology and Applications
Data Platform: Cloudera Enterprise
Workloads: Analytic Database, Data Science & Engineering
Components: Apache Spark, Apache Impala (incubating), Apache Kafka
BI & Analytics Tools: Information Builders WebFOCUS In-Document Analytics, Microsoft Power BI, Microsoft SQL Server Analytic Services Models, Microsoft SQL Server Reporting Services, SAS Enterprise Guide, Tableau Desktop, Tableau Server
Data Science Tools: Python, R, Scala
ETL Tool: IBM InfoSphere DataStage
With Cloudera, we can analyze data in ways and speeds that were not previously possible. We can evaluate billions of rows of data from connected vehicles in hours, not weeks, to enable predictive maintenance .
—Terry Kline, CIO, Navistar
Outcome
The results are overwhelmingly positive. Using real-time big data to frame business decisions and deploy proactive maintenance has opened new revenue streams and delivered additional customer value.
—Troy Clarke, CEO, Navistar
Cerner
Cerner is a leader in the health care IT space, providing solutions to thousands of facilities, such as hospitals, ambulatory offices, and physicians’ offices.
Use Cases
Our vision is to bring all of this information into a common platform and then make sense of it – and it turns out, this is actually a very challenging problem. ii
—David Edwards, Vice President and Fellow, Cerner
Solution
Cerner accomplished its goal by implementing a comprehensive view of population health powered by Cloudera Enterprise. The big data platform currently stores two petabytes of data, ingesting data from multiple sources such as electronic medical records (EMR), HL7 feeds, Health Information Exchange information, claims data, and custom extracts from different several proprietary and client-owned data sources. Cerner uses Apache Kafka to ingest real-time data into HBase or HDFS using Apache Storm. Cerner is exploring augmenting its platform with other real-time components such as Apache Flume, Apache Samza, and Apache Spark. iii
We're able to achieve much better outcomes, both patient-related and financial, than we ever could by just looking at pieces of the puzzle individually. It all comes down to bringing everything together and being able to extract value for any requirement. The enterprise data hub topology allows us to do exactly that.
—Ryan Brush, Senior Director and Distinguished Engineer, Cerner
Technology and Applications
Hadoop Platform: Cloudera Enterprise, Data Hub Edition
Components in Use: Apache Crunch, Apache HBase, Apache Hive, Apache Kafka, Apache Oozie, Apache Storm, Cloudera Manager, MapReduce
Servers: HP
Data Mart: HP Vertica
BI and Analytics Tools: SAP Business Objects, SAS
Outcome
Our clients are reporting that the new system has actually saved hundreds of lives by being able to predict if a patient is septic more effectively than they could before.
—Ryan Brush, Senior Director and Distinguished Engineer, Cerner
British Telecom
BT is one of the leading telecommunications companies in the United Kingdom with over 18 million customers and operations in 180 countries.
Use Cases
We had a proposal to re-platform the system to a new relational database. But as we sat down, our discussion turned to Hadoop. We realized we basically had a data velocity problem. We had to process the data faster and increase the volume that we could ingest—both of which Hadoop excels at.
—Phillip Radley, Chief Data Architect, BT
Solution
BT implemented a Cloudera Enterprise cluster and replaced their batch ETL jobs with MapReduce code. The platform not only solved BT’s ETL problem, but also addressed other data management challenges to help BT accelerate the delivery of new product offerings.
Because data is consolidated in a single, cost-effective infrastructure, it enabled BT to gain a unified 360-view of its data across its multiple business units. The platform will also enable BT to archive data longer from 1 year to more than 10 years and implement mission-critical data management and analytic use cases.
Soon, the company plans to use Apache Spark to combine batch, streaming, and interactive analytics, and Impala enables the business intelligence (BI) teams to perform SQL queries on the data.
Technology and Applications
Hadoop Platform: Cloudera Enterprise, Data Hub Edition
Hadoop Components: Apache Hive, Apache Pig, Apache Sentry, Apache Spark, Cloudera Manager, Cloudera Navigator, Impala
Outcome
Processes 5x more customer data
Increased data velocity by 15x
Delivered ROI of 200–250% in one year
The move also delivered substantial cost savings for BT
We were able to increase data velocity by a factor of 15. We’re processing five times the data in a third of the time. The business sponsors don’t know that we moved to Hadoop and they don’t care. All they know is that they’re now working with today’s data instead of yesterdays.
—Phillip Radley, Chief Data Architect, BT
Shopzilla (Connexity)
Shopzilla is a leading e-commerce company headquartered in Los Angeles, California, with 100 million unique visitors connected to 100 million products from tens of thousands of retailers. v
Use Cases
Shopzilla has an existing 500-terabyte Oracle Enterprise Data Warehouse that’s growing 5 terabytes a day. With the amount of data and processing required to crunch through 100 million products per day, Shopzilla’s legacy data warehouse has exceeded its capacity and was unable to scale further, taking hours to process data per day.
Solution
Data scientists don’t typically need to consume data warehouse resources now because all of the most recent data is available in Cloudera via R or Mahout. We needed enormous processing capabilities, scalability, full redundancy, and extensive storage – all at a cost-effective price. Our Cloudera platform provides all that and more.
—Rony Sawdayi, Vice President, Engineering, Connexity
We are able to answer complex questions, such as how a user is behaving on a particular site and what ads would be most effective, as well as execute other sophisticated data mining queries. It improves Connexity’s ability to provide relevant results to users, and this is a core tenet of our business.
—Paramjit Singh, Director of Data, Connexity
Technology and Applications
Data Platform: Cloudera Enterprise
Hadoop Components: Apache HBase, Apache Hive, Apache Mahout, Apache Pig, Apache Spark, Apache Sqoop, Cloudera Impala, Cloudera Manager
Servers: Dell
EDW: Oracle
BI & Analytic Tools: Oracle BI Enterprise Edition (OBIEE); R
Outcome
Our legacy system delivers great performance for analytics and reporting, but didn't have the bandwidth for the intensive data transformations we needed – it would take hours to process 100 million products per day. We needed enormous processing capabilities, scalability, full redundancy, and extensive storage – at a cost-effective price. Our Cloudera platform provides all that and more, while complementing our current data warehouse system. We were able to reduce latency from days to hours and soon minutes.
—Paramjit Singh, Director of Data, Connexity
Thomson Reuters
Thomson Reuters is a leading mass media and information corporation that provides professionals with trusted information.
Use Cases
Thomson Reuters aims to classify tweets and distinguish fake news and opinions from real news in 40 milliseconds. viii
Solution
To assist in evaluating the veracity of an event, we rely on hundreds of features and have trained the platform to look at the history and diversity of sources, the language used in tweets, propagation patterns, and much more, just as an investigative journalist would do.
—Sameena Shah, Director of Research and Lead Scientist on Reuters Tracer
Cloudera provides us with state-of-the-art technology to help us analyze data, synthesize text, and extract value and meaning from data to deliver the insights that our customers are looking for. The whole application is very fast. It takes less than 40 milliseconds to capture and detect events.
—Khalid Al-Kofahi, Head, Corporate Research & Development, Thomson Reuters
Technology and Applications
Data Platform: Cloudera Enterprise
Workloads: Data Science & Engineering
Hadoop Components: Apache Spark
Outcome
Revealed news worthy events ahead of major news outlets
Distinguishes newsworthy tweets from rumors and fake news across 13 million tweets in 40 milliseconds
We are in the business of building information-based solutions for our professional customers in the financial, legal, tax, and accounting industries, and for Reuters, one of the leading news organizations. With Reuters Tracer, we can alert our customers when market-moving events happen as they are reported, without delays. We have dozens and dozens of examples where Reuters Tracer discovered ground-breaking events ahead of major news organizations. Additionally, because we help journalists discover events, they can focus on higher value-add work as opposed to just reporting on events.
—Khalid Al-Kofahi, Head, Corporate Research & Development, Thomson Reuters
Mastercard
Mastercard is a leader in global payments that connects billions of consumers and millions of organizations around the world.
Use Cases
Mastercard built an anti-fraud system called MATCH (Mastercard Alert to Control High-risk Merchants) that allows users to search Mastercard’s proprietary database containing hundreds of millions of fraudulent businesses. As time went by, it became evident that MATCH’s phonetic-based lookup feature could not provide the versatility to satisfy the growing needs of MATCH users. Additionally, the relational database management system (RDBMS) that is powering MATCH could not keep up with the growing volume of data. x
Solution
Mastercard implemented a new anti-fraud solution based on Cloudera Search (powered by Apache Solr), an integrated part of CDH that provides full-text search and faceted navigation. Cloudera Search provided increased scalability, richer search functionality, and better search accuracy. The new solution can use several search algorithms and new scoring capabilities that were previously hard to implement on their legacy RDBMS. The new platform will also allow Mastercard to add more data sets as opportunities arise.
Technology and Applications
Apache Hadoop Platform: Cloudera Enterprise, Data Hub Edition
Apache Hadoop Components: Apache Solr, Cloudera Search, Hue
Outcome
The new Cloudera-based solution is helping Mastercard easily identify fraudulent merchants to reduce risk. Mastercard users experienced dramatically improved search accuracy, increasing the number of supported search annually 5X, with 25X increase in searches per customer per day. This has allowed Mastercard to expand to new markets resulting to increase in revenue.
Summary
My goal is to provide inspiration to encourage you to start your own big data use cases using effective and proven methodologies. I hope you found this chapter useful.
References
- i.
Cloudera; “Navistar: Reducing Maintenance Costs more than 30 percent for Connected Vehicles,” Cloudera, 2018, https://www.cloudera.com/more/customers/navistar.html
- ii.
Cloudera; “Cerner: Saving Lives with Big Data Analytics that Predict Patient Conditions,” Cloudera, 2018, https://www.cloudera.com/more/customers/cerner.html
- iii.
Cloudera; “Cloudera Cerner Case Study: Saving Lives with Big Data Analytics that Predict Patient Conditions,” Cloudera, 2018, https://www.cloudera.com/content/dam/www/marketing/resources/case-studies/cloudera-cerner-casestudy.pdf.landing.html
- iv.
- v.
- vi.
- vii.
- viii.
- ix.
- x.