© Butch Quinto 2018
Butch QuintoNext-Generation Big Datahttps://doi.org/10.1007/978-1-4842-3147-0_13

13. Big Data Case Studies

Butch Quinto1 
(1)
Plumpton, Victoria, Australia
 

Big data has disrupted entire industries. Innovative use case in the fields of financial services, telecommunications, transportation, health care, retail, insurance, utilities, energy, and technology (to mention a few) have revolutionized the way organizations manage, process, and analyze data. In this chapter, I present real big data case studies from six innovative companies: Navistar, Cerner, British Telecom, Shopzilla, Thomson Reuters, and Mastercard. Information and details about the case studies are referenced from Cloudera's website: www.Cloudera.com .

Navistar

Navistar is one of the leading manufacturers of commercial buses, trucks, defense vehicles, and engines.

Use Cases

Navistar’s use cases include predictive maintenance, remote diagnostics and route optimization. Unscheduled vehicle repairs and breakdowns are costly and inefficient. When service interruption occurs, the impact can be significant. Vehicle owners usually lose US$1,000 in revenue per vehicle daily. Scheduled vehicle maintenance based on mileage is primitive and doesn’t address Navistar’s mounting vehicle maintenance problems. A more modern approach that involves real-time data monitoring and predictive analytics is needed. Furthermore, Navistar’s traditional data warehouse was unable to support the increasing amount of real-time, high-volume telematics and sensor data that they were ingesting. i

As we collected more data, the analytic process slowed to a near halt on our legacy systems.

—Ashish Bayas, CTO at Navistar

Solution

Navistar built an IoT-enabled remote diagnostic platform on Cloudera Enterprise that ingests over 70 telematics and sensor data feeds from more than 300,000 connected vehicles. The data is further enriched with third-party data such as meteorological, geolocation, traffic, vehicle usage, historical warranty, and parts inventory information. The platform uses machine learning to proactively detect vehicle issues and predict vehicle maintenance requirements. Navistar also uses the platform to help prevent accidents and promote road safety. After building a prototype in September 2014, it took Navistar just six months to put the platform into production.

Using IoT devices, machine learning and predictive analytics powered by Cloudera, Navistar has completely overhauled the way we sell, maintain and service our customers’ vehicle fleets.

—Ashish Bayas, CTO at Navistar

Technology and Applications

  • Data Platform: Cloudera Enterprise

  • Workloads: Analytic Database, Data Science & Engineering

  • Components: Apache Spark, Apache Impala (incubating), Apache Kafka

  • BI & Analytics Tools: Information Builders WebFOCUS In-Document Analytics, Microsoft Power BI, Microsoft SQL Server Analytic Services Models, Microsoft SQL Server Reporting Services, SAS Enterprise Guide, Tableau Desktop, Tableau Server

  • Data Science Tools: Python, R, Scala

  • ETL Tool: IBM InfoSphere DataStage

With Cloudera, we can analyze data in ways and speeds that were not previously possible. We can evaluate billions of rows of data from connected vehicles in hours, not weeks, to enable predictive maintenance .

—Terry Kline, CIO, Navistar

Outcome

Navistar is now able to provide proactive vehicle diagnostics and real-time monitoring services for its customers. Cloudera Enterprise enabled Navistar to build a highly scalable real-time IoT platform to derive valuable insights from multiple data sources. The platform helped Navistar customers reduce maintenance costs by up to 40% while early detection of vehicle issues also reduces vehicle downtime by up to 40%.

The results are overwhelmingly positive. Using real-time big data to frame business decisions and deploy proactive maintenance has opened new revenue streams and delivered additional customer value.

—Troy Clarke, CEO, Navistar

Cerner

Cerner is a leader in the health care IT space, providing solutions to thousands of facilities, such as hospitals, ambulatory offices, and physicians’ offices.

Use Cases

Cerner’s goal is to consolidate the world’s health care data into a common platform in order to reduce cost, increase efficiency of delivering health care, and improve patient outcomes. The project requires several challenges: The data must be secure, auditable and easy to explore.

Our vision is to bring all of this information into a common platform and then make sense of it – and it turns out, this is actually a very challenging problem. ii

—David Edwards, Vice President and Fellow, Cerner

Solution

Cerner accomplished its goal by implementing a comprehensive view of population health powered by Cloudera Enterprise. The big data platform currently stores two petabytes of data, ingesting data from multiple sources such as electronic medical records (EMR), HL7 feeds, Health Information Exchange information, claims data, and custom extracts from different several proprietary and client-owned data sources. Cerner uses Apache Kafka to ingest real-time data into HBase or HDFS using Apache Storm. Cerner is exploring augmenting its platform with other real-time components such as Apache Flume, Apache Samza, and Apache Spark. iii

Data is transferred from Cloudera platform to Cerner’s data mart running HP Vertica, providing access to SAS and SAP Business Objects users for reporting and analysis. Cerner utilizes the data to help them determine risks and opportunities for improvement across a population of people. Cerner also leverages SAS on Cloudera Enterprise for data science initiatives such as building prediction models for avoiding hospital readmissions. The Cerner team is evaluating Cloudera Search (Solr) and Impala to allow hundreds of users across the organization direct access to data stored in Cloudera Enterprise. Security is extremely important for Cerner and choosing Cloudera, one of the most secure Hadoop distributions in the market as their big data platform, it gave them confidence that patient data will be secure and completely protected.

We're able to achieve much better outcomes, both patient-related and financial, than we ever could by just looking at pieces of the puzzle individually. It all comes down to bringing everything together and being able to extract value for any requirement. The enterprise data hub topology allows us to do exactly that.

—Ryan Brush, Senior Director and Distinguished Engineer, Cerner

Technology and Applications

  • Hadoop Platform: Cloudera Enterprise, Data Hub Edition

  • Components in Use: Apache Crunch, Apache HBase, Apache Hive, Apache Kafka, Apache Oozie, Apache Storm, Cloudera Manager, MapReduce

  • Servers: HP

  • Data Mart: HP Vertica

  • BI and Analytics Tools: SAP Business Objects, SAS

Outcome

By consolidating health care data from multiple sources, Cerner was able to get a far more complete view of any patient, trend, or condition, helping them achieve much better patient-related and financial outcomes. For example, the big data platform gave them the ability to predict the probability of patient re-admission. Utilizing the same platform, Cerner was also able to accurately predict early onset of sepsis in patients.

Our clients are reporting that the new system has actually saved hundreds of lives by being able to predict if a patient is septic more effectively than they could before.

—Ryan Brush, Senior Director and Distinguished Engineer, Cerner

British Telecom

BT is one of the leading telecommunications companies in the United Kingdom with over 18 million customers and operations in 180 countries.

Use Cases

Like every organization, British Telecom is required to provide its business units with the most relevant and up-to-date information. Their legacy ETL system, built on traditional relational databases, could not scale and can barely process close to one billion rows of data in a timely manner. Their ETL jobs were taking more than 24 hours to process 24 hours of data. Its business units had to content with working with day-old data. iv

We had a proposal to re-platform the system to a new relational database. But as we sat down, our discussion turned to Hadoop. We realized we basically had a data velocity problem. We had to process the data faster and increase the volume that we could ingest—both of which Hadoop excels at.

—Phillip Radley, Chief Data Architect, BT

Solution

BT implemented a Cloudera Enterprise cluster and replaced their batch ETL jobs with MapReduce code. The platform not only solved BT’s ETL problem, but also addressed other data management challenges to help BT accelerate the delivery of new product offerings.

Because data is consolidated in a single, cost-effective infrastructure, it enabled BT to gain a unified 360-view of its data across its multiple business units. The platform will also enable BT to archive data longer from 1 year to more than 10 years and implement mission-critical data management and analytic use cases.

Soon, the company plans to use Apache Spark to combine batch, streaming, and interactive analytics, and Impala enables the business intelligence (BI) teams to perform SQL queries on the data.

Technology and Applications

  • Hadoop Platform: Cloudera Enterprise, Data Hub Edition

  • Hadoop Components: Apache Hive, Apache Pig, Apache Sentry, Apache Spark, Cloudera Manager, Cloudera Navigator, Impala

Outcome

Moving ETL and data processing to Hadoop enabled BT to increase data velocity, providing business users with the information they need when they need it.
  • Processes 5x more customer data

  • Increased data velocity by 15x

  • Delivered ROI of 200–250% in one year

  • The move also delivered substantial cost savings for BT

We were able to increase data velocity by a factor of 15. We’re processing five times the data in a third of the time. The business sponsors don’t know that we moved to Hadoop and they don’t care. All they know is that they’re now working with today’s data instead of yesterdays.

—Phillip Radley, Chief Data Architect, BT

Shopzilla (Connexity)

Shopzilla is a leading e-commerce company headquartered in Los Angeles, California, with 100 million unique visitors connected to 100 million products from tens of thousands of retailers. v

Use Cases

Shopzilla has an existing 500-terabyte Oracle Enterprise Data Warehouse that’s growing 5 terabytes a day. With the amount of data and processing required to crunch through 100 million products per day, Shopzilla’s legacy data warehouse has exceeded its capacity and was unable to scale further, taking hours to process data per day.

Solution

Shopzilla implemented a hybrid environment by complementing its Oracle Enterprise Data Warehouse with a Cloudera Enterprise cluster. Low-value ETL and data processing is handled by the CDH cluster. Using Apache Sqoop, aggregated data is then transferred to the Oracle EDW, freeing it to do what it was designed to do, serving analytics and reports to business users. Shopzilla plans to utilize Apache Impala and Apache Spark in the near future. vi The CDH cluster is utilized to support online price comparison services, SEO, SEM, merchandising, audience scoring, and data science workloads.

Data scientists don’t typically need to consume data warehouse resources now because all of the most recent data is available in Cloudera via R or Mahout. We needed enormous processing capabilities, scalability, full redundancy, and extensive storage – all at a cost-effective price. Our Cloudera platform provides all that and more.

—Rony Sawdayi, Vice President, Engineering, Connexity

We are able to answer complex questions, such as how a user is behaving on a particular site and what ads would be most effective, as well as execute other sophisticated data mining queries. It improves Connexity’s ability to provide relevant results to users, and this is a core tenet of our business.

—Paramjit Singh, Director of Data, Connexity

Technology and Applications

  • Data Platform: Cloudera Enterprise

  • Hadoop Components: Apache HBase, Apache Hive, Apache Mahout, Apache Pig, Apache Spark, Apache Sqoop, Cloudera Impala, Cloudera Manager

  • Servers: Dell

  • EDW: Oracle

  • BI & Analytic Tools: Oracle BI Enterprise Edition (OBIEE); R

Outcome

With Cloudera Enterprise, Connexity can now process data from 15,000 feeds and 100 million products from retailers in a matter of hours instead of several days. A new architecture is being tested and will further decreases processing time to minutes. The faster performance also enables Connexity to score and bid on 10 million keywords every day, vii enabling its search engine marketing activities to scale and reach 100 million unique visitors and collect billions of data points that can be utilized for highly targeted marketing and innovative data analytics.

Our legacy system delivers great performance for analytics and reporting, but didn't have the bandwidth for the intensive data transformations we needed – it would take hours to process 100 million products per day. We needed enormous processing capabilities, scalability, full redundancy, and extensive storage – at a cost-effective price. Our Cloudera platform provides all that and more, while complementing our current data warehouse system. We were able to reduce latency from days to hours and soon minutes.

—Paramjit Singh, Director of Data, Connexity

Thomson Reuters

Thomson Reuters is a leading mass media and information corporation that provides professionals with trusted information.

Use Cases

Thomson Reuters aims to classify tweets and distinguish fake news and opinions from real news in 40 milliseconds. viii

Solution

Thomson Reuters turned to machine learning and advanced analytics to build Reuters Tracer, a “bot journalist in training,” Reuters Tracer analyzes 13 million tweets every day, processing events to determine if the tweet is real news or an opinion or fake news. ix Thomson Reuters uses Cloudera Enterprise and Apache Spark to provide machine learning capabilities needed to implement Reuters Tracer. Spark’s fast in-memory features enables Reuter Tracer to process and derive meaning from millions of tweets in just 40 milliseconds.

To assist in evaluating the veracity of an event, we rely on hundreds of features and have trained the platform to look at the history and diversity of sources, the language used in tweets, propagation patterns, and much more, just as an investigative journalist would do.

—Sameena Shah, Director of Research and Lead Scientist on Reuters Tracer

Cloudera provides us with state-of-the-art technology to help us analyze data, synthesize text, and extract value and meaning from data to deliver the insights that our customers are looking for. The whole application is very fast. It takes less than 40 milliseconds to capture and detect events.

—Khalid Al-Kofahi, Head, Corporate Research & Development, Thomson Reuters

Technology and Applications

  • Data Platform: Cloudera Enterprise

  • Workloads: Data Science & Engineering

  • Hadoop Components: Apache Spark

Outcome

  • Revealed news worthy events ahead of major news outlets

  • Distinguishes newsworthy tweets from rumors and fake news across 13 million tweets in 40 milliseconds

We are in the business of building information-based solutions for our professional customers in the financial, legal, tax, and accounting industries, and for Reuters, one of the leading news organizations. With Reuters Tracer, we can alert our customers when market-moving events happen as they are reported, without delays. We have dozens and dozens of examples where Reuters Tracer discovered ground-breaking events ahead of major news organizations. Additionally, because we help journalists discover events, they can focus on higher value-add work as opposed to just reporting on events.

—Khalid Al-Kofahi, Head, Corporate Research & Development, Thomson Reuters

Mastercard

Mastercard is a leader in global payments that connects billions of consumers and millions of organizations around the world.

Use Cases

Mastercard built an anti-fraud system called MATCH (Mastercard Alert to Control High-risk Merchants) that allows users to search Mastercard’s proprietary database containing hundreds of millions of fraudulent businesses. As time went by, it became evident that MATCH’s phonetic-based lookup feature could not provide the versatility to satisfy the growing needs of MATCH users. Additionally, the relational database management system (RDBMS) that is powering MATCH could not keep up with the growing volume of data. x

Solution

Mastercard implemented a new anti-fraud solution based on Cloudera Search (powered by Apache Solr), an integrated part of CDH that provides full-text search and faceted navigation. Cloudera Search provided increased scalability, richer search functionality, and better search accuracy. The new solution can use several search algorithms and new scoring capabilities that were previously hard to implement on their legacy RDBMS. The new platform will also allow Mastercard to add more data sets as opportunities arise.

Technology and Applications

  • Apache Hadoop Platform: Cloudera Enterprise, Data Hub Edition

  • Apache Hadoop Components: Apache Solr, Cloudera Search, Hue

Outcome

The new Cloudera-based solution is helping Mastercard easily identify fraudulent merchants to reduce risk. Mastercard users experienced dramatically improved search accuracy, increasing the number of supported search annually 5X, with 25X increase in searches per customer per day. This has allowed Mastercard to expand to new markets resulting to increase in revenue.

Summary

My goal is to provide inspiration to encourage you to start your own big data use cases using effective and proven methodologies. I hope you found this chapter useful.

References

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.165.247