Chapter 4
Hadoop

Hadoop is a hot and trendy topic that is growing in popularity as an emerging, modern technology to tackle big data and the increasing amount of data being generated. It is a new approach to manage all of your structured and semi-structured data, analyze, and deliver results. You can find many books, blogs, articles, and websites about Hadoop. In addition, there are many conferences focusing on Hadoop, and many vendors are jumping on the Hadoop bandwagon to develop, integrate, and implement this technology. Customers and prospects are excited because of its open source and claim to be low cost. I will modestly cover Hadoop at a high level in its simplistic form—how it is related to analytics and data management, and how it can fit into your architecture. If your desire is to have a more in-depth understanding of Hadoop, I kindly suggest additional resources from the Internet and software vendors such as Hortonworks or Cloudera. The following topics will be covered in this chapter:

  • What is Hadoop?
  • Why is Hadoop in the big data environment?
  • How does Hadoop fit in the modern architecture?
  • What are some best practices?
  • What are some use cases and success stories?
  • What are the benefits of using Hadoop in big data?

BACKGROUND

The history of Hadoop is fluid. According to my research, the fundamental technology behind Hadoop was invented by Google and had a tremendous amount of influence from Yahoo!. The underlying concept for this technology is to conveniently index and store all the rich textural and structural data being assimilated, analyze the data, and then present noteworthy and actionable results to users. It sounds pretty simple but at that time, in early 2000s, there was nothing in the market to process massive volumes of data (terabytes and beyond) in a distributed environment quickly and efficiently. Yet very little of the information is formatted in the traditional rows and columns of conventional databases. Google's innovative initiative was incorporated into an open source project called Nutch, and then it was spun off in mid-2000 to what is known now as Hadoop. Hadoop is the brain-child of Doug Cutting and Mike Cafarella, who continue to be influential today with open source technology. There are many definitions to what Hadoop is and what it can do. However, I try to use the most simplistic definition. In layman's term, Hadoop is open source software to store lots of data in a file system and process massive volumes of data quickly in a distributed environment. Let's break down what Hadoop provides:

  1. Open source: Refers to a software or computer program in which the source code is available to the general public for use and/or alteration from its original design.
  2. Store: Refers to the Hadoop Distributed File System (HDFS), which is a distributed, scalable, and portable file system for storing large files (typically in the range of gigabytes to petabytes) across multiple servers.
  3. Process: Refers to MapReduce, which is a programming model for parallel processing of large data sets. MapReduce consists of two components:
    1. Map is a method that performs filtering and sorting of data.
    2. Reduce is a method that performs a summary of the data.

To break it down even further, MapReduce coordinates the processing of (big) data by executing the various tasks in parallel across a distributed network of machines and manages all data transfers between the various parts of the system. Figure 4.1 illustrates a simplistic view of the Hadoop architecture, where MR stands for MapReduce.

Illustration of Hadoop, HDFS, and MapReduce.

Figure 4.1 Hadoop, HDFS, and MapReduce

Hadoop has grown over the years and is now an ecosystem of products from the open source and vendors communities. Today, Hadoop's framework and ecosystem of technologies are managed and maintained by the nonprofit Apache Software Foundation (ASF), a global community of software developers and contributors.

HADOOP IN THE BIG DATA ENVIRONMENT

Big data is among us, and the amount of data produced and collected is overwhelming. In recent years, due to the arrival of new technologies, devices, and communication types such as social media, the amount of data produced is growing rapidly and exponentially every year. In addition, traditional data is also growing, which I consider is a part of the data ecosystem. Table 4.1 shows the data sources that are considered as big data and the description of the data that you may be collecting for your business.

Table 4.1 Big Data Sources

Type Data Description
Black Box Data from airplanes, helicopters, and jets that capture voices and recordings of the flight crew and the performance information of the aircraft.
Power Grid Power grids data contain information about consumption by a particular node and customer with respect to a base station.
Sensor Sensor data come from machines or infrastructure such as ventilation equipment, bridges, energy meters, or airplane engines. Can also include meteorological patterns, underground pressure during oil extraction, or a patient's vital statistics during recovery from a medical procedure.
Social Media Sites such as Facebook, Twitter, Instagram, YouTube, etc., collect a lot of data points from posts, tweets, chats, and videos by millions of people across the globe.
Stock Exchange The stock exchange holds transaction data about the “buy” and “sell” decisions made on a share of different companies from the customers.
Transport Transport data encompasses model, capacity, distance, and availability of a car or truck.
Traditional (CRM, ERP, etc.) Traditional data in rows and columns coming from CRM, ERP, financials about a product, customer, sales, etc.

In addition to the data source mentioned, big data also encompasses everything from call center voice data to genomic data from biological research and medicine. Companies that learn to take advantage of big data will use real-time information from sensors, radio frequency identification, and other identifying devices to understand their business environments at a more granular level, to create new products and services, and to respond to changes in usage patterns as they occur. In the life sciences, such capabilities may pave the way to treatments and cures for threatening diseases.

If you are collecting any or all of the data sources referenced in Table 4.1, then your traditional computing techniques and infrastructure may not be adequate to support the volumes, variety, and velocity of the data being collected. Many organizations may need a new data platform that can handle exploding data volumes, variety, and velocity. They also need a scalable extension for existing IT systems in data warehousing, archiving, and content management. Others may need to finally get business intelligence value out of semi-structured data. Thus, Hadoop was designed and is capable to fulfill these needs with its ecosystem of products. Hadoop is not just a storage platform for big data; it's also a computational platform for analytics. This makes Hadoop ideal for companies that wish to compete on analytics, as well as retain customers, grow accounts, and improve operational excellence via analytics. Many firms believe that, for companies that get it right, big data will be able to unleash new organizational capabilities and value that will increase their bottom line.

USE CASES FOR HADOOP

Hadoop Fits in the Modern Architecture

Before we get into how Hadoop fits into the modern architecture, let's spend a few minutes on the traditional architecture. In the traditional architecture, you likely have a configuration that consists of a data warehouse storing the structured data and data, can be accessed and analyzed using your analytical and/or business intelligence tools. Figure 4.2 shows the typical and traditional architecture.

Illustration of a Traditional Architecture.

Figure 4.2 Traditional architecture

  • Structured data sources: These data sources are most likely structured and can include ERP, CRM, legacy systems, financial data, tech support tickets, and e-commerce. As the data are created and captured, they are stored in an enterprise data warehouse or a relational database.
  • Data warehouse or database: This is the primary data storage component for the enterprise. It contains a centralized repository for reporting and data analysis. The ETL process is applied to extract, transform, and load the data into the data warehouse that contains cleansed and integrated data. Other forms of storage can be operational warehouse, analytical warehouse, data mart, operational data store, or data warehouse appliance. Data are then accessed with BI and analytical tools for analysis.
  • Business analytics: This is the analysis and action component in the traditional architecture. Users can access and apply business intelligence and analytics to the data from the warehouse and generate reports for data-driven decisions. Business intelligence and analytic applications may include ad hoc queries, data visualization, descriptive analytics, predictive analytics, online analytical processing (OLAP operational reporting), and prescriptive analytics.

A data warehouse is good at providing consistent, integrated, and accurate data to the users. They can deliver in minutes the summaries, counts, and lists that make up the most common elements of business analytics. The combination of business analytics and massive parallel processing of a data warehouse provides much of the self-service data access to users as well as interactivity. Providing self-service capability is a huge benefit in itself; the ability to interact with the data can teach the user more about the business and can produce new insights.

Another advantage of the warehouse is concurrency. Concurrency in a data warehouse ranges from tens to thousands of people simultaneously accessing the querying the system. The mass parallel processing engine behind the database supports this capability with a superior level of performance and efficiency.

HADOOP ARCHITECTURE

As your business evolves, the data warehouse may not meet the requirements of your organization. Organizations have information needs that are not completely served by a data warehouse. The needs are driven as much by the maturity of the data use in business as they are by new technology.

For example, the relational database at the center of the data warehouse is ideal for data processing to what can be done via SQL. Thus, if the data cannot be processed via SQL, then it limits the analysis of the new data source that is not in row or column format. Other data sources that do not fit nicely in the data warehouse include text, images, audio, and video, all of which are considered as semi-structured data. Thus, this is where Hadoop enters the architecture.

Hadoop is a family of products (Hadoop Distributed File System (HDFS), MapReduce, Pig, Hive, HBase, Mahout, Cassandra, YARN, Ambari, Avro, Chukwa, and Zookeeper), each with different and multiple capabilities. Please visit www.apache.org for details on these products. These products are available as native open source from Apache Software Foundation (ASF) and the software vendors.

Once the data are stored in Hadoop, the big data applications can be used to analyze the data. Figure 4.3 shows a simple stand-alone Hadoop architecture.

Illustration depicting Hadoop architecture.

Figure 4.3 Hadoop architecture

  • Semi-structured data sources: The semi-structured data cannot be stored in a relational database (in column/row format). These data sources include email, social data, XML data, videos, audio files, photos, GPS, satellite images, sensor data, spreadsheets, web log data, mobile data, RFID tags, and PDF docs.
  • Hadoop: The Hadoop Distributed File System (HDFS), which is the data storage component of the open source Apache Hadoop project, is ideal for collecting the semi-data structured sources. (However, it can also host structured data as well). For this example, it is a simple architecture to capture all of the semi-structured data. Hadoop HDFS is designed to run on less expensive commodity hardware and is able to scale out quickly and inexpensively across a farm of servers.

    MapReduce is a key component of Hadoop. It is the resource management and processing component of Hadoop and also allows programmers to write code that can process large volumes of data. For instance, a programmer can use MapReduce to locate friends or determine the number of contacts in a social network application, or process web access log to analyze web traffic volume and patterns. In addition, MapReduce can process the data where it resides (in HDFS) instead of moving it around, as is sometimes the case in a traditional data warehouse system. It also comes with a built-in recovery system—so if one machine goes down, MapReduce knows where to go to get another copy of the data. Although MapReduce processing is fast when compared to more traditional methods, its jobs must be run in batch mode. This has proven to be a limitation for organizations that need to process data more frequently and/or closer to real time. The good news is that with the release of Hadoop 2.0, the resource management functionality has been packaged separately (it's called YARN) so that MapReduce doesn't get bogged down and can stay focused on processing big data.

  • Big data applications: This is the analysis and action component using data from Hadoop HDFS. These are the applications, tools, and utilities that have been natively built for users to access, interact, analyze, and make decisions using data in Hadoop and other nonrelational storage systems. It does not include traditional BI/analytics applications or tools that have been extended to support Hadoop.

    Since the inception of Hadoop, there is a lot of noise and hype in the market as to what Hadoop can or cannot do. It is definitely not a silver bullet to solve all data management and analytical issues. Keep in mind that Hadoop is an ecosystem of products and provides multipurpose functions to manage and analyze big data. It is good at certain things and has shortcomings as well:

  • Not a database: A database provides many functions that help to manage the data. Hadoop has HDFS, which is a distributed file system and lacks the database functions such as indexing, random access to data, support for standard SQL, and query optimization. As Hadoop matures and evolves, these functions will likely be part of the ecosystem. For now, you may consider HBase over HDFS to provide some DBMS functionality. Another option is Hive, which provides an SQL-like interface to manage data stored in Hadoop.
  • No multi-temperature data: The frequency at which data are accessed—often described as its “temperature”—can affect the performance and capability of the warehouse. Hadoop lacks mature query optimization and the ability to place “hot” and “cold” data on a variety of storage devices with different levels of performance. Thus, Hadoop is often complemented with a DBMS that processes the data on Hadoop and moves results to the data warehouse.
  • Lacks user interface: Users who prefer a user interface for ease of use may not find it with Hadoop. This is one big drawback from customer's feedback. However, many vendors are creating applications to access data from Hadoop with a user interface to fill this gap.
  • Deep learning curve: From talking to customers, there is a deep learning curve since it is an emerging technology and subject matter experts are few and far between. Once you have learned Hadoop, the writing of code can be very tedious, time consuming, and costly. Because it is open source, the reliability and stability of the environment can create issues.
  • Limited data manipulation: Joining data is a common practice to prepare data for analysis. For simple joins, Hadoop offers tools to accommodate those needs. However, for complex joins and SQL, it is not efficient. This process will require extensive programing, which requires time and resources from your organization.
  • Data governance and integrity: A database or a data warehouse has the ACID properties (atomicity, consistency, isolation, durability) guarantee, which assures that the database transactions are processed with integrity and can be governed. This can impose a challenge for organizations to ensure the integrity of their analyses and results.
  • Lacks security: Hadoop lacks security features common in RDBMSs, such as row- and column-level security, and it provides minimal user-level controls, authentication options, and encryption.

For the above reasons, I highly recommend integrating Hadoop with your data warehouse. The combination of Hadoop and the data warehouse offers the best of both worlds: managing structured and semi-structured data and optimizing performance for analysis. More scenarios and explanations will be described in the next chapter when I bring all of the elements together.

Now that you know what Hadoop can do, let's shift our focus on some of the best practices for Hadoop.

BEST PRACTICES

As I talk to customers about Hadoop, they share some dos and don'ts based on their experience. Keep in mind that there will be many more best practices as the technology matures into the mainstream and is implemented.

  1. Start small and expand
    Every organization has many projects, and you rely heavily on technology to solve your business needs. You have probably heard to focus on a project that gives you the big bang for your buck. For Hadoop, identify a small project with an immediate need. Too many times I have witnessed projects fail in companies due to the level of complexity, lack of resources, and high expenses. Selecting a small project to get started allows the IT and business staffs to become familiar with the interworking of this emerging technology. The beauty of Hadoop is that it allows you to start small and add nodes as you go.
  2. Consider commodity server
    Many database vendors offer commercial Hadoop distribution with their appliance or hardware. This offer can easily be integrated with the current data warehouse infrastructure. However, I work with customers who adopted commodity server to control costs and work within their budgets. Commodity servers are simply a bunch of disks with single power supplies to store data. Adopting commodity servers will require resources to be trained to manage and gain knowledge of how Hadoop works on commodity. Adopting commodity can be an alternative but may not be suited for your environment.
  3. Monitor the infrastructure
    As mentioned earlier, Hadoop is an ecosystem and there are many parts with multipurpose functions. These moving parts relate to the data and management level, and it is recommended to monitor the infrastructure. By monitoring the infrastructure, you can detect problems before they occur and know problems immediately when issues arise such as disk capacity or failure.
  4. Develop a data integration process
    With any data management technology, the process of integrating data is very critical. Hadoop can store all data, both structured and semi-structured. Hadoop allows you to populate the data and you can define the structure at a later time. Ensuring that the data are standardized, named, and located up front can help to make it easier for analysis. Many customers use Hadoop as a staging area for all of the data collected. The integration process enhances reliability and integrity of the data.
  5. Compression is your companion
    IT has had a love–hate relationship with compression over the years. This is also true with a RDBMS. Compression saves space, but performance can be impacted when it comes to the production systems. Hadoop, by contrast, flourishes on and leverages the use of compression, and it can increase storage usage by many folds.
  6. Enforce a multiple environment infrastructure
    Just like any other IT project, I always advise our customers to build multiple environments (Development, Test, Production) infrastructure. Not only is this a common best practice but it is also vital because of the nature of Hadoop and its nature as an emerging technology. Each project within the Hadoop Ecosystem is constantly changing and having a nonproduction environment to test upgrades and new functionality is critical. Customers who have one environment tend to pay the price when the upgrades are not successful.
  7. Preserve separate functions of master and worker nodes
    Master nodes and worker nodes play two extremely different roles. Master nodes oversee and coordinate the data storage function (HDFS). They should be placed on servers that have fully redundant features such as redundant array of independent disks (RAID) and with multiple power adapters. Master nodes also play well in a virtual environment due to the extra redundancy this provides. Worker nodes on the other hand are the work horses of the cluster and do all the dirty work of storing the data and running the computations. Therefore, these worker nodes need to be dedicated solely to this important function. If you mix these nodes on the same server, it usually leads to undesirable results and unwanted issues.
  8. Keep it simple and don't overbuild
    Your first Hadoop project should be small and simple to build. You may get excited and get carried away over building your first Hadoop cluster. The costs of hardware and software may be low initially, but it's important to only build to what your initial requirements require to solve a business problem. You might find the specifications of the servers you selected need to be modified based on the results and performance of your initial project. As they say, it's easier to add than to take away.
  9. Leverage the open source community
    Every project has its hiccups. When things go wrong, you have a local or online community. Hadoop is driven by the open source community and there is a very good chance the problem you are having is just a quick click away using online search engine. The Hadoop ecosystem continues to mature and the community is a great resource to conquer your issues.
  10. Implement some of these best practices
    There is a lot of hype to what Hadoop is or is not. You may consider Hadoop as a big data storage—data lake—where you can just store all of your data for the time being and deal with it later. I highly recommend that you instill some of these best practices, and develop documentation and rules for the project before it gets out of control. Once it gets out of control, it will be costly to fix and time consuming, which can delay the completion of the project.

Now that you are aware of the technology and some of the best practices, let's check out some of the benefits, use cases, and success stories.

BENEFITS OF HADOOP

In the previous section, customers who implemented Hadoop saw business value and IT benefits. The benefits that follow are the most common ones from talking to customers. As the technology becomes more mature, the benefits will continue to grow in business and IT.

  1. Store and process big data of any kind quickly
    Hadoop is great at storing and processing massive amounts of data of any kind, which are inclusive of structured and semi-structured data. In big data, the 3 V(s)—volume, variety, and velocity—constantly increase and continue to exponentially grow, especially from social media and the Internet of Things (IoT).
  2. Flexibility with data structure and preprocessing
    You will likely need to structure a relational database or data warehouse in columns and rows to be in an integrated format before you can store it. This preprocess can be time consuming and resource intensive. With Hadoop, there is no preprocessing of the data, you can simply store as much data as you want and determine how to use it later. In addition to the structured data, the flexibility extends to the semi-structured data such as text, images, and videos.
  3. Fault tolerance
    With fault tolerance, your data and application processing are protected against any hardware failure. If a node or server goes down, your jobs are automatically redirected to other nodes within the network to make sure the distributed computing does not fail. All data and their duplicates are stored automatically so you don't lose any of your strategic assets.
  4. Scalable
    Hadoop's distributed computing model processes large amount of data fast. You may have terabytes or petabytes of data that need to be processed, analyzed, and managed. When you need more compute capacity, you can simply add more nodes to the server to get more processing power.
  5. Low cost
    I mention the low cost of the commodity server that you can use to run Hadoop and store large quantities of data. In addition, Hadoop is open source software and its framework is free. This benefit is the most popular among customers who consider and adopt Hadoop.

USE CASES AND SUCCESS STORIES

A recent “Best Practices Report” conducted by The Data Warehousing Institute (TDWI) revealed some interesting findings for the usage of Hadoop across all industries.1 The report consisted of several surveys and one of them asked the participants to name the most useful application of Hadoop if their organization were to adopt and employ it. Following is a snapshot from the survey.

  1. 46%: Data warehouse extensions
    In other words, Hadoop complements the data warehouse or RDBMS. It is used to store semi-structured data while the data warehouse is storing the traditional structured data.
  2. 46%: Data exploration and discovery
    Hadoop is an ideal platform for data exploration and discovery to know what data you have and understand its potential business value before it gets analyzed. Once you know what you have, you can develop the data for analytics.
  3. 39%: Data staging area for data warehousing and data integration
    Instead of applying the ETL process at the data warehouse platform, you can apply the data integration process to stage or prepare the data for loading. This is another option for customers to manage the data and only load the data into the data warehouse as needed. Otherwise, the data can stay in Hadoop.
  4. 36%: Data lakes
    Data lakes are considered as large repository for all data types. For example, you can store all your raw data as it is generated and captured and decide what you want to keep at later time.
  5. 36%: Queryable archive for nontraditional data (web logs, sensor, social, etc.)
    Like the data warehouse is an archiving platform for traditional or structured data, Hadoop can be the archiving platform for semi-structured data or the nontraditional data sources. It is a less expensive option to storing data that may no longer relevant or outdated.
  6. 33%: Computational platform and sandbox for advanced analytics
    A sandbox is defined as an area to explore and examine new ideas and possibilities by combining new data with existing data to create experimental designs and ad-hoc queries without interrupting the production environment. This is ideal to test and experiment the new functionality such as in-database or in-memory analytics.

In my research, there are many public-use cases and success stories that you can find on the web by simply searching for them. It is no surprise that Google and Yahoo! leverage Hadoop because they were both involved in the development and maturation of this technology. In my research, I discovered a plethora of companies across industries using Hadoop. Many of the use cases are very similar in nature and share a common theme—support of semi-structured data using HDFS and MapReduce.

Global Online and Social Networking Website

Our first success story comes from an American corporation that provides online social networking services. With tens of millions of users and more than a billion page views every day, this company accumulates massive amounts of data and ends up with terabytes of data to process. It is probably more so than your typical organization, especially considering the amount of media it consumes. This corporation leverages Hadoop since its inception to explore the data and improve user experience.

Background

One of the challenges that this company has faced is developing a scalable way of storing and processing all massive amounts of data. Collecting and analyzing historical data is a very big part of this company so that it can improve the user experience.

Years ago, it began playing around with the idea of implementing Hadoop to handle its massive data consumption and aggregation. In the early days of considering Hadoop, it was questionable whether importing some interesting data sets into a relatively small cluster of servers was feasible. Developers were quickly excited to learn that processing big data sets was possible with MapReduce programming model where this capability was not previously possible due to their massive computational requirements.

Exploring Data with Hadoop

Hadoop is a big hit at this organization. It has multiple Hadoop clusters deployed now—with the biggest having about 2,500 central processing unit (cpu) cores and 1 petabyte of disk space. This company is loading over 250 gigabytes of compressed data (over 2 terabytes uncompressed) into the Hadoop file system every day and has hundreds of jobs running daily against these data sets. The list of projects that are using this infrastructure has increased tremendously—from those generating mundane statistics about site usage, to others being used to fight spam and determine application quality. A majority of the engineering staff has played with and run Hadoop at this company.

There are a number of reasons for the rapid adoption of Hadoop. First, developers are free to write MapReduce programs in the language of their choice. Second, engineers have embraced SQL as a familiar paradigm to address and operate on large data sets. Most data stored in Hadoop's file system are published as tables. Developers and engineers can explore the schemas and data of these tables much like they would do with a relational database management system (RDBMS). A combination of MapReduce and SQL can be very powerful to retrieve and process semi-structured and structured data.

At the corporate level, it is incredibly important that they use the information generated by and from their users to make decisions about improvements to the product and services. Hadoop has enabled them to make better use of the data at their disposal.

Global Consumer Transaction Company: Data Landing Zone

Our next customer is a global leader in consumer transaction technologies, turning everyday interactions with businesses into exceptional experiences. With the software, hardware, and portfolio of services, this company makes nearly 550 million transactions possible daily.

This company's motto is to help their customers respond to the demand for fast, easy, and convenient transactions with intuitive self-service and assisted-service options. But what they do goes beyond niche technologies or markets. Their solutions help businesses around the world increase revenue, build loyalty, reach new customers, and lower their costs of operations.

By continually learning about—and pioneering—how the world interacts and transacts, this global leader is helping companies not only reach their goals, but also change the way all of us shop, eat, travel, bank, and connect. They include thousands of banks and retailers that depend on tens of thousands of self-service kiosks, point-of-sale (POS) terminals, ATMs (automated teller machines), and barcode scanners, among other equipment.

Traditional Approach

In the traditional architecture, this customer could only collect a snapshot of data, which limited the accuracy of its predictive models. For example, the company might collect snapshots from a sampling of ATMs (perhaps once-daily performance data) where the real need is to collect the entire set.

In addition, this company needed to avoid downtime for its clients at all costs. The traditional architecture could not collect millions of records with many data sources and analyze it in a timeline manner to predict the likelihood of a device for preventative maintenance, replacement, or upgrades. Having downtime means that their clients would not be able to operate, and it hurts the reputation of the company. Furthermore, it could not store real-time performance data from every deployed device in banks and retailers and, therefore, was unable to analyze them to detect and prevent downtime for these devices. This process could not be done with the traditional architecture.

Hadoop: A Landing Zone for Data Exploration

As mentioned in Chapter 1, Hadoop is used in the data exploration phase in the analytical data life cycle. It is a common use for Hadoop to be a landing zone for all data, to explore what you have before you apply any analysis to the data.

This customer considered Hadoop to be a landing zone for all data, as Hadoop is good at storing both structured and semi-structured data. Hadoop provides the technology to efficiently and effectively store large amounts of data—that includes data that they are not sure yet how they will use.

To resolve the downtime issue mentioned previously, it is leveraging Hadoop as a “data lake” where it has the capacity to gather and store real-time performance data from every deployed device in banks and retailers. This type of data is not only big in millions of records, but also in the rate or velocity that it is collected. From the data, this company can create rich predictive models using advanced analytics to detect and prevent downtime. It is able to analyze the entire set of data, not just a subset; it is able to build predictive models. For example, the data lake built on Hadoop can monitor every ATM machine it manufactures and develop data models to predict its failure. Fortunately, the downtime is few and far between, due to how big data is collected and analyzed.

Another use for Hadoop is collecting a new data source such as the telephony data to analyze the use of tele-presence devices worldwide. When communications are done between offices, this customer is able to determine the most efficient path and advise where to put telephony equipment in the network. This is purely operational use of the data, and a business analyst responsible for financial analysis of the product's portfolio may never considered looking at retail service calls. With Hadoop and collecting the telephony data, this company is able to expand the scope of the query and root cause analysis to see how a collocated printer, for example, might be shared between a point of sale device, an ATM, and a self-checkout device. The ability to leverage all the data in one place for analytics across lines of business and sharing analytical results creates a culture of openness and of information sharing.

Let's examine another customer success story using Hadoop.

Global Payment Services Company: Hadoop Combining Structured and Semi-Structured Data

Our next success story comes from a global payment services company that provides consumers and businesses with fast, reliable, and convenient ways to send and receive money around the world, to send payments, and to purchase money orders. With a combined network of over 500,000 agent locations in 200 countries and territories and over 100,000 ATMs and kiosks, this customer completed 255 million consumer-to-consumer transactions worldwide, moving $85 billion of principal between consumers and 484 million business payments. Its core business includes

  • Retail, online, ATM, and mobile consumer-to-consumer money transfers
  • Stored-value solutions, including prepaid cards and money orders
  • Consumer bill payments
  • Cross-border business payments for small- and medium-sized enterprises

Background

This customer has been collecting structured data for many years. With the popularity of the Internet and web services, its customers are leveraging the online services to send money to other people, pay bills, and buy or reload prepaid debit cards, or find the locations of agents for in-person services. To give you a sense of the volume, more than 70 million users tapped into their services online, in person, or by phone, and the company said it averaged 29 transactions per second across 200 countries and territories. Thus, web data were growing fast and needed to be captured and analyzed for business needs. Log-based clickstream data about user activity from the website were captured in semi-structured nonrelational formats, and then got mixed and integrated with the relational data to give this customer a true view of their customer activity and engagements. All of the structured and semi-structured data were considered as funnels for a data analysis pipeline. With the traditional infrastructure, only the structured data was captured, integrated, and analyzed. Now, there is a need to integrate large amounts of web data into the corporate workflow to provide a holistic view of its operations.

The current infrastructure was unable to accommodate the business needs. The amount of data was so huge that analyzing it was a challenge and getting answers from massive amounts of data was very difficult. Because data are touched and shared across the enterprise, data scientists, business analysts, and marketing personnel were unable to rely on the existing technology for their data management and analytical needs.

Hadoop: Part of the Enterprise Architecture

At this money-transfer and payment services company, incorporating Hadoop into its enterprise operations also meant successfully integrating large amounts of web data into the corporate workflow. Significant work was required to make new forms of data readily accessible and usable by many staff members—data scientists, business analysts, marketing, and sales. Hadoop is the center of a self-service business intelligence and analytics platform that consists of a variety of others systems and software components. Hadoop feeds these applications the data needed since it is collecting all of semi-structured data.

This customer is adding data from mobile devices to its Hadoop environment and complements it with online data from the Internet. Once the streams of data, structured and semi-structured, are collected, it is parsed and prepared for analysis using a collection of technologies:

  • Data integration tool to integrate and transform the data
  • Data warehousing technology to store the structured data
  • Analytics tools for analysis from Hadoop and the data warehouse

The Hadoop system continues to grow in importance at this global payment company. The enterprise architecture that includes Hadoop is expanding into risk analysis and compliance with financial regulation, anti-money laundering, and other financial crimes. Hadoop is widely used and crucial to the analytics process.

A COLLECTION OF USE CASES

  • A U.S.-based casino-entertainment company is using a new Hadoop environment to define customer segments and develop specific marketing campaigns tailored to each of those segments. With Hadoop, the new environment has reduced processing time for key jobs from about 6 hours to 45 minutes. The reduced processing time allows the casino to run faster and more accurate data analysis. In addition, it has improved customer experiences and increased security for meeting payment card industry (PCI) and other standards. The customer can now process more than 3 million records per hour, much more than what it could do without Hadoop.
  • A healthcare technology company has built an enterprise data hub using Hadoop to create a more complete view of any patient, condition, or trend. Hadoop is helping this company and its clients monitor more than one million patients every day. In addition, it is helping to detect the likelihood that a patient has the potentially fatal bloodstream infection, sepsis, with a much greater accuracy than was previously possible.
  • A relationship-minded online dating website uses Hadoop to analyze a massive volume and variety of data. The technology is helping this company to deliver new matches to millions of people every day, and the new environment accommodates more complex analyses to create more personalized results and improve the chances of relationship success.

There is no question that big data is here, and organizations are spinning in an expanding wealth of data that is either too voluminous or not structured enough to be managed and analyzed through traditional means and approaches. Hadoop is a technology that can complement your traditional architecture to support the semi-structured data that you are collecting. Big data has prompted organizations to rethink and restructure from the business and IT sides. Managing data has always been in the forefront, and analytics has been more of an afterthought. Now, it seems big data has brought analytics to the forefront along with data management. Organizations need to realize that the world we live in is changing, and so is the data. It is prudent for businesses to recognize these changes and react quickly and intelligently to gain the competitive advantage and the upper hand. The new paradigm is no longer stability; it is now agility and discovery. As the volume of data explodes, organizations will need a platform to support new data sources and analytic capability. As big data evolves, the Hadoop architecture will develop into a modern information ecosystem that is a network of internal and external services continuously allocating information, optimizing data-driven decisions, communicating results, and generating new, strategic insights for businesses.

In the next chapter, we will bring it all together with Hadoop with the data warehouse leveraging in-database analytics and in-memory analytics in an integrated architecture.

ENDNOTE

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.93.169