Chapter 15

Integrating Data Sources

In This Chapter

arrow Identifying the data you need

arrow Understanding the fundamentals of big data integration

arrow Using Hadoop as ETL

arrow Knowing best practices for data integration

To get the most business value from big data, it needs to be integrated into your business processes. How can you take action based on your analysis of big data unless you can understand the results in context with your operational data? Differentiating your company as a result of making good business decisions depends on many factors. One factor that is becoming increasingly important is your capability to integrate internal and external data sources comprised of both traditional relational data and newer forms of unstructured data. While this may seem like a daunting task, the reality is that you probably already have a lot of experience with data integration. Don’t toss aside everything you have learned about delivering data as a trusted source to your organization. You will want to place a high priority on data quality as you move to make big data analytics actionable. However, to bring your big data environments and enterprise data environments together, you will need to incorporate new methods of integration that support Hadoop and other nontraditional big data environments.

Two major categories of big data integration are covered in this chapter: the integration of multiple big data sources in big data environments and the integration of unstructured big data sources with structured enterprise data. We cover the traditional forms of integration such as extract, transform, and load (ETL) and new solutions designed for big data platforms.

Identifying the Data You Need

Before you can begin to plan for integration of your big data, you need to take stock of the type of data you are dealing with. Many organizations are recognizing that a lot of internally generated data has not been used to its full potential in the past. By leveraging new tools, organizations are gaining new insight from previously untapped sources of unstructured data in e-mails, customer service records, sensor data, and security logs. In addition, much interest exists in looking for new insight based on analysis of data that is primarily external to the organization, such as social media, mobile phone location, traffic, and weather.

Your analysis may require that you bring several of these big data sources together. To complete your analysis, you need to move large amounts of data from log files, Twitter feeds, RFID tags, and weather data feeds and integrate all these elements across highly distributed data systems. After you complete your analysis, you may need to integrate your big data with your operational data. For example, healthcare researchers explore unstructured information from patient records in combination with traditional medical record patient data such as test results to begin improving patient care and improving quality of care. Big data sources like information from medical devices and clinical trials may be incorporated into the analysis as well.

As you begin your big data analysis, you probably do not know exactly what you will find. Your analysis will go through several stages. You may begin with petabytes of data, and as you look for patterns, you may narrow your results. The following three stages are described in more detail:

check.png Exploratory stage

check.png Codifying stage

check.png Integration and incorporation stage

Exploratory stage

In the early stages of your analysis, you will want to search for patterns in the data. It is only by examining very large volumes (terabytes and petabytes) of data that new and unexpected relationships and correlations among elements may become apparent. These patterns can provide insight into customer preferences for a new product, for example. You will need a platform such as Hadoop for organizing your big data to look for these patterns.

As described in Chapters 9 and 10, Hadoop is widely used as an underlying building block for capturing and processing big data. Hadoop is designed with capabilities that speed the processing of big data and make it possible to identify patterns in huge amounts of data in a relatively short time. The two primary components of Hadoop — Hadoop Distributed File System (HDFS) and MapReduce — are used to manage and process your big data. In the exploratory stage, you are not so concerned about integration with operational data. That will come later.

Using FlumeNG for big data integration

However, one type of integration is critical during the exploratory stage. It is often necessary to collect, aggregate, and move extremely large amounts of streaming data to search for hidden patterns in big data. Traditional integration tools such as ETL would not be fast enough to move the large streams of data in time to deliver results for analysis such as real-time fraud detection. FlumeNG (a more advanced version of the original Flume) loads data in real time by streaming your data into Hadoop.

Typically, Flume is used to collect large amounts of log data from distributed servers. It keeps track of all the physical and logical nodes in a Flume installation. Agent nodes are installed on the servers and are responsible for managing the way a single stream of data is transferred and processed from its beginning point to its destination point. In addition, collectors are used to group the streams of data into larger streams that can be written to a Hadoop file system or other big data storage container. Flume is designed for scalability and can continually add more resources to a system to handle extremely large amounts of data in an efficient way. Flume’s output can be integrated with Hadoop and Hive for analysis of the data. Flume also has transformation elements to use on the data and can turn your Hadoop infrastructure into a streaming source of unstructured data.

Looking for patterns in big data

You find many examples of companies beginning to realize competitive advantages from big data analytics. For many companies, social media data streams are increasingly becoming an integral component of a digital marketing strategy. For example, Wal-Mart analyzes customer location-based data, tweets, and other social media streams to make more targeted product recommendations for customers and to tailor in-store product selection to customer demand. Wal-Mart acquired social media company Kosmix in 2011 to gain access to its technology platform for searching and analyzing real-time data streams. In the exploratory stage, this technology can be used to rapidly search through huge amounts of streaming data and pull out the trending patterns that relate to specific products or customers. The results can be used to optimize inventory based on the likes and dislikes of shoppers near a specific geographic location.

As companies search for patterns in big data, the huge data volumes are narrowed down as if they are passed through a funnel. You may start with petabytes of data and then, as you look for data with similar characteristics or data that forms a particular pattern, you eliminate data that does not match up.

Codifying stage

To make the leap from identifying a pattern to incorporating this trend into your business process needs some sort of process to follow. For example, if a large retailer monitors social media and identifies lots of chatter about an upcoming college football event near one of its stores, how will the company make use of this information? With hundreds of stores and many thousands of customers, you need a repeatable process to make the leap from pattern identification to implementation of new product selection and more targeted marketing. With a process in place, the retailer can quickly take action and stock the local store with clothing and accessories with the team logo. After you find something interesting in your big data analysis, you need to codify it and make it a part of your business process. You need to make the connection between your big data analytics and your inventory and product systems.

To codify the relationship between your big data analytics and your operational data, you need to integrate the data.

Integration and incorporation stage

Big data is having a major impact on many aspects of data management, including data integration. Traditionally, data integration has focused on the movement of data through middleware, including specifications on message passing and requirements for application programming interfaces (APIs). These concepts of data integration are more appropriate for managing data at rest rather than data in motion. The move into the new world of unstructured data and streaming data changes the conventional notion of data integration. If you want to incorporate your analysis of streaming data into your business process, you need advanced technology that is fast enough to enable you to make decisions in real time. One important goal with big data analytics is to look for patterns that apply to your business and narrow down the data set based on business context. Therefore, the analysis of big data is only one step in your implementation. After your big data analysis is complete, you need an approach that will allow you to integrate or incorporate the results of your big data analysis into your business process and real-time business actions.

Companies have high expectations for gaining real business value from big data analysis. In fact, many companies would like to begin a deeper analysis of internally generated big data, such as security log data, that was not previously possible due to technology limitations. Technologies for high-speed transport of very large and fast data are a requirement for integrating across distributed big data sources and between big data and operational data. Unstructured data sources often need to be moved quickly over large geographic distances for the sharing and collaboration required in everything from major scientific research projects to development and delivery of content for the entertainment industry.

For example, scientific researchers typically work with very large data sets. Researchers share data and collaborate more easily than in the past by using a combination of big data analytics and the cloud. One example of a large data set shared by scientific researchers across the world is the 1000 Genomes Project. Disease researches study the human genome to identify and compile variations to help understand and treat diseases. The data for the 1000 Genomes Project — the largest and most detailed catalog of human genetic variation in the world — is maintained on Amazon Web Services (AWS). The data is made available to the international scientific research community. AWS is able to support the transfer of very large files at fast speeds over the Internet (700 megabytes per second) using technology by Aspera. Aspera provides high-speed file transport technology called fasp. This software transfers big data at speeds that are many times faster than TCP-based file transfer technologies like FTP and HTTP. This speed can be guaranteed with very large file sizes and long distances, and across geographic boundaries.

Linking traditional sources with big data is a multistaged process after you have looked at all the data from streaming big data sources and identified the relevant patterns. You may not have known what you were looking for when you started, but now you have some important information for your business. As you move from the exploratory stage and get closer to the real business problem, you need to begin thinking about metadata and rules and the structure of the data. After narrowing the amount of data you need to manage and analyze, now you need to think about integration.

A company that uses big data to predict customer interest in new products needs to make a connection between the big data and the operational data on customers and products to take action. If the company wants to use this information to buy new products or change pricing or manage inventory, it needs to integrate its operational data with the results of its big data analysis. The retail industry is one market where companies are beginning to use big data analytics to deepen its relationship with customers and create more personalized and targeted offers. Integration of big data and operational data is key to the success of these efforts. For example, consider a customer who registers on a retailer’s website, providing her mobile number and e-mail address. Today, the customer receives e-mails about sales and coupon incentives to make purchases in the store or online. In the future, retailers are planning to use location-based services from the customer’s mobile device to identify where the customer is located in the store and send a text message with a coupon for immediate use in that department. In other words, a customer might walk into the entertainment section of the store and receive a text message for a discount on the purchase of a Blu-ray disc player. To do this, the retailer needs real-time integration of big data feeds (location-based information) with operational data on customer history and in-store inventory. The analysis needs to take place immediately, and communication with the customer needs to happen at the same time. Even a delay of ten minutes may be too long, and the moment of customer interaction will be lost.

Understanding the Fundamentals of Big Data Integration

The elements of the big data platform manage data in new ways as compared to the traditional relational database. This is because of the need to have the scalability and high performance required to manage both structured and unstructured data. Components of the big data ecosystem ranging from Hadoop to NoSQL DB, MongoDB, Cassandra, and HBase all have their own approach for extracting and loading data. As a result, your teams may need to develop new skills to manage the integration process across these platforms. However, many of your company’s data management best practices will become even more important as you move into the world of big data.

While big data introduces a new level of integration complexity, the basic fundamental principles still apply. Your business objective needs to be focused on delivering quality and trusted data to the organization at the right time and in the right context. To ensure this trust, you need to establish common rules for data quality with an emphasis on accuracy and completeness of data. In addition, you need a comprehensive approach to developing enterprise metadata, keeping track of data lineage and governance to support integration of your data.

At the same time, traditional tools for data integration are evolving to handle the increasing variety of unstructured data and the growing volume and velocity of big data. While traditional forms of integration take on new meanings in a big data world, your integration technologies need a common platform that supports data quality and profiling.

To make sound business decisions based on big data analysis, this information needs to be trusted and understood at all levels of the organization. While it will probably not be cost or time effective to be overly concerned with data quality in the exploratory stage of a big data analysis, eventually quality and trust must play a role if the results are to be incorporated in the business process. Information needs to be delivered to the business in a trusted, controlled, consistent, and flexible way across the enterprise, regardless of the requirements specific to individual systems or applications. To accomplish this goal, three basic principles apply:

check.png You must create a common understanding of data definitions. At the initial stages of your big data analysis, you are not likely to have the same level of control over data definitions as you do with your operational data. However, once you have identified the patterns that are most relevant to your business, you need the capability to map data elements to a common definition. That common definition is then carried forward into operational data, data warehouses, reporting, and business processes.

check.png You must develop of a set of data services to qualify the data and make it consistent and ultimately trustworthy. When your unstructured and big data sources are integrated with structured operational data, you need to be confident that the results will be meaningful.

check.png You need a streamlined way to integrate your big data sources and systems of record. In order to make good decisions based on the results of your big data analysis, you need to deliver information at the right time and with the right context. Your big data integration process should ensure consistency and reliability.

To integrate data across mixed application environments, you need to get data from one data environment (source) to another data environment (target). Extract, transform, and load (ETL) technologies have been used to accomplish this in traditional data warehouse environments. The role of ETL is evolving to handle newer data management environments like Hadoop. In a big data environment, you may need to combine tools that support batch integration processes (using ETL) with real-time integration and federation across multiple sources. For example, a pharmaceutical company may need to blend data stored in its Master Data Management (MDM) system with big data sources on medical outcomes of customer drug usage. Companies use MDM to facilitate the collecting, aggregating, consolidating, and delivering of consistent and reliable data in a controlled manner across the enterprise. In addition, new tools like Sqoop and Scribe are used to support integration of big data environments. You also find an increasing emphasis on using extract, load, and transform (ELT) technologies. These technologies are described next.

Defining Traditional ETL

ETL tools combine three important functions required to get data from one data environment and put it into another data environment. Traditionally, ETL has been used with batch processing in data warehouse environments. Data warehouses provide business users with a way to consolidate information across disparate sources (such as enterprise resource planning [ERP] and customer relationship management [CRM]) to analyze and report on data relevant to their specific business focus. ETL tools are used to transform the data into the format required by the data warehouse. The transformation is actually done in an intermediate location before the data is loaded into the data warehouse. Many software vendors, including IBM, Informatica, Pervasive, Talend, and Pentaho, provide ETL software tools.

ETL provides the underlying infrastructure for integration by performing three important functions:

check.png Extract: Read data from the source database.

check.png Transform: Convert the format of the extracted data so that it conforms to the requirements of the target database. Transformation is done by using rules or merging data with other data.

check.png Load: Write data to the target database.

However, ETL is evolving to support integration across much more than traditional data warehouses. ETL can support integration across transactional systems, operational data stores, BI platforms, MDM hubs, the cloud, and Hadoop platforms. ETL software vendors are extending their solutions to provide big data extraction, transformation, and loading between Hadoop and traditional data management platforms. ETL and software tools for other data integration processes like data cleansing, profiling, and auditing all work on different aspects of the data to ensure that the data will be deemed trustworthy. ETL tools integrate with data quality tools, and many incorporate tools for data cleansing, data mapping, and identifying data lineage. With ETL, you only extract the data you will need for the integration.

ETL tools are needed for the loading and conversion of structured and unstructured data into Hadoop. Advanced ETL tools can read and write multiple files in parallel from and to Hadoop to simplify how data is merged into a common transformation process. Some solutions incorporate libraries of prebuilt ETL transformations for both the transaction and interaction data that run on Hadoop or a traditional grid infrastructure.

Data transformation

Data transformation is the process of changing the format of data so that it can be used by different applications. This may mean a change from the format the data is stored in into the format needed by the application that will use the data. This process also includes mapping instructions so that applications are told how to get the data they need to process.

The process of data transformation is made far more complex because of the staggering growth in the amount of unstructured data. A business application such as a customer relationship management or sales management system typically has specific requirements for how the data it needs should be stored. The data is likely to be structured in the organized rows and columns of a relational database. Data is semi-structured or unstructured if it does not follow these very rigid format requirements. The information contained in an e-mail message is considered unstructured, for example. Some of a company’s most important information is in unstructured and semi-structured forms such as documents, e-mail messages, complex messaging formats, customer support interactions, transactions, and information coming from packaged applications like ERP and CRM.

Data transformation tools are not designed to work well with unstructured data. As a result, companies needing to incorporate unstructured information into its business process decision making have been faced with a significant amount of manual coding to accomplish the required data integration. Given the growth and importance of unstructured data to decision making, ETL solutions from major vendors are beginning to offer standardized approaches to transforming unstructured data so that it can be more easily integrated with operational structured data.

Understanding ELT — Extract, Load, and Transform

ELT stands for extract, load, and transform. It performs the same functions as ETL, but in a different order. Early databases did not have the technical capability to transform the data. Therefore, ETL tools extracted the data to an intermediary location to perform the transformation before loading the data to the data warehouse. However, this restriction is no longer a problem, thanks to technology advances such as massively parallel processing systems and columnar databases. As a result, ELT tools can transform the data in the source or target database without requiring an ETL server. Why use ELT with big data? The performance is faster and more easily scalable. ELT uses structured query language (SQL) to transform the data. Many traditional ETL tools also offer ELT so that you can use both, depending on which option is best for your situation.

Prioritizing Big Data Quality

Getting the right perspective on data quality can be very challenging in the world of big data. With the majority of big data sources, you need to assume that you are working with data that is not clean. In fact, the overwhelming abundance of seemingly random and disconnected data in streams of social media data is one of the things that make it so useful to businesses. You start by searching petabytes of data without knowing what you might find after you start looking for patterns in the data. You need to accept the fact that a lot of noise will exist in the data. It is only by searching and pattern matching that you will be able to find some sparks of truth in the midst of some very dirty data. Of course, some big data sources such as data from RFID tags or sensors have better-established rules than social media data. Sensor data should be reasonably clean, although you may expect to find some errors. It is always your responsibility when analyzing massive amounts of data to plan for the quality level of that data. You should follow a two-phase approach to data quality:

Phase 1: Look for patterns in big data without concern for data quality.

Phase 2: After you locate your patterns and establish results that are important to the business, apply the same data quality standards that you apply to your traditional data sources. You want to avoid collecting and managing big data that is not important to the business and will potentially corrupt other data elements in Hadoop or other big data platforms.

As you begin to incorporate the outcomes of your big data analysis into your business process, recognize that high-quality data is essential for a company to make sound business decisions. This is true for big data as well as traditional data. The quality of data refers to characteristics about the data, including consistency, accuracy, reliability, completeness, timeliness, reasonableness, and validity. Data quality software makes sure that data elements are represented in the same way across different data stores or systems to increase the consistency of the data.

For example, one data store may use two lines for a customer’s address and another data store may use one line. This difference in the way the data is represented can result in inaccurate information about customers, such as one customer being identified as two different customers. A corporation might use dozens of variations of its company name when it buys products. Data quality software can be used to identify all the variations of the company name in your different data stores and ensure that you know everything that this customer purchases from your business. This process is called providing a single view of customer or product. Data quality software matches data across different systems and cleans up or removes redundant data. The data quality process provides the business with information that is easier to use, interpret, and understand.

Data profiling tools are used in the data quality process to help you to understand the content, structure, and condition of your data. They collect information on the characteristics of the data in a database or other data store to begin the process of turning the data into a more trusted form. The tools analyze the data to identify errors and inconsistencies. They can make adjustments for these problems and correct errors. The tools check for acceptable values, patterns, and ranges and help identify overlapping data. The data-profiling process, for example, checks to see whether the data is expected to be alpha or numeric. The tools also check for dependencies or to see how the data relates to data from other databases.

Data-profiling tools for big data have a similar function to data-profiling tools for traditional data. Data-profiling tools for Hadoop will provide you with important information about the data in Hadoop clusters. These tools can be used to look for matches and remove duplications on extremely large data sets. As a result, you can ensure that your big data is complete and consistent. Hadoop tools like HiveQL and Pig Latin can be used for the transformation process.

Using Hadoop as ETL

Many organizations with big data platforms are concerned that ETL tools are too slow and cumbersome to use with large volumes of data. Some have found that Hadoop can be used to handle some of the transformation process and to otherwise improve on the ETL and data-staging processes. You can speed up the data integration process by loading both unstructured data and traditional operational and transactional data directly into Hadoop, regardless of the initial structure of the data. After the data is loaded into Hadoop, it can be further integrated using traditional ETL tools. When Hadoop is used as an aid to the ETL process, it speeds the analytics process.

The use of Hadoop as an integration tool is a work in progress. Vendors with traditional ETL solutions, such as IBM, Informatica, Talend, Pentaho, and Datameer, are incorporating Hadoop into their integration offerings. By relying on the capabilities of Hadoop as a massively parallel system, developers can perform data quality and transformation functions that were not previously possible. However, Hadoop does not stand on its own as a replacement for ETL.

Best Practices for Data Integration in a Big Data World

You find a lot of potential in using big data to look at a range of business and scientific problems in new ways, find answers to unanswered questions, and begin to take immediate action that delivers significant results. Many companies are exploring big data problems and coming up with some innovative solutions. Chapters 21 and 22 present some interesting case examples. The future is exciting. However, now is the time to pay attention to some basic principles that will serve you well as you begin your big data journey.

In reality, big data integration fits into the overall process of integration of data across your company. Therefore, you can’t simply toss aside everything you have learned from data integration of traditional data sources. The same rules apply whether you are thinking about traditional data management or big data management. So keep these key issues at the top of your priority list:

check.png Keep data quality in perspective. Your emphasis on data quality depends on the stage of your big data analysis. Don’t expect to be able to control data quality when you do your initial analysis on huge volumes of data. However, when you narrow down your big data to identify a subset that is most meaningful to your organization, this is when you need to focus on data quality. Ultimately data quality becomes important if you want your results to be understood in context with your historical data. As your company relies more and more on analytics as a key planning tool, data quality can mean the difference between success and failure.

check.png Consider real-time data requirements. Big data will bring streaming data to the forefront. Therefore, you will have to have a clear understanding of how you integrate data in motion into your environment for predictable analysis.

check.png Don’t create new silos of information. While so much of the emphasis around big data is focused on Hadoop and other unstructured and semi-structured sources, remember that you have to manage this data in context with the business. You will therefore need to integrate these sources with your line of business data and your data warehouse.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.162.135