Chapter 2
In-Database Processing

It is a known fact that organizations are collecting more structured and semi-structured data than ever before, and it is presenting great opportunities and challenges to analyze ALL of this complex data. In this volatile and competitive economy, there has never been a bigger need for proactive and agile strategies to overcome these challenges by applying the analytics directly to the data rather than shuffling data around. The point: There are two key technologies that dramatically improve and increase performance when analyzing big data: in-database and in-memory analytics. I will focus on the in-database analytics (processing) in this chapter.

BACKGROUND

The concept of in-database processing was introduced in the mid-1990s, and vendors such as Teradata, IBM, and Oracle made it commercially available as object-related database systems. The in-database capabilities were still in its infancy and did not really catch on with customers until the mid-2000s. The concept of migrating analytics from the analytical workstation or personal computers and into a centralized enterprise data warehouse sounded good and promising, but customers were very wary of how it could work within their processes and cultures. In addition, IT and business users were questioning what capabilities were present to add value to their organization, which wanted to adopt the technology. Also occurring at the same time, big data has entered into the industry and become the buzzword and trend in the IT industry. Needless to say, it is the perfect storm for in-database processing to be a great choice to solve business issues.

The need for this technology has become more in-demand as the amount of data available to collect and analyze continues to grow at an alarming rate, largely due to the expansion and use of the Internet and mobile devices. Customers have been collecting data from megabytes to gigabytes or from terabytes to petabytes and beyond. With the speed of data being collected, the need for change from businesses has accelerated where performance gains of hours to seconds can make a difference in the decision making in some industries. In addition, as more people and industries rely on data and analytics to answer important business questions, the questions they ask have become more complex, which mandate more sophisticated technologies and even more precise results. Thus, the insurgence of big data is one of the primary reasons for the growth of in-database processing—an enabling technology to gather, process, and analyze data efficiently and effectively. Let's examine how analytics are being handled currently.

TRADITIONAL APPROACH

Traditionally, in order to analyze data, the data must be extracted from a database or data warehouse to a server or personal computer. Currently, many customers are approaching analytics by moving the data out of databases into a separate analytics environment for processing and then back to the database. Users have been shielded from the fact that traditional approaches to data analysis required the movement of data out of the database for analytical processing. The users have not had to contend with the fact that the conventional approach to analytics has imposed performance constraints as data was moved out of the database and into a separate analytics environment. The result from this process produces duplicate data and redundant data in many places. They may not know where to look since the data is scattered and not centralized. Nor have they had to consider the security issues inherent in moving data from one environment to another. Figure 2.1 illustrates the traditional approach to analytics.

Schematic illustration of Traditional approach to analytics.

Figure 2.1 Traditional approach to analytics

Figure 2.1 shows the common and traditional process and architecture used by many businesses to develop and deploy analytic data model in their infrastructure. In this approach, data are extracted from a variety of sources (mainframes, CRM applications, call centers, etc.) ranging from enterprise data warehouses to data marts across multiple lines of business. The data are collected, transformed, and integrated into a development analytic data set (ADS). The ADS is typically a large flat data structure, such as a flat file, that contains hundreds of candidate variables where each row represents an observation on the unit of interest such as customer (also known as modeling ADS). These data are then used to develop analytic data models within an analytic environment. Once the model is developed, tested, and validated, it is then exported into the scoring environment that is typically based on production or operational data. For scoring purposes, the data are again extracted and prepared based on model requirements into the “Scoring ADS” (also sometimes referred as the score table). This table only has the variable used in the analytical model variables but may also contain millions of records to be scored. Scoring is often executed on the scoring server.

The traditional approach has many moving parts and processes including lots of data movement and replication. It presents many challenges within an organization—in particular data validation, data quality, performance, economics, and governance. The downside to the traditional process includes:

  • Moving the data through the network can be slow—just imagine transferring terabytes or petabytes of data across a network that has limited bandwidth. This process can take hours and days to complete. During this process, you may be negatively impacting your corporate network and productivity of others within the organization.
  • Duplicating the data adds another dimension of cost and data silo—by copying millions and billions of rows of data, there is another data set to maintain and support which creates a data silo and adds additional costs to the infrastructure (for storage, backup, etc.).
  • Governing the data can be challenging—because data are scattered and collected from many sources to be analyzed, it is cumbersome to govern where the data are being used and how it is being analyzed.

In the traditional process, you have the data management process apart from the analytical application, and they are not likely integrated to provide end-to-end capabilities. As mentioned in the Introduction and Chapter 1, data management is a key element to having effective analysis of the data, and the data preparation process within the data management side can be very time consuming and labor intensive. Having separate processes and disconnected technologies can be costly and not very economical.

Applying the analytics where the data reside and leveraging the power of the database eliminate costs, time, data quality, and security issues associated with the traditional approach by simply processing the analytics in the data warehouse.

IN-DATABASE APPROACH

Most organizations have confessed that there is no shortage of data to analyze. Organizations often have more data than they're even aware of and more data are being collected daily. Of course, the challenge is to find the most effective way to process the data, ensure its accuracy, store it in an inexpensive manner, and increase the speed and flexibility with which end users can examine it from multiple perspectives to uncover valuable and strategic insights it contains. The good news is that there is a way: the in-database processing.

In-database processing refers to the integration of advanced analytics into the data warehousing platform functionality. Many analytical computing solutions and large databases use this approach because it provides significant performance improvements over the traditional methods. Thus, many business analysts have adopted in-database processing and have been able to realize the valuable business benefits of streamlined processing and increased performance. Figure 2.2 shows the in-database approach to integrating data management and analytics.

Illustration depicting In-database approach to analytics.

Figure 2.2 In-database approach to analytics

With in-database processing, business analysts and IT users have the ability to explore the data, prepare it for analysis in an analytical data set, develop complex data models, and score the model—the end-to-end data life cycle all within the database or data warehouse. By doing so, it removes the need to either move or extract the data to an environment or convert the analytical code to something that could be executed on the data platform.

THE NEED FOR IN-DATABASE ANALYTICS

An in-database analytics approach is much faster, more efficient, more productive, and more secure than traditional analytics approaches. In-database analytics delivers immediate performance, scalability, and governance improvements because data never leaves the database until the results are either filtered or processed. One of the main advantages of in-database analytics is eliminating the movement of data. It is accomplished by integrating the analytical functionality with the database.

By leveraging the in-database analytics capabilities, organizations can now make business decisions that were not previously possible. By executing predictive or advanced analytics directly in a database, users can perform critical drill-downs and deep analyses that were previously not possible or practical. In addition, the in-database approach allows businesses to analyze big data faster, more accurately, and more cost-effectively than ever before in even the most complex, data-intensive environments.

Innovative and leading-edge organizations have realized that effective use of analytics enables them to outperform their peers and improve key processes such as customer targeting and retention, product development, pricing, risk assessment and management, marketing, and sales. It has become a motto that the more advanced your analytical capabilities are, the better equipped you will be to construct and fine-tune successful business strategies for data-driven decisions. These data-driven decisions will increase productivity, efficiency, the bottom line, and even brand value.

This means that, for example, a telecommunications organization can conduct more thorough and meticulous churn analysis and create optimal pricing programs. Insurance companies can run more effective exposure assessments, improve fraud detection, and develop targeted customer retention programs. Financial institutions can enhance their ongoing portfolio analysis activities, upsell customers with new products or offerings, better mitigate risk, and increase individual customer profitability. Marketers in virtually every industry can conduct real-time campaign analysis, enabling them to tweak or abandon campaigns in order to maximize returns on their marketing spend.

From C-level executives to line-of-business managers and frontline employees in sales, service, finance, marketing and other disciplines, leveraging in-database analytics will improve performance, productivity, innovation, and overall competitive advantage.

SUCCESS STORIES AND USE CASES

There are a number of success stories and use cases for in-database analytics since its inception into the industry. As previously mentioned, in-database analytics is used to explore the data, prepare the data for analysis, and develop and deploy the model. Let's examine some success stories and use cases for in-database processing.

E-Commerce: Data Preparation and Data Processing

Our first success story comes from a multinational corporation and e-commerce company that provides consumer-to-consumer and business-to-consumer sales services via the Internet. This corporation leverages in-database processing to prepare the data and increase end-to-end performance.

With a very sophisticated and large data management system, there is also an extensive analytic ecosystem analyzing the petabytes of data stored in the data platform. Their data platform is used to discover, explore, analyze, and report the data using various analytics tools. Figure 2.3 shows data integration, data platform, and data analytics.

Illustration of Data and analytic ecosystem.

Figure 2.3 Data and analytic ecosystem

The data platform is where in-database processing will be executed. In Figure 2.3, the data platform builds a strong foundation for analytics and has structured and semi-structured data consisting of transactions and web logs of 36 petabytes of customer and behavior data. Analyzing millions of rows and hundreds of columns of data within a finite time frame with the traditional approach presented many constraints. These included transferring large amounts of data, managing massive storage space with duplicated data, and administering system resources. In addition, the volume and frequency of data to be processed has opened up several challenges around data preparation and transformation, data transfer between the data platform and data access and analytics, data set storage, and optimum system resource utilization, leading to increased data set processing costs and scalability bottlenecks. Due to the limitations with the traditional approach and infrastructure, the customer decided to adopt in-database technologies for the flexibility and efficient way in which to analyze the increasing amounts of data being collected. By integrating analytics into the database or data warehouse and applying the analytics to where the data reside, it is able to scale and provide better performance compared to the current approach and infrastructure. By moving data preparation, analytics, and reporting tasks to where the data are offers many benefits such as increasing speed and performance, reducing data movement, and promoting enhanced data governance for the IT department. For decision makers, the in-database approach provides faster access to the analytical results, which lead to more agile and accurate data-driven decisions. Let's examine the before-and-after picture and the benefits of in-database processing.

Figure 2.4 provides a look at the traditional approach to analyzing customer behavior from summarized tables processed on a daily basis. The process involves creating a data set from an underlying summarized table from the data warehouse. It consists of four major steps, all triggered by a wrapper script. In the first step, the parameters necessary for analysis are entered into a text file (parameter source file) as macro variables. Then, a sequence of SQLs on the database is executed, which yields a summarized table in the data warehouse. This table acts as the source data for all the subsequent processes. In the next step, a program uses fast-export process with multiple sessions to transfer the table from the data warehouse onto the analytics server and a data set created. The analysis is then performed on this data set to provide test analysis at different dimensions based on multiple aggregations. The output after analysis of each dimension is appended to the final output data set. The final step transfers the results back from the analytical data set to table in the data warehouse using fast-load process. This result or output from the data set is visualized on different dashboards by using a visualization or business intelligence tool for end-user consumption.

Illustration of Traditional approach.

Figure 2.4 Traditional approach

Table 2.1 illustrates the process and time for each step in the traditional approach. A typical test analysis usually consists of over 200 million records with over 100 variables of customer behavior.

Table 2.1 Traditional Run Times at Different Process

Process Description Time (Minutes)
SQL running on the data warehouse Create a table from the data warehouse to be analyzed in the analytics server 32
Data transfer (data warehouse to analytics server) Transfer of table from data warehouse to analytical data set using fast export utility 52
Statistical computation Processing of data at different dimensions 233
Data transfer (analytics to data warehouse) Transfer of output data set to data warehouse 6
Total: 323 minutes

Table 2.1 clearly shows that the SQL step takes around 30 minutes to create a summarized table from the data warehouse. Data transfer from the data warehouse to be used in the analytics server usually takes almost an hour to transfer the data. Looking forward, as the data volume increases, the processing time will increase proportionally and is scalable for this purpose. The analysis is provided for different dimensions of the data set. The data set is aggregated multiple times, and analysis is performed on the summary data set created after aggregation. This part of statistical computation takes around four hours, which is the bulk of the total processing time. The final transfer from output data set to the data warehouse is about six minutes. On the whole, the standard process consumed around five hours, which is too time consuming and resource intensive to maintain. In addition, any “tweaks” or changes to the process will take several business days to test and implement.

There are several challenges with the traditional process. First and foremost, this process uses fast export approach with multiple sessions, and this might take a lot of bandwidth on the server. In addition, the data transfer may fail if the volume of data is massive and the process has to be restarted. More system resources are needed to process such volume of data, so there is a limitation to execute only one or two processes at a time. This will put restrictions on the whole process to perform the analysis in a tight time frame.

In-Database Process

Now let's examine the effects of leveraging in-database technology in this process. Table 2.2 compares the processing time between the traditional and in-database approach.

Table 2.2 In-Database Run Times at Different Process

Process Description Time (minutes) Traditional Approach Time (Minutes) In-Database
SQL running on the data warehouse Create a table from the data warehouse to be analyzed in the analytics server 32 32
Data transfer (data warehouse to analytics server) Transfer of table from data warehouse to analytical data set using fast export utility 52 0
Statistical computation Processing of data at different dimensions 233 45
Data transfer (analytics output to data warehouse) Transfer of output data set to data warehouse table 6 6
Total: 323 minutes 83 minutes

With the in-database processing, the transfer phase of data from the data warehouse to the analytical server is eliminated, which decreases the processing time by almost one hour. Statistical computation takes around 45 minutes as aggregation of data is pushed to database side and summary data set is created, which acts as a source to further processing. The final transfer from the analytical data set to the data warehouse is around 6 minutes. On the whole, in-database process is 83 minutes.

The initial step of running SQLs on the database is similar in both processes as it takes around 32 minutes. The data transfer part is completely eliminated for in-database process compared to standard process, as aggregations are performed in the database as against the analytic server. The run time taken for statistical computation has drastically reduced for in-database process compared to standard process. This is a major improvement to the traditional process and the way analytics is executed. This is achieved by using in-database processing which converts analytical pocedures to SQL. The SQL code is executed on the database, aggregating the data warehouse table multiple times by taking advantage of a highly optimized, massively parallel processing system and performing the functions in parallel. On the other hand, in standard process, the data are transferred from data warehouse table to analytical data set, and aggregations are performed on this analytical data set. The final step remains the same in both the processes, wherein analytical data set is loaded back into the data warehouse table. In the traditional process, it takes 323 minutes; the in-database approach reduces that time to 83 minutes—a fourfold increase in performance.

This process decreases the end-to-end processing times while increasing the processing throughput. There has been an overall reduction in system resource utilization that allows for a greater degree of parallelism and performance. There has been a dramatic reduction in disk space utilization by not having to store data on the analytics server. As the processing time has reduced, additional insights and analyses were provided for all the tests without major increase in the run time. The process also consumed less data transfer bandwidth on both the servers (analytics and data warehouse) and on the network, so other processes were able to take advantage of the servers and enhance data governance. Finally, it is less prone to issues such as timed-out sessions and terminations.

Advantages of in-database processing

  • Decreasing processing time by a factor of four
  • Eliminating of input and output costs as there is no transfer of data
  • Summarizing large amounts of data on DBMS side, which is faster, as most of them are highly optimized, scalable, and tuned
  • Enabling analytic functions directly to the data
  • Increasing throughput by reducing in system resource utilization and allowing for greater degree of parallelism
  • Increasing performance and eliminating data movement
  • Optimizing resource utilization across an integrated analytic/warehousing environment
  • Improving data quality and integrity
  • Reducing network traffic
  • Enhancing the effectiveness of analysts, who can stay focused on higher-value tasks
  • Lowering the total cost of ownership/increasing return on investment

This e-commerce company continues to use in-database processing on a daily basis. One of the feedbacks they shared at this point in time is that not all of the analytical functions that they use are enabled in-database. Some processing continues to run in the traditional process. In-database is great when it is available to be executed, but it is also complementary to the overall analytical architecture.

Future Direction

This e-commerce customer continues to use in-database analytics. It is planning to expand the use of in-database analytics to other departments and is exploring the integration with Hadoop. As data volumes continue to grow, the company is adding more capacity to manage the enterprise data warehouse and complement the in-database analytics to prepare and explore various types of data.

Telecommunication: Developing a Data Model Using In-Database Technology

In our next success story and use case, we have a customer based in Europe. This company is one of the largest telecommunication carriers in Central Europe. With the company earnings under pressure due to falling prices, declining customer loyalty, and competitive nature, there is an enormous need to understand customer behavior in order to diminish the rising churn rates.

Churn analysis is important in every industry, but particularly in the telecommunication sector. Churn analysis is the process of identifying potential defectors and determining the causal factors, so that the company can try to prevent the attrition from happening in the future. Every telecommunication provider uses churn prediction to segment early warning for customer retention and marketing analysis. Segment early warning would be an application of churn analysis in campaign management. It basically is a combination of customer churn segmentation and customer value segmentation. The resulting subsegments (e.g., highest-value customer with neutral churn risk between 30% and 70%) are used for differentiated retention offers to the customer base and help to reduce overall costs of customer retention. Building an effective churn analysis relies heavily on granular customer data, which consists of large amounts of data and a variety of data sources.

One of the key aspects in churn analysis is to minimize modeling latency so that timely reactions can be taken due to market changes. To complete the end-to-end analysis, an automated scoring process should be included in order to have churn scores available on an hourly, daily, or weekly data. At the same time, to take advantage of granular and large amounts usage data in the form of call detail records and in order to optimize this churn model, it needs to have a well and tightly integrated data warehousing and analytics environment to minimize any type of latency between the environments, which historically have been only loosely integrated for this organization.

Background

Working with the customer, this organization started with a business background and a well-defined business problem. In the business background, we learned that the analysts work closely with business representatives, for example, from organizational groups like segment management or a retention office, to formulate the analytical “problem” at hand. The analyst begins the model-building process, typically starting with gathering and preparing the data (statistical data preparation) for their needs before actually developing and validating the model. Once the analyst settles on a model that they deem most sufficient to predict customer churn and satisfy the business problem, this model will have to be deployed to an automated and governed IT process for the recurring scoring of new data sources.

The following will explain how in-database processing and analytics can be applied in the various phases of this model-building process to predict and execute customer churn.

Creating a Data Set

The most essential component of information is the customer level analytical data set. A single data record in this data set is also referred to as the customer analytic record. A series of data preparation steps are usually required to create this record-level analytical data set. The data includes contract information, customer contact information, recharging for prepaid, customer life value, and other behavioral characteristics including service usage. Behavioral information can be found in the call detail records. Such usage information, gathered and transformed on a record-level for each customer entity, typically includes the number of calls or minutes of use, ratio of week-days versus weekend calls, or similar metrics. The granular call detail records data on call level are typically aggregated up to a weekly or monthly granularity before used in model development.

The behavioral data aggregated from call detail records are combined with contract, recharging, and customer contact data. Everything is available in full history and detail on the data warehouse. The aggregation and combination steps are executed directly in the data warehouse, comprising several hundreds of attributes ready to be used in churn modeling. The combination of data is typically implemented as joins, sorts, and merges within the data warehouse.

One best practice is to make aware of the resulting analytic data set at the data warehouse level. This means ensuring that all relevant changes of customer behavior are traced and recorded and that proper timestamps are assigned to these changes. Generally, this process is a prerequisite to achieving the goal of establishing the data warehouse as a corporate repository. In this case, all behaviors related to the customers are pursued over a certain time and are used for the actual modeling to develop a churn model.

Data Exploration

Before building the analytical data set, the analytic user would work against tables in a customer database, either a stand-alone data model or a domain within an enterprise data model inside the database or data warehouse, for all customers in the active population. Based on the understanding of the source data, the analytic user would then move to the next step, building the analytical data set as input for developing the analytic model.

By using the in-database functions such as summary, frequency, and means, the analyst will further explore the analytical data set at this stage to understand the behavior of attributes across subscriptions or groups of customers within the population to identify the expected behavior but also unexpected patterns in subgroups—especially by using the frequency analysis with stratified analysis by computing statistics across the analytical data set as well as within churner versus nonchurner groups of subscriptions. Summary, means, and frequency all compute descriptive statistics for variables across all customer records.

Data Preparation

After having explored the source data for mediated call data, billing data, and recharging data, we can now transform this source data to the customer analytic records at subscription level granularity, with calling, information about a customer recharging his phone, and many more attributes for model development; for example, there can be multiple subscriptions for a single customer. For churn-prediction modeling activities, the analytic data set captures active subscription customer analytic records as well as churned subscription customer analytic records.

In the data preparation step, the analyst will use the customer or related subscription behavior derived from the analytical data set to refine existing attributes or generate additional attributes into the analytic data set using statistical derivations. A good example is principal component analysis (PCA) techniques using the in-database principal components procedure. Depending on research objectives, a similar technique would be a factor analysis, also available as in-database factor function.

One mathematical technique used in this churn analysis is combining variables into factors and reducing the number of variables under consideration in order to obtain the most information about the customers and develop an effective data model.

A typical application in churn-prediction modeling would be to reduce a group of usage behavior attributes (e.g., 3 months, 6 months, 12 months average number of calls) with factor analysis. As a result, we can use a reduced number of attributes for modeling, which has a number of advantages (statistical and practical ones) over using a larger number of attributes.

Alternatively, the modeler can use the in-database capable variable selection procedure to run through this step in a more automated fashion. Many data-mining databases have hundreds and thousands of potential inputs which are model attributes. The variable selection procedure allows quick identification of input variables that are useful for predicting the target variables based on a linear models framework.

Model Development

There are several data-mining methodologies on the market that are recommended sources for more process-specific details. In this section, we will focus on demonstrating how the in-database technology can be applied in the model-development process.

Logistic regression is a robust method, and we see it very often used for building churn-prediction models. To be used by logistic regression analysis, a certain number of churned lines are combined with a number of lines still active in order to prepare modeling.

The analyst can use the in-database procedures such as linear and logistic regression analysis. Linear regression attempts to predict the value of a continuous target as a linear function of one or more independent inputs. Logistic regression attempts to predict the probability that a categorical (binary, ordinal, or nominal) target will acquire the event of interest as a function of one or more independent inputs.

In our example of churn prediction modeling, the logistic regression will be used when a categorical target attribute is used (note that this is the predominant case). The function supports forward, backward, and stepwise selection methods. It also allows the scoring of a data set.

The regression procedure computes a model in which the dependent churn target variable is modeled as a linear equation of multiple independent variables by a least squares function. We might consider REG, for example, when we decide to define the churn target attribute as percentage decrease in usage. The REG procedure has numerous model-fitting options, which include many model selection variations and hypothesis tests. Regression analysis is often used to identify a subset of independent variables that have unique and significant relationships with the dependent variable. In this case, the REG procedure is used for its exploratory analysis, in addition to its extensive model fitting and reporting functions.

As a result of this core model development step, the analyst defines the churn prediction model to be deployed for regular scoring of newly incoming data inside the database.

Model Deployment

In-database scoring is used in the deployment step of the churn prediction modeling. Deployment of models in the customer retention business process requires recurrent scoring of the active subscriber base with the selected churn model.

The scoring function produces churn scores per subscription in the form of a table in the data warehouse that contains predicted churn risk and/or residuals from a model that is estimated using the regression procedure in-database. The in-database scoring function dynamically generates SQL code for the given model. This function then submits the SQL code to the database, which produces a table without having to extract any rows of data.

Table 2.3 is a summary of value and benefits using in-database technology achieved from the telecommunication company.

Table 2.3 Benefits of In-Database Processing

Capability Value
Streamline analytic workflow
  • Accelerate data discovery
  • Minimize data preparation
  • Decrease time to value
Scalability and performance
  • Reduce data movement
  • Leverage the database for parallelization
Data consistency
  • Reduce data redundancy
  • Reduce information latency
Fit for IT
  • Enable data governance
  • Increase hardware utilization
  • Integrate with resource management
  • Facilitate standardization on a single enterprise analytics platform
Productivity
  • Increase business user productivity
  • Minimize network traffic
  • Provide high quality analytical data models

In addition, the customer shared these positive impacts on the business:

  • Receives an early warning when customer behavior changes, enabling fast reaction with new offers.
  • Improves customer retention and the bottom line. Sales people can bring up all relevant data on screen while talking to a customer.
  • Isolates high churn rate in a customer's geographical or social network neighborhood, which leads to proactive propagation and marketing campaigns.
  • Expands line of business into other markets by executing cross-sell promotions into other markets.

Churn analysis is applicable in every industry, but more so in the telecommunication industry due to its highly competitive nature and plethora of options for consumers. By having the right technology and infrastructure, this telecommunication company was able to turn around its declining business and ultimately be proactive by responding to customer behavior, instead of just reacting when it is too late. To borrow a quote from Bill Gates, “Your most unhappy customers are your greatest source of learning.”1

Financial: In-Database Model Development and Deployment

A major European financial organization identified that its cycle time from model initiation to model deployment was unsatisfactory with the changes in a global, competitive economy. The process was manual, error-prone, and resource-intensive. The process had little or no monitoring to identify model degradation.

This organization developed a flexible and efficient infrastructure for flexible data management, model development, and model deployment using in-database technology. The platform harnesses the power of the database environment for data preparation and model scoring and uses the power of analytics to build the complex and comprehensive analytical data models. By leveraging the massively parallel power of the database, over 55 million records can be scored within the database many times during the day. This could not have been accomplished with the older and traditional process. Figure 2.5 shows the process that this customer has implemented within its organization.

Illustration depicting In-database process.

Figure 2.5 In-database process

By adopting the in-database processing, the numerous months that it usually took for a model to be promoted to a production environment were dramatically reduced down to days. There was a 40 percent reduction in data preparation, and analysts are 50 percent more productive doing more strategic initiatives. This financial company was able to increase performance, economics, and governance.

Let's examine how this financial institution was able to apply analytics and the challenges they encountered.

Challenges with the Traditional Approach

Due to the competitive nature of the financial industry and the evolving needs of their valued customers, the traditional approach was not meeting their needs to maintain a competitive advantage and leadership in their market. The top three challenges that this institution faces are common to many other organizations:

  1. Manual process
    The steps to prepare the data and build a data model require multiple areas of the business to be involved. In a typical project an analyst is required to build the model, an IT team executes and deploys the model, then a database administrator centrally manages the data warehouse. As each business area hands off the results to another area, there is a delay. Each area has to check and validate what has happened to the model and data. In many areas, it is common that a model is recoded from one language to another. There might be many iterations of the model before the best model is ready. Because of so much manual involvement, incorrectly coded models can be deployed into a production environment that leads to incorrect results, which could have devastating consequences.
  2. Data governance
    In the traditional approach, data sources are scattered across the organization, and the data use for model development is not in the right format. Analysts tend to massage that data by creating their own analytical data mart or data silo. In addition, they generate additional summarizations of the data. All of these data preparation methods lead to inconsistent data, with different versions of data used to build the model. As a result, the data in the model development environment are inconsistent and may lead to different results. Since the data are distributed, some are kept on file servers and some on mapped drives, and, in extreme cases, some data are on local hard drives. This causes major issues with data governance because there are so many silos of data. Data on local hard drives results in processing on the desktop, which can mean poor performance.
  3. Disjointed technologies
    Data in a data warehouse are used by many different analytical teams, which may result in multiple copies of data and locations. All of these duplicates lead to multiple storage and large data transfers across networks. Needless to say, analysts might use different analytical software to develop the models. When it is time to promote the model to a production environment, analysts may have to translate the mode or recode into a common standard. Models degrade over time, so they must be constantly monitored and managed. Model monitoring in the production environment is not consistent because different spreadsheets can be used to monitor the model. Models degrade over time, so they must be constantly monitored and managed. Because different technologies are being leveraged at different stages and none of them are integrated, it can deliver inconsistent models and outputs.

In-Database Approach

To mitigate the risks and challenges just described, the customer conducted a proof of concept with in-database technologies and saw immediate value from the readout. The customer saw that in-database processing takes advantages of the massively parallel processing (MPP) architecture of the database or data warehouse for scalability and better performance. Moving data management and analytics to where the data reside is beneficial in terms of speed and performance. This move reduces other unnecessary data moves and promotes greater data governance.

  1. Data preparation
    The data warehouse provides the primary source of data for analysis. An ETL tool is used to extract, load, and transform (ELT) data that produce the analytical data set. During the extraction phase, a data source or system is accessed, and only the data that is needed is obtained. The data are then staged in the data warehouse using high-speed connectors to ensure that the data are loaded quickly. The data are transformed into a structure that is fit for model development.
  2. Model development
    Developing an analytical data model is often performed by an analyst. The modeler provides various options that automatically treat the data to handle many scenarios such as outliers, missing values, rare target events, skewed data, correlated variables, variable selection, and model building and selection. When model development is complete, analysts register a model package centrally in metadata. A model package contains all of the scoring code that is needed when the model is deployed. It contains optimized scoring code that includes only the necessary transformations from the scoring code, eliminating any transformations that are redundant.

    The model package contains information about the name of the algorithm that was used, the name of the data-mining analyst, and the time of creation. This is useful information for data governance. A model package is associated with a project. This enables the model deployment team to see the model that has been built by the analyst and all of the associated output. A review of the model ensures that the right steps have been taken and a suitable and robust model is released into the production environment.

  3. Model deployment
    Once a model has been reviewed, signed off, and declared ready for production, it is considered champion status. The model is converted into a vendor-defined function and then placed in the data warehouse. A vendor-defined function is a database-specific function that contains all of the scoring logic that is required for a model to run. By placing the vendor defined function in the data warehouse, security, auditing, and administration can be applied and leveraged.

    The model can now be scored. The scoring process can take place in-database and is scheduled to run at a specific time interval, or when it is initiated by a trigger.

  4. Model monitoring and retirement
    Once a model is in a production environment, and is being executed at regular intervals, the model is centrally monitored because its predictive performance will degrade over time. You can use a model manager tool to view a variety of monitoring reports that have been created. The model can be retired when its performance degradation hits a certain threshold, and it can be replaced with a new model that has been recalibrated or rebuilt. Over time, the mode needs to be recalibrated or redeveloped.

Benefits of Using In-Database Processing

In addition to the increased productivity as indicated previously, this financial company experienced many other value-added benefits with in-database processing. Following is a list of tangible and intangible benefits gained from using in-database technologies:

  • Reducing model development time since data are integrated and consistent.
  • Decreasing deployment time because the scoring code needs no conversion. This eliminates the timely and error-prone manual process of translating the model.
  • Faster-scoring processes since the model can be scored in the database, which leverages the scalability and processing speed offered in the database. This reduces scoring times from hours and days to minutes and seconds.
  • Minimizing risk by using consistent processes and integrated technologies for model development and deployment.

The customer successfully focused its efforts and attention to the core of analytics—which is the model development and deployment. Now, I want to draw your attention to a topic—data quality—which is near and dear to my profession. Without addressing data quality, the analyses and results can be misleading and deliver incorrect insights for making business decisions.

IN-DATABASE DATA QUALITY

The adage “garbage in, garbage out” becomes an unfortunate reality when data quality is not addressed. This is the information age, and we base our decisions on insights gained from data. If inaccurate data are entered without subsequent data quality checks, only inaccurate information will prevail. Bad data can affect businesses in varying degrees, ranging from simple misrepresentation of information to multimillion-dollar mistakes.

In fact, numerous research and studies have concluded that data quality is the culprit behind many failed data warehousing and analytics projects. With the large price tags on these high-profile initiatives and the importance of accurate information to business analytics, improving data quality has become a top management priority.

With big data, you now have to store and manage large amounts of data that need to be cleansed and integrated. By integrating data management and data quality, a consistent, reliable view of your organization ensures that critical decisions are based only on accurate information with in-database data quality. If you trust your data, you can trust your decisions with in-database data quality.

Background

According to a Gartner report titled “Measuring the Business Value of Data Quality”, 40% of business initiatives fail due to poor data quality. Big data warehousing and analytics projects have failed due to data quality issues. The Data Warehousing Institute reports that billions and billions of dollars are lost because of poor data quality.2 The cost accounts only for U.S. business losses in postage, printing, and staffing overhead. Frighteningly, the real cost of poor data quality is much higher. Beyond wasted resources, there are disgruntled customers, decreases in sales revenues, erosion of credibility, and the inability to make sound business decisions. Sometimes the effects of bad data cause enough damage for complete business failure.

Organizations depend on data. Regardless of the industry, revenue size, or the market it serves, every organization relies on its data to produce useful information for effective business decision making. Unfortunately, with so much emphasis on such information, data quality rarely gets the attention it deserves. No one wants to consciously admit that business decisions are based on inaccurate or incomplete data. TDWI revealed that 75 percent of organizations have no data quality processes in place.3 It appears that the majority of the businesses have taken no steps to determine the severity of data quality issues and its impact on the bottom line.

Companies invest hundreds of thousands of dollars and significantly large portions of information technology (IT) budgets on building sophisticated databases and data warehouses. In the quest for successful business intelligence, various applications and systems will be deployed and information gathering processes will be created. However, many overlook the essential fact that it is the underlying data that matters. All of the fantastic screens and reports in the world will not make a positive difference if the data that supports the system is inconsistent, redundant, and full of errors.

There are many reasons why the quality of the data that companies collect and analyze is so poor. The reasons vary everything from the very ambiguous nature of the data itself to the reliance on data entry perfection. But none is more compelling than the simple fact that companies rely on so many different data sources to obtain a holistic view of the business.

Information collection is increasing more than tenfold each year, with the Internet a major driver in this trend. As more and more data are collected, the reality of a multi-channel world that includes e-business, direct sales, call centers, and existing systems sets in. Bad data (i.e., inconsistent, incomplete, duplicate, or redundant data) are affecting companies at an alarming rate, and the dilemma is how to optimize the use of corporate data within every application, system, and database throughout the enterprise.

Take into consideration the director of data warehousing at a major electronic component manufacturer who realized there was a problem linking information between an inventory database and a customer order database. The inventory database had data fields that represented product numbers differently than the customer order database. Nothing was done about it. There were hundreds of thousands of records that would have to be reviewed and changed but there were no resources available to take on this tremendous, time-consuming task. As a result, more than 500 customer orders were unfulfilled. At an average order per customer of $5000, the resulting loss of US$2.5 million in revenue was significant.

Data Quality Defined

Data quality is often defined as a process of arranging information so that individual records are accurate, updated, and consistently represented. Accurate information relies on clean and consistent data that usually includes names, addresses, email addresses, phone numbers, and so on. Good data quality means that an organization's data are accurate, complete, consistent, timely, unique, and valid. The better the data, the more clearly it presents an accurate, consolidated view of the organization, across systems, departments, and line of businesses.

Technological advancements that use data pattern analysis, smart clustering, numerous data algorithms, and a host of other sophisticated capabilities ensure that data gathered throughout the organization is accurate, usable, and consistent. By intelligently identifying, standardizing, correcting, matching, and consolidating data, software solutions offer much needed relief to organizational data quality headaches.

Today organizations are looking for a wide range of features in data quality tools. According to the report by TDWI, standardization and verification top the list of desired capabilities, followed by tools that define and validate business rules. Other important features include matching, consolidation, and integration with other enterprise applications such as analytics.

Business Challenges

Bad data originates from a variety of sources—errors in data entry, erroneous data received from Internet forms, faulty data purchased or acquired from an outside source, or simply combining good data with outdated data and not having the ability to distinguish between the two.

One obstacle to creating good data is simply examining what is available and developing a plan for how it will be used. You will need to determine which data are good, which are bad, and how bad the bad data are. Most organizations have little or no idea about the quality of the data residing in their systems and applications.

According to TDWI, almost half (44 percent) of the respondents said the quality of the data within their companies was “worse than everyone thinks.”4 This same report chronicles examples of costs and missed opportunities due to inaccurate or incomplete data:

  • A telecommunications firm lost $8 million a month because data entry errors incorrectly coded accounts, preventing bills from being sent out.
  • An insurance company lost hundreds of thousands of dollars annually in mailing costs (postage, returns, collateral, and staff to process returns) due to duplicate customer and prospect records.
  • A global chemical company discovered it was losing millions of dollars in volume discounts in procuring supplies because it could not correctly identify and reconcile suppliers on a global basis.

Sadly, most companies are oblivious to the true business impact of poor quality data. The simple truth is that poor data quality absolutely affects the bottom line. Accurate, complete data reduces costs in a number of ways, from the simple and obvious marketing savings (postage and production costs on a direct marketing piece, for example) to the less obvious organizational efficiency savings.

According to the same Data Quality Survey (mentioned earlier), almost half of the surveyed companies suffered losses, problems, or costs due to poor quality data. Companies also cited extra costs due to duplicate mailings, excess inventory, inaccurate billing, and lost discounts, as well as customer dissatisfaction, delays in deploying new systems, and loss of revenue.

Data as a Strategic Asset

Data should be treated as a key strategic asset, so ensuring its quality is imperative. Organizations collect data from various sources: legacy, databases, external providers, the Web, and so on. Due to the tremendous amount of data variety and sources, quality is often compromised. It is a common problem that many organizations are reluctant to admit and address. The single most challenging aspect for companies is to recognize and determine the severity of their data quality issues and face the problem head-on to obtain resolution. Spending the money, time, and resources to collect massive volumes of data without ensuring the quality of the data is futile and only leads to disappointment.

Cleansing data at the source is a significant way to enhance the success of a data warehouse and analytics. Thus, it becomes a proactive rather than reactive model. Simply collecting data is no longer sufficient. It is more important to make proper sense of the data and to ensure its accuracy. As the amount of data escalates, so does the amount of inaccurate information obtained from the data. Data should be cleansed at the source in order to detect and address problems early in the process so that quality issues are prevented further down the line.

Information is all about integration and interaction of data points. Inaccuracies in a single data column can ultimately affect the results and may directly affect the cost of doing business and the quality of business decisions. Preventive measures to ensure data quality usually are more economical and less painful. Delaying the inevitable data cleansing dramatically increases the cost of doing so, as well as increases how long the cleansing process takes to complete.

Improved synergy between the extraction, loading, and transformation (ETL) warehousing process and data quality offers the ability to more easily manage complex data integration. By applying data quality in the ETL process, data integrity and accuracy are assured. Much of the data warehousing effort is concentrated in the ETL process with the extraction of records and fields from various data sources, conversion of the data to new formats, and the loading of data to other target destinations such as a warehouse or a data mart. The purpose of the ETL process is to load the warehouse with integrated and cleansed data. Data quality focuses on the contents of the individual records to ensure the data loaded into the target destination are accurate, reliable, and consistent.

The Value of In-Database Data Quality

The technology for in-database data quality has only been on the market for about three years. Not many vendors offer this capability. This technology is neither as mature nor as advanced as in-database analytics. Enabling in-database data quality provides value to both business analysts and IT professionals. Many of the IT professionals focus heavily on the ETL process since its purpose is to load the data warehouse with integrated and cleansed data. However, data quality is a key component in the preparation for entry into the data warehouse. The best place to clean the data is in the source system so the defects cannot extend to the data warehouse and other interrelated systems. There are several methods and processes to achieve integrated and cleansed data.

Data Auditing and Standardization

One of the most popular data quality processes is standardizing the data. Now, you can standardize the data in-database. Data in a database or data store typically are inconsistent and lack conformity. There are many ways to say the same thing. As Table 2.4 illustrates, a title may be expressed in various ways.

Table 2.4 Variations of Title

Name Title
Jane Doe1 VP of Sales
Jane Doe2 Vice President of Sales
Jane Doe3 V.P. of Sales
Jane Doe4 Vice Pres. of Sales

If a direct mailing campaign plans to extract 5,000 “VP of Sales” out of a database by writing a query, and the query does not include every possible way that “vice president of sales” is represented, then the search will miss some of the target names. Inconsistently represented data are more difficult to aggregate or query. If the data are not represented consistently, it is even more difficult to perform any meaningful analysis of the data.

The first step in the cleansing process is data auditing. Data auditing provides counts and frequencies for data fields. It also identifies unique values and range reports with maximum and minimum values. In this phase, you should also define your business and cleansing rules. Once you have analyzed the values, then you can standardize the title for consistency.

Data Consolidation and Matching Analysis

Another important data-quality process is matching and consolidating records. Anyone who gets two or three of the same advertisement or magazine instead of a single copy to which you subscribed is likely experiencing a perfect example of a “duplicate record” problem in the customer database. In some cases, there are probably small variations in the way the subscriber's name or address appear in the database. Table 2.5 illustrates the example.

Table 2.5 Duplicate Records

Name Address Country
William Doe 900 Sunset Dr USA
Will Doe 900 Sun set Dr. U.S.A
Bill Doe 900 Sunset Drive US
Billy Doe 900 Sunset United States
Mr. W. N. Doe 900 SunSet Drive U.S.

William Doe may have several contact points and enter various kinds of contact information. A human being would look at these variations and instantly recognize that these names represent the same person, but a computer would store these as different records—hence multiple copies of the same advertisement or magazine. In this case, resources are allocated on inaccurate information extracted and loaded into the data warehouse with duplicate or redundant data. This problem can obviously be very costly for any organization that routinely mails against a large customer database.

Data consolidation and matching analysis are fundamental to the data management and data quality process. These capabilities link data from multiple sources and identify records that represent the same individual, company, or entity. The process of consolidation combines the elements of identifying matching records into a single, complete record (often called the golden or master record).

Matching and standardizing routines are necessary to analyze, cleanse, and consolidate data from a variety of platforms and sources. It includes defect analysis and corrections for invalid data and data authentication processes.

Other In-Database Data Quality Functions

Standardizing and matching of data are the most prevalent data quality functions, and they are most useful in-database. Other in-database functions include:

  • Casing: Ensures context-appropriate casing is used in a data column.
  • Extraction: Extracts context-specific entities or attributes from text string.
  • Gender analysis: Determines the gender of a name.
  • Identification analysis: Determines the type of data represented by a text string.
  • Parsing: Segments a string into separate, discrete entities.
  • Pattern analysis: Shows simple representation of a text string's character pattern. It's useful to determine if further data quality processing is necessary.

As in-database data quality evolves and customers are demanding more in-database data-quality functions, I truly believe it is an essential element to conquering big data and analytics.

Value of In-Database Data Quality

Like in-database analytics, in-database data quality offers valuable benefits and advantages. Imagine what you can accomplish when data quality processes you once considered too complex are not only possible, but can be done quickly and with minimal effort. You will get your big data under control in no time—which leads to big ideas. It has big potential and possibly an even bigger payoff.

With in-database data quality, you can:

  • Become more agile, reacting with lightning speed to new opportunities and challenges.
  • Improve productivity by minimizing time spent on moving and manipulating data.
  • Boost security by keeping data intact and in a single, highly governed environment.
  • Seamlessly manage the flow of cleansed data throughout the enterprise.
  • Keep costs in check by using an integrated solution instead of piecemeal technology.

It is a fact that ignoring data quality is costly, and it affects every industry that relies on accurate and consistent information. Data quality is a continuous process and effort. By addressing data quality at the source, data cleansing becomes a proactive rather than a reactive process. The integration of data quality with the database minimizes the risk of failure, cost, and number of resources to manage the data. The synergy between ETL and data quality provides integrated and cleansed data for your warehouse, mart, or other repository. Data are a vital resource and should be treated as a key strategic asset in order to obtain reliable and accurate view of your organization.

INVESTMENT FOR IN-DATABASE PROCESSING

Customers often express interest for in-database processing, but they are concerned? about the investment or costs to adopt and implement this technology. I frequently tell customers and prospects that it will require some investment in software, hardware, and training the staff to a new process. There are a handful of analytic vendors that are working closely with the database vendors to deliver in-database processing.

From the software perspective, the analytic marketplace varies—open source, small boutique start-ups, visionary, market leaders. The open source vendors are starting to integrate with some of the database vendors to offer cost-effective in-database solutions with a lower total cost of ownership. With open source, there is a large and growing user community to leverage for sample code and implementation approach. However, it can be limited to analyzing large data volumes because it is memory bound and is single-threaded processing for constrained performance. Niche or small-boutique analytic vendors are also entering the in-database solutions. They offer another low-cost alternative with more modern user interface for those who prefer the nontraditional manual coding technique. On the other hand, users may not have the depth and breadth of analytic capabilities and may not be certified as a standard in industries such as health care, finance, or pharmaceutical. Finally, the visionaries or leaders in analytics offer the most robust and mature in-database solutions. Not only do they offer the depth and breadth of in-database functions, but they also deliver many proven methodologies, integrated consulting services, and documented best practices. Users can easily adopt these solutions with minimal changes to their existing code or process to applying in-database.

As the name implies, in-database processing would require the database vendors to deliver the hardware and software for the database to host the data for analytics. Unlike the analytics market space, most of the database vendors that offer in-database solutions are visionaries and leaders. Yes, there are a select few open source companies that have started to work with the analytic vendors in the last couple of years. As expected, adoption is few and far between and lacks the depth and breadth, methodology, and proven successes. The leaders in the database market have developed mature, high-performance technologies to enable the heavy lifting of data for analysis. Because in-database processing can impact the performance of the database, it may require additional hardware purchases to handle the additional capacity for executing of analytics. As you consider in-database technology, vendors may ask for the following details to configure the right solution:

  • What is the largest size data to be processed in-database? Whether it is small (in the gigabytes) or large (in the terabytes), in-database processing can impact the performance of your production data warehouse environment. Most of our customers who leverage in-database processing have a development or testing environment in a form of a data lab or sandbox to test before moving it into production.
  • Do you know the number of rows and columns of the data being analyzed in-database? As mentioned, in-database processing can assist in joining, merging, and aggregating data in-database. It will help the consultants to estimate the amount of effort by knowing the rows and columns and type of data that will be processed in-database.
  • How many total users will have access to the in-database capabilities? If your organization is laser focused on a department such as the operations research group to only have in-database technology, then the licensing and purchasing of the in-database solution can be minimal. On the other hand, if it is intended for enterprisewide usage, then expect a bigger investment in both software and hardware to accommodate the needs of the number of users for your business.
  • How many concurrent in-database sessions may be running? In-database processing may be used in many concurrent sessions from an enterprise or departmental level to prepare the data and/or run the analytics. Depending on the size of your organization, consider one or two departments first to have access to in-database capabilities to show its value before enabling it enterprisewide.

In-database processing will require some retraining on how users handle, manage, and analyze the data. As you select the analytic or database vendor, consider how the users will need to transform the way they do things. For example, select a package that will minimally disrupt the current data preparation and analytic process. In another example, many analytic professionals will need to reuse existing code. If they are able to leverage the existing code and make minor tweaks, it will minimize the learning curve and disruption of culture within the company. It will likely gain acceptance and adoption when change is minimal.

I have demonstrated the value of in-database technology in an e-commerce company, telecommunication provider, and financial institution. In addition, I highlighted the importance of data quality and the in-database approach to accurately deliver data-driven decisions. Depending on your organization's needs for in-database, the investment can be minimal. Selecting the appropriate in-database solution from the right analytic and database vendor will make a big difference in value, adoption, and success. Now, let's move to in-memory analytics and reveal even more business benefits and value added from this technology.

ENDNOTES

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.97.170