Data Warehousing

Introduction

A substantial problem faced by businesses is that their data can often be dispersed, incongruent, vulnerable, and slow to access. What do I mean by these terms?
  • Dispersed datasets are those stored in multiple different locations, for example in different information systems and applications, varied geographical locations, on different servers or personal computers, and so forth. Especially noteworthy here is the fact that most businesses initially store data in many different user applications. For example, marketing may have a specific “customer management system,” which is an application that stores data on various customers and service or sale processes. HR may have one system to store employee data and an entirely different system in which to run payroll data. Your operations centers may, in turn, run yet another database application, and the same for finance. Some of these may run in different geographies, on different servers, and so on. This dispersion can make it tricky to integrate data when it is needed.
  • Incongruent datasets are those that are hard to analyze together, especially if they are difficult to combine together into coherent datasets for an integrated analysis. Say, for instance, that you are a bank which has a high-value (“wealth”) customer database and a database of customer representatives who service these clients. You may wish to combine these datasets together for analysis. Specifically, you may intend to merge the datasets by matching the customer name field in each dataset. However, you may find that the full customer name is used in the customer database; whereas the service representatives table uses only the surname and initials of the customer (e.g. I might be Lee, Gregory in the one dataset and Lee, G.J. in the other). Also, there may be financial data in one dataset measured in dollars and financial data in another dataset measured in yen, or time data in one set measured in hours and time data in another dataset measured in seconds. Finally, very unstructured data – such as social media data – can present its own challenges as we discussed in the previous section on big data. Disconnects such as these may make it difficult to combine and analyze the datasets together.
  • Vulnerable datasets are those that are not protected from various risks. Many such risks exist. Databases can be damaged or even lost altogether, for example when a server gets damaged by a lightning strike or when a personal computer is stolen and there is no backup. Security risks also exist, such as hacking of data.
  • Slow to access data is a frequent problem with major data storage applications, especially the older ones. These applications are often designed to enter and store data one row at a time. They are not always designed to call data out in large volumes quickly. Many businesses store incredibly large volumes of data in the original, raw datasets, which cannot easily or quickly be drawn out, integrated and analyzed. Speed, on the other hand, is often desired by managerial users.
Data warehouses[1], which integrate data and provide certain views and analyses, can help with all these issues to some extent, although designing and implementing a data warehouse needs very careful thought and expertise.

Steps in Traditional Data Warehousing

In traditional data warehousing, you are thinking of collecting, integrating, standardizing and presenting the traditional, structured datasets of the company. Think about the different finance, marketing, operational, HR and other datasets of your company, which, as discussed above, may be dispersed, incongruent, vulnerable and slow to access. Traditional data warehousing can solve some of these issues. Although there are some different ways to warehouse, Figure 18.1 One version of traditional data warehousing below shows a common set of steps, which I explain in the following points.
Figure 18.1 One version of traditional data warehousing
The process in Figure 18.1 One version of traditional data warehousing above is only one possible view of data warehousing, and there are, of course, variations. With this in mind, let us look at each of the steps presented here, which do include most elements in the traditional process.
  • Step 1: The raw datasets. In the initial phase, you begin by identifying the difference, disparate datasets in your organization that might benefit from being integrated into the warehouse, such as marketing, financial, operational and other datasets.
  • Step 2: The staging phase. In many warehousing solutions, the raw data from the different databases is extracted from their separate ”homes” and stored in an initial staging area, which essentially means that they are copied to the central storage. This is part of what we term the Extraction, Transformation and Load (ETL) process, which usually involves specific suites of software. Note here we are only extracting from some datasets and loading temporarily to another. SAS is particularly excellent at ETL.
  • Step 3: The integration step. In this step, transformation of the original data occurs to integrate and standardize it. For example, names of things (like people, products or business units) might be standardized to ensure that data can be linked. Units of things like money or time might be standardized, for instance all money might be translated using certain business rules to dollars, and all time data to minutes. Data can also be cleaned according to certain business rules. This transformation is part of the ETL process. This step can be done programmatically using SAS or similar software, or it can involve storage in yet another database (often called an Operational Data Store or ODS), which is increasingly unnecessary when in-memory solutions like SAS exist.
  • Step 4: Loading to the data warehouse. Having been transformed, the data is now stored in the data warehouse. Typically, this means storage in large-memory server or cloud environments, which can involve significant investments in hardware or other costs. The data is stored in a multi-format way. It includes the raw data itself, often certain prescribed summaries, and metadata (higher level data about the data).
  • Step 5: Extraction to data marts. In many warehouses, further work is done to make specific, limited views of data available for specific user needs (“data marts”). For example, you may speak to sales managers to ascertain what data from the warehouse would best serve their specific needs, and build a query that presents only that data for them, on command. The data marts can be quite detailed in nature, or involve processed, higher-level strategic information for higher level managerial analysis. Not all warehousing architectures necessarily build in data marts.
  • Step 6: Analysis of the data. In the final step, specific analyses (such as standard statistical analysis, predictive analysis, data mining, etc.) are performed on the data, either from the warehouse or the specific data marts.
There are advantages to such data warehouses:
  • The standardization and cleaning of institutional data has the obvious advantages of data integrity and integration of previously disparate data.
  • The formalization of data cleaning and integration into a system means that these large tasks are not left to the dedication of specific users at the time that an analysis becomes necessary, and the results of these efforts are accessible to a wide user pool.
  • Data warehousing often allows for significant speed in getting and analyzing data, especially when good quality, focused data marts have been made available.
  • Data warehousing helps to protect data, first by creating a back-up from the original databases, and second by usually having redundancy (duplicate data and copies) built in, which ,means that complete losses are less likely. In addition, although the original databases may have varied levels of security, high levels of security can be built into the warehouse, which contains the cleaned and perhaps more useful data.
However, there are also issues and problems with traditional warehouses, and alternatives, as briefly discussed next.

Issues & Alternatives in Data Warehousing

Traditional data warehousing has certain disadvantages that bear careful thought. These include:
  • The traditional process such as that seen in Figure 18.1 One version of traditional data warehousing often involves significant investments in local hardware (large server-type solutions) or cloud storage, which can become obsolete quickly.
  • Traditional data warehouses can sometimes still be somewhat slow compared to alternate solutions. Real-time access to data is nearly impossible with a warehouse where the data has to go through several layers of processing and storage.
  • Traditional data warehouses can often involve moving data in and out of several storage locations.
  • There is usually a lot of redundant (overlapping) data in a warehouse. While this can have advantages, it also leads to huge data stores and may reduce flexibility. In some situations, organizations are forced to either destroy older data or to invest in increasingly more hardware.
  • Traditional warehouses are not designed to store and analyze unstructured data of the types discussed in the previous section on big data.
These and other challenges have led to multiple calls for changes to the traditional structure and process of data warehousing. These include:
  • Simplify and improve data staging and processing. As discussed earlier, traditional designs sometimes involve multiple-location storage of initial data after it has been initially extracted from native databases and while it is being transformed. In addition, data gets processed and moved several times thereafter. This may be a needless waste of space and time. Many modern systems can improve on the storage and speed involved, for example by extracting, processing and loading the raw data to the warehouse in memory, which means processing and analyzing the data in computing processors without storing it on disk in the initial phases. SAS LASR is an example of this technique, which also applies to other parts of the warehousing process. In addition, open-source software initially developed for big data (Hadoop) has been suggested as a good way to initially process data, due to its low cost and more efficient processing abilities.
  • Cheaper and less restrictive storage solutions: There are solutions that have made storage of data far easier and cheaper, and that more easily allow for storage of data that warehouses have not traditionally dealt with. Once again, the big data Hadoop solution is an example. As pointed out by Dull (2014), instead of using huge hardware storage devices for centralized storage, Hadoop is designed to operate on cheap, commodity machines, and is able to expand easily across literally thousands of devices to achieve its storage – this is known as distributed computing. In addition, Hadoop is less restrictive, in that it can store and deal with very unstructured data such as Twitter posts, as discussed earlier in the big data section. It is therefore a potential solution not only for big data but also for combinations of big and ”small” data, where for the latter it can replace or enhance the traditional Extract-Transform-Load (ETL) solutions. In fact, Hadoop can be used as an effective data warehouse in its own right.
This is not to say that solutions like Hadoop are a surefire replacement for traditional data warehousing. The skills to implement such technology are sorely lacking, and organizations are finding other ways to achieve improvements to warehouses. As with all big organizational decisions, careful thought and design is needed before considering any data warehousing solution.

Conclusion on Data Warehousing

For larger organizations especially, data warehousing is probably going to become increasingly important. Whether alternate technologies like Hadoop take over is up for debate, however. The sheer size and increasing complexity of the data universe will continue to call for storage and processing solutions.
Last updated: April 18, 2017
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.38.176