In traditional data warehousing, you are thinking of collecting, integrating, standardizing
and presenting the traditional, structured datasets of the company. Think about the
different finance, marketing, operational, HR and other datasets of your company,
which, as discussed above, may be dispersed, incongruent, vulnerable and slow to access.
Traditional data warehousing can solve some of these issues. Although there are some
different ways to warehouse,
Figure 18.1 One version of traditional data warehousing below shows a common set of steps, which I explain in the following points.
The process in
Figure 18.1 One version of traditional data warehousing above is only one possible view of data warehousing, and there are, of course, variations.
With this in mind, let us look at each of the steps presented here, which do include
most elements in the traditional process.
-
Step 1: The raw datasets. In the initial phase, you begin by identifying the difference, disparate datasets
in your organization that might benefit from being integrated into the warehouse,
such as marketing, financial, operational and other datasets.
-
Step 2: The staging phase. In many warehousing solutions, the raw data from the different databases is extracted
from their separate ”homes” and stored in an initial staging area, which essentially
means that they are copied to the central storage. This is part of what we term the
Extraction, Transformation and Load (ETL) process, which usually involves specific
suites of software. Note here we are only extracting from some datasets and loading
temporarily to another. SAS is particularly excellent at ETL.
-
Step 3: The integration step. In this step, transformation of the original data occurs to integrate and standardize
it. For example, names of things (like people, products or business units) might be
standardized to ensure that data can be linked. Units of things like money or time
might be standardized, for instance all money might be translated using certain business
rules to dollars, and all time data to minutes. Data can also be cleaned according
to certain business rules. This transformation is part of the ETL process. This step
can be done programmatically using SAS or similar software, or it can involve storage
in yet another database (often called an Operational Data Store or ODS), which is
increasingly unnecessary when in-memory solutions like SAS exist.
-
Step 4: Loading to the data warehouse. Having been transformed, the data is now stored in the data warehouse. Typically,
this means storage in large-memory server or cloud environments, which can involve
significant investments in hardware or other costs. The data is stored in a multi-format
way. It includes the raw data itself, often certain prescribed summaries, and metadata
(higher level data about the data).
-
Step 5: Extraction to data marts. In many warehouses, further work is done to make specific, limited views of data
available for specific user needs (“data marts”). For example, you may speak to sales
managers to ascertain what data from the warehouse would best serve their specific
needs, and build a query that presents only that data for them, on command. The data
marts can be quite detailed in nature, or involve processed, higher-level strategic
information for higher level managerial analysis. Not all warehousing architectures
necessarily build in data marts.
-
Step 6: Analysis of the data. In the final step, specific analyses (such as standard statistical analysis, predictive
analysis, data mining, etc.) are performed on the data, either from the warehouse
or the specific data marts.
There are advantages to such data warehouses:
-
The standardization and cleaning of institutional data has the obvious advantages
of data integrity and integration of previously disparate data.
-
The formalization of data cleaning and integration into a system means that these
large tasks are not left to the dedication of specific users at the time that an analysis
becomes necessary, and the results of these efforts are accessible to a wide user
pool.
-
Data warehousing often allows for significant speed in getting and analyzing data,
especially when good quality, focused data marts have been made available.
-
Data warehousing helps to protect data, first by creating a back-up from the original
databases, and second by usually having redundancy (duplicate data and copies) built
in, which ,means that complete losses are less likely. In addition, although the original
databases may have varied levels of security, high levels of security can be built
into the warehouse, which contains the cleaned and perhaps more useful data.
However, there are also issues and problems with traditional warehouses, and alternatives,
as briefly discussed next.