Chapter 4
Data Warehousing Technical Architecture

From an architects' point of view, the hard part is to find the natural places to locate boundaries and define the appropriate interfaces needed to build a working system.

Einar Landre

In this chapter, you will learn about technical architecture, data architecture and data warehouse architecture and how to define them. You will learn to do many things.

  Understand technical architecture terms and concepts.

  Identify data architecture components needed for data warehousing and business intelligence.

  Understand and present the big picture of data warehouse architecture.

  Present the data warehouse technology stack which is the set of technologies selected for a data warehouse implementation.

  Recognize databases needed to produce your data warehouse.

  Compare data warehouses to data marts. This book explains the functions of each and shows where they are the same and where they are different.

  Identify data sourcesthe places where you are going to get data to populate your data marts and data warehouses are explained.

  Manage the business intelligence application architecture portfolio.

  Put together the technologies needed to support data warehousing and business intelligence.

Solid architecture is critical to data warehousing. The ANSI/IEEE Standard 1471-2000 defines the architecture of software-intensive systems as “the fundamental organization of a system, embodied in its components, their relationships to each other and the environment, and the principles governing its design and evolution.” This chapter presents Technical Architecture, Data Architecture and Data Warehouse Technical Architecture. Data warehousing architecture is an application of technical architecture and data architecture targeted to the data warehouse.

Technical Architecture presented in this book is a combination of Application Architecture and Infrastructure Architecture. Major topics of technical architecture are described in the next section of this book:

  Functional and non-functional architecture requirements

  Technical architecture principles

  Buy, build, or re-use approach

  Architecture roadmaps.

The recommendation “Begin with the end in mind” is very true for data warehousing and business intelligence. The end that we have in mind is a system that satisfies both functional and non-functional requirements. The data warehouse is a system that does what it is supposed to do.

Functional requirements (business requirements) are needs identified by the business relating to data and business processes. For example, you may be looking for a system that provides information about customers, territories and products that supports the business processes of selling and customer support. See Chapter 3, Requirements for Data Warehousing and Business Intelligence, for guidance on how to elicit and organize business requirements.

Non-functional requirements are needs about performance and IT chosen practices. Performance includes issues such as required system availability and recoverability. It depends on the volume of data and number of users expected for the data warehouse. IT chosen practices include selected technologies (the “technology stack”) and standards.

Technical architecture principles are critical to the success of data warehousing efforts. In my experience, these principles are often expressed in terms of “ilities”, such as those described in Table 04-01.


Table 04-01: Technical Architecture Principles

Principle

Description

Adaptability

The ability to be effective in changing circumstances.

Affordability

The ability to achieve goals at costs that are within the financial means of an enterprise.

Auditability

The ability to record the circumstances and history of a process for later review.

Extensibility

The ability of a solution to accept new functionality.

Interoperability

The ability to exchange and use functionality across organizations and systems.

Maintainability

The ability to make changes to a system to keep it operating correctly.

Manageability

The ability to obtain information about a system and to control it.

Recoverability

The ability to restore a system to operational status without loss of data and within a required time period.

Reliability

The ability to consistently perform according to specified standards.

Scalability

The ability of a system to increase throughput and maintain responsiveness under conditions of increased load.

Supportability

The ability of a system to be serviced such as through the use of a help desk.

Understandability

The ability for a system to be comprehended or interpreted by people who use or support it.

A critical question for data warehousing efforts is how to obtain the software resources that make up the data warehousing system. There are three options:

1.       Re-use existing resources

2.       Buy a new resource

3.       Build a resource.

Re-using existing resources can often save money and deliver a superior and more maintainable solution. If an organization buys or builds new components every time there is a new project, then the portfolio of resources will soon become bloated and expensive to maintain. Re-use can also have drawbacks. Existing resources may not meet current functional or non-functional requirements.

Buying a new data warehousing resource can save time and money over building a resource. Buying is a good choice when products are available for a price less than building a new resource, and meet a large percentage of your requirements. Purchased software may have more features and fewer problems than homegrown software, for example. In addition to software, pre-built industry data models are available which can also save time and money.

Complete turn-key solutions, known as data warehouse appliances, are available for some applications. The data warehouse appliance integrates multiple components: hardware, software, and data models. This can result in money savings, risk reduction, and faster implementation.

Building a resource can be a good answer when there are no existing resources to re-use and when purchased resources that meet requirements are not available for a reasonable price. Building a solution, or part of a solution, can result in a competitive advantage for your organization when it has a capability that is not readily duplicated by competitors. Cost is also a factor. Purchased software often has per user or per computer charges, while in-house developed software can be made available to internal users without additional licensing fees. On the other hand, building a solution can take a lot more time than adapting existing resources or purchasing a new data warehouse resource.

In general, I recommend “re-use before buy, and buy before build.” Re-use means using an existing asset. For example, if your organization owns an ETL tool that meets requirements, it may cost less to re-use that tool than acquiring a new tool. Buying is acquiring an asset by purchasing or licensing it. The alternative is to build an asset. Table 04-02 illustrates considerations for this decision. Some combination is likely. When requirements and technology stack are known, create a list of needed resources and specify possible sources of supply for each item. This organized approach leads to effective sourcing with minimal duplication and waste.


Table 04-02: Comparing Re-use, Buy, and Build of Software Tools

 

Pro

Con

Re-use

    Avoids new license fees

    Current staff is capable of using tool

    Does not add complexity to technology portfolio

    Existing tools may not be best of breed

    Existing tools may be built on obsolete technology

    Existing tools may no longer be supported by vendors

Buy

    Newer tool use may result in more productivity

    Newer tool may support newer environments

    Additional purchase or licensing cost

    Additional training costs

    Complexity added to technology portfolio

Build

    Avoids licensing fees

    May result in competitive advantage

    Standard languages such as Java will be supported for years

    Pulls focus away from business goals

    Depends on a limited number of internal developers

 

Data warehousing architecture helps to answer critical questions:

  What is the current state of data warehousing and business intelligence?

  What is the desired future state of data warehousing and business intelligence?

  What are the high-level steps that must be taken to move from the current state to the future state?

Roadmaps identify actions that must be taken to achieve the desired future state. In addition, roadmaps specify intermediate future states that must be passed through to achieve the ultimate future state.

Data Architecture or Information Architecture is a blueprint for the management of data in an enterprise. The data architect builds a picture of how multiple sub-domains of data combine to support the organization's stated vision and mission. The multiple sub-domains include data governance, data quality, ILM (Information Lifecycle Management), data framework, metadata and semantics, master data, and finally business intelligence. Figure 04-01 visually depicts these sub-domains.

Figure 04-01: Data Architecture Sub-domains

Each of the data architecture sub-domains has a role in the successful data warehousing and business intelligence project. Table 04-03 identifies and describes each of the sub-domains and identifies places in this book that provide further information.

Table 04-03: Data Architecture Sub-domain Descriptions

Data Governance (DG)

The overall management of data and information, including the people, processes, and technology that improve the value obtained from data and information by treating data as an asset. It is the cornerstone of the data architecture. See Chapter 8, Data Governance and Metadata Management, for more information.

Data Quality Management (DQM)

The discipline of ensuring that data is fit for use by the enterprise. It includes obtaining requirements and rules that specify the dimensions of quality required such as: accuracy, completeness, timeliness, and allowed values. Chapter 9, Data Sources, describes the use of DQM to improve the data utilized in DWBI.

Information Lifecycle Management (ILM)

The discipline of specifying and managing information through its life from its inception to disposal. Information activities that make up ILM include classification, creation, distribution, use, maintenance, and disposal. ILM is described in Chapter 8, Data Governance and Metadata Management.

Data Framework

A description of data-related systems in terms of a set of fundamental parts and the recommended methods for assembling those parts using patterns. The data framework can include database management, data storage, and data integration. This chapter shows the organization of the data framework and additional chapters provide supporting information:

    Chapters 5, 6, and 7 – Data Attributes and Data Modeling

    Chapter 8 – Data Governance and Metadata

    Chapter 9 – Data Sources

    Chapter 10 – Database Technology

    Chapter 11Data Integration

Metadata and Semantics

Information that describes and specifies data-related objects. This description can include: structure and storage of data, business use of data, and processes that act on the data. Semantics refers to the meaning of data. Metadata is described in Chapter 8, Data Governance and Metadata Management.

Master Data Management (MDM)

An activity focused on producing and making available a single authoritative source of master data, the essential business entities such as customers, products, and financial accounts. Master data is data describing major subjects of interest that are shared by multiple applications. Consistent master data is a critical ingredient in the data warehouse. See Chapter 9, Data Sources, for more information.

Business Intelligence

The people, tools, and processes that support planning and decision-making, both strategic and operational, for an organization. Business intelligence is a focus of this book and its many aspects are described in:

    Chapter 12 – BI Operations and Tools

    Chapter 13 Statistics

    Chapter 14 – Data Mining and Analytics

    Chapter 15 – Analytic Pattern

    Chapter 16Presenting Data

    Chapter 17 – BI Applications

    Chapter 18 – Customer Analytics

 

Given the frameworks of Technical Architecture and Data Architecture, it is now time to focus on Data Warehousing Technical Architecture.

Data warehousing technical architecture builds on the classic system pattern: input, process, and output. Figure 04-02 shows a simplified picture view of the data warehouse system. This is the type of diagram included in earlier generation books and presentations about data warehousing. We will start with this view and then dive deeper.

Figure 04-02: Simplified Picture – The Data Warehouse System

Each part of the data warehousing technical architecture is addressed in this chapter and expanded in later chapters.

  Data Sources

  Data Integration

  Data Stores

  BI / Analytics

  Data Access

  Data Governance

  Metadata

  Data Quality

  Infrastructure

  Security

  Systems Monitoring

The data warehousing system flow starts with a data extract process, which pulls data out of data sources and loads the extracted data into an intermediate data store such as staging areas for the atomic data warehouse. Data may be transformed at the same time as the extract or after the load. Figure 04-03 shows the next level of detail.

Figure 04-03: Next Level – The Data Warehouse System

Next, data from data stores may be manipulated and moved between and within internal data stores. For example, data might be moved from an atomic data warehouse to a dimensional data mart. Query, reporting, data mining, statistics and data profiling tools each access data in their own particular format. This book describes how to design each of these areas so that analysts can access that data.

The data warehousing architecture includes a metadata layer and a system monitoring layer.

As you will recall from previous discussions, data warehousing and business intelligence are complementary and supplement each other. Data warehousing is about the data – it integrates and stores the data needed for the analytics required by the business. It includes databases, data movement, data integration, and data cleansing.

Business intelligence is the result of analyzing the data that was brought together through data warehousing. It includes producing reports, dashboards, scorecards, and analysis.

Starting from the beginning, data sources (Inputs) are the places where data is obtained. There are typically multiple places to store data. This architecture chapter introduces data sources and Chapter 9, Data Sources, provides greater depth about this topic.

Data sources are the origination points of data. These founts of data include:

  Transactional Systems, which contain business activity such as receipts and purchase orders

  Process Oriented Systems, which contain workflow information

  Specification Systems, which contain plans, settings, and rules

  Syndicated Data, which is obtained outside the organization, and is used to enrich internal data, such as customer profiles

  Big Data is typically unstructured and consists of text, images, video, and audio. The advent of social media data has increased the importance of Big Data.

Data is obtained from data sources through data integration approaches such as:

  Extract, Transform and Load (ETL)a process where data is obtained from a data source, modified, and placed in a data target through batch means. This is the primary method for obtaining data for most data warehouses. A related method is “Extract, Load and Transform (ELT)” where data is transformed after it is loaded.

  Enterprise Application Integration (EAI)a method of integration between software applications that utilizes application program interfaces (APIs). This approach provides immediate data and useful for near real-time data warehousing.

  Enterprise Information Integration (EII)a method of data integration that provides access to multiple data sources without the need to move the data to a common database such as a data warehouse. This method supports a federated schema, which makes multiple data sources act like a single data source. This method can save the cost of developing new physical data warehouses and copying data.

Data integration tools move data into a data integration and storage area, which is another part of the architecture.

So, what factors should be considered when integrating data? There could be a customer who is defined in the Enterprise Resource Planning (ERP) system with their orders, shipping information, and so forth. The same customer is then identified in the Customer Relationship Management (CRM) system or database where there is information recorded about potential sales activities, customer service, and so forth. The data warehouse is a place where this data can be put together, providing a 360 degree view of the customer and our relationship with them.

After data has been extracted and the physical storage areas created, it is time to pump the data through the data warehousing system. It is moved from the data sources to a staging database, then to the data warehouse and to the data mart, where it is then available for BI queries by the business user.

These key activities are needed to support this process:

  MappingData sources are aligned with their data targets. You will decide what data goes where.

  TransformingData is modified to meet requirements and formatting for where it lands.

  EnrichingAdditional data may be added, such as geocoding.

  LoadingData is inserted into the databases.

Data Integration Tools are a category of software that moves/copies data from one location to another, as well as scrubbing and cleansing the data. ETL (extract, transformation, and load) tools, as well as more real time integration tools like ESB (enterprise service bus) fall into this category. As you look at the overall effort and work that goes into a data warehousing project, the bulk of the work actually falls to the data integration tool. This tool is critical to the success of the project.

Chapter 11, Data Integration, provides further information on this subject.

Once data is integrated and stored, it is moved to an exposed data area where it is available to analysts and managers. Those who are going to make use of that data will access it using business intelligence and analytic tools.

Scalability was a primary driver when establishing the technical architecture for the successful 3M Global Enterprise Data Warehouse (GEDW). The system must support a global enterprise with over 100,000 employees and 50,000 trading partners that require data returned in seconds. The system would start with 500 users and grow to 30,000 users. The GEDW team evaluated a number of platforms and then decided on the Teradata® database engine to manage the databases associated with GEDW, and Information Advantage® MyEureka!® to support analytic applications. The technical architecture started with two database server complexes and one application server complex.

The database servers were built on Teradata Massively Parallel Processing (MPP) technology, with server complexes consisting of 48 to 80 CPUs organized into multiple nodes. Storage started at fifteen terabytes stored in a disk configuration that supports failover (RAID5). High availability is supported by failover control between processing nodes and use of a RAID5 disk.

The data includes structured data such as customer orders and shipment records, as well unstructured data like documents and product images. Also, GEDW provides data critical to data warehouses, such as master data, reference data, digital asset data, and metadata. (Messerli 1999)

Allen Messerli recalls the achievement, “We just did it and made it work! 3M had independent transaction systems in 62 countries, all feeding into the GEDW.” This resulted in a streamlined system. Messerli explained, “We got rid of all our separate physical data marts and other data stores.” (Hackathorn 2002)

Now, let's look at the data stores required to support the data warehouse system.

The data warehouse system is not going to be composed of just one database. Multiple databases or areas of databases will be required, with each database having a particular role in the data warehousing architecture. Sometimes it may be efficient and effective to combine multiple roles into a single database like the Enterprise Data Warehouse (EDW) or the Operational Data Store (ODS).

Data Warehouse: This database is organized into subjects, such as customer, product, sales transaction and so forth. The EDW contains a full set of data with an enterprise scope. The Atomic Data Warehouse area contains normalized data, meaning one fact or piece of information in one place. The data warehouse is larger in scope than the data mart and may provide data to one or more data marts.

Data Mart: This database is organized by business process, rather than data subject, so there may be receipts, issues, insurance policy, transactions, and so forth. It is a sub-set of data needed to analyze particular events or facts, so it is smaller in scope than the data warehouse.

  Focused on presentation and querying, to enable business people to quickly understand and use data. In this case, fast data output is critical, as is analysis for pulling data out, understanding it, and using it.

  Integrated using conformed dimensions, which are database tables that contain properties that consistently identify and categorize data, using agreed upon labels. These are descriptors that help us index and cross reference business processes.

  Denormalized or designed as cubes or a star schema, which provides faster access for reporting and analytics.

Staging Areas: These are databases where data is staged, integrated, and stored, prior to being made available through data warehouses and data marts.

MDM Hub (Master Data Management): Master data includes the core business and reference entities, such as customers, products, and locations.

ODS (Operational Data Store): ODS stores the data resulting from operating a business, such as orders, deliveries, receipts, issues, purchase orders, etc. The ODS can also act as a data integration hub and data warehouse.

PDS (Process Data Store): Data for supporting business process activity to provide near real-time business intelligence. This data is associated with BAM (Business Activity Monitoring). The PDS is a relatively new kind of database that is gaining in importance.

Data Mining Database: This database contains data that is optimized for use by data mining software. For data mining, it is best to store the data in flat structures, which is different from the data mart, described above.

Big Data: Data that is so voluminous (over 2TB) that it cannot be managed using traditional databases such as relational databases. This data is typically unstructured, and consists of text, images, video, and audio. This data is spread across many servers.

So what is the difference between a data warehouse and a data mart? Sometimes these terms are used interchangeably, but technically they are not the same thing. For the purposes of our architecture, we are going to discuss the distinctions between the two of them.


The data warehouse is:

  Focused on integration and storage.

  Loaded quickly to provide users with the most current data.

  Organized by subjects, such as customer, product, and so forth.

  May contain both normalized data, called an atomic schema, meaning one fact or piece of information in one place, as well as denormalized, which allows planned data redundancy.

Data Warehouse Architecture Patterns

The choice of where and how to store the data for the data warehousing system is a critical architectural question. Part of the issue is the data warehousing architecture pattern, which is explained in Table 04-04. Data warehousing patterns include:

  Independent Data Marts

  Coordinated Data Marts

  Centralized Data Warehouse

  Hub and Spoke

  Federated.

Table 04-04: DW Architecture Patterns

Pattern

Description

Independent Data Marts

Multiple databases containing analytic data are created and maintained by different organizational units. The databases tend to be inconsistent with each other, having different dimensions, measures and semantics. There is “no single version of the truth” and it is a challenge to perform cross data mart analysis. These weaknesses make the independent data mart the least desirable of the architecture patterns.

Coordinated Data Marts

Data is harmonized across multiple analytic databases or areas of databases. Each database meets a specific need such as customer segmentation or product analysis. There are shared dimensions and semantics between the databases. For example, customer and product dimensions are consistent across the databases. Consistent dimensions and semantics support a logical enterprise view.

Centralized Data Warehouse

An Enterprise Data Warehouse (EDW) contains atomic data, summarized data and dimensional data. This is provided through a single database. Logical data marts are built from this database to support specific business requirements. Analysis between subject areas is facilitated by data being stored in a single database.

Hub and Spoke

The hub and spoke architecture is similar to the centralized data warehouse architecture except that there are physical data marts instead of logical data marts. A cost and complexity element is added due the need to copy data from the central data warehouse to the distributed data marts.

Federated

The federated data warehouse architecture makes data that is distributed across multiple databases look to users like a single data source. Common keys such as customer and product identifiers are used to tie the data together. This approach reduces the expense and time needed for data movement. The approach may have a high overhead and slow response time because queries between data sources in multiple locations can be inefficient.

Data Model Patterns for Data Warehousing

A data model is a graphical view of data created for analysis and design purposes. While architecture does not include designing data warehouse databases in detail, it does include defining principles and patterns for modeling specialized parts of the data warehouse system.

Areas that require specialized patterns are:

  Staging / landing arealooks like source systems

  Atomic Data Warehouseuses normalized ERD (Entity Relationship Diagram)

  Data mart – uses dimensional modeling – the star schema, the snowflake schema or the cube.

In addition to these specialized patterns, the architecture should include other pattern descriptions for:

  Naming of tables and columns

  Assignment of keys

  Indexing

  Relational Integrity (RI)

  Audit history.

A Data Modeling Tool is a software tool that provides a way to analyze database requirements using a graphical representation of the data needed to support the business, and then design the database(s) that will house the data. In other words, pictures are used to describe the data. On the surface, this looks like a drawing tool for creating graphics, but a data modeling tool provides a detailed design capability, which includes the design of tables, columns, relationships, rules, and specifying business definitions. A Data Modeling Tool will enable you to design a better database and also to communicate with team members about the data.

Data modeling is described in greater detail in Chapter 6, Data Modeling, and Chapter 7, Dimensional Modeling.

Operational BI

Look at Figure 04-04, Operational BI, which shows operational business intelligence, which is decision-making about the business for more tactical use. Distinctive features of Operational BI include: Process Data Stores, Business Rule Engines, Business Activity Monitoring, Virtual Data Warehouses, and Complex Event Processing. Operational BI supports the organization that needs to make a decision such as should we upgrade a customer from coach class to first class, or to grant credit. Immediate answers may be required, and data stores such as the ODS and the PDS may be consulted. Those data stores will be updated in near real time fashion, which means the data is as up to date as is practical.

Figure 04-04: Operational BI (See oval entries.)

The data that populates transactional systems and the data warehouse is the tip of the iceberg, compared to multi-terabyte volumes of unstructured data known as Big Data. Your organization may have upwards of 80 percent of its data allocated to text, images, video, and audio. Technical architecture needs to consider this Big Data factor.

Database Versions

Database categories such as data warehouse, data mart, ODS, and PDS were just described. During the system lifecycle, which includes the development, testing, and ultimate use of these databases, multiple versions of each database should be maintained. Data warehouse architects need to be aware of and plan for multiple databases and versions of databases. Figure 04-05 shows an example of multiple database versions.

Figure 04-05: Database Versions

Development is where all sourcing, loading, application, and reporting SQL and program code is coded and prepared. ETL developers, DBAs, and data modelers are also developing the databases. One or more versions of the data warehouse database will be needed to create and try out the proposed designs. In Figure 04-05, a number is appended to the database role name to indicate a version number. In this example, there are three copies of the development database so that multiple people can develop at the same time without interfering with each other.

Once the database has been designed and developed, and coding completed and unit tested, there will be a need to perform testing on the data, so multiple versions of the test database may be needed. Testing may be performed by QA (Quality Assurance) people, often with the participation of business users.

After testing has been completed, a staging version of the database is needed to prepare for the initial data load for production. This is not the same data staging that is part of the daily data movement pipeline.

When the system is ready to “go live” and people are ready to use it for business, then a production version of the system is created. In addition, the architectural plan should cover disaster recovery (DR), in case the production database goes down. This could mean a backup copy of the database.

Decision Making which includes Business Intelligence is the part of the data warehousing system where business users make use of the data stored in the data warehouse by analyzing it and preparing presentations. Business people tend to either act like farmers, harvesting a crop of known information, or explorers, who are seeking new patterns. The data warehouse architecture must support both types of access.

Data warehouse architecture includes the selection and use of the following types of BI/Analytics tools:

  Query Tools are business intelligence software tools that select data through queries and present it as reports and/or graphical displays. The business or the analyst is able to explore the data and produce reports and outputs that are desired and needed to understand that data.

  Reporting Tools are software tools that produce repeatable reports and graphic outputs. Reports are output in predefined formats, often on a schedule unlike query tools, which are more ad hoc in nature.

  Data Mining Tools are software tools that find patterns that are useful for predictive analytics and optimization analytics in stores of data or databases. See Chapter 19, Data Mining, for more information about this topic

  Statistics Tools are software tools that enable the analysis of data. There is a substantial overlap between statistics tools and data mining tools. Both can be used to describe and understand data as well as to draw conclusions about data and to make predictions. Statistics Tools work with data samples while data mining tools work with larger volumes of data (aka Big Data).

  Data Profiling Tools are software tools that support understanding data through exploration and comparison. These tools are used to access existing data and explore it. The tool looks for patterns such as typical values, outlying values, ranges, and allowed values, so that you can better understand the content and quality of the data. It is very important to not just look at the definition of the data, but to actually look at the content of the data to see what is there. Application developers are notorious for using data columns for unintended purposes, so knowing what should be in a column and knowing what's actually populated in that column is very important for accuracy in reporting.

Chapter 12, Business Intelligence Operations and Tools, supplies additional information about each tool category.

The topics of data governance, metadata and data quality were introduced earlier in this chapter under the heading of Data Architecture.

Metadata

Supporting this process is metadata, which is information that describes the whole process and the data. The data included in system monitoring, a function that supports all areas, tracks what is happening as information moves from one place to another. More details are provided in Figure 04-03, Data Warehousing Architecture.

A Metadata Repository is a software tool that has both a user interface and a database component to record, maintain, and look up data that describes other data. There are two kinds of metadata: business metadata and technical metadata. A metadata repository should enable you to build a metadata dictionary, which describes important business terms, as well as the technical components of the data warehouse, such as tables, columns, data maps, and indexes. See Chapter 8, Data Governance and Metadata Management, for more information.

The successful data warehouse system is built on a fundamental framework of hardware and software known as the infrastructure. Data warehouse technical architecture includes a big picture view of the infrastructure as well as recommendations concerning the components that make the infrastructure.

A view of a data warehouse infrastructure from layered perspective is illustrated in Figure 04-06. Components that can influence data warehouse and business intelligence performance are described in this section, including:

  Communication Channels

  Storage Devices

  Database Servers

  Database Appliances

  ETL Servers

  Query and Reporting Servers

  Analytics Servers

  Applications

  Workstations.

The diagram shows that disk drives or disk pools can be shared by multiple database servers. The database servers can in turn be configured as a cluster, enabling high availability and redundancy. Multiple databases can be hosted by a single database server.

Figure 04-06: Data Warehouse Layered Infrastructure Architecture

Servers are computers configured to host or run specific applications or services. They may be designed as standalone computers or in a group of load sharing computers known as a cluster. A single physical computer or cluster of computers can support multiple servers by employing virtual machines. Each virtual server has its own operating system (OS) and can be controlled independently. Virtual servers offer flexibility with a potential loss of performance due to shared resources.

In the case of data warehousing, each server tends to be assigned to a specific role. When developing data warehouses it is possible to place multiple functions on a single workstation computer. My main workstation computer is loaded with a variety of database management software and other tools. This will not work well for a production system with more than a few users or with over 200 GB of data.

The base characteristics of a server can be described in a specification such shown in Table 04-05. Each high level component of the server is identified. Together these characteristics describe the server configuration. I recommend that your organization standardize on configurations for each server role: database server, calculation server, ETL server, application server, web server, etc. These standard configurations will be part of the data warehouse reference architecture.

Table 04-05: Database Server Specification Example

Component

Description

Processor

Intel Itanium 3.22 GHz

Virtual

No

Operating System

Microsoft Windows Server 2008

Server Model

Dell PowerEdge 600x

CPU / Core Count

4 CPU with 4 cores – 16 cores total

Internal Memory

48 GB

Internal Storage

4 x 300 GB 15,000 rpm HDD

Fiber Channel HBAs

8 x Dual Port 4 Gbps

LAN Connections

4 x 10/100/1000 Ethernet

Database Software

Microsoft SQL Server 2008 R2 Enterprise

These specifications are heavy in the use of acronyms, abbreviations and technical terms. Table 04-06 contains terms and definitions to help you to interpret server specifications. In addition, this chapter includes further explanations about components of the infrastructure architecture.

Table 04-06: Infrastructure Terms

Term

Description

Cache

Main memory that is reserved to contain data. Performance is improved because main memory data is faster to access than HDD data.

Core

A processor within a CPU chip that can execute instructions. It is an engine within a physical CPU.

CPU

Central Processing Unit – a physical chip that handles the processing for a computer.

Fiber Channel HBAs

A high speed network connection typically used to communicate with HDDs. HBA stands for Host Bus Adapter.

GHz

Gigahertz – a measure of CPU cycle speed. Expect 1.5 to 4 GHz.

Gigabyte (GB)

1 billion (1,000,000,000) characters

Grid

A high speed network of computers that are utilized to solve a single problem in parallel.

HDD

Hard Disk Drive

LAN

Local Area Network

MPP

Massively Parallel Processing – multiple independent processors are coordinated to produce results.

Petabyte (PB)

1,000,000,000,000,000 characters

RAID

Redundant Array of Independent Disks – data is stored across multiple hard with the same data being stored in multiple places.

RPM

Revolutions Per Minute – the rotation speed of Hard Disk Drives. Current top speed is 15,000 RPM.

SAN

Storage Area Network – a pool of disk drives that is accessed through a high speed network connection.

SDD

Solid State Disk Drive

Server

Server – a computer system configured to host or run specific applications or services.

SMP

Symmetric Multi-Processor – a computer system consisting of multiple tightly coupled processors of the same type, running the same operating system and sharing resources.

Terabyte (TB)

1 trillion (1,000,000,000,000) characters

Virtual Server

Virtual Server – a server that runs on a Virtual Machine, which is software that gives the appearance of a physical computer including support for an Operating System.

 

Communication Channels – sometimes referred to as “the pipe” are the means of carrying data between devices. Types of communication channel include:

  Ethernet Network – A family of networking protocols that enable Local Area Networks (LANs). There are multiple transmission rates available ranging from one Gigabit per second to 100 Gigabit with 10 Gigabit being considered high performance.

  Fiber Channel HBA – a high speed network optimized for use in connecting to disk drives.

  InfiniBand – a high speed (up to 120 Gigabit) that is used to connect processors such as within a data warehouse appliance.

Storage Devices – devices that store data include Hard Disk Drives (HDD) and Solid State Disk (SSD) devices. Hard disk drives utilize a physical spinning disk, which can be the factor that slows down the data warehouse. Make sure to use 15,000 RPM disks or faster rather than the roughly 5,000 RPM disks found on work stations and home computers. SDDs have no moving parts and so are much faster than HDDs and are equivalent to 50,000 RPM. Unfortunately, SDDs may have longevity issues, supporting a limited number of data rewrites.

The size of the storage device is a critical consideration. In general, it is faster to store data on four 250 GB devices than on a single one TB device because the four devices can be searching data in parallel.

Storage devices can be organized into a Storage Area Network (SAN) such as the example network described in Table 04-06. SANs commonly have 50 to 100 individual HDDs to improve performance and reliability. The SAN can provide data to multiple database and application servers through a high speed Fiber Channel HBA network. This approach is an economical way to organize and share data.

Table 04-07: Storage Area Network (SAN) Specification Example

Component

Description

Fiber Channel HBAs

8 x Dual Port 4 Gigabytes per second (Gbps)

SAN Switch

16 port SAN Fiber Channel Switch

External Storage

1 x IBM DS4800 with 10 x EXP810 Expansion Units – 80 x 300GB 15K rpm HDD

RAID Type

RAID-5 (striping with parity)

Raw / Usable Storage Space

8 TB / 24 TB

 

Database Servers – a system that runs database management software such as IBM DB2, Microsoft SQL Server or Oracle DBMS. It may also run specialized columnar database software such as SybaseIQ or Vertica. Database servers are typically based on general purpose SMP architecture, which can be used for other purposes such as running application software.

Database software manufacturers provide reference architecture documents, which show example configurations which support specified performance levels. For example, Microsoft has developed a “Fast Track” architecture which shows the sizes of database servers needed to support SQL Server 2008 for data warehouse workloads. The specification includes amount of storage supported specified in terabytes, processor specification (number of CPUs and cores), main memory, operating system, DBMS version, fiber channel HBA and disk storage.

Data Warehouse Appliances – a specialized system that is optimized to provide database services using MPP architecture. The Data Warehouse Appliance includes multiple tightly coupled computers with specialized functions plus one or more arrays of storage devices which are accessed in parallel. Specialized functions include system controller, database access, data load and data backup.

Data Warehouse Appliances provide high performance. They can be up to 100 times faster than the typical Database Server. Consider the Data Warehouse Appliance when more than 2TB of data must be stored.

See Chapter 10, Database Technology, to learn more about database technology including data warehouse appliances and columnar databases.

ETL Servers servers that run Extract Transform and Load (ETL) software such as Informatica PowerCenter or Microsoft SQL Server Integration Services (SSIS). The ETL software should be placed on its own server.

Query and Reporting Servers – servers that support query and reporting software such as IBM Cognos or SAP Business Objects. The query and reporting software should be placed on its own server.

Analytics Servers – servers that support data mining and statistical software. This software should be placed on its own server. Some analytic software places a heavy load on hardware. For example, software that projects insurance capital requirements may generate a large number of scenarios and require hundreds of CPUs and/or cores in order to complete calculations in a timely manner. This may require high performance computing approaches such as grid computing, supercomputing or cloud computing.

Applications – systems that are the data sources for the data warehouse systems. See the discussion of Data Sources in Chapter 9, Data Sources and Data Quality Management.

Workstations computers used directly by data warehouse end users. Workstations may include large multiple display screens to support data visualization.

Now we will catalog the technologies selected to support our warehousing architecture. The data warehouse technology stack is the set of technologies selected for a data warehouse implementation. Table 04-08 is an example of the data warehousing technology stack for First Place Toys, Inc.

Table 04-08: Data Warehousing Technology Stack Example

Technology Role

Technology Selected

DBMS for Data Warehouse

Microsoft SQL Server 2008 R2

DBMS for Data Marts

Microsoft SQL Server 2008 R2

Data Integration Tool

Microsoft SSIS 2008

Query and Report Software

Microsoft SSRS 2008

Statistics Software

System R (Open Source)

Data Mining Software

RapidMiner from Rapid I

Data Modeling Tool

Computer Associates ERwin 8.2

Scenario Generator

Internally developed

Cloud computing can be an effective way of managing data warehouse infrastructure and the technology stack. Using cloud computing, a data warehouse environment can be rapidly established by purchasing services from either internal or external sources. This means less money and time to productivity, however, security and performance issues must be considered.

What is cloud computing? The National Institute of Standards (NIST) has created a practical and often quoted definition.

Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models. (NIST 2012)

A cloud computing environment includes the following essential characteristics (NIST 2012):

  On-demand self-service – the cloud environment enables consumers to provision system resources (e.g. data storage, server time) through system requests without need to interact with provider personnel

  Broad network access – a variety of customer devices such as workstations, laptops and smart phones can access cloud resources over the network.

  Resource pooling – resources such as server time and data storage are shared and then allocated based on multi tenancy consumer demand

  Rapid elasticity – virtually unlimited cloud resources can be acquired and released

  Measured service – use of cloud resources via services is monitored and consumers are charged on a pay-per-use basis.

Cloud computing is deployed “as a Service” (aaS) and includes a number of service models:

  Data as a Service (DaaS)data and/or database services (e.g. Microsoft SQL Azure) are provided

  Software as a Service (SaaS) – consumers are provided complete applications (e.g. Netsuite®, Oracle On Demand®, salesforce.com®) through thin client interfaces such as web browsers or mobile applications

  Platform as a Service (PaaS) – providers (e.g. Amazon Web Services®, force.com®, Microsoft Windows Azure®) supply consumers with a platform including computer languages, libraries, services and utilities which can be used to support consumer provided applications

  Infrastructure as a Service (IaaS) – providers (e.g. Amazon Web Services®, Rackspace®, Verizon®) supply consumers with basic computer resources such as CPU, networks and data storage enabling the customer to deploy supporting software and applications.

Clouds can be deployed through a number of delivery modes (NIST 2012):

  Public cloud – the cloud provider makes its service available to the general public so consumer resources are pooled with those of the resources of other consumers in the general public

  Private cloud – the cloud provider makes its service available for the exclusive use of a single organization with infrastructure being on-premises or off-premises

  Community cloud – the cloud provider makes its service available to a group of organizations or consumers who share common interests such as an industry association

  Hybrid cloud – a cloud that has characteristics in common with multiple other modes (e.g. public, private, community clouds).

Data warehouse in the Cloud can provide a number of benefits. Advantages of cloud based data warehousing include:

  Rapid start upa lengthy process acquire and build infrastructure is avoided

  Avoid over buyinga data warehouse can start small and grow dynamically, avoiding the need to buy excess capacity to support future growth

  Support peak periods capacity can be increased to support processing for month end and year end peak periods for example

  Scalability and elasticity resources and their costs can be easily scaled up and cut back as needed

  Reduced costs – the economies of scale drive down the cost of infrastructure

  Load balancing additional computer resources can be utilized to maintain high availability and performance

  Environment Support – multiple environments such as development, test, staging and disaster recovery are supported.

Data warehousing in the Cloud does present challenges. The Cloud is not the best approach for every situation. Challenges associated with the Cloud include:

  Importing data mechanisms must be developed to provide enterprise data to the cloud

  Communication channels data movement over the network can be a bottleneck

  High costs for data transfer – cloud vendors often charge by the volume of data moved to the cloud

  Data latency the data in the cloud may not be current

  Query speed and computation speedcloud solutions are often slower than on premise platforms such as data warehouse appliances

  Control of security – requirements for security must be considered such as authentication, authorization, vulnerability and security event monitoring, and data access logging

  Legal protection – data could be lost or unavailable in cases where the cloud provider runs into legal problems.

Use of best practices can help to overcome many data warehousing in the cloud challenges.

  Standards – use standard APIs and languages to provide flexibility in case it is necessary to change cloud vendors

  Reference architecture – develop architectural patterns that can be consistently applied to the cloud including data warehousing

  Plan for required workload – determine the patterns and volumes of likely workload

  Network bandwidthmake sure that there is enough network bandwidthget a big pipe

  Upload changes only send only changed data to the cloud rather than complete data refreshes to reduce bandwidth costs and improve performance

  Determine best delivery model – private clouds or hybrid clouds may address weaknesses public clouds

  Security – require the cloud provider to provide security features such strong authentication, data encryption and audit logs of data access

  Monitor performance detect and avoid problems such as “killer queries” that slow performance

  High performance guarantees – negotiate high performance SLA guarantees to supply calculation intense activities and timely data warehouse queries

  High availability – negotiate high availability SLA guarantees

  Contingency – have a backup plan in case of service outages or other problems, which includes a backup copy of the data.

Data warehousing in the cloud provides many opportunities to improve results, increase flexibility and reduce costs. This is a technology that should be considered as part of data warehouse technical architecture.

To achieve benefits from business intelligence, it is important to manage the data warehouse to make sure that it continues to provide value and risk of loss is avoided. This includes activities such as:

  Obtaining feedback from BI users

  Assuring the quality of the data

  Monitoring the performance of the data warehouse

  Securing the data warehouse from threats

  Enabling governance of the overall system.

The management of data warehousing is further described in the Chapter 19, Testing, Rolling Out, and Sustaining the Data Warehouse.

Data Warehouse Security

It is essential that data warehouse data be secured both when it is in motion and when it is at rest. Data that is being copied as part of data integration is in motion while data stored in a database such as the data warehouse is at rest.

All data should be assigned a security classification based on its sensitivity, such as:

  Public – data can be freely shared.

  Confidential – data must be protected, such as credit card numbers, social security numbers, or account balances.

  Highly Confidential – data is of strategic importance such as plans to acquire a company.

Data must be secured appropriately. Those who access data must have appropriate security clearances, depending on their roles and need to access data. This includes login authentication and determination of access permissions. Network, database and application security controls should regulate who can access data. Sensitive data should be encrypted both when stored in databases and when on the move through ETL.

Data masking techniques should be used to hide sensitive data from view. This can be used for test data as well as for data displays.

This section explains things that you can and should do to make your data warehouse architecture more sound and flexible.

First, begin to work with real data early. That means looking at your data sources and understanding what is there, rather than just theorizing about the data and not understanding what it really is.

Second, I recommend using a consistent technology stack. You should decide on technology stack items such as the data integration tool, metadata repository, and so forth. Deciding on a technology stack and sticking to a consistent tool for each task makes data warehouse tasks easier. You will build up skills and capabilities in each area and be able to reuse work between areas.

Third, work with data source experts. Data source experts are people who understand the data sources, including people in the business who are working with the systems – business experts. People in information technology who work with that data and make changes for maintaining programs and so forth can also be included.

Fourth, develop architecture for the real world, which has multiple data sources and multiple data marts, using a federated approach. Generally, you will find that it is unrealistic to assume there is just going to be a single data source, single data warehouse, or single data mart, because multiples may be required. New systems often include something called a data mart, which might be reused or included in our overall federated approach.

Fifth, I advise including external data such as syndicated data, which can really improve the value delivered by the data warehouse system. Include unstructured data such as text, images, audio, and video in your plans. This is important to the modern data warehouse.

Finally, I recommend designing for scalability. Work with infrastructure architects who can design for growth. A scalable system can grow by expanding the hardware and software, enabling the data warehouse to support more data and more work.

On the other hand, there are traps – things that you should avoid. Become aware of the following traps of data warehouse architecture.

First, do not attempt to do the do-all data warehouse on your first project. In other words, avoid building a big data warehouse to cover everything. It will take too long, get complicated, and probably not deliver the value that everyone is expecting. Begin with smaller steps and successes.

Second, do not build a rigid system. Make sure that you create a flexible system that can be enhanced and expanded as the need grows.

Third, avoid using a variety of tools, which is just the opposite of a consistent technical stack. If each person picks only the tools they are comfortable with, it is going to be more complicated and expensive to build and maintain. Flexibility and consistency will be lost and people will be unable to move between projects. Instead, use a consistent technical stack.

Fourth, do not start without preparation. Obtain training on how to use new tools and learn what the best practices are. Attend classes, and if your budget permits, bring in consultants who can demonstrate how to use the tools for a given project to reduce the learning curve.

Fifth, do not save the most difficult work for last. This is important, because you may get to the end of the project, discover problems, and be unable to get the most difficult parts of the project to work. Instead, tackle the most difficult parts of the project early and the rest of the project is more likely to be straightforward.

Sixth, do not report on staged data. As illustrated in the warehouse diagram, the staged data area contains data that comes directly from its data sources; it has not been cleansed and integrated for use by the analytics tool. Ralph Kimball explicitly warns against using staged data for reporting and analytics. (Kimball 2008)

Finally, avoid fighting a religious war. You may encounter people who subscribe to the Ralph Kimball, William Inmon, or Agile philosophies of data warehouse architecture. Each has its strengths and weaknesses which can be adapted for your project. A religious war would mean spending your time debating the differing views and deciding between them, rather than producing results.

Build your knowhow in the areas of data warehousing technical architecture using these resources.

Visit a Website!

The Wikipedia provides a great overview of data warehousing, including a history beginning in the 1960s:

http://en.wikipedia.org/wiki/Data_Warehouse

http://en.wikipedia.org/wiki/Data_warehouse_appliance

Get Research!

Search the web for research reports (filetype=pdf):

  Forrester Wave Data Warehouse

  Gartner Magic Quadrant Data Warehouse

Read about it!

This book shows how to architect data warehouses using Microsoft SQL Server technology.

Rainardi, Vincent. Building a Data Warehouse: With Examples in SQL Server. Apress 2008

Inmon, W. H. and Krish Krishnan. Building the Unstructured Data Warehouse. Technics Publications, LLC; First edition, (January 1, 2011).

Include unstructured data to take your data warehouse to the next level.

Key Points

  Technical Architecture Principles provide non-functional requirements such as adaptability, extensibility, scalability, and understandability.

  The Data Warehousing Architecture Picture includes data sources, data movers (ETL), data integration and storage, and data that are exposed for analysis.

  Metadata, often defined as “data about data”, provides critical documentation about the DWBI system.

  The Data Warehouse Architectural Pattern is a critical choice. The top patterns include independent data marts, coordinated data marts, centralized data warehouse, federated as well as hub and spoke.

  The Data Warehouse Technical Stack is the selection of tools used to analyze, design, build, operate, and maintain the data warehouse. Tool categories in the technical stack include metadata repository, data modeling tool, data movement tools, databases, and reporting/BI tools.

  The Data Warehouse Infrastructure Architecture is the combination of hardware and software that enables the operation of the data warehouse. Infrastructure elements include servers, storage devices, communication channels and operating systems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.56.45