SQL Server Components

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

SQL Server Components

SQL Server 2005 consists of a number of integrated components, as shown in Figure 2-1. When you run the SQL Server installation program on a server, you can choose which of these services to install. We focus on those components relevant to a BI solution, but SQL Server also includes all the services required for building all kinds of secure, reliable, and robust data-centric applications.

Figure 2-1 SQL Server components

Development and Management Tools

SQL Server includes two complementary environments for developing and managing BI solutions. SQL Server Management Studio replaces both Enterprise Manager and Query Analyzer used in SQL Server 2000. SQL Server Management Studio enables you to administer all the aspects of solutions within a single management environment. Administrators can manage many different servers in the same place, including the database engine, Analysis Services, Integration Services, and Reporting Services components. A powerful feature of the Management Studio is that every command that an administrator performs using the tool can also be saved as a script for future use.

For developing solutions, BI developers can use the new Business Intelligence Development Studio. This is a single, rich environment for building Analysis Services cubes and data mining structures, Integration Services packages, and for designing reports. BI Development Studio is built on top of the Visual Studio technology, so it integrates well with existing tools such as source control repositories.

Deploying Components

You can decide where you would like to install the SQL Server components based on the particular environment into which they are deployed. The installation program makes it easy to pick which services you want to install on a server, and you can configure them at the same time.

In a large solution supporting thousands of users with a significantly sized data warehouse, you could decide to have one server running the database engine to store the data warehouse, another server running Analysis Services and Integration Services, and several servers all running Reporting Services. Each of these servers would, of course, need to be licensed for SQL Server, because a single server license does not allow you to run the various components on separate machines.

Figure 2-2 shows how these components fit together to create an environment to support your BI solution.

Figure 2-2 SQL Server BI architecture

SQL Server Database Engine

The SQL Server 2005 database engine is the service that provides support for relational databases. It is a highly reliable, available, and scalable database server. As shown in Figure 2-2, its primary role in BI solutions is to be the long-term authoritative data store for enterprise data from which Analysis Services will build OLAP databases and cubes.

Usually a single server machine runs a single instance of the database engine, but each server can be configured to run more than one instance of the database engine at the same time if necessary. Each instance has its own completely separate set of databases and tables. It is usually better to have a single instance on a machine because the overall performance of the server will be improved and the administration simplified, but sometimes it is useful to be able to run multiple isolated environments on the same large server, such as when you need to run different versions or service pack levels to support existing applications. SQL Server 2005 supports up to 50 instances on a single server.

Management

SQL Server 2005 greatly simplifies the job of managing a database because it is largely self-tuning. Whether you need to manage a few or hundreds of SQL Servers, tools are included that will reduce the cost of implementing and maintaining SQL Server instances. Most management operations can be performed with the database online.

For backup and index maintenance operations, a wizard will create and deploy scheduled maintenance plans. It requires only a few minutes to implement a maintenance plan for a SQL Server instance. Backups of the data or the log are performed online, so there is no need to take the database down to create a backup. Most index reorganizations or rebuilds are also performed with the database online.

To help track down resource-intensive queries or operations, SQL Profiler can monitor all statements or a subset of the statements sent to SQL Server, as shown in Figure 2-3. You can see the statement, who issued it, how long it ran, and many other metrics and attributes. You can view the list directly or record it to a table or file. You can capture the text of a query and have the Database Engine Tuning Advisor provide suggestions for creating new indexes to help the query perform better.

Figure 2-3 Analyzing query performance with SQL Server Profiler

SQL Server is extensively instrumented to provide information about how it is using resources. To monitor the resource consumption or internal metrics of SQL Server’s operation, you can run the System Monitor (previously known as Perfmon). You can use this information to determine or predict requirements for additional resources.

If you are managing more than a few SQL Servers, a separate product called Microsoft Operations Manager (MOM) can collect events and errors raised by each instance and filter out the important information for forwarding to the system operators. With the MOM interface, you can see at a glance the state of any of the servers. Predictive logic in MOM can alert you to potential issues before they become critical issues.

Scheduled Execution of Processes

SQL Server Agent is a Windows service that can schedule and execute jobs. A job is a sequence of one or more steps. A step can invoke an operating system command, SQL script, Analysis Services query or command, Integration Services package, an ActiveX script, or a replication management command. The job can be executed on demand or scheduled to recur on a regular basis. The sequence of job steps can be conditional on the results of a prior job step. Notifications can be sent via e-mail, pager, or net send command on the success or failure of the job.

Security

SQL Server supports authentication of connections using either Windows authentication alone or Windows and SQL Server authentication. Windows authentication uses your Windows credentials to identify you; SQL authentication uses a login name and password that a SQL Server administrator creates for you. This login is valid only for connecting to a single SQL Server instance.

SQL Server 2005 authentication has been enhanced over that of SQL Server 2000 to provide security rules such as password expiration and strong password requirements. Access to a database is not possible without an authenticated connection. You can use Windows groups or individual accounts to assign database permissions. Using groups makes administration easier because the database administrator (DBA) no longer needs to administer individual users, just a smaller number of groups. Windows authentication is preferred because it provides a single logon for users, it is more secure, and you don’t have to maintain two sets of accounts.

Database permissions can be very granular. You can set read, write, update, delete, or deny permissions at the database level, table level, or column level. In a medical application, you could deny read permission on personal identification data to most users, while allowing reads on a Diagnosis column. This would allow statistical analysis of disease data, without letting a user associate this information with any patient. The ability to invoke any management or database operation, such as backup or database creation, can be granted, or denied, to any user or group. As with all rights and permissions in SQL Server, deny overrides any granting of rights or permissions.

Transmissions “over the wire” can be encrypted to protect data being sent between the client and server. You can also encrypt stored data down to the column level, using certificates, symmetric keys, or asymmetric keys.

Availability

SQL Server provides a wide range of options to ensure availability under almost any circumstances. Whether you need server redundancy simply to be able to perform server maintenance, or geographically remote synchronized databases for disaster recovery, you will find the features and tools to support these requirements. Clustering provides automatic failover to one or more alternative servers connected to the same storage area network (SAN). Database mirroring is a cost-effective solution that provides complete synchronization over the network between databases and fast automatic failover to a completely separate set of servers and storage. Replication offers the ability to synchronize a subset of the database with another active server. All the options maintain transactional integrity.

Scalability

You can run SQL Server on either 32-bit or 64-bit Windows platforms, but the file structures are exactly the same, so you can freely move databases from 32-bit to 64-bit and back again. 64-bit architecture gives you the advantages of a much larger and more efficient memory space, and more processors per server. The reason memory space is important is to support larger data and query caches. You can use up to 32TB of RAM on a 64-bit platform.

You can also add more processors to improve performance or handle a larger workload. Eight-way servers are commonplace now and are appropriate for many situations. For larger data warehouses, you can scale up SQL Server to 128 processors on a single server.

Multi-terabyte data warehouses are supported with good design and infrastructure in most contexts. The maximum single database size is 1,048,516 terabytes. We will avoid becoming famous for stating that this “should be enough for anyone.” However, it is likely enough for the next few years, for most uses.

Support for Very Large Databases

Partitioned tables and distributed partitioned views are two features of the database engine that enhance support for very large databases. A partitioned table appears as a single table to a query, but the rows in the table are physically divided across a number of filegroups in the same database. A distributed partitioned view is similar in concept, but the tables are distributed across several SQL Servers and presented to the user through a view. If the SQL Servers are multiple instances on a single server, this is called simply a partitioned view. These features offer improvements in performance through parallel queries and through manageability and maintenance (because you can treat each partition independently in many respects).

Integration Services

SQL Server Integration Services (SSIS) provides the data ETL services that you use to deliver clean, validated data to your data warehouse. Integration Services also enables you to invoke administrative tasks, monitor external events, and maintain audit logs of Integration Services runtime events. The design and runtime environments are totally new in SQL Server 2005, replacing Data Transformation Services (DTS) of SQL Server 2000. DTS packages may continue to be executed, but not modified, because Integration Services has a completely different architecture.

Integration Services is an independent service that you can choose to install and run on any server, as shown in Figure 2-4, regardless of whether the SQL Server engine is installed on that server. You create packages to access, cleanse, and conform source data; load data into the relational data warehouse and Analysis Services databases; and audit the overall ETL process. Packages are usually executed by a job scheduled by the SQL Agent, or an active package can wait on an external event such as the arrival of a file.

Figure 2-4 Integration Services architecture

Designing Packages

BI Development Studio is the development environment for Integration Services packages. You create an Integration Services project, which may contain one or more packages. A graphical designer is used to build the packages, and you can configure most complex tasks or transforms via a wizard. The designer retains metadata about all the data flowing through the package. You can break data flows, insert new transforms, and reconnect the data flow without fear of losing column mappings going into or out of a transform.

A package primarily contains one or more control flows, and usually a data flow invoked by the control flow. You can think of a control flow as a high-level description of the steps needed to accomplish a major task. For example, the steps to update the data warehouse might be “initiate an FTP download from regional offices,” “load the sales data,” and “load the inventory data.”

The details of how to load the sales and inventory data are not part of the control flow, but are each a separate data flow. The data flow tasks would define the source for the data, which columns you needed, probably some key lookups, validation, and eventually would write the transformed data to the data warehouse.

Defining the Control Flow

Even though our goal is frequently just to move data from our sources to the data warehouse, quite a bit of administration and overhead is required to implement a full production-ready ETL solution. You might need to empty tables, update audit logs, or wait for an event to occur indicating the availability of new data. Some tasks must be performed before others. This is what a control flow is for. Integration Services provides a number of different types of tasks that you can link together to perform all the steps necessary for your ETL solution.

You graphically design the control flow by dragging tasks from the toolbox onto the work surface, as shown in Figure 2-5. Simple tasks do things such as execute an SQL statement, invoke a data flow task, or invoke another package. Variables can be defined and used to pass information between tasks or to other packages. You can define a sequence for their execution by linking one task to another, or you can define a group of tasks that can execute in parallel by putting them in a sequence container and simply linking other tasks to or from the container. You can put a set of tasks in a loop to be executed until some condition is satisfied, or have them repeated while enumerating the values on a list, such as a list of file names to be loaded.

Figure 2-5 Control flow in a package

Other tasks are related to interacting with external events and processes rather than data. You can work with a message queue to send or wait for a message. You can listen for any Windows Management Instrumentation (WMI) event, such as a new file added to a directory, and begin the control flow task when this occurs. You can use a web service to receive data or a command to initiate processing. You can initiate FTP sessions to send or receive data files between systems with no other common interface.

Defining Data Flows

A data flow defines where the data comes from (the data source), the transformations required to make it ready for the data warehouse, and where the data goes to (the data destination), as shown in Figure 2-6. This is generally the core of a package. Many data flows can be invoked by a control flow, and they may be invoked in parallel. A data flow is initiated when a data flow task is executed in a control flow.

Figure 2-6 Data flow

Data Sources and Destinations

Integration Services supports a wide variety of data sources and data destinations. Common relational databases such as SQL Server, Oracle, and DB2 are supported directly “out of the box.” In addition, Excel, Access, XML documents, and flat files connectors are provided. Connections can also be made to Analysis Services cubes, Directory Services, and Outlook, among many other services with OLE DB providers. You can use Integration Services for essentially all your ETL requirements between any data sources and destinations. There is no requirement at all that a SQL Server database be either the source or the destination of a data flow.

Data Transformations

Data transformations are used to define the specific actions to be performed on the data in a data flow task as it flows from a data source to a data destination. You graphically design the sequence of actions by dragging data sources, transforms, and data destinations onto a design surface, configuring them, and linking them together. Simple transforms provide a means of changing data types, computing new columns, or looking up values in a reference table based on one or more columns in the data flow.

Many other powerful transforms make it easy to solve some difficult problems you might encounter in the course of importing data into a data warehouse, such as slowly changing dimensions, which is described in Chapter 8, “Managing Changing Data.” If you have duplicate rows in an address table, a Fuzzy Grouping transform will provide a ranking of rows that are probably the same, even with minor differences in spelling or punctuation. If you receive data in a spreadsheet, it is often denormalized, with multiple time periods across the columns when you really need one row per time period. An Unpivot transform will normalize the data stream, putting each column on its own row, retaining the row key and adding an additional key to indicate which column the row corresponds to.

You can also add transforms to split or merge a data flow. If you need to process some rows differently than others based on some value in the row, you can use a Conditional Split transform to create multiple independent data flows. You can perform unique transforms on each data flow, and then send each one to unique destinations in the data warehouse or other data target, or you can merge some of the data flows back into the main stream.

Data flows quickly through most transforms thanks to the new pipeline architecture in Integration Services. You will see that a typical data flow consists of reading data from a source, passing it through several transforms, and the finally writing it to a destination. The data is not written to disk between each transform. Instead, it is retained in memory and passed between the transforms. For large volumes of data, a block of records is read from the source and then passed on to the first transform. When the transform completes its work on the block, it passes the data on to the next transform and then receives another block to continue working. Both transforms can now work in parallel. This design means there is little overhead spent writing intermediate results to disk only to be read back in again immediately.

Debugging

Debugging packages is easy, too. When the package is executed in the BI Development Studio, each task is color coded by its state. Running tasks are yellow, successfully completed tasks turn green, and failing tasks turn red. Row counts display along each data flow path so that you can observe the progress and traffic along each path. If you need to view the data flowing along a path, you can add a data viewer to the path. A data viewer can show you the value of each column in each row in a grid, or you can choose to view column values as a histogram, scatter plot, or column chart. If a transform or task fails, a descriptive error is written to a progress file. You can set breakpoints at any task, or at any point in a script task or transform, step through each task or script, and view the values of variables as they change.

Data Quality

The ETL process is critical to ensuring high quality of the data reaching the data warehouse. Integration Services transforms are designed so that data containing errors can be redirected to a different path for remediation. Common errors such as missing business keys or string truncation errors automatically raise an error condition by default, but you can specify alternative actions. You can also use a Conditional Split transform to redirect rows with values that are out of a predefined range. Nearly every transform provides multiple data flow outputs that you can simply drag to some other transform to create a new data flow that you use to handle the data that has failed some test.

Deploying and Configuring Packages

You can deploy packages to other environments such as test or production one at a time from the development studio, or in a batch using a command line. Using package configuration sources, you can reconfigure properties such as connection strings, server names, or parameters at runtime. The source for these properties can be environment variables, the registry, a database table, or an XML file.

Executing Packages

Integration Services packages can be run from the BI Development Studio designer, by starting them in SQL Server Management Studio, from a command line, or through the SQL Agent to schedule the execution. You can also invoke a package from another package. You can pass parameters to the packages using any of these methods. The parameters can set package variables that can be used to set task and transform properties, such as a server name, or to control other aspects of the package execution. You can use Management Studio to view or stop currently executing packages, regardless of how they were started.

Analysis Services

SQL Server Analysis Services is an engine designed to support storing and querying large amounts of data based on dimensional models. Analysis Services implicitly understands concepts such as dimensions, hierarchies, slicing, and filtering. Using Analysis Services, you no longer need to worry about how to construct complex SQL statements to do the kind of analysis of your data commonly performed in BI applications.

In addition to simply presenting a set of values as output, Analysis Services can assist in interpreting these values. Data mining capabilities in Analysis Services can provide insight into the relationships between different aspects of your data (for example, how level of education correlates with credit risk). Another common application in BI is key performance indicators (KPI), where you are measuring success against some pre-established goals.

Analysis Services Architecture

Analysis Services reads data from one or more sources to populate the dimensions and cubes you have designed. It is a distinct service from the SQL Server database engine. Most common data sources can be used by Analysis Services as data sources. You can just as easily create Analysis Services databases with Oracle or Access databases as you can with SQL Server databases.

Like the SQL Server database engine, Analysis Services is a server application, as shown in Figure 2-7, not an end-user application. Queries in the form of Multidimensional Expressions (MDX) statements are submitted to the server, and results are typically returned to the user through Excel, Reporting Services, Business Scorecard Manager, or third-party tools such as ProClarity, Cognos, and Panorama. Communication with end-user applications is done using XML for Analysis (XMLA), an open standard for interfacing with data sources. The XMLA council has more than 20 vendors (and many more subscribers to the standard).

Figure 2-7 Analysis Services architecture

Usually a single server machine runs a single instance of Analysis Services, just like the SQL Server database engine, but you can configure a server to run more than one instance of the Analysis Services engine at the same time if necessary.

With Analysis Services 2005, you can have on-demand, real-enough-time, or real-time updating of the analysis database. The notification of new data being available can be automatic if you are using SQL Server or via polling for other databases. You choose how long to wait before processing new data, and how old the data is allowed to be, and Analysis Services will ensure that if the data is not available in its database within that timeframe, it will revert to the relational data store until it is available. This feature is called proactive caching and is an important feature not just for real-time scenarios but for high availability, too. Updating performed using proactive caching does not mean taking the database offline.

Development Environment

In a BI development environment, you need to specify what the data sources are, what your dimensions and measures are, the goals of your KPIs, and other design criteria. This is not an end-user task, but a task for a BI developer.

BI Development Studio is the graphical development environment where you create your Analysis Services database design, as shown in Figure 2-8. This is the same environment used to develop Integration Services packages and Reporting Services reports. Analysis Services projects can be included in source control services, and you can define multiple project targets, such as “dev,” “test,” and “production” environments.

Figure 2-8 Cube designer in BI Development Studio

Building a basic cube is extremely simply, thanks to a wizard. All you need to do is to tell the wizard which tables to include, and it will determine which tables represent dimensions and which represent facts. If you have enough data in the tables, the wizard can determine some of the natural hierarchies of each dimension. You can be well on your way to having the framework for a cube within minutes. Many other wizards are available to help you build special objects such as time dimensions.

In the designer, you also specify any KPI, actions, data partitioning, and other options you require. The result of building an Analysis Services project is an XML file that completely describes your design. You can have this file deployed to any server running Analysis Services, which creates the Analysis Services database and performs the initial population of the dimensions and cubes. You can use the Analysis Services Deployment Wizard to deploy your database to other servers and environments, and as part of the deployment, specify properties to change so that the solution will work in the new environment, such as the data source server, the log files, and so on.

You can also reverse engineer an existing Analysis Services database into a BI Development Studio project. This is important because you can make changes to a live database through the SQL Management Studio, or through programmatic means; neither method modifies the underlying project.

Managing and Securing Analysis Services

You use SQL Server Management Studio as shown in Figure 2-9 to perform routine maintenance, manage security, and to browse the dimensions and cubes.

Figure 2-9 Management Studio with Analysis Services

If you need to understand the kinds of queries being presented to a cube, you can use SQL Profiler (the same one used to trace relational queries). You can filter on the duration of a query, who issued the query, and many other attributes. You can capture the query string, put it into an Analysis Services query window in Management Studio, and execute it to review its results and test modifications to the query.

Analysis Services by default requires authentication of connections using Windows authentication. When you log in to Windows, you are authenticated, and the credentials you receive are used by Analysis Services to determine your permissions. In this default mode, access to a database is not possible without an authenticated connection and explicit permission granted for that database. In Analysis Services, you create roles to which you give permissions to all or portions of a cube. You can place Windows groups or individual user accounts in these roles. Using groups makes administration easier because you no longer need to administer individual users, just a smaller number of groups, and the specific groups rarely change.

You can configure Analysis Services to use Basic or Digest authentication or to simply grant unauthorized users access (although, of course, the latter is not generally recommended).

The Unified Dimensional Model

OLAP technology can usually support all the different design elements covered in Chapter 1, “Introduction to Business Intelligence,” including the ability to easily handle stars or snowflakes and to define hierarchies from the attributes in a dimension. However, in the past there has always been one major reason that people continued to use relational reporting as well as OLAP. Most OLAP technologies restrict users to drilling down through summary data along predefined hierarchies; so when users get to a point in their analysis where they want to see detailed transactional information, they need to switch to a relational report.

SQL Server 2005 includes some OLAP innovations that can unify these previously separate relational and dimensional reporting models, called the Unified Dimension Model (UDM). The term UDM refers to the extended set of capabilities offered by Analysis Services 2005, which essentially means that cubes are not restricted to providing classic drilldown access through a star or snowflake schema, but can support detail-level reporting from complex real-world source databases.

The major difference is that users of Analysis Services 2005 cubes are not restricted to a predefined set of hierarchies for querying the cube. Instead, they can use any descriptive attribute on a dimension to analyze information. This means that in addition to the classic OLAP-style reports with summary information, users can include attributes such as order numbers to generate reports with the most detailed level of information available, such as a list of order-line items.

Support for Large and Mission-Critical BI Solutions

As BI solutions become a key part of the strategy of a company, BI will quickly move from being simply an important initiative to a mission-critical system. In large and more complex BI solutions, Analysis Services’ support for availability, scalability, and very large volumes of data become essential.