Data compression

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

9.4. Data compression

Predicting the future is a risky business, but if there's one thing that's guaranteed in future databases, it's an ever-growing volume of data to store, manage, and back up. Growing regulatory requirements, the plummeting cost of disk storage, and new types of information such as digitized video are converging to create what some call an information storage explosion. Managing this information while making it readily accessible to a wide variety of client applications is a challenge for all involved, particularly the DBA. Fortunately, SQL Server 2008 delivers a number of new features and enhancements to assist DBAs in this regard.

In the previous section we looked at the FileStream data type, which enhances the storage options for BLOB data. One of the great things about FileStream is the ability to have BLOBs stored outside the database on compressed NTFS volumes. Until SQL Server 2008, compressing data inside the database was limited to basic options such as variable-length columns, or complex options such as custom-developed compression routines.

In this section, we'll focus on a new feature in SQL Server 2008, data compression, which enables us to natively compress data inside the database without requiring any application changes. We'll begin with an overview of data compression and its advantages before looking at the two main ways in which SQL Server implements it. We'll finish with a number of important considerations in designing and implementing a compressed database.

9.4.1. Data compression overview

Data compression, available only in the Enterprise edition of SQL Server 2008, allows you to compress individual tables and indexes using either page compression or row compression, both of which we'll cover shortly. Due to its potentially adverse impact on performance, there's no option to compress the entire database in one action.

As you can see in figure 9.10, you can manage a table's compression by right-clicking it and choosing Storage > Manage Compression.

When considering compression in a broad sense, lossy and lossless are terms used to categorize the compression method used. Lossy compression techniques are used in situations where a certain level of data loss between the compressed and uncompressed file is accepted as a consequence of gaining higher and faster compression rates. JPEG images are a good example of lossy compression, where a reduction in data quality between the original image and the compressed JPEG is acceptable. Video and audio streaming are other common applications for lossy compression. It goes without saying that lossy compression is unacceptable in a database environment.

Figure 9.10. Individual tables can be selected for compression using SQL Server Management Studio.

SQL Server implements its own custom lossless compression algorithm and attempts to strike a balance between high compression rates and low CPU overhead. Compression rates and overhead will vary, and are dependent on a number of factors that we'll discuss, including fragmentation levels, the compression method chosen, and the nature of the data being compressed.

Arguably the most powerful aspect of SQL Server's implementation of data compression is the fact that the compressed data remains compressed, on disk and in the buffer cache, until it's actually required, at which point only the individual columns that are required are uncompressed. Compared to a file system-based compression solution, this results in the lowest CPU overhead while maximizing cache efficiency, and is clearly tailored toward the unique needs of a database management system.

Let's consider some of the benefits of data compression:

Lower storage costs—Despite the rapidly decreasing cost of retail storage, storage found in high-end SANs, typically used by SQL Server implementations, is certainly not cheap, particularly when considering actual:usable RAID ratios and duplication of data for various purposes, as figure 9.11 shows.
Lower administration costs—As databases grow in size, disk I/O-bound administration tasks such as backups, DBCC checks, and index maintenance take longer and longer. By compressing certain parts of the database, we're able to reduce the administration impact. For example, a database that's compressed to half its size will take roughly half the time to back up.^[]
^[] In addition to data compression, SQL Server 2008 introduces backup compression, covered in detail in the next chapter.

Figure 9.11. Compared to retail disk, enterprise SAN storage is expensive, a cost magnified by RAID protection and data duplication such as that shown here.
RAM and disk efficiency—As mentioned earlier, compressed data read into the buffer cache will remain compressed until required, effectively boosting the buffer size. Further, as the data is compressed on disk, the same quantity of disk time will effectively read more data, thus boosting disk performance as well.

SQL Server 2008 implements two different methods of data compression: page compression and row compression. The makeup of the data in the table or index determines which of these two will yield the best outcome. As we'll see shortly, we can use supplied tools to estimate the effectiveness of each method before proceeding.

9.4.2. Row compression

Row compression extends the variable-length storage format found in previous versions of SQL Server to all fixed-length data types. For example, in the same manner that the varchar data type is used to reduce the storage footprint of variable length strings, SQL Server 2008 can compress integer, char, and float data in the same manner. Crucially, the compression of fixed-length data doesn't expose the data type any differently to applications, so the benefits of compression are gained without requiring any application changes.

As an example, consider a table with millions of rows containing an integer column with a maximum value of 19. We could convert the column to tinyint, but not if we need to support the possibility of much larger values. In this example, significant disk savings could be derived through row compression, without requiring any application changes.

An alternative to row compression is page compression, our next topic.

9.4.3. Page compression

In contrast to row compression, page compression, as the name suggests, operates at the page level and uses techniques known as page-dictionary and column-prefix to identify common data patterns in rows on each page of the compressed object. When common patterns are found, they're stored once on the page, with references made to the common storage in place of the original data pattern. In addition to these methods, page compression includes row compression, therefore delivering the highest compression rate of the two options.

Page compression removes redundant storage of data. Consider an example of a large table containing columns with a default value specification. If a large percentage of the table's rows have the default value, there's obviously a good opportunity to store this value once on each page and refer all instances of that value to the common store.

Compressing a table, using either the page or row technique, involves a considerable amount of work by the SQL Server engine. To ensure the benefits outweigh the costs, you must take a number of factors into account.

9.4.4. Data compression considerations

In considering the merits of data compression for a given table or index, the first and most straightforward consideration is the potential compression rate.

Compression rate

The compression rate achieved depends on the underlying data and the compression method you choose. SQL Server 2008 includes two tools for estimating disk savings: a Management Studio GUI-based wizard (shown in figure 9.12) and the sp_estimate_ data_compression_savings procedure. Let's look at the wizard first.

Figure 9.12. Using SQL Server Management Studio, you can estimate the effectiveness of both row and page compression for a particular table.

Figure 9.13. Use the `sp_estimate_data_compression_savings` procedure to estimate the disk savings for a table and optionally all its indexes using both page and row compression.

You can access the wizard by right-clicking a table and choosing Storage > Manage Compression. The wizard can be used to estimate, script, and compress the table using the selected compression technique.

The second tool, the sp_estimate_data_compression_savings procedure, as shown in figure 9.13, lists, for a given table and optionally all its indexes, the estimated size before and after compression. Like the Management Studio wizard, you can produce estimates for both row and page compression.

Using the estimate tools as we discussed earlier is an important step in evaluating the benefits of compression before implementing it. Once you complete the evaluation, you can implement compression using the same Management Studio wizard used for estimating the savings. Alternatively, use the ALTER TABLE statement as shown here:

-- Compress a table using 4 CPUs Only
ALTER TABLE [Sales].[SalesPerson]
REBUILD WITH (DATA_COMPRESSION = PAGE, MAXDOP=4)

One of the nice things about the ALTER TABLE^[] method of implementing compression is its ability to accept a MAXDOP value for controlling CPU usage during the initial compression process. Depending on the size of the table and/or indexes being compressed, CPU usage may be very high for an extended length of time, so the MAXDOP setting allows some degree of control in this regard.

^[] The ALTER INDEX statement also contains the DATA_COMPRESSION option.

Finally, you should consider the tables and indexes proposed for compression. Compressing a table that represents a very small percentage of the overall database size will not yield much of a space gain. Further, if that same table is used frequently, then the performance overhead may outweigh the small gain in disk savings. In contrast, a very large table representing a significant portion of the total database size may yield a large percentage gain, and if the table is used infrequently, the gain comes with little performance overhead.

Performance overhead

As with any compression technique, space savings and increased CPU usage go hand in hand. On systems close to CPU capacity, the additional overhead may preclude data compression from being an option. For other systems, measuring the level of overhead is an important consideration.

The ideal targets for compression are tables and indexes that are used infrequently yet represent a significant percentage of the database size. Targeting such tables minimizes the performance impact while maximizing disk space savings.

Dynamic management functions and views such as sys.dm_db_index_opera-tional_stats and sys.dm_db_index_usage_stats assist in the process of identifying the least frequently used objects, and we'll cover these in detail in chapter 13. For frequently used objects, the performance impact of data compression needs to be carefully measured in a volume-testing environment capable of simulating production load.

Despite the CPU overhead, certain operations such as table scans can actually receive a performance boost with data compression enabled. Let's have a look at two examples of both the positive and negative performance impacts of data compression. In viewing these examples, keep in mind that the results of any tests such as these are very much dependent on the makeup of the underlying data. These tests were conducted on modified versions of the tables in the AdventureWorks sample database. Results from real-world customer databases will obviously vary.

The first example tests the time taken to insert the contents of a modified version of the AdventureWorks SalesOrder_Detail table containing 1.6 million records into a blank table with the same structure. The insert was repeated multiple times to observe the insert time and resultant table size with both page and row compression enabled. For comparison purposes, we also ran the test against an uncompressed table.

-- Measure the size and execution time of various compression settings
TRUNCATE TABLE [Sales].[SalesOrder_Detail_Copy];
GO

ALTER TABLE [Sales].[SalesOrder_Detail_Copy]
REBUILD WITH (DATA_COMPRESSION = PAGE) -- repeat for ROW, NONE
GO

INSERT [Sales].[SalesOrder_Detail_Copy]
SELECT *
FROM [Sales].[SalesOrder_Detail];
GO

Rather than execute DBCC DROPCLEANBUFFERS between executions to clear the buffer cache, each test was run multiple times to ensure the data to insert was cached in memory for all three tests. This method lets you more accurately compare the relative performance differences between the compression methods by narrowing the focus to the time taken to write the new rows to disk.

The results of the three tests, shown in figure 9.14, clearly indicate higher compression rates for page compression over row compression, but at a correspondingly higher cost in terms of execution time.

Performance increase

Despite the CPU overhead required to compress and uncompress data, in certain cases compressed data can actually boost performance. This is particularly evident in disk I/O bound range scans. If the data is compressed on disk, it follows that fewer pages will need to be read from disk into memory—which translates to a performance boost. Let's use another example to demonstrate.

Figure 9.14. Inserting 1.6 million rows into a destination table with three different compression settings. Actual results will differ based on various factors.

In this example, we'll select the average unit price from the Sales.SalesOrder_Detail_Copy table. Again, this table was modified for the purposes of the test. For this example, the table was increased in size to 6.7 million rows. Given that the UnitPrice field isn't indexed, a full table scan will result, which is ideal for our test. We'll run this three times, on an uncompressed table, and with both forms of compression enabled. For this test, we'll clear the buffer cache with DBCC DROPCLEANBUFFERS before each test to ensure the query reads from disk each time. The script used for this test looks like this:

-- Measure the table scan time of various compression settings
ALTER TABLE [Sales].[SalesOrder_Detail_Copy]
REBUILD WITH (DATA_COMPRESSION = ROW) -- repeat for PAGE, NONE
GO

DBCC DROPCLEANBUFFERS;
GO

SELECT AVG(UnitPrice)
FROM Sales.SalesOrder_Detail_Copy;
GO

The results of the three tests, shown in figure 9.15, clearly indicate that page compression enables the fastest execution time for this particular example—almost three times quicker than the query against the uncompressed table.

Figure 9.15. Table scan execution time using three different compression levels. Actual results will differ based on various factors.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Data compression

Create new playlist

Sign In

Sign Up