Chapter 8 Understanding and designing tables

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 8
Understanding and designing tables

This chapter discusses a fundamental concept in relational databases: tables. Tables are the database objects that actually store the data in the database. Thoroughly understanding the concepts in this chapter is a requirement for designing and maintaining an effective database. We first discuss table design fundamentals, including data types, keys, and constraints. Next, we cover several specialized table types, such as temporal tables and graph tables. Then, we examine the specialized nature of storing binary large objects (BLOBs) in relational tables. Your understanding of table design would not be complete without including both vertical and horizontal partitioning, which we review before the chapter ends with an overview of change tracking methods.

Reviewing table design

In this section, we review information that is relevant when designing tables. First, we look at system data types, emphasizing the data design decisions surrounding their use. Next, we briefly discuss primary and foreign key concepts. Then, we cover constraints and their impact on table design. The section ends with user-defined data types and computed columns.

Generic data types

Selecting the appropriate data type when designing relational databases is a crucial activity. Even though you can change the data types of columns, it is an expensive operation at best; at worst, it could lead to a loss of data. A poorly chosen data type can result in suboptimal performance or might allow for unexpected values to be stored in the column. The intent of this section is not to provide exhaustive coverage of each system data type available in SQL Server 2017. Instead, the focus will be on providing the information and guidance necessary to make solid table design decisions.

Inside OUT

When should I use a numeric data type instead of an alphanumeric data type?

You can store any numeric value, such as an amount or an identifier consisting only of digits, in an alphanumeric column or in a numeric data type column. Generally, you would choose a numeric data type if you use the values in some type of calculation or when magnitude matters. For example, on a monetary value, you might need to calculate a discount. Another example for which numeric data types are used is in quantities because you might need to adjust the quantity by adding or subtracting additional units. On the other hand, a US Zip code is best stored as an alphanumeric value because leading zeroes must be preserved. The same can be true in an employee ID number.

In addition to considering whether you need to use the value in calculations, there are also differences in how values are sorted. In a numeric column sorted in ascending order, the value 12 will come before 100. But in an alphanumeric column, 100 will come before 12. Either one can produce the desired answer based on your use case.

If you decide that using a numeric data type is the right choice for a column, you then will need to decide which numeric data type is the best. This decision is a balancing act between precision and storage size required. There is, however, one golden rule: if you are storing financial information, such as monetary values or interest rate percentages, you should always use the decimal type (which is the same as numeric in SQL Server) because it has a fixed precision. This means that there will never be an approximate value stored, as could be the case with the float and real data types.

Alphanumeric types

Alphanumeric types in SQL Server are commonly discussed in terms of fixed versus variable length, and with Unicode versus without Unicode support. The char data type is a fixed length type and varchar is variable length. The nchar and nvarchar data types are fixed and variable length, respectively, and support Unicode.

With string data, collation becomes an important consideration. Among other things, collation determines how the high order bits in each character’s byte are interpreted. Collation supports internationalization by allowing different character representations for characters whose integer values is greater than 127. This is determined using the code page, which is one element of the collation. Collation also determines how data is compared and sorted, such as whether casing and accented letters are considered different.

For complete details about collation and Unicode support, refer to Microsoft Docs at https://docs.microsoft.com/sql/relational-databases/collations/collation-and-unicode-support.

Finally, no discussion of alphanumeric types would be complete without taking a look at (N)VARCHAR(MAX). By specifying MAX instead of a value between 1 and 8,000 characters (for VARCHAR) or 1 and 4,000 characters (for NVARCHAR), the maximum bytes stored increases to 2 GB. This data, however, is not stored in the table’s storage structure. Large-value data is stored out-of-row, though for each such column, 24 bytes of overhead is stored in the row.

Numeric types

When considering numeric types in computer systems, it is important to distinguish between exact and approximate numeric types. Approximate types store values using scientific notation. The number of bits in the mantissa is limited to 24 or 53. Due the nature of the scientific notation and the limited number of bits, these types cannot accurately store all numbers in the supported range. On the other hand, exact types store numbers without a loss of precision, but this comes at a loss of range.

SQL Server provides real and float as approximate data types, though their implementation is closely related. The real data type is lower precision than the float data type. It is possible to specify the number of bits for the mantissa when defining float, but SQL Server will always use either 24 bits or 53 bits—any other value you specify is rounded up to either 24 or 53. The real data type is the same as specifying float(24), or in effect any number of mantissa bits between 1 and 24.

Exact numeric types include tinyint, smallint, int, and bigint, which are all whole numbers of varying byte sizes and therefore range. SQL Server does not support unsigned integers.

There are exact numeric types that support decimal-point numbers. Foremost among those is the decimal data type. In SQL Server, another name for decimal is numeric. The decimal data type supports a precision of up to 38 digits. The number of digits determines the storage size. In addition, you can specify the scale, which determines the number of digits to the right of the decimal point.

Another category of exact numeric types that support decimal point numbers are money and smallmoney. These data types can store monetary data with a precision of up to four digits to the right of the decimal point, so the precision is to the ten-thousandth. Choosing between decimal and money or smallmoney is primarily determined by your need for range and number of digits to the right of the decimal point. For monetary values, and if your calculations will return the desired result when using only four significant digits to the right of the decimal point, smallmoney and money are good choices because they are more efficient as it relates to storage space. For high precision and scale, decimal is the right choice.

Date and time types

Date and time data types available in SQL Server 2017 include the venerable datetime and smalldatetime types. Although these are not technically deprecated, we strongly caution against using them for new development due to issues surrounding precision, available date range, and lack of control over the storage size. Additionally, these data types are not aligned with the SQL standard, lowering portability of the data between platforms. Their immediate replacement is datetime2, which in no case consumes more than eight bytes of storage space (the same as datetime), but addresses precision, increases the date range, and can store dates in less than eight bytes in return for lower precision.

This does not mean, however, that all date or time values should be stored in datetime2. There are three additional data types that you should consider for storing date or time values:

date. The date data type stores only a date and supports the same date range as datetime2. It stores the date in only three bytes, thereby being much more efficient than datetime (fixed at eight bytes) and datetime2 (minimally six bytes). If you need to store only a date without time or time zone information, this is your best choice. An example of such a case is a date of birth. A date of birth is commonly stored to calculate someone’s age, and that is not generally dependent on the time zone or on the time. One of the authors was born at 11 PM Central European Summer Time. If he moved to southeast Asia, he would not celebrate his birthday a day later, even though the date in Southeast Asia was the next day. (We appreciate that some applications, such as one used in a neonatal facility, might need to store a more precise “time of birth,” but in most cases, the aging logic above holds up.)
datetimeoffset. The datetimeoffset data type provides the same precision and range as datetime2 but includes an offset value in hours and minutes used to indicate the difference from UTC. Even though a discussion of time zones and the impact of Daylight Saving Time (DST), also known as summer time, is beyond the scope of this book, we will note that this data type neither tracks or understands actual time zones nor DST. It would be up to the application to track the time zone where the value originated to allow the application or recent versions of SQL Server (see the note that follows) to perform correct date arithmetic.

NOTE

Prior to SQL Server 2016, SQL Server did not have any understanding of time zones or DST. SQL Server 2016 and Microsoft Azure SQL Database introduced the AT TIME ZONE function, which you can use to convert between time zones and apply or revert DST offset. The rules SQL Server applies are based on the Windows functionality for time zones and DST. These rules are explained and illustrated with examples at https://docs.microsoft.com/sql/t-sql/queries/at-time-zone-transact-sql.
time. The time data type stores only a time-of-day value consisting of hours, minutes, seconds, and fractional seconds, with a precision up to 100 nanoseconds. The exact fractional second precision and storage size is user defined by optionally specifying a precision between 0 and 7. The time data type is a good choice when storing only a time-of-day value that is not time-zone sensitive, such as for a reminder. A reminder set for 11 AM might need to be activated at 11 AM regardless of time zone and date.

Inside OUT

How can I correctly retrieve the current system date and time?

In addition to continued use of the datetime data type, we also ob1serve the common use of the lower-precision functions CURRENT_TIMESTAMP, GETDATE(), and GETUTCDATE(). Although these functions continue to work, they return values of the datetime type.

There are replacement functions available in SYSDATETIME(), SYSDATETIMEOFFSET(), and SYSUTCDATETIME(). Despite that their names don’t make it immediately clear, the SYSDATETIME() and SYSUTCDATETIME() functions return the improved datetime2(7) type. SYSDATETIMEOFFSET() returns a value of type datetimeoffset(7). SYSDATETIME() and SYSUTCDATETIME() are functionally equivalent to GETDATE() and GETUTCDATE(), respectively. SYSDATETIMEOFFSET() did not have a functional equivalent and is thus the only option if you need to include the time zone offset on the server.

Even if you are unable to change the schema of your database to change all date-time columns to datetime2 (or datetimeoffset), you might benefit in the long term from adopting the improved functions now. Even though the range of valid dates for datetime is much smaller on the lower end than for datetime2 and datetimeoffset, the upper end is the same (December 31, 9999). Discarding the future possibility of time-travel to before the year 1753, none of the improved functions will return a datetime2 or datetimeoffset value that cannot be cast to datetime.

For detailed technical comparisons between the available date and time data types, consult Microsoft Docs at https://docs.microsoft.com/sql/t-sql/functions/date-and-time-data-types-and-functions-transact-sql.

Binary types

Some data cannot be efficiently represented as an alphanumeric string. For example, data that has been encrypted by the application should be stored as a binary value. The same might also apply to storing contents of binary file formats, such as PDF files.

SQL Server provides the binary data type to store fixed-length binary values, and varbinary to store variable-length binary values. (The image data type is deprecated, and you should no longer use it.) For both data types, you specify the number of bytes that will be stored, up to 8,000. If you need to store more than 8,000 bytes, you can specify varbinary(max). This will allow up to 2 GB to be stored, although those bytes are not stored in the data row.

You can store the data for varchar(max) and varbinary(max) columns in a separate filegroup without needing to resort to FILESTREAM. In the CREATE TABLE statement, use the TEXTIMAGE_ON clause to specify the name of the filegroup where large object (LOB) data should be stored.

Specialized data types

In addition to the data types that are designed to store traditional numeric, alphanumeric, and date and time values, SQL Server provides more specialized data types. These data types are more specific to a purpose or use case than the generic data types.

Some specialized data types have SQL Common Language Runtime (CLR) functions that make working with them significantly easier. For example, the hierarchyid data type has a ToString() function that converts the stored binary value into a human-readable format. Such SQL CLR function names are case-sensitive, regardless of the case sensitivity of the instance or database.

Spatial data types

The spatial data types provide a way to work with planar (flat) or ellipsoidal (round-earth) coordinates. The geometry data type is for a flat coordinate system, whereas the geography data type is for round-earth coordinates. In addition, both data types also support elevation, or Z, values. Both data types are CLR types available in every database, regardless of whether the SQL CLR feature is turned on.

SQL Server provides several methods to work with the values of these data types, including finding intersections, calculating surface area and distance, and many more. SQL Server supports methods defined by the Open Geospatial Consortium (OGC) as well as extended methods designed by Microsoft. The methods defined by the OGC are identified by their ST prefix.

Generally, you create a geometry or geography value by using the static STGeomFromText method. You can use this method to define points, lines, and polygons (closed shapes). The code example that follows creates two geometric points, one with coordinates (0, 0) and the second with coordinates (10, 10), and then the distance between both points is calculated and output:

Click here to view code image

-- Define the variables
DECLARE @point1 GEOMETRY, @point2 GEOMETRY, @distance FLOAT;
-- Initialize the geometric points
SET @point1 = geometry::STGeomFromText('POINT( 0 0)', 0);
SET @point2 = geometry::STGeomFromText('POINT(10 10)', 0);
-- Calculate the distance
SET @distance = @point1.STDistance(@point2);
SELECT @distance;

The result in the output is approximately 14.14 (see Figure 8-1; note that no units are defined here). The second argument in the STGeomFromText method is the spatial reference ID (SRID), which is relevant only for the geography data type. Still, it is a required parameter for the function and you should specify 0 for geometry data.

Figure 8-1 The geometry defined in the sample script.

Using spatial data types in a database is valuable when you use the Database Engine to perform spatial queries. You have probably experienced the results of spatial queries in many applications; for example, when searching for nearby pizza restaurants on Bing Maps. Application code could certainly also perform those spatial queries; however, that would require the database to return all pizza restaurants along with their coordinates. By performing the spatial query in the database, the data size that is returned to the application is significantly reduced. SQL Server supports indexing spatial data such that spatial queries can perform optimally.

For a complete reference on the geometry and geography data types, the methods they support, and spatial reference identifiers, refer to Microsoft Docs at https://docs.microsoft.com/sql/t-sql/spatial-geometry/spatial-types-geometry-transact-sql and https://docs.microsoft.com/sql/t-sql/spatial-geography/spatial-types-geography.

The XML data type

A relational database is generally used to store highly structured data, by which we mean data that has a known schema. And even though schemas can change, at any given time every row in a table will have the same columns. Yet, for some scenarios, this strict schema is not appropriate. It might be necessary to accommodate storing data where different rows have different attributes. Sometimes, you can meet this requirement by adding additional nullable sparse columns. Column sets are a feature by which you can manage a group of sparse columns as XML data.

You can read more about sparse columns in the section “Sparse columns” later in the chapter.

Other times, this becomes onerous as a substantial number of columns can introduce additional challenges in working with the table. There, the xml data type can alleviate the column sprawl. Additionally, if data is frequently used in XML format, it might be more efficient to store the data in that format in the database.

Although XML data could be stored in (N)VARCHAR columns, using the specialized data type allows SQL Server to provide functionality for validating, querying, indexing, and modifying the XML data.

SQL Server 2016 introduced support for JSON, though it is not a data type. JSON support includes parsing, querying, and modifying JSON stored in varchar columns.

For complete information on handling JSON-formatted data in SQL Server 2016 and later, refer to https://docs.microsoft.com/sql/relational-databases/json/json-data-sql-server.

Rowversion

This data type generates a database-wide unique binary value upon each modification of row data. This binary value increments with each insert or update statement that affects the row, even if no other row data is actually modified.

A rowversion column in a table is an excellent way to implement optimistic concurrency. In optimistic concurrency, a client reads data with the intent of updating it. However, unlike with pessimistic concurrency, a lock is not maintained. Instead, in the same transaction as the update statement, the client will verify that the rowversion was not changed by another process. If it hasn’t, the update proceeds. But if the rowversion no longer matches what the client originally read, the update will fail. The client application then can retrieve the current values and present the user with a notification and suitable options, depending on the application needs. Many object-relational mappers (ORMs), including Entity Framework, support using a rowversion column type to implement optimistic concurrency.

Inside OUT

Can I implement optimistic concurrency without rowversion?

There are other ways to implement optimistic concurrency. A client application could track the value of each individual column in the row to be updated and verify that only the columns that will be updated by its own update statement have not been modified. Specifically, client A reads a row of data and intends to change only the Name column. Client B reads the same row of data and updates the Address column. When client A attempts to update the Name, it would find that the Name column’s value is unchanged and will proceed with the update.

This approach is suitable in some scenarios, but it has some drawbacks. First, each client needs to maintain additional state. In a web application, the amount of state to maintain can grow very large and consume a lot of memory. In a web farm scenario, maintaining such state might require shared state configuration because the web client might not communicate with the same web server on the POST that it did on the GET. But, perhaps more important, the data row can be inconsistent after the second update. If each client updates a column in the same row, the row’s data might not reflect a valid business scenario. Certainly, the row’s values would not reflect what each client believes it would be.

When designing tables with rowversion, keep the following restrictions in mind:

A table can have only a single rowversion column.
You cannot specify a value for the rowversion column in insert or update statements. However, unlike with identity or computed columns, you must specify the columns in insert statements for tables with a rowversion column. We should note that not specifying a column list is not recommended anyway.
Although the Database Engine will not generate duplicate rowversion values within a database, rowversion values are not unique across databases or across instances.
Duplicate rowversion values can exist in a single database if a new table is created by using the SELECT INTO syntax. The new table’s rowversion values will be the same as those of the source table. This behavior might be desired, for example, when modifying a table’s schema by creating a new table and copying all the data into it. In other instances, this behavior might not be desired. In those cases, do not include the rowversion column in the SELECT INTO statement. Instead, alter the new table and add a rowversion column. This behavior and workaround are illustrated in a sample script file in the accompanying downloads for this book, which are available at
https://aka.ms/SQLServ2017Admin/downloads.

The uniqueidentifier data type

The uniqueidentifier data type stores a 16-byte value known as a globally unique identifier (GUID). SQL Server can generate GUIDs using one of two functions: NEWID() and NEWSEQUENTIALID(). NEWSEQUENTIALID() generates a GUID that is greater than a previously generated GUID by this function since the last restart of the server. You can use NEWSEQUENTIALID() only as a default value for columns, for which it is suitable to use as a clustered primary key. Unlike NEWID(), which generates random values, the increasing nature of the GUIDs generated by NEWSEQUENTIALID() means that data and index pages will fill completely.

The uniqueidentifier data type plays an important role in some replication techniques. For more information, see Chapter 12.

The hierarchyid data type

The hierarchyid data type provides a way for an application to store and query hierarchical data in a tree structure. A tree structure means that a row will have one parent and zero or more children. There is a single root element denoted by a single forward slash (/). hierarchyid values are stored as a binary format but are commonly represented in their string format. Each element at the same level in the hierarchy (referred to as siblings) has a unique numeric value (which might include a decimal point). In the string representation of a hierarchyid value, each level is separated by a forward slash. The string representation always begins with a slash (to denote the root element) and ends with a slash.

For example, as illustrated in Figure 8-2, a hierarchyid whose string representation is /1/10/ is a descendant of the /1/ element, which itself is a descendant of the implicit root element /. It must be noted, however, that SQL Server does not enforce the existence of a row with the ancestor element. This means that it is possible to create an element /3/1/ without its ancestor /3/ being a value in a row. Implicitly, it is a child of /3/, even if no row with hierarchyid value /3/ exists. Similarly, the row with hierarchyid element /1/ can be deleted if another row has hierarchyid value /1/10/. If you don’t want this, the application or database will need to include logic to enforce the existence of an ancestor when inserting and to prevent the deletion of an ancestor.

Figure 8-2 hierarchyid values—the value /3/ is in gray to indicate it is implicit.

Perhaps surprisingly, SQL Server does not enforce uniqueness of the hierarchyid values unless you define a unique index or constraint on the hierarchyid column. It is, therefore, possible that the element /3/1/ is defined twice. This is likely not the desired situation, so we recommend that you ensure uniqueness of the hierarchyid values.

Using the hierarchyid data type is an appropriate choice if the tree is most commonly queried to find consecutive children. That is because the hierarchyid stores rows depth-first if it is indexed. You can create a breadth-first index by adding a computed column to the table, which uses the GetLevel() method on the hierarchyid column, and then creating an index on the computed column followed by the hierarchyid column. However, you cannot use a computed column in a clustered index, so this solution will still be less efficient compared to creating a clustered index on the hierarchyid value alone.

For a complete overview of the hierarchyid data type, refer to https://docs.microsoft.com/sql/relational-databases/hierarchical-data-sql-server.

The sql_variant data type

The sql_variant data type allows a single column to store data of diverse types. You can also use the type as a parameter or a variable. In addition to storing the actual value, each sql_variant instance also stores metadata about the value, which includes the system data of the value, its maximum size, scale and precision, and collation. Using sql_variant can be indicative of a poor database design, and you should use it judiciously. Client libraries that do not know how to handle that data might convert it to NVARCHAR(4000).

Keys and relationships

Proper database design calls for a process called normalization. Normalization is not covered in this book; however, suffice it to say that it leads to breaking logical entities into multiple related tables. Without intending to wax poetically, keys are like the nervous system of a relational database; they establish the relationships between the tables. A relational database system provides primary and foreign keys. In a single table, the primary key values must be unique because those values can be used as foreign key values in the related table. The foreign key values could also be unique in the related table, in which case the established relationship is a one-to-one relationship. This is referenced in the section “Vertical partitioning” later in the chapter.

A table can have exactly one primary key. The primary key can consist of multiple columns, in which case it’s referred to as a compound primary key. However, in no case can a nullable column be (part of) the primary key. If additional columns’ values should be unique, you can apply a unique index or constraint. (See the next section for coverage on constraints.)

Foreign keys are intended to establish referential integrity. Referential integrity enforces that the values found in the foreign key column(s) exist in the primary key column(s). By default, foreign keys in SQL Server have referential integrity enforced. It is possible to establish a foreign key without referential integrity enforced, or alter the foreign key to turn referential integrity off and on. This functionality is useful during import operations or certain types of database maintenance. Foreign keys should have referential integrity turned on to protect the integrity of your data.

One table can have multiple foreign keys. In addition to referencing a primary key, a foreign key can also reference the columns in a nonfiltered unique index.

When defining a foreign key, you can specify cascading. Cascading can occur when a parent row’s primary key value is updated, or when the parent row is deleted. Cascading specifically means that the same operation will be run on the child row(s) as was run on the parent. Thus, if the primary key value were updated, the foreign key values would be updated. If the parent row is deleted, the foreign key values will be deleted. Alternatively, on updates or deletes in the parent table, no action can be taken (the default, which would cause the update or delete statement to fail if referential integrity is enforced), the foreign-key value can be set to null (effectively creating an orphaned row), or the foreign-key value can be set to its default constraint’s specification (effectively mapping the child row to another parent).

Constraints

Constraints define rules to which your data must adhere, and those rules are enforced by the Database Engine. This makes constraints a very powerful mechanism for guaranteeing data integrity. In the previous section, we covered two types of constraints: primary keys and foreign keys. In this section, we provide details on unique, check, and default constraints.

A unique constraint enforces unique values in one column or selected columns. Unlike a primary key, the unique constraint allows the column(s) to be nullable.

A check constraint enforces rules that can be expressed by using a Boolean expression. For example, in the Sales.Invoices table in the sample WideWorldImporters database, there is a check constraint defined that requires the ReturnedDeliveryData column to either be null or contain valid JSON. Check constraints can reference more than one column. A frequently encountered requirement is that when one column contains a particular value, another column cannot be null.

Using constraints with compound conditions also provides an opportunity to provide check constraints in the face of changing requirements. If a new business rule requires that a nullable column must now contain a value, but no suitable default can be provided for the existing rows, you could consider creating a check constraint that verifies whether an incrementing ID column or date column is larger than the value it held when the rule took effect. For example, consider the table Sales.Invoices, which has a nullable column Comments. If effective February 1, 2018, every new and modified invoice must have a value in the Comments column, the table could be altered using the following script:

Click here to view code image

ALTER TABLE Sales.Invoices WITH CHECK
ADD CONSTRAINT CH_Comments CHECK (LastEditedWhen < '2018-02-01'
OR Comments IS NOT NULL);

A problem that you cannot solve by using a constraint is when a column must contain unique values, if a value is provided. In other words, the column should allow multiple rows with null, but otherwise should be unique. The solution then is to use a filtered unique index.

The third and final constraint type is the default constraint. A default constraint specifies the value that will be used as the default value when an INSERT statement does not specify a value for the column.

Sequences

A sequence is a database object that generates a sequence of numeric values. How the sequence is generated is determined by its start value, increment value, and minimum and maximum values. A sequence can be ascending, which is the case when the increment value is positive. When the increment value is negative, the sequence is descending. A sequence object has some similarities to a column with an identity specification, but there are important distinctions:

You can define a sequence to cycle, meaning that when the numbers in the sequence are exhausted, the next use will return a previously generated number. Which number will be returned when the sequence cycles is determined by the increment: if it is an ascending sequence, the minimum value is returned; if it is a descending sequence, the maximum value is returned.
A sequence is not bound to any table. You can use numbers generated by the sequence in any table in the database, or outside of a table.
Sequence numbers can be generated without inserting a new row in a table.
Uniqueness is not enforced. If unique values are desired, a unique index should be placed on the column where the generated numbers are stored.
Values generated from a sequence can be updated.

Sequences are used when the application wants to have a numeric sequence generated at any time; for example, before inserting one or more rows. Consider the common case of a parent–child relationship. Even though most developer tools expect to work with identity columns, knowing the value of a new parent row’s primary key value and using it as the foreign key value in the child rows can have benefits for the application. A sequence is also useful when a single incrementing range is desired across multiple tables. More creative uses of a sequence include using a sequence with a small range—five, for example—to automatically and randomly place new rows in one of five buckets.

To create a sequence, use the CREATE SEQUENCE command. When creating the sequence, you specify the integer data type; the start, increment, minimum and maximum values; and whether the numbers should cycle when the minimum or maximum value is reached. However, all of these are optional. If no data type is specified, the type will be bigint. If no increment is specified, it will be 1. If no minimum or maximum value is specified, the minimum and maximum value of the underlying data type will be used. By default, a sequence does not cycle. The sample script that follows creates a sequence called MySequence of type int. The values start at 1001 and increment by 1 until 1003 is reached, after which 1001 will be generated again. The script demonstrates the cycling of the values using the WHILE loop.

Click here to view code image

-- Define the sequence
CREATE SEQUENCE dbo.MySequence AS int
    START WITH 1001
    INCREMENT BY 1
    MINVALUE 1001
    MAXVALUE 1003
    CYCLE;
-- Declare a loop counter
DECLARE @i int = 1;
-- Execute 4 times
WHILE (@i <= 4)
BEGIN
    -- Retrieve the next value from the sequence
    SELECT NEXT VALUE FOR dbo.MySequence AS NextValue;
    -- Increment the loop counter
    SET @i = @i + 1;
END

The output of the script will be 1001, 1002, 1003, and 1001. The sequence is used by calling NEXT VALUE FOR. You can use NEXT VALUE FOR as a default constraint or as a function parameter. There are, however, quite a few places where NEXT VALUE FOR cannot be used.

For a listing of limitations, refer to Microsoft Docs at https://docs.microsoft.com/sql/t-sql/functions/next-value-for-transact-sql#limitations-and-restrictions.

CAUTION

Using NEXT VALUE FOR multiples times for the same sequence in the same INSERT or UPDATE statement will result in only one value per row being used. For example, in the following Transact-SQL (T-SQL) snippet, the first and the second column of the new row will have the same value:

Click here to view code image

INSERT INTO dbo.SomeTable VALUES (NEXT VALUE FOR MySequence,
NEXT VALUE FOR MySequence, 'More data…');

This is likely not the desired scenario. The solution is to use two separate sequences, though there would not be a guarantee that they would return different numbers even if their start values are different. If you must guarantee that the values are different, you should place a CHECK CONSTRAINT on the table. However, we recommend evaluating the need for having two columns in the same table with autogenerated numeric values.

NEXT VALUE FOR generates and returns a single value at a time. If multiple values should be generated at once, the application can use the sp_sequence_get_range stored procedure. This procedure generates as many numbers from the sequence as specified and returns metadata about the generated numbers. The actual values that have been generated are not returned. The sample script that follows uses the MySequence sequence to generate five numbers. The metadata is captured in variables and later output. You’ll note that the data type of most output parameters is sql_variant. The underlying type of those parameters is the data type of the sequence.

Click here to view code image

-- Declare variables to hold the metadata
DECLARE @FirstVal sql_variant, @LastVal sql_variant,
    @Increment sql_variant, @CycleCount int,
    @MinVal sql_variant, @MaxVal sql_variant;
-- Generate 5 numbers and capture all metadata
EXEC sp_sequence_get_range 'MySequence'
    ,  @range_size = 5
    , @range_first_value = @FirstVal OUTPUT
    , @range_last_value = @LastVal OUTPUT
    , @range_cycle_count = @CycleCount OUTPUT
    , @sequence_increment = @Increment OUTPUT
    , @sequence_min_value = @MinVal OUTPUT
    , @sequence_max_value = @MaxVal OUTPUT;
-- Output the values of the output parameters
SELECT @FirstVal AS FirstVal, @LastVal AS LastVal
    , @CycleCount AS CycleCount, @Increment AS Increment
    , @MinVal AS MinVal, @MaxVal AS MaxVal;

The output of this sample script will vary with each run; however, every three cycles, it will repeat.

NOTE

Although the only required output parameter is @range_first_value, if the application actually intends to use any value but the first, the application should consume all the metadata that is returned as part of the optional output parameters. Without it, the application cannot reliably infer the values that were generated. It is up to the application to calculate the actual numbers that were generated by using the first value, last value, increment, minimum and maximum value, and cycle count output parameters.

User-defined data types and user-defined types

SQL Server supports creating new data types. Two variations exist: user-defined data types (UDTs), which alias existing data types, and user-defined types, which are .NET Framework types. UDTs have no specific requirements because they are merely a new name for an existing system data type, optionally augmented with length or precision, nullability, a default value, or a validation rule. For example, if you want to ensure that a customer name was always defined as an nvarchar column with a max length of 100 characters, you might use the CREATE TYPE statement, as shown here:

Click here to view code image

CREATE TYPE CustomerNameType FROM NVARCHAR(100);
GO

After creating this UDT, in any place where you would ordinarily specify NVARCHAR(100), you can use CustomerNameType, instead. This can be in a table’s column definition, as the return type of a scalar function, or as a parameter to a stored procedure. The following abbreviated CREATE TABLE statement, which is based on the WideWorldImporters sample Customers table, illustrates how CustomerNameType replaces NVARCHAR(100):

Click here to view code image

CREATE TABLE [Sales].[Customers] (
    [CustomerID] INT NOT NULL,
    [CustomerName] CustomerNameType,
    …

Though deprecated functionality, you can further extend the CustomerNameType UDT by providing a default value and enforcing validation using a specific rule. The T-SQL snippet that follows first defines a default with value 'NA' and a rule that requires that the length is at least three characters or equals 'NA'. It then binds the default and the rule to the CustomerTypeName UDT:

Click here to view code image

-- Create a database-wide default
CREATE DEFAULT CustomerNameDefault
AS 'NA';
GO
-- Create a database-wide validation rule
CREATE RULE CustomerNameRule
AS
-- Allow the value 'NA' (the default)
@value = 'NA' OR
-- Or require at least 3 characters
LEN(@value) >= 3;
GO
-- Bind the default and the rule to the CustomerNameType UDT
EXEC sp_bindefault 'CustomerNameDefault', 'CustomerNameType';
EXEC sp_bindrule 'CustomerNameRule', 'CustomerNameType';

After running these statements, most uses of the CustomerNameType will enforce the validation rule. The notable exception is when the data type is used in a variable declaration.

You develop user-defined types in a .NET language such as C#, and you must compile them into a .NET assembly. This .NET assembly is then registered in the database where the UDT will be used. A database can use these types only if SQL CLR is turned on.

We should warn against the liberal use of either variant of custom data types. They can make a database schema significantly more difficult to understand and troubleshoot. Alias types, in particular, add little value, because they do not create new behavior. SQL CLR types allow SQL Server to expose new behavior, but they come with a significant security risk.

Sparse columns

As we just saw, a potential workaround for saving storage space for tables with many columns that allow null and have many null values is using sparse columns. When using sparse columns, less storage space for storing null values is traded for increased overhead to retrieve non-null values. Microsoft suggests that a space savings of at least 20 percent should be achieved before the overhead is worth it.

The Microsoft Docs at https://docs.microsoft.com/sql/relational-databases/tables/use-sparse-columns define the space savings by data type when using sparse columns.

Sparse columns are defined in CREATE or ALTER TABLE statements by using the SPARSE keyword. The sample script that follows creates a table, OrderDetails, with two sparse columns, ReturnedDate and ReturnedReason. Sparse columns are useful here because we might expect most products not to be returned, in addition to retrieving only the ReturnedDate and ReturnedReason columns occasionally.

Click here to view code image

CREATE TABLE OrderDetails (
    OrderId int NOT NULL,
    OrderDetailId int NOT NULL,
    ProductId int NOT NULL,
    Quantity int NOT NULL,
    ReturnedDate date SPARSE NULL,
    ReturnedReason varchar(50) SPARSE NULL);

Computed columns

Mostly, columns store data that is original. Derived data—that is, data that is the result of a calculation—is not ordinarily stored. Instead, the application derives it every time it’s needed. In some circumstances, storing derived data in the database can be beneficial. SQL Server supports storing derived data using computed columns and indexed views (indexed views are beyond the scope of this chapter). Computed columns are defined in a table as the result of an expression of other columns in the table, function calls, and perhaps constants.

And although derived data can be saved in the database, by default a computed column is not persisted. Any way you use a computed column is a trade-off. When using either type of computed column, you have decided that there is some benefit of having the database be aware of the derived data, which will require putting some business logic that computes the data in the database. You might find that more beneficial because the database could be the central source of the computation instead of having to spread it out across multiple systems. When you use persisted computed columns, you are trading storage space for compute efficiency.

If the expression that computes the computed column’s value is deterministic, the computed column can also be indexed. An expression is deterministic if that expression will always return the same result for the same inputs. An example of a nondeterministic expression is one that uses the SYSDATETIME() function.

For a complete discussion of indexing computed columns, refer to Microsoft Docs at https://docs.microsoft.com/sql/relational-databases/indexes/indexes-on-computed-columns.

Using the WideWorldImporters sample database, you will find two computed columns in the Sales.Invoices table. One of those columns is ConfirmedDeliveryTime. It is derived by examining the contents of the JSON value stored in the ReturnedDeliveryData column and converting it to a datetime2 value. The datetime2 value is not persisted in this case. What this means is that each time the ConfirmedDeliveryTime is queried, the expression is evaluated. If the column was persisted, the expression would be evaluated only when the row is created or updated.

When you are defining a computed column, instead of specifying a data type, you specify an expression following the AS clause. Using the Sales.OrderLines table in the same sample database, you can create a computed column to calculate the order line’s extended price. The following sample SQL statement illustrates how:

Click here to view code image

ALTER TABLE Sales.OrderLines
ADD ExtendedPrice AS (Quantity * UnitPrice) PERSISTED;

This statement creates a new column in the table called ExtendedPrice. This column’s value is computed by using the expression Quantity * UnitPrice. The column is saved because we expect to be querying this value frequently. The type of the computed column is determined by SQL Server based on the result of the expression. In this case, the data type is set to decimal(29,2). If the determined data type is not suitable for your needs, you can apply a cast to a data type that is more appropriate.

Special table types

As data storage needs have become more specialized, SQL Server has gained extended functionality to support these scenarios in the form of special table types. These table types support scenarios that would otherwise have required significant effort on behalf of the DBA to implement. Thus, although the table types discussed in this section support requirements that can be implemented without them, using these tables makes it possible for you to offload some of the work to the Database Engine. This section discusses temporal tables, memory-optimized tables, external tables, and graph tables. We discuss another special table type, FileTable, in the next section.

System-versioned temporal tables

System-versioned temporal tables, or “temporal tables” for short, are designed to keep not only current values of rows, but also historic values. When a table is designed to be a temporal table, it has two explicitly defined columns of type datetime2 that are used to indicate the validity period of the row. In addition to the current table, there is a companion history table with the same schema. SQL Server can create the history table at the time the current table is created, but you will need to specify the history table’s name. Alternatively, you might use an existing table as the history table, in which case the Database Engine will validate that the schema matches.

SQL Server manages the movement of data from the current table to the history table. The following list indicates what happens with each Data Manipulation Language (DML) operation:

INSERT and BULK INSERT. A new row is added to the current table. The row’s validity start time is set to the transaction’s start time. The validity end time is set to the datetime2 type’s maximum value: December 31, 9999 at a fractional second, or whole second when using datetime2(0), before midnight. There is no change in the history table.
UPDATE. A new row is added to the history table with the old values. The validity end time is set to the transaction’s start time. In the current table, the row is updated with the new values and the validity start time is updated to the transaction’s start time. Should the same row be updated multiple times in the same transaction, multiple history rows with the same validity start and end time will be inserted.
DELETE. A new row is added to the history table containing the values from the current table. The validity end period is set to the transaction’s start time. The row is removed from the current table.

NOTE

A merge statement needs no special consideration: the merge will still run insert, update, and delete statements as needed. Those statements will add and update rows, as just described.

Querying a temporal table is no different from querying another table if your query only needs to return current data. This makes it possible to modify an existing database and alter tables into temporal tables without requiring application modifications. When designing queries that need to return historical data or even a mix of current and historical data, you use the FOR SYSTEM_TIME clause in the FROM clause of the SELECT statement. There are five subclauses used with FOR SYSTEM_TIME that help you define the time frame for which you want to retrieve rows. The list that follows describes these subclauses as well as provides a sample SQL statement of each one that you can run against the WideWorldImporters sample database to see the effects of the subclause.

NOTE

The IsCurrent column in the output indicates whether the row that is retrieved is the current row or a history row. This is accomplished by simply checking whether the ValidTo column contains the maximum datetime2 value. Due to the nature of the WideWorldImporters sample data, you might need to scroll through several hundred rows before encountering a value of 0 for IsCurrent, which indicates that it is a history row.
ALL. The result set is essentially the union between the current and the history tables. Multiple rows can be returned for the same primary key in the current table. This will be the case for any row that has one or more history entries, as shown here:

Click here to view code image

SELECT PersonID, FullName,
CASE WHEN ValidTo = '9999-12-31 23:59:59.9999999' THEN 1
ELSE 0 END AS IsCurrent
FROM Application.People FOR SYSTEM_TIME ALL
ORDER BY ValidFrom;
AS OF. The AS OF clause returns rows that were valid at the single point in time in the UTC time zone. Rows that had been deleted from the current table or didn’t exist yet will not be included:

Click here to view code image

SELECT PersonID, FullName,
CASE WHEN ValidTo = '9999-12-31 23:59:59.9999999' THEN 1
ELSE 0 END AS IsCurrent
FROM Application.People FOR SYSTEM_TIME AS OF '2016-03-13'
ORDER BY ValidFrom;
FROM … TO. This subclause returns all rows that were active between the specified lower bound and upper bound. In other words, if the row’s validity start time is before the upper bound or its validity end time is after the lower bound, the row will be included in the result set. Rows that became active exactly on the upper bound are not included. Rows that closed exactly on the lower bound are not included. This clause might return multiple rows for the same primary key value:

Click here to view code image

SELECT PersonID, FullName,
CASE WHEN ValidTo = '9999-12-31 23:59:59.9999999' THEN 1
ELSE 0 END AS IsCurrent
FROM Application.People FOR SYSTEM_TIME FROM '2016-03-13' TO '2016-04-23'
ORDER BY ValidFrom;
BETWEEN … AND. This is like FROM … TO, but rows that opened exactly on the upper bound are included, as well:

Click here to view code image

SELECT PersonID, FullName,
CASE WHEN ValidTo = '9999-12-31 23:59:59.9999999' THEN 1
ELSE 0 END AS IsCurrent
FROM Application.People FOR SYSTEM_TIME BETWEEN '2016-03-13' AND '2016-04-23'
ORDER BY ValidFrom;
CONTAINED IN (,). This subclause returns rows that were active only between the lower and the upper bound. If a row was valid earlier than the lower bound or valid past the upper bound, it is not included. A row that was opened exactly on the lower bound or closed exactly on the upper bound will be included. If the upper bound is earlier than the maximum value for datetime2, only history rows will be included:

Click here to view code image

DECLARE @now DATETIME2 = SYSUTCDATETIME();
SELECT PersonID, FullName,
CASE WHEN ValidTo = '9999-12-31 23:59:59.9999999' THEN 1
ELSE 0 END AS IsCurrent
FROM Application.People FOR SYSTEM_TIME CONTAINED IN ('2016-03-13', @now)
ORDER BY ValidFrom;

NOTE

In the sample statement for the CONTAINED IN subclause, the variable @now is declared and initialized with the current UTC time. This is necessary because the FOR SYSTEM_TIME clause does not support functions as arguments.

Inside OUT

How can I design temporal tables to use the least amount of space?

New for SQL Server 2017, you can use the new HISTORY_RETENTION_PERIOD option with the SYSTEM_VERSIONING option when defining or altering system-versioned temporal tables. By default, the HISTORY_RETENTION_PERIOD is INFINITE, meaning that SQL Server will not automatically purge history data. By defining a finite period, such as 6 MONTHS, SQL Server will automatically purge history records with a valid end time older than the finite period. Two conditions must hold true: first, the temporal history retention flag must be turned on for the database (which it is by default), and second, the history table must have a clustered or Columnstore index.

In addition to using automatic retention, which controls the growth of the history table, you can also consider vertically partitioning the temporal table. Vertical partitioning is discussed in more detail in the section “Vertical partitioning” later in this chapter. By splitting the table vertically into two tables, and only making one table system versioned, you can achieve significant space savings because history will be kept only for the columns in the system-versioned table. This does come at the expense of potentially frequent JOINs between both tables. This approach is also not suitable if you are system-versioning tables for compliance requirements for which row data must be available in exactly the form it was at any given point.

In addition to reducing the history kept by setting a retention period and using vertical partitioning to avoid keeping history for columns that do not require it, you might also consider Stretch Database, horizontal partitioning, or a custom cleanup script to manage the history data. These options are described in detail at https://docs.microsoft.com/sql/relational-databases/tables/manage-retention-of-historical-data-in-system-versioned-temporal-tables.

Memory-optimized tables

A traditional database table’s data is loaded in memory as needed to efficiently run queries. The operations of loading data from durable storage to memory and removing the data from memory again is handled by the Database Engine. Many factors play a role in when data is loaded or released from memory. Memory-optimized tables avoid this by ensuring that the data is always available in memory. This data is durable by default by using the transaction log and saving to the drive. A “schema only” option is available, which does not save data between service restarts and certain other operations.

The benefits of keeping all data from specific tables in memory is blazing-fast performance, which often can be improved by another order of magnitude by applying a Columnstore index to the memory-optimized table. (Columnstore indexes are covered in Chapter 9 and Chapter 10.) This of course requires that the server has sufficient memory to hold the memory-optimized tables’ data in memory while still leaving enough room for other operations.

This chapter discusses only the setup and configuration of memory-optimized tables along with caveats. You can find complete discussion of the purpose and the use of memory-optimized tables in Chapter 10.

Memory-optimized tables are available in all editions of SQL Server and in Azure SQL Database’s Premium Tier.

Database preparation for memory-optimized tables

First, you must prepare the database. There are two database requirements: the database compatibility level must be at least 130 and the snapshot isolation level must be supported. For SQL Server, you need to create a memory-optimized filegroup. There is no such requirement for Azure SQL Database; or, more accurately, the filegroup is intrinsically present.

Microsoft provides a T-SQL script to ensure that these settings are correct and that a memory-optimized filegroup is created. You can even run the script in Azure SQL Database to ensure that the database supports memory-optimized tables. Rather than reprinting this script here, we refer you to the GitHub content.

See the GitHub user content page at https://raw.githubusercontent.com/Microsoft/sql-server-samples/master/samples/features/in-memory/t-sql-scripts/enable-in-memory-oltp.sql.

The script first checks to ensure that the instance or database supports memory-optimized tables. If you’re running on SQL Server, the script will create a memory-optimized filegroup and container if none already exist. The script also checks and sets the database compatibility level.

After these actions are complete, you are ready to create one or more memory-optimized tables. The WITH (MEMORY_OPTIMIZED = ON) is the key clause that will create a memory-optimized table. Memory-optimized tables support indexing, but you must create and delete them using an ALTER TABLE … ADD/DROP INDEX statement instead of a CREATE/DROP INDEX statement.

Natively compiled stored procedures and user-defined functions

You can access memory-optimized tables via standard T-SQL statements and stored procedures. However, you can achieve significant additional performance gains if you use natively compiled stored procedures. These stored procedures are compiled to machine code the first time they are run rather than evaluated every time they run.

To create a natively compiled stored procedure, use the WITH NATIVE_COMPILATION clause of the CREATE PROCEDURE statement. In addition, the BEGIN ATOMIC statement is required instead of BEGIN TRANSACTION for natively compiled procedures and functions. This statement either begins a new transaction or creates a savepoint in an existing transaction on the session. The BEGIN ATOMIC statement has two options:

TRANSACTION_ISOLATION. You must set this value to one of the three supported isolation levels: snapshot, repeatable read, or serializable.
LANGUAGE. This is a name value from the sys.syslanguages system compatibility view. For example, for United States English, it is us_english, and for Dutch it is Nederlands.

The BEGIN ATOMIC statement is also where delayed durability can be specified (DELAYED_DURABILITY = ON). Delayed durability means that the Database Engine will report to the client that the transaction committed before the log record has been committed to a drive. This creates a risk of data loss should the service or server shut down before the asynchronous log write is completed. You should take the same care to use delayed durability with BEGIN ATOMIC as with BEGIN TRANSACTION. Further, to use delayed durability, it must also be allowed at the database level. Schema-only memory-optimized tables already do not use transaction logging, so when modifying data in those tables, there is no benefit in specifying delayed durability.

Caveats to memory-optimized tables

To put it plainly, you should probably not convert all of your tables to memory-optimized tables. There are several caveats that must be considered before adopting memory-optimized tables and when deciding which tables to turn into memory-optimized tables. We discuss these caveats in this section.

Memory-optimized tables support only three transaction isolation levels: snapshot, repeatable read, and serializable. If your application has a need for other isolation levels, you will not be able to implement memory-optimized tables.

Due to all table data being kept in memory, you would naturally expect additional memory requirements. However, when planning for memory size, you should consider that the memory requirement of a memory-optimized table can be more than twice that of the size of the data in the table. This is because of processing overhead requirements.

To review specific guidance on planning for memory size, refer to Microsoft Docs at https://docs.microsoft.com/sql/relational-databases/in-memory-oltp/estimate-memory-requirements-for-memory-optimized-tables.

When persisted memory-optimized tables are used, upon service start, the Database Engine will load all data from the drive to memory. The service is not available while this operation takes place. With large tables, this can lead to significantly longer service start times. Even though you might carefully plan your service or server restarts for a maintenance window, an unplanned failover on a failover cluster instance (FCI) can take significantly longer. This might be detrimental to meeting your Service-Level Agreement (SLA), which might have been the entire reason to configure an FCI in the first place. If the performance of memory-optimized tables is needed in combination with a high-availability configuration, you might consider Always On availability groups, instead. Because the Database Engine service is running on the secondary, there is no delay caused by having to read the data from a drive.

One way to reduce database startup time due to memory-optimized tables is to ensure that checkpoints are taken frequently. That’s because checkpoints cause the updated rows in the memory-optimized table to be committed to the data file. Any data that is not committed to the data file must be read from the transaction log. However, for large tables, this benefit is likely small.

Another contributor to delays, though after service start, is when natively compiled stored procedures are run for the first time; this can take about as long as running a traditional stored procedure. This is because the compiled version of the stored procedure is not saved. Any time a natively compiled stored procedure is run subsequently, the compiled version will be faster.

Memory-optimized tables use an optimistic concurrency model. Instead of locks, latches are used. This means that a client application might experience unexpected conflicts. You should design the application to handle those.

Not unlike when faster drive storage is used for SQL Server, when adopting memory-optimized tables, you might find that the CPU usage is much higher. This is because much less time is spent waiting for I/O operations to complete. The first consideration on this point is that this is exactly the reason why you implemented memory-optimized tables: the CPU utilization is higher because data is being processed faster! Another consideration could be that you might inadvertently reduce the number of concurrent requests that can be served, especially if one instance runs multiple databases. If this is a concern, you can consider using a Resource Governor to manage the relative CPU usage for each database.

PolyBase external tables

PolyBase external tables bring SQL Server into the NoSQL era. An external table is a reference to one of two types of nonrelational data stores. (A third type is a reference to a table in another Azure SQL Database to configure elastic query, but that scenario is beyond the scope for this chapter.) Specifically, an external table can reference a Hadoop hive or an Azure Storage blob. The data in the hive or Storage blob must be of a known format; for example, a text-delimited file.

After the external table is defined, you can query it by using T-SQL alongside the relational data stored in the SQL Server database. You also can import data in the relational tables.

To define an external table referencing a Hadoop hive or Azure Storage blob, three steps are required. First, an external data source is defined, which determines the type and location of the external data. Next, an external file format is set. This determines the format of the external data. SQL Server recognizes delimited text, RCFile, and ORC formats. The third and final step is to define the actual external table.

For details on creating and working with PolyBase external tables, refer to Microsoft Docs at https://docs.microsoft.com/sql/t-sql/statements/create-external-table-transact-sql.

Graph tables

New to SQL Server 2017, graph functionality provides schema extensions to store directed graph data—that is, nodes and edges—in the relational database. Fitting graph data in a relational database is challenging, and this new feature attempts to resolve these challenges. The graph features in SQL Server 2017 solve common issues but are not a complete replacement for dedicated graph databases that support advanced scenarios.

Graph data is often associated with networks, such as social networks. More generally, graphs are data structures that consist of nodes and edges. The nodes represent entities, and the edges represent the connections between those entities. We are using the term “entities” here in a more generic fashion than what you commonly think of as entities in a relational model. Nodes are also referred to as vertices, and edges as relationships.

Some use cases lend themselves particularly well to be stored in a graph model. Examples of such use cases include the following:

Interconnected data. A commonly used example of interconnected data is that of social networks. Social network data expresses relationships among people, organizations, posts, pictures, events, and more. In such a data model, each entity can be connected to any other entity, creating lots of many-to-many relationships. In a relational database, this requires the creation of a join table for each many-to-many relationship. Querying such relationships requires two or more INNER JOIN clauses, which can quickly create lengthy SELECT statements. Such statements can be difficult to digest and are potentially error prone. Graph databases support flexible definitions of relationships.
Hierarchical data. SQL Server provides a specialized data type, hierarchyid (discussed earlier) that supports modeling simple tree hierarchies. One limitation of this data type is its inability to support multiple parents. A node in a graph is not limited like that, and a single node can have many parents in addition to many children.
Many-to-many relationships that can be extended at any time during the data life cycle. Relational databases have strict requirements for the definition of tables and relationships. For a data model that can evolve quickly to require new relationships, this strict schema requirement can get in the way of meeting evolving requirements.

You can effectively implement these use cases by employing a graph database, but it is important to note that SQL Server 2017’s graph features do not (yet) provide a solution that is on-par with dedicated graph databases. We discuss some of the limitations of the current implementation in the section “Current graph table shortcomings” later in the chapter.

Defining graph tables

In SQL Server 2017 and Azure SQL Database, you can store graph data in two new types of table: node and edge tables. These table types are still stored internally as relational structures, but the Database Engine has additional capabilities to manage and query the data that is stored within them. The T-SQL CREATE TABLE syntax has been extended with two new clauses: AS NODE and AS EDGE. The following T-SQL script creates a Person node table and a Relationship edge table, and you can run the script in any existing or new database; there are no specific requirements of the database:

Click here to view code image

CREATE TABLE dbo.People (
    PersonId INT NOT NULL PRIMARY KEY CLUSTERED,
    FirstName NVARCHAR(50) NOT NULL,
    LastName NVARCHAR(50) NOT NULL
) AS NODE;
CREATE TABLE Relationships (
    RelationshipType NVARCHAR(50) NOT NULL
) AS EDGE;

In the sample script, both the node and the edge table contain user-defined columns. Edge tables are not required to have user-defined columns, but node tables must have at least one. In the case of edge tables, which model relationships, simply modeling the relationship without additional attributes can be desirable. In the case of node tables, which model entities, there is no value in a node without properties. Designing a node table is comparable to designing a relational table; you would still consider normalization and other concepts.

In addition to user-defined columns, both table types also have one or more implicit columns. Node tables have one implicit (also called pseudo) column, $node_id, which uniquely identifies the node in the database. This pseudo-column is backed by two actual columns:

graph_id_<hex_string_1>. This is a BIGINT column, which stores the internally generated graph ID for the row. This column is internal and cannot be explicitly queried.
$node_id_<hex_string_2>. This column can be queried and returns a computed NVARCHAR value that includes the internally generated BIGINT value and schema information. You should avoid explicitly querying this column. Instead, you should query the $node_id implicit pseudo column.

In addition to optional user-defined columns, edge tables have three implicit columns:

$edge_id_<hex_string_3>. This is a system-managed value, comparable to the $node_id column in a node table.
$from_id_<hex_string_4>. This references a node ID from any node table in the graph. This is the source node in the directed graph.
$to_id_<hex_string_5>. This references a node ID from any node table in the graph, whether it is the same or a different table than the $from_id$ node’s table. This is the target node in the directed graph.

Inside OUT

When should I choose graph tables over relational tables?

First, it’s important to understand that there is nothing inherent to a graph database that makes it possible for you to solve a problem that you cannot also solve using a relational database. The relational database concept has been around for nearly five decades, and relational database management systems are as popular as ever.

The use cases described earlier and queries described momentarily are examples of data models and operations that might be better addressed by a graph database. This is because the Database Engine has specific optimizations to address some of the particular types of queries that are often run against such models.

In addition, a graph table can still contain foreign keys referring to relational tables, and a relational table can contain a foreign key referring to a graph table.

Working with graph data

DML statements generally work the same in graph tables as they do in relational tables. Some operations are not supported, though. An edge table does not support updating either of the node values. Thus, to update a relationship, the existing edge row must be deleted and a new one inserted. User-defined columns of edge tables do support update operations.

When querying graph data, you can write your own table joins to join nodes to edges to nodes, though this approach offers none of the benefits of graph tables. Instead, using the new MATCH subclause in the WHERE clause, you can use a new style of expression, referred to as ASCII art, to indicate how nodes and edges should be traversed. You might be surprised to find that the node and edge tables are joined using old-style join syntax first. The MATCH subclause then performs the actual equi-joins necessary to traverse the graph.

The brief example that follows is intended to provide an introduction only. It builds on the creation of the People and Relationship tables shown in the previous example. First, a few rows of sample data are inserted. Then, the sample data is queried by using the MATCH subclause:

Click here to view code image

-- Insert a few sample people
-- $node_id is implicit and skipped
INSERT INTO People VALUES
    (1, 'Karina', 'Jakobsen'),
    (2, 'David', 'Hamilton'),
    (3, 'James', 'Hamilton');
-- Insert a few sample relationships
-- The first sub-select retrieves the $node_id of the from_node
-- The second sub-select retrieves the $node_id of the to node
INSERT INTO Relationships VALUES
    ((SELECT $node_id FROM People WHERE PersonId = 1),
     (SELECT $node_id FROM People WHERE PersonId = 2),
     'spouse'),
    ((SELECT $node_id FROM People WHERE PersonId = 2),
     (SELECT $node_id FROM People WHERE PersonId = 3),
     'father');
-- Simple graph query
SELECT P1.FirstName + ' is the ' + R.RelationshipType +
    ' of ' + P2.FirstName + '.'
FROM People P1, People P2, Relationships R
WHERE MATCH(P1-(R)->P2);

The ASCII art syntax used in the MATCH subclause means that a node in the People table should be related to another node in the People table using the Relations edge. As with self-referencing many-to-many relationships, the People table needs to be present in the FROM clause twice to allow the second People node to be different from the first. Otherwise, the query would retrieve only edges in which people are related to themselves (there is no such relationship in our sample).

The true power of the MATCH subclause is evident when traversing edges between three or more nodes. One such example would be finding restaurants your friends have liked in the city where your friends live and where you intend to travel.

For a more comprehensive sample graph database, refer to Microsoft Docs at https://docs.microsoft.com/sql/relational-databases/graphs/sql-graph-sample.

Current graph table shortcomings

This first release of SQL Server with support for graph models is missing several features that dedicated graph databases usually have. This doesn’t mean that the current graph support isn’t useful, but if your short-term needs for graph support require these features, SQL Server might not yet be the right solution for you. The following list contains the most notable missing graph features and a brief description of their significance for graph processing. Hopefully, this will provide the information you need to make an informed decision about using SQL Server for graph data.

No graph analytic functions. Graph processing commonly involves finding answers to common questions, such as, “What is the shortest path between two nodes?” There are specific algorithms that are used to find answers to these questions. However, SQL Server does not implement any graph analytic functions.
Need to explicitly define edges as tables. Graphs model pairwise relations between entities (the nodes). Flexibility can be key in maximizing the benefits of graph models. Even though the nodes are often well understood, including their properties, new relationships can be modeled as new needs arise or additional possibilities emerge. The need to make schema modifications to support new types of edges reduces flexibility. Some of this can be addressed by defining one or few edge tables and storing the edge properties as XML or JSON. This approach, too, has drawbacks in terms of performance and ease of writing queries against the data.
No transitive closures. The current MATCH subclause syntax does not support querying recursively. A transitive closure is when node A is connected to node B, which is in turn connected to node C, and we want to traverse the graph from node A to node C, but we don’t know that node B is involved at all; or, indeed, that there might even be a node B1 involved. Currently, MATCH can traverse multiple edges but the query needs to be defined knowing which edges and how many edges are involved. As an alternative, you could write a recursive common table expression (CTE) or a T-SQL loop. Dedicated graph databases can traverse graphs without being directed as to which edges and how many edges to use.
No polymorphism. Polymorphism is the ability to find a node of any type connected to a specified starting node. In SQL Server, a workaround for graph models with few node and edge types is to query all known node and edge types and combine the result sets by using a UNION clause. For large graph models, this solution becomes impractical.

Storing BLOBs

Storing LOBs, and more specifically BLOBs, in the relational database has been known to cause debate. Prior to SQL Server offering the FILESTREAM feature as a specialized way for the Database Engine to manage BLOBs using the file system, database designers had two suboptimal choices:

Store the BLOB, such as an image, video, or document file, in a VARBINARY column. Downsides of this approach include rapid growth of the data file, frequent page splits, and pollution of the buffer pool. Benefits include transactional integrity and integrated backup and recovery of the BLOB data.
Have the application store the BLOB in the file system and use an NVARCHAR column to store the file system path to the file. Downsides of this approach include requiring the application to manage data integrity (e.g., avoiding orphaned or missing files) and lack of integrated security (i.e., the security mechanism to secure the BLOBs is an entirely different model than that for protecting the database). There are some benefits, though, primarily around performance and ease of programming for the client to work with the BLOBs (i.e., using traditional file I/O APIs provided by the OS).

The FILESTREAM feature is designed to provide a way to have the best of both alternatives. FILESTREAM is not a data type as much as it is an extension to VARBINARY(max). This section discusses FILESTREAM and FileTable. FileTable builds on FILESTREAM, so we first cover FILESTREAM and then FileTable.

Understanding FILESTREAM

To take advantage of FILESTREAM, there are three requirements. First, the instance must be configured to allow at least one of several levels of FILESTREAM. Second, your database will need to have at least one FILESTREAM filegroup. Third, any table containing a FILESTREAM column requires a unique, non-null rowguid. A FILESTREAM filegroup refers to a location on an NT File System (NTFS) or Resilient File System (ReFS) volume that is under the control of the Database Engine. This location will be used by the Database Engine to store the binary data and log files for the binary data.

When a FILESTREAM filegroup is available in the database, FILESTREAM can be used as a modifier on VARBINARY(MAX) columns. When creating a table with a FILESTREAM column, you can specify on which filegroup the FILESTREAM data will be stored. When multiple FILESTREAM files are added to a single filegroup, the files will be used in round-robin fashion, as long as they don’t exceed their maximum size.

In general, FILESTREAM’s performance benefits kick in when the average BLOB size is 1 MB or larger. For smaller BLOB sizes, storing the BLOBs in the database file using a VARBINARY(MAX) column is better for performance. However, you might determine that the ease of programming against file I/O APIs in the client application is an overriding factor and decide to use FILESTREAM even with smaller BLOBs.

Additionally, if any of your BLOBs exceed 2 GB in size, you will need to use FILESTREAM; varbinary(max) supports a maximum BLOB size of 2 GB. Another reason for choosing FILESTREAM is the ability to integrate the BLOBs with SQL Server semantic search. To be clear, VARBINARY(MAX) columns can also be integrated with semantic search, but BLOBs stored in traditional file systems files cannot.

More information about Semantic Search is available online at https://docs.microsoft.com/sql/relational-databases/search/semantic-search-sql-server.

Inside OUT

How can I move data from a VARBINARY(MAX) column to FILESTREAM ?

Unfortunately, moving from VARBINARY(MAX) columns to FILESTREAM is not as easy as modifying the column to add the FILESTREAM modifier. Attempting to modify the column in that way will result in an error. Instead, you should use the following three-step process, after creating a FILESTREAM file group:

Create a new VARBINARY(MAX) FILESTREAM column in the table or in another table if you want to use vertical partitioning.
Copy the data from the existing VARBINARY(MAX) column to the new FILESTREAM column. The amount of database activity that will be caused by this operation can be significant, depending on the number of rows and the size of the BLOBs.
Drop the VARBINARY(MAX) column. Optionally, you can then rename the FILESTREAM column to the name of the dropped column. Until you have (optionally) deployed a modified application that uses the I/O APIs with FILESTREAM, the existing T-SQL statements will continue to work on the FILESTREAM column.

In addition to potentially causing significant database activity, you also need to ensure that sufficient storage space is available to hold both copies of the data. Perhaps you are using the opportunity to move the BLOBs to different storage hardware, in which case this might be less of a concern.

Even though FILESTREAM BLOBs are stored in the file system, they are managed by the Database Engine. That includes transactional consistency and point-in-time restores. Thus, when a BLOB is deleted, the file on the drive backing that BLOB is not immediately deleted. Similarly, when a BLOB is updated, an entirely new file is written and the previous version is kept on the drive. When the deleted file or previous file version is no longer needed, the Database Engine will eventually delete the file using a garbage collection process. You are already aware of the importance of taking transaction log backups with databases in the full recovery model. That way, the transaction log can be truncated and stop growing. When using FILESTREAM, this mantra applies double: the number of files will keep growing until they are deleted by the garbage collector.

FileTable

FileTable makes it possible to access BLOBs managed by the Database Engine using traditional file share semantics. Applications that can read and write from a file share can access the BLOBs managed by the SQL Server engine. Although clients can use file I/O APIs to work with FILESTREAM, obtaining a handle to the BLOB requires using specific client libraries and application modifications. There might be applications that cannot be modified to work with FILESTREAM but for which having the BLOBs managed by the relational engine would have significant advantages. To that end, FileTable, which is a special table type, was developed.

A FileTable has a fixed schema. This means that you can neither add user-defined columns nor can you remove columns. The only control provided is the ability to define indexes on some FileTable columns. The fixed schema has a FILESTREAM column that stores the actual file data in addition to many metadata columns and the non-null unique rowguid column required of any table containing FILESTREAM data. FileTable can organize data hierarchically, meaning folders and subfolders are supported concepts.

For a detailed discussion of the FileTable schema, refer to Microsoft Docs at https://docs.microsoft.com/sql/relational-databases/blob/filetable-schema.

Table partitioning

Table partitioning occurs when you design a table that stores data from a single logical entity in physically separate structures. In other words, rather than storing all of the entity’s data in a single physical data structure, it is split into multiple physical data structures. They continue to be treated by the user as a single unit. Table partitioning has multiple purposes, most of which relate to performance, either when querying or when loading data. We discuss this later in detail. We first distinguish between horizontal and vertical partitioning, as illustrated in Figure 8-3, and then discuss each separately with its common use cases and recommendations.

Figure 8-3 (A) Horizontal partitioning splits a table’s data rows. (B) Vertical partitioning splits a table’s columns.

As Figure 8-3 illustrates, horizontal and vertical partitioning are distinctly different. Horizontal partitioning splits the data rows, and each partition has the same schema. Vertical partitioning splits the entity’s columns across multiple tables. The diagram shows a table partitioned in only two partitions. However, you can partition tables into many partitions. You can also mix horizontal and vertical partitioning.

Horizontally partitioned tables and indexes

In a large-scale database, in which a single table can grow to hundreds of gigabytes and more, some operations become more difficult. For example, adding new rows can take an excessive amount of time and might also cause SELECT queries against the table to fail due to lock escalation. Similar concerns exist with respect to removing data and index maintenance.

Horizontal partitioning can address these concerns, but it is also important to understand that it is not a silver bullet that will make all performance problems in large tables disappear. On the contrary, when applied incorrectly, horizontal partitioning can have a negative effect on your database workload. This section builds on the brief discussion of partitioning found in Chapter 3.

About horizontal partitioning

When partitioning a table, the rows of the table are not stored in the same physical place. When designing partitions, you decide on a partition key, which is the column that will be used to assign a row to exactly one partition. From a logical viewpoint, however, all rows belong to the same table. A query without a WHERE clause returns all rows, regardless of which partition they are stored in. This means that the Database Engine must do more work to retrieve rows from different partitions. Your goal when partitioning for query performance should be to write queries that eliminate partitions. You can accomplish this by including the partition key in the WHERE clause.

Additional benefits of horizontal partitioning include the ability to set specific filegroups to read-only. By mapping partitions containing older data to read-only filegroups, you can be assured that this data is unchangeable without affecting your ability to insert new rows. In addition, you could exclude the read-only filegroups from regular backups. Finally, during a restore, filegroups containing the most recent data could be restored first, allowing new transactions to be recorded faster than if the entire database would need to be restored.

In addition to horizontal table partitioning, SQL Server also supports index partitioning. A partitioned index is said to be aligned with the table if they are partitioned in the same number of partitions using the same column and boundary values. When a partitioned index is aligned, you can direct index maintenance operations to a specific partition. This can significantly speed up the maintenance operation compared to rebuilding the index for the entire table. On the other hand, if the entire index needs to be rebuilt, SQL Server will attempt to do so in a parallel fashion. Rebuilding multiple indexes simultaneously will create memory pressure. Because of this concern, we recommend that you do not use partitioning on a server with less than 16 GB of RAM.

You might benefit from creating a partitioned index without partitioning the table. You can still use this nonaligned index to improve query efficiency if only one or a few of the index partitions need to be used. In this case, you will also use the index’ partition key in the WHERE clause to gain the performance benefit of eliminating partitions.

Defining partitions and partitioning a table

We now demonstrate how to create a partitioned table. Three database objects are involved in defining partitions and partitioning a table:

A partition function, which defines the number of partitions and the boundary values
A partition scheme, which defines on which filegroup each partition is placed
The partitioned table

NOTE

For brevity, the following script does not show the creation of the database with the filegroups and files necessary to support the partition scheme. The sample script included with the book downloads does include the CREATE DATABASE statement.

Click here to view code image

-- Create a partition function for February 1, 2017 through January 1, 2018
CREATE PARTITION FUNCTION MonthPartitioningFx (datetime2)
    -- Store the boundary values in the right partition
    AS RANGE RIGHT
    -- Each month is defined by its first day (the boundary value)
    FOR VALUES ('20170201', '20170301', '20170401',
        '20170501', '20170601', '20170701', '20170801',
        '20170901', '20171001', '20171101', '20171201', '20180101');
-- Create a partition scheme using the partition function
-- Place each trimester on its own partition
-- The most recent of the 13 months goes in the latest partition
CREATE PARTITION SCHEME MonthPartitioningScheme
    AS PARTITION MonthPartitioningFx
    TO (FILEGROUP2, FILEGROUP2, FILEGROUP2, FILEGROUP2,
        FILEGROUP3, FILEGROUP3, FILEGROUP3, FILEGROUP3,
        FILEGROUP4, FILEGROUP4, FILEGROUP4, FILEGROUP4, FILEGROUP4);

If you visualize the table data as being sorted by the partition key in ascending order, the left partition is the partition that is on top. When defining a partition function, you indicate whether the boundary value—in our example, the first day of each month—will be stored in the partition on the left (which is the default), or the partition on the right (as specified in the sample).

Figure 8-4 shows the relationship between the partition function and the partition scheme. The partition function created 13 partitions using 12 boundary values. The partition scheme directed these 13 partitions to three filegroups by specifying each filegroup four times, and the last filegroup five times because it will hold the last partition.

Figure 8-4 The relationship between the partition function and the partition scheme, and the filegroups on which the partitions will be stored.

Partition design guidelines

When designing partitions, keep the following guidelines in mind, though also keep in mind that your mileage may vary:

The number of parallel operations that can be run depends on the number of processor cores in the system. Using more partitions than processor cores will limit the number of partitions that will be processed in parallel. So, even though SQL Server now supports up to 15,000 partitions, on a system with 12 processor cores, at most 12 partitions will be processed in parallel.
Choose the partition key based on the column’s values growing. This could be a date value or an incrementing identity column. The goal is to always have new rows added to the right-most partition.
The selected partition key should be immutable, meaning that there should be no business reason for this key value to change. A narrow data type is preferable over a wide data type.
To achieve most benefits of partitioning, you will need to put each partition into its own filegroup. This is not a requirement, and some or all partitions can share a single filegroup. For example, the next section discusses a sliding window partition strategy, in which partitioning is beneficial even if all are on the same filegroup.
Consider the storage that is backing the filegroups. Your storage system might not provide higher performance if all filegroups have been placed on the same physical drives.
Tables that are good candidates for partitioning are tables with many—as in millions or billions—rows for which data is mostly added as opposed to updated, and against which queries are frequently run that would return data from one or a few partitions.

Implementing a sliding window partition strategy

Horizontal partitioning is often applied to relational data warehouses. A common data warehouse operation is loading a significant amount of data to a fact table while simultaneously purging old data. The sliding window partition strategy is particularly well suited for tables for which data is regularly added and removed. For example, data in a data warehouse fact table can be purged after 13 months. Perhaps each time data is loaded into the data warehouse, rows older than 13 months are removed while new rows are added. This is a sliding window in so much as the fact table always contains the most recent 13 months of data.

To set up a sliding window, you’ll need a partition function and scheme as well as the fact table. You should also set up a stored procedure that modifies the partition function to accommodate the new boundary values. You will also need a staging table with the same columns and clustered index as the partitioned table.

Figure 8-5 illustrates what happens on March 1, 2018, when data is loaded for the month of February 2018. The fact table is partitioned into 13 partitions, one for each month. An automated process takes care of modifying the partition function to accommodate the new date range by splitting the last partition in two. Then, the oldest month’s data is switched out to a staging table and the new data is switched in from a staging table. Finally, the left-most partition, which is now empty, is merged.

Figure 8-5 An overview of the sliding window partition strategy.

You can optimize switching the old partition out and the new partition in by using a memory-optimized table as the staging table.

Vertical partitioning

An entity that is vertically partitioned into multiple tables can usually be identified by the fact that the tables share the same primary key. A one-to-one relationship exists between the tables, which you can enforce by using a foreign key constraint.

Unlike horizontal partitioning, SQL Server does not have extensive support for vertical partitioning. As the database designer, you will need to create the necessary schema to vertically partition tables yourself.

Vertical partitioning makes sense when a single table would ordinarily contain a large number of columns, some of which might contain large values that are infrequently queried. You can improve performance by storing the infrequently accessed columns in another table. Another problem that you can solve by vertical partitioning is when you run into a maximum row size limit or maximum column count limit. We would encourage you to first review your database design to ensure that one logical entity really needs such wide columns or that many attributes. If that’s the case, splitting the entity into two or more tables can be a workaround.

Be careful not to abuse vertical partitioning as a strategy. Every time data from two tables is needed in a single result set, a join operation will be required. These joins could be expensive operations or are at least more expensive than reading data from a single page, and might nullify other performance benefits if you run them frequently.

There is a special case for using vertical partitioning, and it relates to FileTable. FileTables, as described previously in this chapter, have a fixed schema. You might, however, need to store additional metadata about the files. Because you are unable to extend the schema, you will need to create a new table which uses the same primary key as the FileTable. Using insert and delete triggers, you can guarantee data integrity by ensuring that for every row in the FileTable, there is a matching row in your extended metadata table.

Capturing modifications to data

SQL Server supports several methods for capturing row data that has been modified. Temporal tables have been discussed earlier in this chapter. In this section, we discuss change tracking and change data capture. Although these features allow applications to detect when data has changed, they operate very differently and serve different purposes. This section first discusses change tracking, then change data capture, and then finally provides recommendations on their use.

NOTE

We should immediately state that neither feature tracks Data Definition Language (DDL) or Data Control Language (DCL) changes. Only DML changes to user tables are tracked, and it does not include the login that made the change. This limitation affects these features’ usefulness as an auditing mechanism. When used in combination with auditing, change data capture can provide additional value. For more information about auditing in SQL Server, refer to Chapter 7. For more information about auditing in Azure SQL Database, refer to Chapter 5.

Using change tracking

Change tracking does not actually track the data that has changed, but merely that a row has changed. You use it mostly for synchronizing copies of data with occasionally offline clients or for Extract, Transform, and Load (ETL) operations. For example, an application that facilitates offline editing of data will need to perform a two-way synchronization when reconnected. One approach to implementing this requirement is to copy (a subset of) the data to the client. When the client goes offline, the application reads and updates data using the offline copy. When the client reestablishes connectivity to the server, changes can be merged efficiently. The application is responsible for detecting and managing conflicting changes.

Configuring change tracking is a two-step process: first, change tracking must be turned on for the database. Then, change tracking must be turned on for the table(s) that you want to track. Before performing these steps, we recommend setting up snapshot isolation for the database. Snapshot isolation is not required for proper operation of change tracking, but it is very helpful for accurately querying the changes. Because data can change as you are querying it, using the snapshot isolation level and an explicit transaction, you will see consistent results until you commit the transaction. This is described in the detail in the article “Working with change tracking” referenced at https://docs.microsoft.com/sql/relational-databases/track-changes/work-with-change-tracking-sql-server.

The sample script that follows turns on snapshot isolation on the WideWorldImporters sample database. Then, change tracking on the WideWorldImporters sample database and on two tables, Sales.Orders and Sales.OrderLines, is turned on. Only on the Sales.Orders table is column tracking activated. Next, change tracking is turned off for Sales.OrderLines. Finally, the sys.change_tracking_tables catalog view is queried to retrieve a list of tables with change tracking turned on.

Click here to view code image

USE master;
GO
-- Enable snapshot isolation for the database
ALTER DATABASE WideWorldImporters
    SET ALLOW_SNAPSHOT_ISOLATION ON;
-- Enable change tracking for the database
ALTER DATABASE WideWorldImporters
    SET CHANGE_TRACKING = ON
    (CHANGE_RETENTION = 5 DAYS, AUTO_CLEANUP = ON);
USE WideWorldImporters;
GO
-- Enable change tracking for Orders
ALTER TABLE Sales.Orders
    ENABLE CHANGE_TRACKING
    -- and track which columns changed
    WITH (TRACK_COLUMNS_UPDATED = ON);
-- Enable change tracking for OrderLines
ALTER TABLE Sales.OrderLines
    ENABLE CHANGE_TRACKING;
-- Disable change tracking for OrderLines
ALTER TABLE Sales.OrderLines
    DISABLE CHANGE_TRACKING;
-- Query the current state of change tracking in the database
SELECT *
FROM sys.change_tracking_tables;

For an end-to-end example of how an application can use change tracking to accomplish two-way data synchronization with an occasionally offline data store, see https://docs.microsoft.com/sql/relational-databases/track-changes/work-with-change-tracking-sql-server.

A major benefit of change tracking compared to implementing a custom solution is that change tracking does not make any schema changes to the user tables that are tracked. In addition, change tracking is available in all editions of SQL Server and in Azure SQL Database. Autocleanup ensures that the database does not grow unchecked.

Although change tracking can track which rows, and optionally columns, have changed, it is not able to indicate what the old values were or how often the row has been changed. If your use case does not require this, change tracking provides a light-weight option for tracking. If your use case does require one or both, change data capture might offer a solution.

We discuss change data capture in the next section. The functions for querying changes is the same between change tracking and change data capture. They are also covered in the next section.

Using change data capture

Change data capture varies in some important ways from change tracking. Foremost, change data capture actually captures the historical values of the data. This requires a significantly higher amount of storage than change tracking. Unlike change tracking, change data capture uses an asynchronous process for writing the change data. This means that the client does not need to wait for the change data to be committed before the database returns the result of the DML operation.

Change data capture stores the captured changes in internal tables. You can query that data by using the CHANGETABLE function. To request the data, the client specifies the table for which history is requested along with a version number of the last time the client synchronized. The following script turns on change data capture on a fictitious database using the
sys.sp_cdc_enable_db stored procedure. Then, the script turns on change data capture for the dbo.Orders table. The script assumes that a database role cdc_reader has been created.

Click here to view code image

USE WideWorldImporters;
GO
EXEC sys.sp_cdc_enable_db;
EXEC sys.sp_cdc_enable_table
    @source_schema = 'dbo',
    @source_name = 'Orders',
    @role_name = 'cdc_reader';

Additional settings for change data capture include specifying which columns to capture, on which filegroup to store the change table, and more. These are discussed in the Microsoft Docs at https://docs.microsoft.com/sql/relational-databases/track-changes/enable-and-disable-change-data-capture-sql-server.

The following list includes a few important considerations when designing applications that take advantage of change tracking and change data capture:

An application should identify itself by using a source ID when synchronizing data. By providing a source ID, the client can avoid obtaining the same data again. A client specifies its own source ID by using the WITH CHANGE_TRACKING_CONTEXT clause at the start of statements.
An application should perform the request for changed data in a snapshot isolation–level transaction. This will avoid another application updating data between the check for updated data and sending data updates. The snapshot isolation level needs to be turned on at the database level, which was demonstrated in the previous section.

Comparing change tracking, change data capture, and temporal tables

In this section, we present a comparison between three features that have common use cases. Table 8-1 should prove helpful when you’re deciding which change tracking feature is appropriate for your needs.

Table 8-1 A comparison of features and uses of change tracking, change data capture, and temporal tables

	Change tracking	Change data capture	Temporal tables
Requires schema modification	No	No	Yes
Available in Azure SQL Database	Yes	No	Yes
Edition support	Any	Enterprise only	Any
Provides historical data	No	Yes	Yes
History end-user queryable	No	Yes	Yes
Tracks DML type	Yes	Yes	No
Has autocleanup	Yes	Yes	Yes
Change indicator	LSN	LSN	datetime2

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8 Understanding and designing tables

Create new playlist

Sign In

Sign Up

Chapter 8Understanding and designing tables

Reviewing table design

Generic data types

Alphanumeric types

Numeric types

Date and time types

Binary types

Specialized data types

Spatial data types

The XML data type

Rowversion

The uniqueidentifier data type

The hierarchyid data type

The sql_variant data type

Keys and relationships

Constraints

Sequences

User-defined data types and user-defined types

Sparse columns

Computed columns

Special table types

System-versioned temporal tables

Memory-optimized tables

Database preparation for memory-optimized tables

Natively compiled stored procedures and user-defined functions

Caveats to memory-optimized tables

PolyBase external tables

Graph tables

Defining graph tables

Working with graph data

Current graph table shortcomings

Storing BLOBs

Understanding FILESTREAM

FileTable

Table partitioning

Horizontally partitioned tables and indexes

About horizontal partitioning

Defining partitions and partitioning a table

Partition design guidelines

Implementing a sliding window partition strategy

Vertical partitioning

Capturing modifications to data

Using change tracking

Using change data capture

Comparing change tracking, change data capture, and temporal tables

Table of Contents for
Chapter 8 Understanding and designing tables

Chapter 8
Understanding and designing tables