Chapter 2. SQL Server tools for maintaining data integrity

Louis Davidson

You’ve probably seen it before: a client has called the help desk and reported that a couple of queries have seemingly inconsistent results. You’re called in to test the queries, so you start digging. You run two queries, and the results for one query don’t seem right, considering the other query’s results. So you start to wonder, “Is it my understanding of the schema?” You then start looking at the data model and the tables seem wonderfully designed. Your query should work. Dang, now what?

You dig deeper, and realize... the only objects you find in the database are tables. No constraints, no relationships, no nothing. A quick query later and you find that orphaned rows in a child table show up in one query but not another. At this point it’s clear that almost all of the data in these tables is suspect because if you can’t trust that a simple relationship works, can you trust that the child rows that exist are for the proper parent? Sadly, the answer is no, because the software may have created a new parent row that reused a key value. Now ideally you’ll test the consistency of the data every time you use the data, to verify that it’s correct. It doesn’t take a genius to determine how long before that gets old.

So how do you avoid this calamity? Protect the data using all of the tools you have available. In this chapter, I’ll present many of the tools that SQL Server gives you to help protect your data from corruption, usually in a behind-the-scenes manner that the client doesn’t really have to worry about. I’ll also present guidance on how to decide what type of protection and where, and give examples of each type of protection. I’ll break down the topic into the following two sections:

  • Protection tools— Introducing the tools that SQL Server gives us and techniques for using them
  • When and why to use what tool— Examining the different types of protection tools and discussing when and where they should be used

The goal will be to protect the data such that it’s guaranteed to be fundamentally correct, meaning that completely illogical or impossible data isn’t allowed. For example, consider a column that holds an employee’s salary. Logically, it’ll be a number value, and should probably have a reasonable upper and lower bound. What those bounds are is up to the individual situation, but the minimal goal is to make sure completely illogical data is never stored. Without any real concern, you can set a lower bound of 0 on a salary, as by definition a salary is pay, not a fee to come to work (that is what timesheets and staff meetings are for). You might use a numeric(15,2) data type to establish an upper bound of 999,999,999,999.99 for the salary value (even the CEO of Enron never made that kind of money).

The application code will be used to warn users that the average salary for a data architect is less than 200 million dollars... but if you want to pay that much, I’ll take it. The final result is that there’s no way that the value too freaking little will show up in the salary column.

Protection tools

In this section I’ll introduce the different tools available for you to protect the quality of your data. Understanding the tools at your disposal for protecting data quality is the second most important thing to know and know well (the third being how to write set-based queries, and the first being normalization). This chapter is about automatic data protection, meaning tools that can be used to seamlessly protect the integrity of the data without any further coding effort or user action to make sure that they work, and no method of overriding the protection without cheating (you can disable some protections, and you can get around them using bulk copy methods).

You can also take manual steps to protect your data; they generally require the client to adhere to a given API to work. This could include stored procedures, client code, defaults, and so forth. I won’t discuss manual protection in this chapter, but the point of using manual data protection techniques is that they enforce rules that can be overridden or applied situationally, or that are frequently changed. These rules are generally apt to change, so you wouldn’t be able to easily write a query to find data that didn’t meet the rules over a long period of time.

We’ll discuss the following types of protection that allow you to design integrity into the base structure:

  • Data types— Defining the storage of values of the proper base type and size
  • NULL specifications— Making sure that only columns where you might have a NULL value are defined to allow NULLs
  • Uniqueness constraints— Preventing duplicate values for a column or set of columns
  • Filtered unique indexes— Preventing duplicate values for a column or set of columns for a subset of rows in the table
  • Foreign key constraints— Making sure every foreign key value has a corresponding primary key value in the related table
  • Check constraints— Checking that values in the row meet the minimum standards
  • Triggers— The catch-all of data protection, which allows you to execute a batch of code for any DML operation to validate data and even introduce side effects to the DML operation that happen without the knowledge of the client

As you’ll see in the following subsections, this list is loosely ordered from the most desirable to the least desirable, particularly where there’s overlap.

Data types

The most fundamental thing you can do to ensure the quality of your data is to choose a reasonable container for the value being stored. Want to store an integer value? Use an integer type. Want to store an image? Use a varbinary(max). Could you use a varchar(50) or varchar(max) to store every value? In almost every situation, you sure could. You could also stick your tongue in a lamp socket, but I guarantee you it’s a bad idea unless you want to say hi to your cat that your mother told you was “sent off to live on the farm.”

Way too often, I see designs where every column is nvarchar(50), unless they want to store something larger; then it’s an nvarchar(max). Storagewise, unless you expect that some values will have special characters, using a Unicode-enabled type just costs you space, and even worse is absolutely horrible for data integrity.

It’s perfectly possible to do this and get by, as all values can be represented as a textual value in a T-SQL DML statement. So why bother? First, consider performance. As an example, consider storing a GUID value in a varchar column. Using the uniqueidentifier data type, you can store the 36-character value in 16 bytes. Put it in a varchar column and you’ll need 36 bytes, or 72 if the column is nvarchar.

The second, more important problem is data integrity. A uniqueidentifier column requires a properly formatted GUID to be stored, whereas a varchar column is happy to take Your data integrity blows as a value.

NULL specification

Setting the NULL specification of a column to allow or disallow NULLs is an important step in getting correct data. It’s quite common to see that someone has created a perfectly acceptable design, but left each column nullable. Part of this is probably laziness, as not specifying a column’s NULL specification will by default allow the column to allow NULL values.

NULLs greatly increase the complexity of using a column, because NULL isn’t equal to anything, including itself. But NULL = NULL isn’t false; rather it evaluates to NULL. Then consider the expression NOT(NULL) which also evaluates to NULL, but really looks like it should be true, right? You might be thinking that it’s silly, that you’d never need to care about this type of thing, right? Well, consider the following:

SELECT *
FROM table
WHERE NOT(nullableColumn = @value)

If all values for nullableColumn were NULL, then that statement would return the same rows as the following:

SELECT *
FROM table
WHERE (nullableColumn = @value)

Because the expression nullableColumn = @value evaluates to a Boolean value, in the case where nullableColumn is NULL, nothing is returned. Only rows where the expression evaluates to TRUE are returned.

If you need to allow a value to have no known value, then having nullable columns is acceptable. But, in many cases, columns allow NULLs because the implementer didn’t think or care if it mattered and let some other code manage whether data is optional. The problem here is twofold:

  • The SQL Server engine uses all information it has to make your queries faster. The fact that all values must be filled in helps the optimizer to not worry about the case where a value is NULL.
  • You have to write code that validates this over and over. Placing it in the definition of the data makes sure that it’s always adhered to, regardless of whether the UI has checked to make sure that the user entered a value.

So the fact is, there’s no reason not to disallow nulls where they should never occur; it can only help your system, and never hurt it. I’m not going to get any deeper into the NULL discussion than this, and I leave it to you to find all of the ways that NULL makes comparisons more interesting. The point of this section is to show that although NULL can be a useful tool, it behooves you to protect against unnecessary NULL values, as unhandled NULLs in comparisons will mess up your query results in ways that won’t be obvious to you without some thought.

Uniqueness constraints

One of the most common newbie questions is how to remove duplicate data from a table. After demonstrating the requisite query to remove the duplicated data (removing what might actually be a row of important data that the user updated instead of the kept copy), a nice but stern rant follows on why you must protect your data against duplicated data. In SQL Server, uniqueness is enforced using either a primary key or a unique constraint. The primary key is used for the main value that each row will be referenced as. Unique constraints are then applied to all other candidate keys, which are frequently other sets of columns that a user will want to reference a row as.

Every table should have a primary key to make sure that you have some way to distinguish one row from other rows. In addition, it’s important to make sure that you have either the primary key or a unique key on a value in the table that has some meaning. An identity value in the table doesn’t have any actual meaning, as you could have two rows with exactly the same value other than the identity key, and the two rows would be technically indistinguishable from one another. This isn’t always completely possible, as there are some tables where a key that has full meaning can’t be created, such as a table that logs events that have occurred, where multiple events of the same type could occur at the same instance. For logging situations, adding a sequence number is an acceptable trade-off, though that shouldn’t be the only value in the key.

As an example, if you had a customer table with an identity value customerId for a surrogate key (the name given a key that’s generated without regard to any meaningful data to stand in for a more complex key value) and a customerNumber, which the user enters, the table would be defined as the following:

CREATE TABLE customer
(
customerId int identity NOT NULL CONSTRAINT PKcustomer PRIMARY KEY,
customerNumber char(10) NOT NULL
CONSTRAINT AKcustomer_customerNumber UNIQUE
)

You can also use ALTER TABLE syntax to add a PRIMARY KEY or UNIQUE constraint once the table has been created:

ALTER TABLE customer
ADD CONSTRAINT PKcustomer PRIMARY KEY (customerId)
ALTER TABLE customer
ADD CONSTRAINT AKcustomer_customerNumber UNIQUE (customerNumber)

Internally, PRIMARY KEYS and UNIQUE constraints are actually hybrid objects—they’re constraints, but physically they’re implemented internally by creating a unique index. The syntax for creating a PRIMARY KEY or UNIQUE constraint includes several of the different settings for an index, such as whether the index to create is clustered, what partition to place it on, and so on.

Semantically, the distinction between a unique index and a constraint should be considered as part of your design. A constraint declares that some factor must be true, and for key constraints, this is declaring that the values in the selected columns must be unique. An index is a physical construct used specifically to speed up some operation; in this case, SQL Server is applying a unique index to speed up checking for duplicate values. I generally think of an index as something that can be created or dropped by the production staff without helping or harming anything other than performance.

Filtered unique indexes

As mentioned in the previous section, indexes are, in general, only to be used to improve performance, not for data protection. For SQL Server 2008, Microsoft has given us a WHERE clause on the CREATE INDEX syntax that also has data protection capabilities not found in UNIQUE constraints. If you want to protect for unique values except for when the value is some given value (most commonly, the use is unique values when not NULL), you can use a unique index with a WHERE clause. For example, if you have a table that holds persons, and you want to associate a unique customer number with the user if she becomes a customer, but not until, you could define an index such as the following:

CREATE INDEX AKSperson_customerNumber on person(customerNumber)
WHERE customerNumber is not null

If you’re using a previous version of SQL Server, there are other methods to use, such as an indexed view where you define the view to exclude data like you did with the WHERE clause on the CREATE INDEX statement. Then you add a clustered unique index on the view. Either method is acceptable, and implements the same sort of constructs with the same overhead.

Foreign key constraints

A foreign key constraint is used to make sure that columns involved in a relationship contain only correct data. I started out the chapter with an example that contained a foreign key reference. The reason for this is because it’s probably the most frustrating of errors; it’s rarely a problem in most databases, so it’s not the first thing that will pop to mind.

For example, say you have an invoice table and a table to represent what’s on the invoice. The keys for these tables might look like the following:

CREATE TABLE invoice
(
invoiceNumber char(10) CONSTRAINT PKinvoice PRIMARY KEY,
invoiceStatus varchar(10) CONSTRAINT invoice$invoiceStatus
)
CHECK (invoiceStatus in ('Open','Closed'))
CREATE TABLE invoiceItem
(
invoiceNumber char(10),
invoiceItemNumber int,
CONSTRAINT PKinvoiceItem PRIMARY KEY (invoiceNumber, invoiceItemNumber)
)

If you don’t take the time to add the foreign key constraint, then the value for orderNumber in the orderItem table isn’t bound to any domain of values. It could be “Hello!” for all that matters, even if that value didn’t get created in the order table. The problem is that no matter the intentions of the non–data tier implementers, it’s difficult to test for all possible scenarios, and impossible to stop a DBA from messing up the data by accident.

A good example of the kinds of problems you can run into is a query against a child, such as an invoice line item. A query against this table should give consistent results whether or not you include the invoice in the query. So executing a query like the following

SELECT sum(invoiceLineItem.amount)
FROM invoiceLineItem

should give the same result as this:

SELECT sum(invoiceLineItem.amount)
FROM invoiceLineItem
JOIN invoice
ON invoiceLineItem.invoiceId = invoice.invoiceId

On the surface, this seems a safe bet, as why would you put data in invoiceLineItem without a matching value in invoice? But unless you’ve created a constraint between invoice and invoiceLineItem, you could have values for invoiceLineItem.invoiceId that aren’t in the invoice table. These values would be excluded in the second query, causing the user to have to think about why this is true. Simply adding

ALTER TABLE invoiceLineItem
ADD CONSTRAINT invoice$has$invoiceLineItems
FOREIGN KEY (invoiceId) REFERENCES invoice (invoiceId)

could prevent a lot of heartache trying to decipher the random bits and bytes from the meaningful data. Even worse, note that you’d also be required to check for bad data like this in any program that uses this data (such as ETL for your warehouse), or you’d corrupt the sources of data that use this data, requiring even more data cleanup.

Check constraints

Check constraints allow you to define, for an individual column or row, a condition of the data that must occur. You use them to constrict the set of values to a certain domain. For example, if you want to make sure that the values could only be entered into the status column of the invoice table, you’d define a condition that said status in ('active','inactive'). The following is the syntax for adding a constraint to an existing table:

ALTER TABLE <tableName>
ADD [CONSTRAINT <constraintName>] CHECK <BooleanCondition>

For the invoice table example, you could code the following:

ALTER TABLE invoice
ADD CONSTRAINT invoice$statusDomain
CHECK (invoiceStatus in ('active','inactive'))

No matter how many rows you modify in a statement, the condition specified by a constraint is checked one row at a time, and can’t see which other rows are modified. You can access the columns in a table by name with no alias, or you can access data in other rows, or even tables using user-defined functions. As another example, consider the employee.salary example we used previously in the chapter. If you wanted to make sure it’s greater than 0, you might define the following:

ALTER TABLE employee
ADD CONSTRAINT employee$salaryNonNegative CHECK (salary > 0)

The Boolean expression can’t directly reference another table, but you can use a function to access the data in the table. Note that I said you can’t see which other rows were modified, but the data is already in the table, so you can access it in the function’s query.

One example of using a function is limiting the cardinality of a value in a database (or the count of the times a value can appear). Using a unique constraint, you can easily limit the cardinality to 1, but if you want to allow a number different than 1, there isn’t an easy way to do this declaratively. So we can use a function and a check constraint. Take the following table:

create table cardinalityLimit
(
value varchar(10)
)

If we want no more than two of the same values for the value column, you can build a simple function like this:

CREATE FUNCTION dbo.cardinalityLimit$countValue
(
@value varchar(10)
)
RETURNS INT
AS
BEGIN
RETURN (SELECT count(*)
FROM cardinalityLimit
WHERE value = @value)
END

It counts the number of values in the table that match the @value parameter. In the following constraint, you’ll check to see if the count is 2 or less:

ALTER TABLE cardinalityLimit
ADD CONSTRAINT cardinalityLimit$valueCardinality2
CHECK (dbo.cardinalityLimit$countValue(value) <= 2)

Constraints see the data in the table from the current DDL statement. Now you should be careful to test the statements. First, insert rows one at a time:

INSERT INTO cardinalityLimit
VALUES ('one')
INSERT INTO cardinalityLimit
VALUES ('one')
INSERT INTO cardinalityLimit
VALUES ('one')

After the third one, you’ll get an error. After creating a constraint that accesses other tables, you should test all of the possible types of combinations of rows you may get. For example, we tested the case of sending one row at a time, but in this case you should test modifying multiple rows in a single DML statement to make sure that it behaves correctly. In the next statement, I’ll check to make sure that it works for a set of two rows:

INSERT INTO cardinalityLimit
VALUES ('two'),('two')
Which it does. Then try three at a time:
INSERT INTO cardinalityLimit
VALUES ('three'),('three'),('three')

This again fails with the same error message. (I’ll leave it to you to test the UPDATE scenarios.) This leads to the most difficult problem with check constraints: terrible error messages. The following are the errors you get from our cardinality example:

Msg 547, Level 16, State 0, Line 1
The INSERT statement conflicted with the CHECK constraint "cardinalityLimit$valueCardinality2". The conflict occurred in database "tempdb", table "dbo.cardinalityLimit", column 'value'.

Proper naming conventions help, and in my database design book, Pro SQL Server 2008 Relational Database Design and Implementation (Apress, 2008), I present a method of mapping these messages by parsing the message and creating a table of constraint names, but it’s a cumbersome process at the least.

Check constraints have a downside, but they’re important to the process of keeping the data clean of illogical data. Done right, your reporting and ETL process won’t have to check for out-of-range values that can trip up calculations (and as mentioned, make users wary of trusting “your” numbers). Just like we did with the salary example, you should make judicious use of check constraints to prevent data from being fouled up in an unreasonable manner. For example, I generally put a constraint on all varchar columns to make sure that the empty string isn’t inserted. If the employee table had a lastName column, I’d add the following:

ALTER TABLE employee
ADD CONSTRAINT employee$nameNotEmpty CHECK (LEN(RTRIM(lastName)) > 0)

This isn’t to say that I always enforce this sort of constraint, as sometimes allowing an empty string is a good thing, because it allows you to differentiate between no value and a NULL value. But, when a column value is required, it’s usually not the best plan of attack to allow the user to enter a single space character in lieu of a “real” value. And for unique indexing purposes, I really don’t want strings containing different numbers of spaces to be treated as different values; users sometimes use this technique to get around putting in reasonable data values (though you can’t stop them from putting in gibberish for values, at least not easily).

One last note: when you create a CHECK constraint, you can specify WITH NOCHECK to tell SQL Server to not test existing data against the constraint. This can be a faster way to add a constraint, but generally speaking, it’s not a good idea. Two problems often come up. First, the constraint can’t be used by the optimizer to know what data could exist in the column(s). Second, if there’s invalid data, it’ll fail if it’s updated to the same value. UPDATE table SET column = column is something that should never fail, but untrusted constraints can leave you open to these kinds of spurious errors that the user interface probably couldn’t even help prevent. You can determine whether your constraints are tested by looking at the is_not_trusted column in sys.check_constraints.

Triggers

Triggers are stored batches of T-SQL, much like stored procedures. But instead of being called directly, they execute indirectly when you execute an INSERT, UPDATE, or DELETE operation. You can do almost anything in them that you can do in T-SQL, with only a few caveats (check Books Online for more information). In capable hands—particularly the hands of those who understand the caveats of triggers—they can fill gaps left by constraints by allowing you to handle problems that constraints are incapable of and have the solution automatically occur without user or programmer involvement. On the other hand, in the wrong hands, they can drag down the performance of your system faster than you can kiss a duck.

Some people feel that triggers are more evil than cursors (and putting a cursor in a trigger is a capital offense in some locales). Others use them for just about everything, often to do the sorts of simple checks that you could use a CHECK constraint for. The truth is somewhere in between. There’s an overhead in using the inserted and updated virtual tables to access the modified rows inside the trigger, and this overhead needs to be considered before using triggers.

There are two different types of triggers. First there are AFTER triggers, which execute after the rows are in the table and after constraints have been applied. These triggers are most often used to validate some condition or to cause some cascading operation that can’t be done declaratively using a CHECK constraint.

The second type is called INSTEAD OF triggers, and they fire as the first operation before any other actions, even before the row is in the table. Instead of a trigger, you’ll have to manually execute the operation you want to occur. For example, if you have an INSTEAD OF INSERT trigger, you have to code the INSERT statement in the trigger to do the INSERT that the trigger fired for. INSTEAD OF triggers can be useful for formatting data, or possibly redirecting data to a different table. An easy example is directing data that’s unusual to a holding table, such as taking all data from a data entry clerk but putting out-of-normal-range data into a table for more validation. In this scenario, the user doesn’t know there was anything wrong, his job is just to key in data (and yes, no doubt a slap on the wrist will come to users who enter too much out-of-range data, but we can’t govern the social aspects of our software, now can we?). INSTEAD OF triggers can also be placed on views as well as tables to allow DML on a view to modify more than one table at a time.

Triggers are useful as a last resort when you can’t use one of the previously mentioned tools to do the same thing. They have the benefit that you can’t “forget” to invoke them, particularly if you don’t have coding practices that require all modifications to an object to use a common object that knows how to protect. The following list represents the main types of uses I have for triggers, and even this list is apt to start a minor argument between T-SQL experts:

  • Cross-database referential integrity (RI)— SQL Server doesn’t allow declarative constraints across database boundaries, so triggers can be written to validate that data exists in a different database. Databases are independent containers, so this method isn’t nearly as good as having constraints in the same database, but as long as you understand that you have to be careful with cross-database references, it’s definitely usable.
  • Intra-table, inter-row constraints— These are used when you need to see that the sum of a column value over multiple rows is less than some value (possibly in another table). You can use a constraint with a function, but the query will be executed once per row modified, whereas a trigger need only run the query a single time.
  • Inter-table constraints— When a value in one table relies on the value in another, triggers are useful for checking for correct values. This might also be written as a functions-based CHECK constraint, but it’s often more maintainable to use a trigger.
  • Introducing desired side effects to your queries— Constraints support cascading operations on UPDATE, but it’s possible to come up with many different side effects that might be desired. For example, cascading inserts, maintaining denormalized data, logging who modified a row, and so on.

Triggers come at a price, and the biggest concern is performance. Triggers fire once per DML operation, so whether you modify 1 or 1,000 rows, the trigger fires only once. During trigger execution, SQL Server sets up two tables, one called inserted, which contains all rows that are to be created (or are to be used instead of triggers), and another for removed rows called deleted. You have to be very careful that your code considers the number of rows involved in the operation and the trigger can handle multiple rows being modified.

The inserted and deleted tables aren’t indexed, and are reflections of the changes that have been made to the table and are captured in tempdb. Hence, you need to be careful with the queries in the trigger to make sure that you’re ready for 1 or 1,000 rows to be updated. In some cases, you won’t be able to optimize a trigger for large numbers of rows, so you might have to set a cutoff point and fail the operation for large numbers of changed rows. To start the examples, I present the template shown in listing 1. It’s the basic form I use for all triggers I write.

Listing 1. Basic template for triggers
CREATE TRIGGER <schema>.<tablename>$<actions>[<purpose>]Trigger
ON <schema>.<tablename>
INSTEAD OF <comma delimited actions> AS
--or
AFTER <comma delimited actions> AS
BEGIN

DECLARE @rowsAffected int, --stores the number of rows affected
@msg varchar(2000) --used to hold the error message

SET @rowsAffected = @@rowcount

--no need to continue on if no rows affected
IF @rowsAffected = 0 return

SET NOCOUNT ON --to avoid the rowcount messages
SET ROWCOUNT 0 --in case the client has modified the rowcount

BEGIN TRY
--[validation section]
--[modification section]
END TRY
BEGIN CATCH
IF @@trancount > 0
ROLLBACK TRANSACTION

--[log errors section]

DECLARE @ERROR_MESSAGE nvarchar(4000)
SET @ERROR_MESSAGE = ERROR_MESSAGE()
RAISERROR (@ERROR_MESSAGE,16,1)

END CATCH
END

Each of the values such as <name> is something to replace, and there are three sections where code is placed. The validation section is where to put code that’s used to validate the data that has been updated. I rarely use this section in an INSTEAD OF trigger because any validation you can do in an INSTEAD OF trigger you should do in a constraint. The modification section is where to update data in this or another table, or for an INSTEAD OF trigger, to perform the action that the trigger fired for. And the log errors section is used to log errors after the transaction has been rolled back.

For example, if you had an invoiceLineItem table and you wanted to make sure that the total of items was > 0, and then create a row in the invoiceLog table, you might build the trigger shown in listing 2 for the UPDATE action. New code is shown in bold.

Listing 2. Trigger to ensure line item total > 0
CREATE TRIGGER dbo.invoiceLineItem$UPDATETrigger
ON dbo.invoiceLineItem
AFTER UPDATE AS
BEGIN

DECLARE @rowsAffected int, --stores the number of rows affected
@msg varchar(2000) --used to hold the error message

SET @rowsAffected = @@rowcount

--no need to continue on if no rows affected
IF @rowsAffected = 0 return

SET NOCOUNT ON --to avoid the rowcount messages
SET ROWCOUNT 0 --in case the client has modified the rowcount

BEGIN TRY
--[validation section]
IF EXISTS( SELECT *
FROM ( SELECT sum(amount) as total
FROM invoiceLineItem
WHERE invoiceId in (SELECT invoiceId
FROM inserted
UNION ALL
SELECT invoiceId
FROM deleted)
GROUP BY invoiceId) as totals
WHERE totals.total < 0)
RAISERROR ('The sum of amounts for the item must be > 0',16,1)

--[modification section]
INSERT INTO invoiceLog(invoiceNumber, action, changedByLoginId)
SELECT invoice.invoiceNumber, 'Added LineItems',original_login_id
FROM inserted
JOIN invoice
ON inserted.invoiceId = invoice.invoiceId
END TRY
BEGIN CATCH
IF @@trancount > 0
ROLLBACK TRANSACTION

--[log errors if desired]

DECLARE @ERROR_MESSAGE nvarchar(4000)
SET @ERROR_MESSAGE = ERROR_MESSAGE()
RAISERROR (@ERROR_MESSAGE,16,1)

END CATCH
END

The query is built in three layers, the first of which is almost always a join or a correlated subquery to the inserted or deleted table to get the rows to check. For the INSERT and DELETE triggers you’d need to complete the solution, you could vary the subquery to reference only the inserted or deleted tables respectively. In this particular example, you could use the same trigger (because the inserted and deleted tables are available in every trigger), but I usually try to have a trigger per action, which makes it easier to add additional code and easier to test. Admittedly, I use tools to create triggers that allow parts of the code to be built in a macro, which allows coding modularity while producing specific code. I do this because in most cases it’s better for T-SQL code to be as specific as possible, even if that means writing hundreds of similar procedures or triggers. No, it doesn’t seem like a good idea for code reuse, but it’s much better for performance.

If you want to stop the trigger because of invalid conditions, you raise an error and the TRY...CATCH block handles the rest. If you use the trigger to make modifications to other tables, you don’t really need to do anything else; the error handler will capture any errors that occur, meaning that you don’t have to check the error level after each statement. If an error occurs, it’ll take you to the CATCH block immediately.


Note

Something I won’t cover is that you can also build triggers using the CLR. I haven’t heard of any valuable reason to use the CLR for triggers, but it can be done, and all of the same caveats are still a consideration.


When and why to use what tool

It’s important to consider how to apply all of these tools to common situations. Bear in mind that for the most part, most of the situations we protect against and the errors that will be sent to the client should rarely if ever occur. The UI and application layer should prevent them. Our goal in placing these protections is twofold.

First, without a guarantee, mistakes will often be made. There are too many clients and too many situations to guarantee perfection in the object layers that are created. Having this impenetrable layer of protection protects us from any mistakes being made.

Second, there are performance benefits as well. Many of the objects that we’ve looked at will improve performance:

  • Using proper data types can reduce storage overhead and prevent comparisons between unlike types eliminating index utilization.
  • Limiting NULLs can help the optimizer not consider NULL as a possibility, and make comparisons easier.
  • Uniqueness constraints can help the optimizer know that only one value can match for a query.
  • Using proper indexes (including unique indexes) generally helps performance unless grossly overused, even ones on a view or filtered index (with a WHERE clause).
  • Using foreign key constraints helps the optimizer to know whether related values exist.
  • Using check constraints (when trusted) lets the optimizer know what sorts of values can be in a table, so searching for an illegal value is a known fail.

In general, the major problem with all of these objects is that they take work. So unless you have an advocate for the process, it’s often quite hard to make the case for database protection that duplicates some of the protection that’s in the application code. The goal of this section is to give you a succinct overview of the different types of objects in terms of types of things you’re trying to protect against.

Independent of performance, there’s a tremendous amount of overlap in which tools you use to solve any problem. Triggers could be written to do all data validation, and in the early (dark) ages of SQL Server, triggers, NULL specifications, and unique indexes were all we had for implementation. There were no constraints whatsoever and we made do. Clearly a trigger-only approach isn’t the best practice today; rather, the best practice is to use the tools in the order we first discussed, considering whether you can use the technique mentioned to cover your particular need:

  • Data types to constrain data to a basic physical structure that best matches the need.
  • NULL specifications to keep NULL data out of columns that shouldn’t have NULLs (this is one of the worst offences that people do: letting values be NULL so that the UI can send multiple updates).
  • Uniqueness constraints to make sure that the set of values for some column or columns only have unique values.
  • Filtered indexes to implement selective uniqueness (and performance, but this isn’t a performance-tuning chapter).
  • Foreign key constraints to make certain that related values have matches.
  • Check constraints to implement checks on the same row, as well as checks on other tables when you want them to happen row by row.
  • Triggers for everything else you need to validate in the table, as well as introducing side effects to a query. They’re seldom used as they’re rarely needed, but they’re tools that you can use on occasion.

Sadly this topic also becomes a polarizing political topic as well. There are many application developers who’ll be annoyed that you want to use constraints to manage anything. They’ll become incensed at the errors that are sent to their client by your constraints, and will demand their removal. Even though your goal is to only use constraints to manage “never-occurring” errors, the fact is that they weren’t prepared for bugs in the application code that allow errors to occur. This is a defining point in your career when you can, while making them feel good about themselves, explain why this is spurious reasoning.

There are two reasons why this reasoning is less than sound. First, if the errors are not supposed to occur, nobody should notice your constraints in the first place. Second, you can’t stop all errors from happening. SQL Server could encounter a hardware glitch and an error might occur. More importantly, deadlocks aren’t 100 percent preventable, so your application code needs to be aware that a deadlock might occur and be able to handle it. Knowing what changes were running during a transaction (which you need for a deadlock handler to retry the transaction) is the same thing you need to know for an error handler.

Taking this knowledge that you need to protect against errors already, and noting that it’s rare if ever that you should see an error from a constraint raised to even the application code (usually the only reason you should ever see an error occur is a bug or some trouble with your database), it should be clear why you need to protect your data in SQL Server.

Now the onus is on you to carefully consider how enforceable and permanent each rule you are attempting to enforce is. It doesn’t behoove you to build a lot of constraints and triggers that have to be frequently changed, and it’s best if you can use some client code or stored values to data-drive as many of the variables to your systems as you can. For example, if you can give a customer a variable amount of discount, based on external settings, don’t hard-code the settings: build a table and let the client code look it up.

As a final example, consider the salary example from the previous section. You, being the intrepid reader, probably noted that salaries are bound by position, and a new DBA’s salary might be bound by the current year’s salary bounds. You may even want to create a table that contains the minimum and maximum salary that can be offered. Although this might sound like a rigid rule that could easily be built into a constraint or trigger, this isn’t a likely candidate for rigid database constraints, because no matter what the database says, we all know that if you had a shot to get Kalen Delaney to come work for your company as a regular employee, you’d quickly break the salary rules and give her what she wanted (and perhaps a tad more than she wanted to make sure she stayed happy). Having rules be optional like this prevents the user from needing to make a new phony position with its own salary rules to handle fringe cases. A simple piece of data indicating who chose to override the common rules is more than enough. On the other hand, you definitely want to make sure that you use a proper data type to store the salary amount (possibly a data type such as numeric(12,2)) and a check constraint to make sure only positive values are inserted. In some cases, you might add a column that indicates who overrode the rules and then ignore checking the rules, but again, it can end up like spaghetti code if you start to build a web of enforcement rules that allow too much overriding.

Summary

SQL Server provides quite a few ways to implement data protection using methods that can help you take your well-designed database from being simple data storage that has to be constantly validated on use and turn it into a trustable source of information. Every tool we presented can be coded in a manner that the database user needn’t have any idea that it’s running.

We can’t stop the user who wants to spell the name Michael as Micheal (or even F-R-E-D, for that matter), but we can stop someone from putting in a salary of -100, or entering NULL for required data, and we can do our best to avoid the most basic issues that take tons of time to code around when data is used. SQL Server provides tools such as data types to define the reasonable storage of values, NULL specifications to require data when it must be available, uniqueness constraints to prevent duplicate values, filtered unique indexes to prevent duplicate values on a subset of rows, foreign key constraints to ensure related data is correct, check constraints to make sure that values in the row meet the minimum standards, and finally, as a last resort, triggers to catch all other needs.

None of this is all that hard to do. Even triggers are straightforward to code once you get the concept of multi-row handling. The hardest part is gaining the acceptance of the programmer community. During the development process, they’re likely to get pinged with errors that they don’t expect because they haven’t written their code to cover all of the cases that might occur. Remind them that you’re building an application for the future, not just to get done faster.

About the author

Louis Davidson has more than 15 years of experience as a corporate database developer and architect. Currently he’s the data architect for the Christian Broadcasting Network. Nearly all of Louis’s professional experience has been with Microsoft SQL Server, from the early days to the latest version he can get access to legitimately. Louis has been the principal author of four editions of a book on database design, including one for SQL Server 2008 entitled Pro SQL Server 2008 Relational Database Design and Implementation (Apress, 2008). Louis’s primary areas of interest are database architecture and coding in T-SQL; he’s had experience designing many databases and writing thousands of stored procedures and triggers through the years.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.167.87