CHAPTER 6: Physical Model Implementation Case Study

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 6

Physical Model Implementation Case Study

Even in literature and art, no man who bothers about originality will ever be original: whereas if you simply try to tell the truth (without caring twopence how often it has been told before) you will, nine times out of ten, become original without ever having noticed it.

—C.S. Lewis

When the normalization task is complete, you have the basic structures ready for implementation, but tasks still need to be performed in the process for completing the transformation from the logical to the physical, relational model. Throughout the normalization process, you have produced legal, normalized tables that can be implemented and, using the information that should be produced during the logical modeling phase, are now ready for the finishing touches that will to turn your theoretical model into something that users (or at least developers!) can start using. At a minimum, between normalization and actual implementation, take plenty of time to review the model to make sure you are completely happy with it.

In this chapter, I’ll take the normalized model and convert it into the final blueprint for the database implementation. Even starting from the same logical model, different people tasked with implementing the relational database will take a subtly (or even dramatically) different approach to the process. The final physical design will always be, to some extent, a reflection of the person/organization who designed it, although usually each of the reasonable solutions “should” resemble one another at its core.

The normalized model you have created is pretty much database agnostic and unaffected by whether the final implementation would be on Microsoft SQL Server, Microsoft Access, Oracle, Sybase, or any relational database management system. (You should expect a lot of changes if you end up implementing with a nonrelational engine, naturally.) However, during this stage, in terms of the naming conventions that are defined, the datatypes chosen, and so on, the design is geared specifically for implementation on SQL Server 2012. Each of the relational engines has its own intricacies and quirks, so it is helpful to understand how to implement on the system you are tasked with. In this book, we will stick with SQL Server.

We will go through the following steps:

Choosing names: We’ll look at naming concerns for tables and columns. The biggest thing here is making sure to have a standard and to follow it.
Choosing key implementation: Throughout the earlier bits of the book, we’ve made several types of key choices. In this section, we will go ahead and finalize the implementation keys for the model.
Determining domain implementation: We’ll cover the basics of choosing datatypes, nullability, and simple computed columns. Another decision will be choosing between using a domain table or a column with a constraint for types of values where you want to limit column values to a given set.
Setting up schemas: This section provides some basic guidance in creating and naming your schemas. Beginning in SQL Server 2005, you could set up groups of tables as schemas that provide groupings of tables for usage, as well as security.
Adding implementation columns: We’ll consider columns that are common to almost every database that people implement that are not part of the logical design
Using Data Definition Language (DDL) to create the database: In this section, we will go through the common DDL that is needed to build most every database you will encounter
Baseline testing your creation: Because it’s is a great practice to load some data and test your complex constraints, this section offers guidance on how you should approach and implementing testing.

Note For this and subsequent chapters, I’ll assume that you have SQL Server 2012 installed on your machine. For the purposes of this book, I recommend you use the Developer Edition, which is available for a small cost from www.microsoft.com/sql/howtobuy/default.aspx . The Developer Edition gives you all of the functionality of the Enterprise Edition of SQL Server for developing software. It also includes the fully functional Management Studio for developing queries and managing your databases. (The Enterprise Evaluation Edition will also work just fine if you don't have any money to spend. Bear in mind that licensing changes are not uncommon, so your mileage may vary. In any case, there should be a version of SQL Server available to you to work through the examples.)

Another possibility is SQL Server Express Edition, which is free but doesn’t come with the full complement of features of the Developer Edition. For the most part, the feature list is complete enough to use with this book. I won’t make required use of any of the extended features, but if you’re learning SQL Server, you’ll probably want to have the full feature set to play around with. You can acquire the Express Edition in the download section at www.microsoft.com/sql/ .

Finally, I’ll work on a complete (if really small) database example in this chapter, rather than continue with any of the examples from previous chapters. The example database is tailored to keeping the chapter simple and to avoiding difficult design decisions, which we will cover in the next few chapters.

The main example in this chapter is based on a simple messaging database that a hypothetical company is building for its upcoming conference. Any similarities to other systems are purely coincidental, and the model is specifically created not to be overly functional but to be very, very small. The following are the simple requirements for the database:

Messages can be 200 characters of Unicode text. Messages can be sent privately to one user, to everyone, or both. The user cannot send a message with the exact same text more than once per hour (to cut down on mistakes where users click send too often).
Users will be identified by a handle that must be 5–20 characters and that uses their conference attendee numbers and the key value on their badges to access the system. To keep up with your own group of people, apart from other users, users can connect themselves to other users. Connections are one-way, allowing users to see all of the speakers’ information without the reverse being true.

Figure 6-1 shows the logical database design for this application, on which I’ll base the physical design.

Figure 6-1 . Simple logical model of conferencing message database

The following is a brief documentation of the tables and columns in the model. I won’t be too specific with things like datatypes in this list. To keep things simple, I will expound on the needs as we get to each need individually.

User : Represents a user of the messaging system, preloaded from another system with attendee information.
- UserHandle : The name the user wants to be known as. Initially pre-loaded with a value based on the persons first and last name, plus a integer value, changeable by the user.
- AccessKey : A password-like value given to the users on their badges to gain access.
- AttendeeNumber : The number that the attendees are given to identify themselves, printed on front of their badges.
- TypeOfAttendee : Used to give the user special privileges, such as access to speaker materials, vendor areas, and so on.
- FirstName, LastName: Name of the user printed on badge for people to see.

UserConnection : Represents the connection of one user to another in order to filter results to a given set of users.
- UserHandle: Handle of the user who is going to connect to another user.
- ConnectedToUser : Handle of the user who is being connected to.

Message : Represents a single message in the system.
- UserHandle: Handle of the user sending the message.
- Text : The text of the message being sent.
- RoundedMessageTime : The time of the message, rounded to the hour.
- SentToUserHandle : The handle of the user that is being sent a message.
- MessageTime : The time the message is sent, at a grain of one second.

MessageTopic : Relates a message to a topic.
- UserHandle: User handle from the user who sent the message.
- RoundedMessgeTime : The time of the message, rounded to the hour.
- TopicName : The name of the topic being sent.
- UserDefinedTopicName : Allows the users to choose the UserDefined topic styles and set their own topics.

Topic : Predefined topics for messages.
- TopicName: The name of the topic.
- Description : Description of the purpose and utilization of the topics.

Choosing Names

The target database for our model is SQL Server, so our table and column naming conventions must adhere to the rules imposed by this database and generally be consistent and logical. In this section, I’ll briefly cover some of the different concerns when naming tables and columns. All of the system constraints on names have been the same for the past few versions of SQL Server, including 2000, 2005, and 2008.

Names of columns, tables, procedures, and so on are referred to technically as identifiers . Identifiers in SQL Server are stored in a system datatype of sysname . The system defined type named sysname is defined as a 128-character (or less, of course) string using double-byte Unicode characters. SQL Server’s rules for identifier consist of two distinct naming methods:

Regular identifiers : This is the preferred method, with the following rules:
- The first character must be a letter as defined by Unicode Standard 3.2 (generally speaking, Roman letters A to Z, uppercase and lowercase, although this also includes other letters from other languages) or the underscore character (_). You can find the Unicode Standard at www.unicode.org .
- Subsequent characters can be Unicode letters, numbers, the “at” sign (@), or the dollar sign ($).
- The name must not be a SQL Server reserved word. You can find a large list of reserved words in SQL Server 2012 Books Online, in the “Reserved Keywords” section. Some of the keywords won’t cause an error, but it’s better to avoid all keywords if possible. Some of these are tough, like user, transaction, and table, as they do often come up in the real world. (Note that our original model includes the name User, which we will have to correct.)
- The name cannot contain spaces.
Delimited identifiers : These should have either square brackets ([ ]) or double quotes ("), which are allowed only when the SET QUOTED_IDENTIFIER option is set to on, around the name. By placing delimiters around an object’s name, you can use any string as the name. For example, [Table Name], [3232 fjfa*&(&^(], or [Drop Database Master] would be legal (but really annoying, dangerous) names. Names requiring delimiters are generally a bad idea when creating new tables and should be avoided if possible, because they make coding more difficult. However, they can be necessary for interacting with data tables in other environments. Delimiters are generally to be used when scripting objects because a name like [Drop Database Master] can cause “problems” if you don’t.

If you need to put a closing brace (]) or even a double quote character in the name, you have to include two closing braces (]]), just like when you need to include a single quote within a string. So, the name fred]olicious would have to be delimited as [fred]]olicious]. However, if you find yourself needing to include special characters of any sort in your names, take a good long moment to consider whether you really do need this. If you determine after some thinking that you do, please ask someone else for help naming your objects, or e-mail me at [email protected]. This is a pretty horrible thing to do and will make working with your objects very cumbersome. Even just including space characters is a bad enough practice that you and your users will regret for years. Note too that [name] and [name ] are treated as different names (see the embedded space). I once had a DBA name a database with a trailing space by accident . . . very annoying.

Note Using policy-based management, you can create naming standard checks for whenever a new object is created. Policy-based management is a management tool rather than a design one, though it could pay to create naming standard checks to make sure you don’t accidentally create objects with names you won’t accept. In general, I find doing things that way too restrictive, because there are always exceptions to the rules and automated policy enforcement only works with a dictator’s hand. (Think Darth Vader, development manager!)

Table Naming

While the rules for creating an object name are pretty straightforward, the more important question is, “What kind of names should be chosen?” The answer is predictable: “Whatever you feel is best, as long as others can read it.” This might sound like a cop-out, but there are more naming standards than there are data architects. (On the day this paragraph was written, I actually had two independent discussions about how to name several objects and neither person wanted to follow the same standard.) The standard I generally go with is the standard that was used in the logical model, that being Pascal-cased names, little if any abbreviation, and as descriptive as possible. With space for 128 characters, there’s little reason to do much abbreviating (other than extending the life of your keyboard, I would suppose).

Caution Because most companies have existing systems, it’s a must to know the shop standard for naming tables so that it matches existing systems and so that new developers on your project will be more likely to understand your database and get up to speed more quickly. The key thing to make sure of is that you keep your full logical names intact for documentation purposes.

As an example, let’s consider the name of the UserConnection table we will be building later in this chapter. The following list shows several different ways to build the name of this object:

user_connection (or sometimes, by some awful mandate, an all-caps version USER_CONNECTION ): Use underscores to separate values. Most programmers aren’t big friends of underscores, because they’re cumbersome to type until you get used to them. Plus, they have a COBOLesque quality that doesn’t please anyone.
- [user connection] or "user connection": This name is delimited by brackets or quotes. As I have already mentioned, this isn’t really favored by anyone who has done any programming, because it’s impossible to use this name when building variables in code, and it’s very easy to make mistakes with them. Being forced to use delimiters is annoying, and many other languages use double quotes to denote strings. (In SQL, you should always uses single quotes!) On the other hand, the brackets [ and ] don’t denote strings, although they are a Microsoft-only convention that will not port well if you need to do any kind of cross-platform programming. Bottom line: delimited names are a bad idea anywhere except perhaps in a SELECT clause for a quickie report.
- UserConnection or userConnection: Pascal or camel case (respectively), using mixed case to delimit between words. I’ll use Pascal style in the examples, because it’s the style I like. (Hey, it’s my book. You can choose whatever style you want!)
usrCnnct or usCnct: The abbreviated forms are problematic, because you must be careful always to abbreviate the same word in the same way in all your databases. You must maintain a dictionary of abbreviations, or you’ll get multiple abbreviations for the same word—for example, getting “description” as “desc,” “descr,” “descrip,” and/or “description.”

Choosing names for objects is ultimately a personal choice but should never be made arbitrarily and should be based first on existing corporate standards, then existing software, and finally legibility and readability. The most important thing to try to achieve is internal consistency. Naming, ownership, and datatypes are all things that will drive you nuts when not done consistently, because they keep everyone guessing what will be used next time. Your goal as an architect is to ensure that your users can use your objects easily and with as little thinking about structure as possible. Even most pretty bad naming conventions will be better than having ten different good ones being implemented by warring architect/developer factions. And lest you think I am kidding, in many ways the Cold War was civil compared to the internal politics of database/application design.

Note There is something to be said about the quality of corporate standards as well. If you have an archaic standard, like one that was based on the mainframe team’s standard back in the 19th century, you really need to consider trying to change the standards when creating new databases so you don’t end up with names like HWWG01_TAB_USR_CONCT_T just because the shop standards say so (and yes, I do know when the 19th century was).

Naming Columns

The naming rules for columns are the same as for tables as far as SQL Server is concerned. As for how to choose a name for a column—again, it’s one of those tasks for the individual architect, based on the same sorts of criteria as before (shop standards, best usage, and so on). This book follows this set of guidelines:

Other than the primary key, my feeling is that the table name should rarely be included in the column name. For example, in an entity named Person, it isn’t necessary to have columns called PersonName or PersonSocialSecurityNumber . Most columns should not be prefixed with the table name other than with the following two exceptions:
- A surrogate key such as PersonId : This reduces the need for role naming (modifying names of attributes to adjust meaning, especially used in cases where multiple migrated foreign keys exist).
- Columns that are naturally named with the entity name in them, such as PersonNumber , PurchaseOrderNumber , or something that’s common in the language of the client and used as a domain-specific term.
The name should be as descriptive as possible. Use few abbreviations in names, with a couple of notable exceptions:
- Highly recognized abbreviations : As an example, if you were writing a purchasing system and you needed a column for a purchase-order table, you could name the object PO, because this is widely understood. Often, users will desire this, even if some abbreviations don’t seem that obvious.
- Pronounced abbreviations : If a value is read naturally as the abbreviation, then it can be better to use the abbreviation. For example, I always use id instead of identifier, first because it’s a common abbreviation that’s known to most people and second because the surrogate key of the Widget table is naturally pronounced Widget-Eye-Dee, not Widget-Identifier.
Usually, the name should end in a “class” word that distinguishes the main function of the column. This class word gives a general idea of the purpose of the attribute and general expectation of datatype. It should not be the same thing as the datatype—for example:
- StoreId is the identifier for the store.
- UserName is a textual string, but whether or not it is a varchar(30) or nvarchar(128) is immaterial.
- EndDate is the date when something ends and does not include a time part.
- SaveTime is the point in time when the row was saved.
- PledgeAmount is an amount of money (using a numeric(12,2), or money, or any sort of types).
- DistributionDescription is a textual string that is used to describe how funds are distributed.
- TickerCode is a short textual string used to identify a ticker row.
- OptInFlag is a two-value column (possibly three including NULL) that indicates a status, such as in this case if the person has opted in for some particular reason.

Note Many possible class words could be used, and this book is not about giving you all the standards to follow at that level. Too many variances from organization to organization make that too difficult.

I should point out that I didn’t mention a Hungarian-style notation in the guidelines to denote the datatype of the column for a reason. I’ve never been a big fan of this style, and neither are a large number of the professional architects that I know. If you aren’t familiar with Hungarian notation, it means prefixing the names of columns and variables with an indicator of the datatype and possible usage. For example, you might have a variable called vc100_columnName to indicate a varchar(100) datatype. Or you might have a Boolean or bit column named bIsCar or bCarFlag .

In my opinion, such prefixes are very much overkill, because it’s easy to tell the type from other metadata you can get from SQL Server. Class word usage indicators go at the end of the name and give you enough of an idea of what a column is used for without spelling out the exact implementation details. Consider what happens if you want to change the type of a column from varchar(100) to varchar(200) because the data is of a different size than was estimated, or even because the requirements have changed. If you then have to change the name of the column, the user interface must change, the ETL to the data warehouse has to change, and all scripts and procedures have to change, even if there’s no other reason to change. Otherwise, the change could be trivial, possibly needing to expand the size of only a few variables (and in some languages, even this wouldn’t be required). Take the name bCarFlag. It is typical to use a bit datatype, but it is just as acceptable to use a char(1) with a domain of 'Y' and 'N' or any implementation that makes sense for your organization.

A particularly hideous practice that is somewhat common with people who have grown up working with procedural languages (particularly interpreted languages) is to include something in the name to indicate that a column is a column, such as colFirstName or columnCity . Please don’t do this (please?). It’s clear by the context in which columns are used that a column is a column. It can be used only as a column. This practice, just like the other Hungarian-style notations, makes good sense in a procedural programming language where the type of object isn’t always clear just from context, but this practice is never needed with SQL tables.

Note I’ll use the same naming conventions for the implementation model as I did for the logical model: Pascal-cased names with a few abbreviations (mostly in the class words, like “id” for “identifier”). I’ll also use a Hungarian-style notation for objects other than tables and columns, such as constraints, and for coded objects, such as procedures. This is mostly to keep the names unique and avoid clashes with the table and column names, plus it is easier to read in a list that contains multiple types of objects (the tables are the objects with no prefixes). Tables and columns are commonly used directly by users. They write queries and build reports directly using database object names and shouldn’t need to change the displayed name of every column and table.

Model Name Adjustments

In our demonstration model, the first thing we will do is to rename the User table to MessagingUser because “User” is a SQL Server reserved word. While User is the more natural name than MessagingUser, it is one of the tradeoffs we have made because of the legal values of names. In rare cases, when an unsuitable name can’t be created, I may use a bracketed name, but even though it took me four hours to redraw graphics and undo my original choice of User as a name, I don’t want to give you that as a good practice. If you find you have used a reserved word in your model (and you are not writing a chapter in a book that is 70+ pages long about it), it is usually a very minor change.

In the model snippet in Figure 6-2, I have made that change.

Figure 6-2 . Table User has been changed to MessagingUser

The next change we will make will be to a few of the columns in this table. We will start off with the TypeOfAttendee column. The standard we discussed was to use a class word at the end of the column. In this case, Type will make an acceptable class, as when you see AttendeeType , it will be clear what it means. The implementation will be a value that will be an up to 20-character value.

The second change will be to the AccessKey column. Key itself would be acceptable as a class word, but it will give the implication that the value is a key in the database (a standard I have used in my data warehousing dimensional database designs). So adding value to the name will make the name clearer and distinctive. Figure 6-3 reflects the change in name.

Figure 6-3 . MessagingUser table after change to AccessKey column name

Choosing Key Implementation

The next step in the process is to choose how to implement the keys for the table. In the model at this point, it has one key identified for each table, in the primary key. In this section, we will look at the issues surrounding key choice and, in the end, will set the keys for the demonstration model. We will look at choices for implementing primary keys and then note the choices for creating alternate keys as needed.

Primary Key

Choosing the style of implementation for primary keys is an important choice. Depending on the style you go with, the look and feel of the rest of the database project will be affected. This is the case because whatever method you go with, the primary key value will be migrated to other tables as a reference to the particular row. Choosing a primary key style is also one of the most argued about topics on the forums and occasionally over dinner after a SQL Saturday event. In this book, I’ll be reasonably agnostic about the whole thing, and I’ll present several methods for choosing the implemented primary key throughout the book. In this chapter, I will use a very specific method, of course.

Presumably, during the logical phase, you’ve identified the different ways to uniquely identify a row. Hence, there should be several choices for the primary key, including the following:

Using an existing column (or set of columns)
Deriving a new surrogate column to represent the row

Each of these choices has pros and cons. I’ll look at them in the following sections.

Basing a Primary Key on Existing Columns

In many cases, a table will have an obvious, easy-to-use primary key. This is especially true when talking about independent entities. For example, take a table such as product. It would often have a productNumber defined. A person usually has some sort of identifier, either government or company issued. (For example, my company has an employeeNumber that I have to put on all documents, particularly when the company needs to write me a check.)

The primary keys for dependent tables can often generally take the primary key of the independent tables, add one or more attributes, and—presto!—primary key.

For example, I have a Ford SVT Focus, made by the Ford Motor Company, so to identify this particular model, I might have a row in the Manufacturer table for Ford Motor Company (as opposed to GM or something). Then, I’d have an automobileMake row with a key of manufacturerName = 'Ford Motor Company' and makeName = 'Ford' (instead of Lincoln, Mercury, Jaguar, and so on), style = 'SVT', and so on, for the other values. This can get a bit messy to deal with, because the key of the automobileModelStyle table would be used in many places to describe which products are being shipped to which dealership. Note that this isn’t about the size in terms of the performance of the key, just the number of values that make up the key. Performance will be better the smaller the key, as well, but this is true not only of the number of columns, but this also depends on the size of the values.

Note that the complexity in a real system such as this would be compounded by the realization that you have to be concerned with model year, possibly body style, different prebuilt packages, and so on. The key of the table may frequently have many parts, particularly in tables that are the child of a child of a child, and so on.

Basing a Primary Key on a New, Surrogate Value

The other common key style is to use only a single column for the primary key, regardless of the size of the other keys. In this case, you’d specify that every table will have a single primary key and implement alternate keys in your tables, as shown in Figure 6-4.

Figure 6-4. Single-column key example

Note that in this scenario, all of your relationships will be implemented in the database as nonidentifying type relationships, though you will implement them to all be required values (no NULLs). Functionally, this is the same as if the parentKeyValue was migrated from parent through child and down to grandChild, though it makes it harder to see in the model.

In the model in Figure 6-4, the most important thing you should notice is that each table not only has the primary key but also an alternate key. The term “surrogate” has a very specific meaning, even outside of computer science, and that is that it serves as a replacement. So the surrogate key for the parent object of parentKeyValue can be used as a substitute for the defined key, in this case otherColumnsForAltKey .

This method does have some useful advantages:

Every table has a single-column primary key: It’s much easier to develop applications that use this key, because every table will have a key that follows the same pattern. It also makes code generation easier to follow, because it is always understood how the table will look, relieving you from having to deal with all the other possible permutations of key setups.
The primary key index will be small: Thus, operations that use the index to access a row in the table will be faster. Most update and delete operations will likely modify the data by accessing the data based on primary keys that will use this index.
Joins between tables will be easier to code: That’s because all migrated keys will be a single column. Plus, if you use a surrogate key that is named TableName + Suffix, there will be less thinking to do when setting up the join.

There are also disadvantages to this method, such as always having to join to a table to find out the meaning of the surrogate key value, plus, as in our example table in Figure 6-2, you would have to join from the grandChild table through the child table to get values from parent. Another issue is that some parts of the self-documenting nature of relationships are obviated, because using only single-column keys eliminates the obviousness of all identifying relationships. So in order to know that the logical relationship between parent and grandchild is identifying, you will have trace the relationship and look at the uniqueness constraints.

Assuming you have chosen to use a surrogate key, the next choice is to decide what data to use for the key. Let’s look at two methods of implementing these keys, either by deriving the key from some other data or by using a meaningless surrogate value.

A popular way to define a primary key is to simply use a meaningless surrogate key like we’ve modeled previously, such as using a column with the IDENTITY property, which automatically generates a unique value. In this case, you rarely let the user have access to the value of the key but use it primarily for programming.

It’s exactly what was done for most of the entities in the logical models worked on in previous chapters: simply employing the surrogate key while we didn’t know what the actual value for the primary key would be. This method has one nice property:

You never have to worry about what to do when the primary key value changes.

Once the key is generated for a row, it never changes, even if all the data changes. This is an especially nice property when you need to do analysis over time. No matter what any of the other values in the table have been changed to, as long as the surrogate key value represents the same thing, you can still relate it to its usage in previous times. (This is something you have to be clear about with the DBA/programming staff as well. Sometimes, they may want to delete all data and reload it, but if the surrogate changes, your link to the unchanging nature of the surrogate key is likely broken.) Consider the case of a row that identifies a company. If the company is named Bob’s Car Parts and it’s located in Topeka, Kansas, but then it hits it big, moves to Detroit, and changes the company name to Car Parts Amalgamated, only one row is touched: the row where the name is located. Just change the name, and it’s done. Keys may change, but not primary keys. Also, if the method of determining uniqueness changes for the object, the structure of the database needn’t change beyond dropping one UNIQUE constraint and adding another.

Using a surrogate key value doesn’t in any way prevent you from creating additional single part keys, like we did in the previous section. In fact, it pretty much demands it. For most tables, having a small code value is likely going to be a desired thing. Many clients hate long values, because they involve “too much typing.” For example, say you have a value such as “Fred’s Car Mart.” You might want to have a code of “FREDS” for it as the shorthand value for the name. Some people are even so programmed by their experiences with ancient database systems that had arcane codes that they desire codes such as “XC10” to refer to “Fred’s Car Mart.”

In the demonstration model, I set all of the keys to use natural keys based on how one might do a logical model, so in a table like MessagingUser in Figure 6-5, it uses a key of the entire handle of the user.

Figure 6-5. MessagingUser table before changing model to use surrogate key

This value is the most logical, but this name, based on the requirements, can change. Changing this to a surrogate value will make it easier to make the name change and not have to worry about existing data in the table. Making this change to the model results in the change shown in Figure 6-6, and now, the key is a value that is clearly recognizable as being associated with the MessagingUser, no matter what the uniqueness of the row may be. Note that I made the UserHandle an alternate key as I switched it from primary key.

Figure 6-6. MessagingUser table after changing model to use surrogate key

Next up, we will take a look at the Message table shown in Figure 6-7. Note that the two columns that were named UserHandle and SentToUserHandle have had their role names changed to indicate the change in names from when the key of MessagingUser was UserHandle.

Figure 6-7. Messaging table before changing model to use surrogate key

We will transform this table to use a surrogate key by moving all three columns to nonkey columns, placing them in a uniqueness constraint, and adding the new MessageId column. Notice, too, in Figure 6-8 that the table is no longer modeled with rounded corners, because the primary key no longer is modeled with any migrated keys in the primary key.

Figure 6-8. Message table before changing model to use surrogate key

One additional benefit of your tables having a single column surrogate key for a key is that all tables follow a common pattern. Having a common pattern for every table is useful for programming with the tables as well. Because every table has a single-column key that isn’t updatable and is the same datatype, it’s possible to exploit this in code, making code generation a far more straightforward process. Note once more that nothing should be lost when you use surrogate keys, because a surrogate of this style replaces an existing natural key. Many of the object relational mapping (ORM) tools that are popular (if controversial in the database community) require a single column integer key as their primary implementation pattern. I don’t favor forcing the database to be designed in any manner to suit client tools, but sometimes, what is good for the database is the same as what is good for the tools, making for a relatively happy ending, at least.

By implementing tables using this pattern, I’m covered in two ways: I always have a single primary key value, but I always have a key that cannot be modified, which eases the difficulty for loading a warehouse. No matter the choice of human-accessible key, surrogate keys are the style of key that I use for all tables in databases I create, for every table. In Figure 6-9, I have completed the transformation to using surrogate keys.

Figure 6-9. Messaging Database Model progression after surrogate key choices

Keep in mind that I haven’t specified any sort of implementation details for the surrogate key at this point, and clearly, in a real system, I would already have done this during the transformation. For this chapter example, I am using a deliberately detailed process to separate each individual step, so I will put off that discussion until the DDL section of this book, where I will present code to deal with this need along with creating the objects.

Alternate Keys

In the model so far, we have already identified alternate keys as part of the model creation (MessagingUser.AttendeeNumber was our only initial alternate key), but I wanted to just take a quick stop on the model and make it clear in case you have missed it. Every table should have a minimum of one natural key—that is, a key that is tied to the meaning of what the table is modeling. This step in the modeling process is exceedingly important if you have chosen to do your logical model with surrogates, and if you chose to implement with single part surrogate keys, you should at least review the keys you specified.

A primary key that’s manufactured or even meaningless in the logical model shouldn’t be your only defined key. One of the ultimate mistakes made by people using such keys is to ignore the fact that two rows whose only difference is a system-generated value are not different. That’s because, from the user’s perspective, all the data that be of value is the same. At this point, it becomes more or less impossible to tell one row from another.

For example, take Table 6-1, a snippet of a Part table, where PartID is an IDENTITY column and is the primary key for the table.

Table 6-1. Sample Data to Demonstrate How Surrogate Keys Don’t Make Good Logical Keys

PartID	PartNumber	Description
1	XXXXXXXX	The X part
2	XXXXXXXX	The X part
3	YYYYYYYY	The Y part

How many individual items are represented by the rows in this table? Well, there seem to be three, but are rows with PartIDs 1 and 2 actually the same row, duplicated? Or are they two different rows that should be unique but were keyed in incorrectly? You need to consider at every step along the way whether a human being could not pick a desired row from a table without knowledge of the surrogate key. This is why there should be a key of some sort on the table to guarantee uniqueness, in this case likely on PartNumber .

Caution As a rule, each of your tables should have a natural key that means something to the user and that can uniquely identify each row in your table. In the very rare event that you cannot find a natural key (perhaps, for example, in a table that provides a log of events), then it is acceptable to make up some artificial key, but usually, it is part of a larger key that helps you tell two rows apart.

In a well-designed model, you should not have anything to do at this point with keys. The architect (probably yourself) has already determined some manner of uniqueness that can be implemented. For example, in Figure 6-10, a MessagingUser row can be identified by either the UserHandle or the AttendeeNumber.

Figure 6-10. MessagingUser table for review

A bit more interesting is the Message table, shown in Figure 6-11. The key is the RoundedMessageTime, which is the time, rounded to the hour, the text of the message, and the UserId .

Figure 6-11. Message table for review

In the business rules, it was declared that the user could not post the same message more than once an hour. Constraints such as this are not terribly easy to implement in a simple manner, but breaking it down to the data you need to implement the constraint can make it easier. In our case, by putting a key on the message, user, and the time rounded to the hour, the implementation is quite easy.

Of course, by putting this key on the table, if the UI sends the same data twice, an error will be raised when a duplicate message is sent. This error will need to be dealt with at the client side, either by just ignoring the message, or translating the error message to something nicer.

The last table I will cover here is the MessageTopic table, shown in Figure 6-12.

Figure 6-12. MessageTopic table for review

What is interesting about this table is the optional UserDefinedTopicName value. Later, when we are creating this table, we will load some seed data that indicates that the TopicId is UserDefined, which means that the UserDefinedTopicName column can be used. Along with this seed data, on this table will be a check constraint that indicates whether the TopicId value represents the user-defined topic. I will use a 0 surrogate key value. In the check constraint later, we will create a check constraint to make sure that all data fits the required criteria.

At this point, to review, we have the model at the point in Figure 6-13.

Figure 6-13. Messaging model for review

Determining Domain Implementation

In logical modeling, the concept of domains is used to specify a template for datatypes and column properties that are used over and over again. In physical modeling, domains are used to choose the datatype to use and give us a guide as to the validations we will need to implement.

For example, in the logical modeling phase, domains are defined for such columns as name and description, which occur regularly across a database/enterprise. The reason for defining domains might not have been completely obvious at the time of logical design, but it becomes clear during physical modeling. During implementation domains serve several purposes:

Consistency : Define every column of type TopicName column in precisely the same manner; there will never be any question about how to treat the column.
Ease of implementation : If the tool you use to model and implement databases supports the creation of domain and template columns, you can simply use the template to build similar columns with the same pattern, and you won’t have to set the values over and over, which leads to mistakes! (Even using proper tools, I always miss some minor naming or typing issue that ends up in the final model that just irks me forever.) If you have tool support for property inheritance, when you change a property in the definition, the values change everywhere.
Documentation : Even if every column used a different domain and there was no reuse, the column/domain documentation would be very useful for programmers to be able to see what datatype to use for a given column. In the final section of this chapter, I will include the domain as part of the metadata I will add to the extended properties.

Domains aren’t a requirement of logical or physical database design, nor does SQL Server actually make it easy for you to add them, but even if you just use them in a spreadsheet or design tool, they can enable easy and consistent design and are a great idea. Of course, consistent modeling is always a good idea regardless of whether you use a tool to do the work for you. I personally have seen a particular column type implemented in four different ways in five different columns when proper domain definitions were not available. So, tool or not, having a data dictionary that identifies columns that share a common type by definition is extremely useful.

For example, for the TopicName domain that’s used often in the Topic and MessageTopic tables in our ConferenceMessage model, the domain may have been specified by the contents of Table 6-2.

Table 6-2. Sample Domain: TopicName

Property	Setting
Name	TopicName
Optional	No
Datatype	Unicode text, 30 characters
Value Limitations	Must not be empty string or only space characters
Default Value	n/a

I’ll defer the CHECK constraint and DEFAULT bits until later in this chapter, where I discuss implementation in more depth. Several tables will have a TopicName column, and you’ll use this template to build every one of them, which will ensure that every time you build one of these columns it will have a type of nvarchar(30). Note that we will discuss data types and their exact implementation later in this chapter.

A second domain that is used very often is SurrogateKey , shown in Table 6-3.

Table 6-3. Sample Domain: SurrogateKey

Property	Setting
Name	SurrogateKey
Optional	When used for primary key, not optional, typically auto-generated. When used as a nonkey, foreign key reference, optionality determined by utilization for nonkey.
Datatype	int
Value Limitations	N/A
Default Value	N/A

This domain is a bit different, in that it will be implemented exactly as specified for a primary key attribute, but when it is migrated for use as a foreign key, some of the properties will be changed. First, it won’t have the IDENTITY property set. Second, for an optional relationship, an optional relationship will allow nulls in the migrated key, but when used as the primary key, it would not allow them. Finally, let’s set up one more domain definition to our sample. The userHandle domain, shown in Table 6-4.

Table 6-4. Sample Domain: UserHandle

Property	Setting
Name	UserHandle
Optional	no
Datatype	Basic character set, 20 characters maximum
Value Limitations	Must be 5–20 simple alphanumeric characters and must start with a letter
Default Value	n/a

In the next four subsections, I’ll discuss a couple topics concerning how to implement domains:

Implementing as a column or table: You need to decide whether a value should simply be entered into a column or whether to implement a new table to manage the values.
Choosing the datatype: SQL Server gives you a wide range of datatypes to work with, and I’ll discuss some of the issues concerning making the right choice.
Choosing nullability: In the last section, I will implement the datatype choices in the example model.
Choosing a collation: The collation determines how data is sorted and compared.

Getting the datatype right is the first step in getting the implementation correct, so it can really help to spend a reasonable amount of time here making sure it is right. Too many databases end up with all datatypes the same size and nullable (except for primary keys, if they have them) and lose the integrity of having properly sized and constrained constraints.

Implement as a Column or Table?

Although many domains have only minimal limitations on values, often a domain will specify a fixed set of named values that a column might have that is less than can be fit into one of the base datatypes. For example, in the demonstration table MessagingUser shown in Figure 6-14, a column AttendeeType has a domain of AttendeeType.

Figure 6-14. MessageUser table for reference

This domain might be specified as in Table 6-5.

Table 6-5. Genre Domain

Property	Setting
Name	AttendeeType
Optional	No
Datatype	Basic character set, maximum 20 characters
Value Limitations	Regular, Volunteer, Speaker, Administrator
Default Value	Regular

The value limitation limits the values to a fixed list of values. We could choose to implement the column using a check constraint with a predicate of AttendeeType IN ('Regular', 'Volunteer', 'Speaker', 'Administrator') and a literal default value of 'Regular'. There are a couple of minor annoyances with this form:

There is no place for table consumers to know the domain: Unless you have a row with one of each of the values specified in the CHECK constraint and you do the dreaded DISTINCT query over the column, it isn’t easy to know what the possible values are without either having foreknowledge of the system or looking in the metadata. If you’re doing Conference Messaging system utilization reports by AttendeeType, it won’t be easy to find out what AttendeeTypes had no activity for a time period, certainly not using a simple, straightforward SQL query that has no hard-coded values.
Often, a value such as this could easily have additional information associated with it: For example, this domain might have information about actions that a given type of user could do. For example, if a Volunteer attendee is limited to using certain Topics, you would have to manage the types in a different table, with no real control on the spelling of the names. Ideally, if you define the domain value in a table, any other uses of the domain are easier to maintain.

I nearly always include tables for all domains that are essentially “lists” of items, as it is just far easier to manage, even if it requires more tables. The choice of key is a bit different than most tables, and sometimes, I use a surrogate key for the actual primary key and other times, use a natural key. The general difference is whether or not using the integer or GUID key value has value to the client’s implementation. In the model, I have two examples of such types of domain implementations. In Figure 6-15, I have added a table to implement the domain for attendee types, and for this table, I will use the natural key.

Figure 6-15. AttendeeType domain implemented as a table

This lets an application treat the value as if it is a simple value. So if the application wants to manage the value as simple string values, I don’t have to know about it from the database standpoint. I still get the value and validation that the table implementation affords me.

In the original model, we had the Topic table, shown in Figure 6-16, which is a domain similar to the AttendeeType, but that is designed to allow a user to make changes to the topic list.

Figure 6-16. Topic table for reference

The Topic entity has the special case that it can be added to by the application managers, so it will be implemented as a numeric surrogate value. We will initialize the table with a row that represents the use defined topic that allows the user to enter their own topic in the MessageTopic table. Note too that we earlier discussed that the Topic table would loaded with a seed value for the user-defined topic.

Choosing the Datatype

Choosing proper datatypes to match the domain chosen during logical modeling is an important task. One datatype might be more efficient than another of a similar type. For example, you can store integer data in an integer datatype, a numeric datatype, a floating-point datatype or even a varchar(10) type, but these datatypes are certainly not alike in implementation or performance.

Note I have broken up the discussion of datatypes into two parts. First, there is this and other sections in this chapter in which I provide some basic guidance on the types of datatypes that exist for SQL Server and some light discussion on what to use. Appendix A at the end of this book is an expanded look at all of the datatypes and is dedicated to giving examples and example code snippets with all the types.

It’s important to choose the best possible datatype when building the column. The following list contains the intrinsic datatypes and a brief explanation of each of them:

Precise numeric data : Stores numeric data with no possible loss of precision.
- bit : Stores either 1, 0, or NULL; frequently used for Boolean-like columns (1 = True, 0 = False, NULL = Unknown). Up to 8-bit columns can fit in 1 byte.
- tinyint : Nonnegative values between 0 and 255 (1 byte).
- smallint : Integers between –32,768 and 32,767 (2 bytes).
- int : Integers between 2,147,483,648 to 2,147,483,647 (–2³¹ to 2³¹ – 1) (4 bytes).
- bigint : Integers between so it is –9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 (that is, – 2⁶³ to 2⁶³ – 1) (8 bytes).
- decimal ( numeric is a synonym): All numbers between –10³⁸ – 1 and 10³⁸ – 1 (between 5 and 17 bytes, depending on precision).
Approximate numeric data : Stores approximations of numbers, typically for scientific usage. Gives a large range of values with a high amount of precision but might lose precision of very large or very small numbers.
- float(N) : Values in the range from –1.79E + 308 through 1.79E + 308 (storage varies from 4 bytes for N between 1 and 24, and 8 bytes for N between 25 and 53).
- real : Values in the range from –3.40E + 38 through 3.40E + 38. real is a synonym for a float(24) datatype (4 bytes).
Date and time : Stores date values, including time of day.
- date : Date-only values from January 1, 0001, to December 31, 9999 (3 bytes).
- time : Time-only values to 100 nanoseconds (3 to 5 bytes).
- datetime2(N) : Despite the hideous name, this type will store dates from January 1, 0001, to December 31, 9999, with accuracy ranging from 1 second (0) to 100-nanosecond accuracy (7) (6 to 8 bytes).
- datetimeoffset : Same as datetime2, but includes an offset for time zone (8 to 10 bytes).
- smalldatetime : Dates from January 1, 1900, through June 6, 2079, with accuracy to 1 minute (4 bytes). (Note: it is suggested to phase out usage of this type and use the more standards oriented datetime2, though smalldatetime is not technically deprecated.)
- datetime : Dates from January 1, 1753, to December 31, 9999, with accuracy to 3.33 milliseconds (8 bytes). (Note: it is suggested to phase out usage of this type and use the more standards oriented datetime2, though datetime is not technically deprecated.)
Binary data : Strings of bits, for example, files or images. Storage for these datatypes is based on the size of the data stored.
- binary(N) : Fixed-length binary data up to 8,000 bytes long.
- varbinary(N) : Variable-length binary data up to 8,000 bytes long.
- varbinary(max) : Variable-length binary data up to (2³¹) – 1 bytes (2 GB) long. All the typical functionality of the varbinary columns is allowed on these types.
Character (or string) data :
- char(N) : Fixed-length character data up to 8,000 characters long.
- varchar(N) : Variable-length character data up to 8,000 characters long.
- varchar(max) : Variable-length character data up to (2³¹) – 1 bytes (2 GB) long. All the typical functionality of the varchar columns is allowed on these types.
- nchar(N), nvarchar(N), nvarchar(max) : Unicode equivalents of char, varchar, and varchar(max).
Other datatypes:
- sql_variant : Stores any datatype. It’s generally a bad idea to use this datatype, but it is handy in cases where you don’t know the datatype of a value before storing. Best practice would be to describe the type in your own metadata when using this type.
- rowversion ( timestamp is a synonym): Used for optimistic locking to version-stamp a row. It changes on every modification. The name of this type was timestamp in all SQL Server versions before 2000, but in the ANSI SQL standards, the timestamp type is equivalent to the datetime datatype. I’ll demonstrate the rowversion datatype in detail in Chapter 10, which is about concurrency.
- uniqueidentifier : Stores a GUID value.
- XML : Allows you to store an XML document in a column. The XML type gives you a rich set of functionality when dealing with structured data that cannot be easily managed using typical relational tables. You shouldn’t use the XML type as a crutch to violate the First Normal Form by storing multiple values in a single column. I will not use XML in any of the designs in this book.
- Spatial types (geometry, geography, circularString, compoundCurve, and curvePolygon): Used for storing spatial data, like for maps. I will not be using this type in this book.
- heirarchyId : Used to store data about a hierarchy, along with providing methods for manipulating the hierarchy. We will cover more about manipulating hierarchies in Chapter 8.

Choice of datatype is a tremendously important part of the process, but if you have defined the domain well, it is not that difficult of a task. In the following sections, we will look at a few of the more important parts of the choice. A few of the considerations we will include are

Deprecated or bad choice types
Common datatype configurations
Large-value datatype columns
Complex datatypes

I didn’t use too many of the different datatypes in the sample model, because my goal was to keep the model very simple and not try to be an AdventureWorks-esque model that tries to show every possible type of SQL Server in one model. In the next chapters of patterns, we will include a good amount of the datatypes in our examples because there are good pattern usages of almost all of the common types.

Deprecated or Bad Choice Types

I didn’t include several datatypes in the previous list listed because they have been deprecated for quite some time, and it wouldn’t be surprising if they were completely removed from the version after 2012, even though I said the same thing in the previous version of the book so be sure to stop using them as soon as possible). Their use was common in versions of SQL Server before 2005, but they’ve been replaced by types that are far easier to use:

image: Replace with varbinary(max)
text or ntext: Replace with varchar(max) and nvarchar(max)

If you have ever tried to use the text datatype in SQL code, you know it is not a pleasant thing. Few of the common text operators were implemented to work with it, and in general, it just doesn’t work like the other native types for storing string data. The same can be said with image and other binary types. Changing from text to varchar(max), and so on, is definitely a no-brainer choice.

The second types that are generally advised against being used are the two money types:

money: –922,337,203,685,477.5808 through 922,337,203,685,477.5807 (8 bytes)
smallmoney: Money values from –214,748.3648 through 214,748.3647 (4 bytes)

In general, the money datatype sounds like a good idea, but using has some confusing consequences. In Appendix A, I spend a bit more time covering these consequences, but here are two problems:

There are definite issues with rounding off, because intermediate results for calculations are calculated using only four decimal places.
Money data output includes formatting, including a monetary sign (such as $ or £), but inserting $100 and £100 results in the same value being represented in the variable or column.

Hence, it’s generally accepted that it’s best to store monetary data in decimal datatypes. This also gives you the ability to assign the numeric types to sizes that are reasonable for the situation. For example, in a grocery store having the maximum monetary value of a grocery item over 200,000 dollars is probably unnecessary, even figuring for a heck of a lot of inflation. Note that in Appendix A I will include a more thorough example of the types of issues you will see.

Common Datatype Configurations

In this section, I will briefly cover concerns and issues relating to Boolean/logical values, large datatypes, and complex types and then summarize datatype concerns in order to discuss the most important thing you need to know about choosing a datatype.

Boolean/Logical Values

Booleans are another of the hotly debated choices that are made for SQL Server data. There’s no Boolean type in standard SQL, since every type must support NULL, and a NULL Boolean makes life far more difficult for the people who implement SQL, so a suitable datatype needs to be chosen through which to represent Boolean values. Truthfully, though, what we really want from a Boolean is the ability to say that the property of the modeled entity “is” or “is not” for some basic setting.

There are three common choices to implement a value of this sort:

Using a bit datatype where a value of 1:True and 0:False: This is, by far, the most common datatype because it works directly with programming languages such as VB .NET with no translation. The check box and option controls can directly connect to these values, even though VB uses -1 to indicate True. It does, however, draw the ire of purists, because it is too much like a Boolean. Commonly named “flag” as a class word, like for a special sale indicator: SpecialSaleFlag . Some people who don’t do the suffix thing as a rule often start the name off with Is, like IsSpecialSale . Microsoft uses the prefix in the catalog views quite often, like in sys.databases: is_ansi_nulls_on, is_read_only, and so on.
A char(1) value with a domain of 'Y', 'N'; 'T', 'F' , or other values: This is the easiest for ad hoc users who don’t want to think about what 0 or 1 means, but it’s generally the most difficult from a programming standpoint. Sometimes, a char(3) is even better to go with 'yes' and 'no'. Usually named the same as the bit type, but just having a slightly more attractive looking output.
A full, textual value that describes the need: For example, a preferred customer indicator, instead of PreferredCustomerFlag , PreferredCustomerIndicator , with values 'Preferred Customer' and 'Not Preferred Customer'. Popular for reporting types of databases, for sure, it is also more flexible for when there becomes more than 2 values, since the database structure needn’t change if you needed to add 'Sorta Preferred Customer' to the domain of PreferredCustomerIndicator.

As an example of a Boolean column in our messaging database, I’ll add a simple flag to the MessagingUser table that tells whether the account has been disabled, as shown in Figure 6-17. As before, we are keeping things simple, and in simple cases, a simple flag might do it. But of course, in a sophisticated system, you would probably want to have more information, like who did the disabling, and why they did it.

Figure 6-17. MessagingUser table with DisabledFlag bit column

Large-Value Datatype Columns

As of SQL Server 2005, dealing with large datatypes changed quite a bit. By using the max specifier on varchar, nvarchar, and varbinary types, you can store far more data than was possible in previous versions using a “normal” type, while still being able to deal with the data using the same functions and techniques you can on a simple varchar(10) column, though performance will differ slightly.

As with all datatype questions, use the varchar(max) types only when they’re required, and you should always use the smallest types possible. The larger the datatype, the more data possible, and the more trouble the row size can be to get optimal storage retrieval times. In cases where you know you need large amounts of data or in the case where you sometimes need greater than 8,000 bytes in a column, the max specifier is a fantastic thing.

Note Keep on the lookout for uses that don’t meet the normalization needs. Most databases have a “comments” column somewhere that morphs from comments to a semistructured mess that your DBA staff then needs to dissect using the dreaded SUBSTRING and CHARINDEX functions .

There are two special concerns when using these types:

There’s no automatic datatype conversion from the normal character types to the large-value types.
Because of the possible large sizes of data, a special clause is added to the UPDATE statement.

The first issue is pretty simple, but it can be a bit confusing at times. For example, concatenate '12345' + '67890'. You’ve taken two varchar(5) values, and the result will be contained in a value that is automatically be recast as a varchar(10). But if you concatenate two varchar(8000) values, you don’t get a varchar(16000) value, and you don’t get a varchar(max) value. The values get truncated to a varchar(8000) value. This isn’t always intuitively obvious. For example, consider the following code:

SELECT LEN(CAST(replicate('a',8000) AS varchar(8000))

+ CAST(replicate('a',8000) AS varchar(8000))

);

It returns a value of type varchar(8000). If you cast one of the varchar(8000) values to varchar(max), then the result will be 16,000:

SELECT LEN(CAST(replicate('a',8000) AS varchar(max))

+ CAST(replicate('a',8000) AS varchar(8000))

);

Second, because the size of columns stored using the varchar(max) datatype can be so huge, it wouldn’t be favorable to always pass around these values just like you do with smaller values. Because the maximum size of a varchar(max) value is 2 GB, imagine having to update a value of this size in its entirety. Such an update would be pretty nasty, because the client would need to get the whole value, make its changes, and then send the value back to the server. Most client machines may only have 2 GB of physical RAM, so paging would like occur on the client machine, and the whole process would crawl and probably crash. So, you can do what are referred to as chunked updates. These are done using the .WRITE clause in the UPDATE statement. For example

UPDATE TableName

SET varcharMaxCol.WRITE('the value', <offset>, <expression>)

WHERE . . .

One important thing to note is that varchar(max) values will easily cause the size of rows to go greater than the 8060-byte limit with the data placed on overflow pages. Overflow pages are not terribly efficient because SQL Server has to go fetch extra pages that will not be in line with other data pages. (Overflow pages are covered more in Chapter 10 when the physical structures are covered.)

I won’t go over large types in any more detail at this point. Just understand that you might have to treat the data in the (max) columns differently if you’re going to allow large quantities of data to be stored. In our model, we’ve used a varbinary(max) column in the Customer table to store the image of the customer.

The main point to understand here is that having a datatype with virtually unlimited storage comes at a price. SQL Server 2008 allows you some additional freedom when dealing with varbinary(max) data by placing it in the file system using what is called filestream storage . I will discuss large object storage in Chapter 8 in more detail, including filestreams.

User Defined Type/Alias

One really excellent sounding feature that you can use to help make your code more clean is a user defined type, which is really an alias to a type. I don’t want to get too much into syntax yet, but you can use the datatype alias to specify a commonly used datatype configuration that’s used in multiple places using the following syntax:

CREATE TYPE <typeName>

FROM <intrinsic type> --any type that can be used as a column of a

--table, with precision and scale or length,

--as required by the intrinsic type

[NULL | NOT NULL]

When declaring a table, if nullability isn’t specified, then NULL or NOT NULL is based on the setting of ANSI_NULL_DFLT_ON, except when using a alias type (variables will always be nullable). In general, it is best to always specify the nullability in the table declaration.

For example, consider the UserHandle column. Earlier, we defined its domain as being varchar(20), not optional, alphanumeric, with the data required to be between 5 and 20 characters. The datatype alias would allow us to specify:

CREATE TYPE UserHandle FROM varchar(20) NOT NULL;

Then, in the table create statement, we could specify

CREATE TABLE MessagingUser

…

UserHandle UserHandle,

By declaring that the UserHandle type will be varchar(20), you can ensure that every time the type of UserHandle is used, in table declarations, and variable declarations will be varchar(20) and as long as you don’t specify NULL or NOT NULL. It is not possible to implement the requirement that data be between 5 and 20 characters on any other constraints on the type, including the null specification.

For another example, consider an SSN type. It’s char(11), so you cannot put a 12-character value in, sure. But what if the user had entered 234433432 instead of including the dashes? The datatype would have allowed it, but it isn’t what’s desired. The data will still have to be checked in other methods such as CHECK constraints.

I am personally not a user of these types. I have never really used these kinds of types, because of the fact that you cannot do anything with these other than simply alias a type. Any changes to the type also require removal of all references to the type.

I will note, however, that I have a few architect friends who make extensive use of them to help keep data storage consistent. I have found that using domains and a data modeling tool serves me a bit better, but I do want to make sure that you have at least heard of them and know the pros and cons.

Complex Datatypes

In SQL Server 2005 and later, we can build our own datatypes using the SQL CLR. Unfortunately, they are quite cumbersome, and the implementation of these types does not lend itself to the types behaving like the intrinsic types. Utilizing CLR types will require you to install the type on the client for them to get the benefit of the type being used.

Hence, while it is possible to do, for the most part you should use them only in the cases where it makes a very compelling reason to do so. There are a few different possible scenarios where you could reasonably use user-defined types (UDTs) to extend the SQL Server type system with additional scalar types or different ranges of data of existing datatypes. Some potential uses of UDTs might be:

Complex types that are provided by an owner of a particular format, such as a media format that could be used to interpret a varbinary(max) value as a movie or an audio clip. This type would have to be loaded on the client to get any value from the datatype.
Complex types for a specialized application that has complex needs, when you’re sure your application will be the only user.

Although the possibilities are virtually unlimited, I suggest that CLR UDTs be considered only for specialized circumstances that make the database design extremely more robust and easy to work with. CLR UDTs are a nice addition to the DBA’s and developer’s toolkit, but they should be reserved for those times when adding a new scalar datatype solves a business problem.

In SQL Server 2008, Microsoft provided several CLR user-defined types to implement hierarchies and spatial datatypes. I point this out here to note that if Microsoft is using the CLR to implement complex types (and the spatial types at the very least are pretty darn complex), the sky is the limit. I should note that the spatial and hierarchyId types push the limits of what should be in a type, and some of the data stored (like a polygon) is really an array of connected points.

The Most Important Consideration for Choosing Datatypes

When all is said and done, the most important consideration when choosing a datatype is to keep things simple and choose the right types for the job. SQL Server gives you a wide range of datatypes, and many of them can be declared in a wide variety of sizes. I never cease to be amazed by the number of databases around where every single column is either an integer or a varchar(N) (where N is the same for every single string column) and varchar(max). One particular example I’ve worked with had everything, including GUID-based primary keys, all stored in nvarchar(200) columns! It is bad enough to store your GUIDs in a varchar column at all, since it is stored as a 16-byte binary value, and as a varchar column, it will take 36 bytes; however, store it in an nvarchar column, and now, it takes at least 72 bytes! What a hideous waste of space. Even worse, now all data could be up to 200 characters wide, even if you plan to give entry space for only 30 characters. Now, people using the data will feel like they need to allow for 200 characters on reports and such for the data. Time wasted, space wasted, money wasted.

As another example, say you want to store a person’s name and date of birth. You could choose to store the name in a varchar(max) column and the date of birth in a varchar(max) column. In all cases, these choices would certainly store the data that the user wanted, but they wouldn’t be good choices. The name should be in something such as a varchar(30) column and the date of birth in a date column. Notice that I used a variable size type for the name. This is because you don’t know the length and not all names are the same size. Because most names aren’t nearly 30 bytes, using a variable-sized type will save space in your database.

Of course, in reality, seldom would anyone make such poor choices of a datatype as putting a date value in a varchar(max) column. Most choices are reasonably easy. However, it’s important keep in mind that the datatype is the first level of domain enforcement. Thinking back to our domain for UserHandle, we had the following datatype definition, and value limitations:

Table 6-6. Sample Domain:UserHandle

Property	Setting
Name	UserHandle
Optional	no
Datatype	Basic character set, maximum 20 characters
Value Limitations	Must be between 5-20 characters, simple alphanumeric and start with a letter
Default Value	n/a

You can enforce the first part of this at the database level by declaring the column as a varchar(20). A column of type varchar(20) won’t even allow a 21-character or longer value to be entered. It isn’t possible to enforce the rule of greater than or equal to five characters using only a datatype. I’ll discuss more about how to enforce simple domain requirements later in this chapter, and in Chapter 9, we will discuss patterns of integrity enforcements that are more complex.

In the earlier part of the process, we defined domains for every one of our columns (well, theoretically, in actuality some of them are simply named now, but we will make assumptions about each column in this chapter, so we can bring it in under 200 pages).

Initially, we had the model in Figure 6-18 for the MessagingUser table.

Figure 6-18. MessagingUser table before choosing exact datatypes

Choosing types, we will use an int for the surrogate key (and in the DDL section, we will set the implementation of the rest of the optionality rule set in the domain: “Not optional auto generated for keys, optionality determined by utilization for nonkey”, but will replace items of SurrogateKey domain with int types. User handle was discussed earlier in this section. In Figure 6-19, I chose some other basic types for Name. AccessKeyValue , and the AttendeeType columns .

Figure 6-19. MessagingUser after datatype choice

Sometimes, you won’t have any real domain definition, and you will use common sizes. For these, I suggest using either a standard type (if you can find them, like on the Internet) or look through data you have in your system. Until the system gets into production, changing types is fairly easy from a database standpoint, but the more code that accesses the structures the more difficult it gets to make changes.

For the Message table in Figure 6-20, we will choose types.

Figure 6-20. Message table before datatype choice

The text column isn’t datatype text but is the text of the message, limited to 200 characters. For the time columns, in Figure 6-21, I choose datetime2(0) for the MessageTime, since the requirements specified time down to the second. For RoundedMessageTime, we are rounding to the hour, and so I chose also will expect it to be datetime2(0), though it will be a calculated column based on the MessageTime value. Hence, MessageTime and RoundedMessageTime are two views of the same data value.

Figure 6-21. Message table after datatype choice, with calculated column denoted

So I am going to use a calculated column as shown in Figure 6-21, I will specify the type of RoundedMessageTime as a nonexistent datatype (so if I try to create the table it will fail). A calculated column is a special type of column that isn’t directly modifiable, as it is based on the result of an expression.

Later in this chapter, we will specify the actual implementation, but for now, we basically just set a placeholder. Of course, in reality, I would specify the implementation immediately, but again, for this first learning process, I am doing things in this deliberate manner to keep things orderly. So, in Figure 6-22, I have the model with all of the datatypes set.

Figure 6-22. Messaging system model after datatype choices

Choosing Nullability

The next step in the process is to set nullability of columns. In our domains, we specified if the columns were optional, so this will generally be a simple task. For the Message table in Figure 6-23, I have chosen the following nullabilities.

Figure 6-23. Message table for review

The interesting choice was for the two MessagingUserId columns. In Figure 6-24, you can see the full model, but note the relationships from MessagingUser to Message. The relationship for the user that sent the message (MessagingUserId) is NOT NULL, because every message is sent by a user. However, the relationship representing the user the message was sent to is nullable, since not every message needs to be sent to a user.

Figure 6-24. Messaging System model, with NULLs chosen

At this point, our model is very nearly done and very much resembles a database that could be used. Just a bit more information needed to finish out the model.

Choosing a Collation

Many character sets are used by the many different cultures around the world. While you can choose a Unicode datatype if you need to store the characters for almost any character set, there still is the question of how data is sorted (case sensitive or not) and compared (accent sensitive or not). The collation sequence shows how data is sorted when needed and how data is compared. SQL Server and Windows provide a tremendous number of collation types to choose from. The collation is specified at many levels, starting with the server. The server collation determines how much of the system metadata is stored. Then the database has a collation, and finally, each column may have a different collation.

It’s a somewhat uncommon need for the average database to change the collation from the default, which is usually chosen to be the most useful for most uses. This is usually a case-insensitive collation, which allows that when doing comparisons and sorts, 'A' = 'a'. I’ve only used an alternative collation a few times for columns where case sensitivity was desired (one time was so that a client could force more four-character codes than a case-insensitive collation would allow!).

To see the current collation type for the server and database, you can execute the following commands:

SELECT serverproperty('collation'),

SELECT databasepropertyex('MovieRental','collation'),

On most systems installed in English-speaking countries, the default collation type is SQL_Latin1_General_CP1_CI_AS, where Latin1_General represents the normal Latin alphabet, CP1 refers to code page 1252 (the SQL Server default Latin 1 ANSI character set), and the last parts represent case insensitive and accent sensitive, respectively. You can find full coverage of all collation types in the SQL Server 2012 documentation.

To list all the sort orders installed in a given SQL Server instance, you can execute the following statement:

SELECT *

FROM fn_helpcollations();

On the computer on which I do testing, this query returned more than 1,000 rows, but usually, you don’t need to change from the default that the database administrator initially chooses. To set the collation sequence for a char, varchar, text, nchar, nvarchar, or ntext column when creating a column, you specify it using the COLLATE clause of the column definition, like so:

CREATE TABLE alt.OtherCollate

(

OtherCollateId integer IDENTITY

CONSTRAINT PKAlt_OtherCollate PRIMARY KEY ,

Name nvarchar(30) NOT NULL,

FrenchName nvarchar(30) COLLATE French_CI_AS_WS NULL,

SpanishName nvarchar(30) COLLATE Modern_Spanish_CI_AS_WS NULL

);

Now, when you sort output by FrenchName, it’s case insensitive, but arranges the rows according to the order of the French character set. The same applies with Spanish, regarding the SpanishName column. For this chapter we will stick with the default, and I would suggest taking a look at books online if you have the need to store data in multiple languages.

One quick note, you can specify the collation in a WHERE clause using the COLLATE keyword:

SELECT Name

FROM alt.OtherCollate

WHERE Name COLLATE Latin1_General_CS_AI

LIKE '[A-Z]%' collate Latin1_General_CS_AI; --case sensitive and

--accent insensitive

It is important to be careful when choosing a collation that is different from the default because at the server level it is extremely hard to change, and at the database level it is no picnic. You can change the collation of a column with an ALTER command, but it can’t have constraints or indexes referencing it, and you may need to recompile all of your objects that reference the tables.

Setting Up Schemas

A schema is a namespace: a container where database objects are contained, all within the confines of a database. We will use them to group our tables in functional groups. Naming schemas is a bit different than tables or columns. Schema names should sound right, so sometimes, they make sense to be plural, and other times singular. It depends on how they are being used. I find myself using plural names most of the time because it sounds better, and because sometimes, you will have a table named the same thing as the schema if both were singular.

In our model in Figure 6-25, we will put the tables that are used to represent messages in a Messages schema, and the ones that represent Attendees and their relationships to one another in a schema we will name Attendees.

Figure 6-25. Messages model with schemas assigned

Note, too, that I often will set up schemas late in the process, and it might seem more correct to start there. I find that it is often easier to discover the different areas of the implementation, and that schemas aren’t necessarily easy to start with, but that different areas come and go until I get to the final solution. Sometimes, it is by necessity because you have multiple tables with the same name, though this can be a sign of a bad design. In this manufactured solution, I simply did it last to make the point that it could be last.

What makes schemas so nice is that you can deal with permissions on a schema level, rather than on an object-by-object level. Schemas also give you a logical grouping of objects when you view them within a list, such as in Management Studio.

I’m not going to go any further into the security aspects of using schemas at this point in the book, but I’ll just mention that they’re a good idea. Throughout this book, I’ll always name the schema that a table is in when doing examples. Schemas will be part of any system I design in this book, simply because it’s going to be best practice to do so going further. On a brief trip back to the land of reality, I would expect that beginning to use schemas in production systems will be a slow process, because it hasn’t been the normal method in years past. Chapter 9 will discuss using schemas for security in more detail.

Adding Implementation Columns

Finally, I will add one more thing to the database, and these are columns to support the implementation. A very common use is to have columns that indicate when the row was created, when it was updated, by whom, and so on. In our model, I will stick to the simple case of the times mentioned and will demonstrate how to implement this in the database. A lot of implementers like to leave these values to the client, but I very much prefer using the database code because then we have one clock managing times, rather than multiples.

So in Figure 6-26, I add two NOT NULL columns to every table for the RowCreateTime and RowLastUpdateTime , except for the AttendeeType table. As we specified it to be not user manageable, I chose not to include the modified columns for that table. Of course, you might want to do this to let the user know when the row was first available.

Figure 6-26. Message model after adding RowCreateTime and RowLastUpdateTime to tables

As a final note, it is generally best to only use these implementation columns strictly for metadata purposes. For example, consider the Messages.Message table. If you need to know when the message was created, you should use the MessageTime column as that value may represent the time when the user clicked the create button, even if it took five minutes to actually store the data. If you want to know when the row was created that represented the message, use the RowCreateTime. That is why I use such clunky names for the implementation column. Many tables will include the creation time, but that data may be modifiable. I don’t want users changing the time when the row was created, so the name notes that the time of creation is strictly for the row, and I don’t allow this column to be modified by anyone.

Sometimes, I will use these columns in concurrency control, but most of the time, I will use a rowversion type if the client can (and will) make use of it. Concurrency control is a very important topic that I will spend a full chapter on in in Chapter 11).

Using DDL to Create the Database

So far, we have been molding the model to make it fit our needs to implement. Columns were added, tables were added, and constraints were specified. Now, in this latter half of the chapter, we move toward the mechanical bits of the process, in that all that’s left is to implement the tables we have spent so much time designing. The blueprints have been drawn up, and now, we can finally grab a hammer and start driving nails.

Just like in the rest of this book, I’ll do this work manually using DDL, because it will help you understand what a tool is building for you. It’s also a good exercise for any database architect or DBA to review the SQL Server syntax; I personally wouldn’t suggest doing this on a database with 300 tables, but I definitely do know people who do this and wouldn’t consider using a tool to create any of their database objects. On the other hand, the same data modeling tools that could be used to do the logical modeling can usually create the tables and often some of the associated code, saving your fingers from added wear and tear, plus giving you more time to help Mario save the princess who always seems to get herself captured. No matter how you do the work, you need to make sure that you end up with scripts of DDL that you or the tool uses to create objects in some manner in the file system, because they’re invaluable tools for the DBA to apply changes to production, test, development, QA, or whatever environments have been set up to allow developers, users, and DBAs to coexist throughout the process.

It is not uncommon for DBAs to do all their work using scripts and never use a database design/generation tool, especially when they work for a company with smallish resources that they have already spent purchasing gold-plated laptops for all of the developers. Make sure that your scripts are in a source control system too, or at the very least backed up. Now in SQL Server 2012, we have two tools that we can work in, Management Studio and a new tool called Development Studio. Development Studio is the development-oriented tool that will replace Management Studio for developers and will introduce an offline version of creating database code. In this chapter, I am going to stick to the online paradigm where I create a database directly in the server that most readers who have any experience will understand naturally.

Before starting to build anything else, you’ll need a database. I’ll create this database using all default values, and my installation is very generic on my laptop. I use the Developer Edition, and I used all the default settings when installing. I hate to state the completely obvious, but you’ll need to do this with an account that has rights to create a database, especially if you’re doing this on a shared server, such as your company development server.

Choosing a database name is just as important as naming of other objects, and I tend to take the same sort of naming stance. Keep it as simple as possible to differentiate between all other databases, and follow the naming standards in place for your organization. I would try to be careful to try to standardize names across instances of SQL Server to allow moving of databases from server to server. In the code downloads, I will name the database ConferenceMessaging .

The steps I’ll take along the way are as follows:

Creating the basic table structures: Building the base objects with columns.
Adding uniqueness constraints: Using primary and unique constraints to enforce uniqueness between rows in the table.
Building default constraints: Assisting users in choosing proper values when it isn’t obvious.
Adding relationships: Defining how tables relate to one another (foreign keys).
Implementing Basic Check Constraints: Some domains need to be implemented a bit more strictly than using a simple datatype.
Documenting the database: Including documentation directly in the SQL Server objects.
Validating the dependency information: Using the catalog views and dynamic management views, you can validate that the objects you expect to depend on the existence of one another do, in fact, exist, keeping your database cleaner to manage.

I will use the following statement to create a small database in my directories have created:

CREATE DATABASE ConferenceMessaging ON

PRIMARY ( NAME = N'ConferenceMessaging', FILENAME =

N'C:SQLDATAConferenceMessaging.mdf' ,

SIZE = 1024MB , MAXSIZE = 1024MB)

LOG ON ( NAME = N'ConferenceMessaging_log', FILENAME =

N'C:SQLLOGConferenceMessaging_log.ldf' ,

SIZE = 100MB, MAXSIZE = 2048GB , FILEGROWTH = 100MB);

Or, if you want to just take the defaults, you can simply execute CREATE DATABASE ConferenceMessaging. In either case, you can see where the database files were placed by running the following statement (note that size is presented in 8 KB pages—more on the internal structures of the database storage in Chapter 10):

SELECT type_desc, size*8/1024 as [size (MB)],physical_name

FROM sys.master_files

WHERE database_id = db_id('ConferenceMessaging'),

This returns

type_desc	size (MB)	physical_name
------------	---------------	------------------------------------
ROWS	1024	C:SQLDATAConferenceMessaging.mdf
LOG	100	C:SQLLOGConferenceMessaging_log.ldf

Next, we want to deal with the owner of the database. The database is owned by the user who created the database, as you can see from the following query (note that you should be in the context of the ConferenceMessaging database for this to work):

--determine the login that is linked to the dbo user in the database

SELECT suser_sname(sid) as databaseOwner

FROM sys.database_principals

WHERE name = 'dbo';

On my instance, I created the database using a user named denali with a machine named DENALI-PC:

databaseOwner

--------------

denali-PCdenali

You can see the owner of all databases on an instance using the following query:

--Get the login of owner of the database from all database

SELECT SUSER_SNAME(owner_sid) as databaseOwner, name

FROM sys.databases;

On a typical corporate production server, I almost always will set the owner of the database to be the system administrator account so that all databases are owned by the same user. The only reason to not do this is when you are sharing databases or when you have implemented cross-database security that needs to be different for multiple databases (more information about security in Chapter 10 on security). You can change the owner of the database by using the ALTER AUTHORIZATION statement:

ALTER AUTHORIZATION ON Database::ConferenceMessaging TO SA;

Going back and checking the code, you will see that the owner is now SA.

Tip Placing semicolons at the end of every statement in your T-SQL is fast becoming a standard that will, in future versions of SQL Server, be required.

Creating the Basic Table Structures

The next step is to create the basic tables of the database. In this section, we will form CREATE TABLE statements to create the tables. The following is the basic syntax for the CREATE TABLE statement:

CREATE TABLE [<database>.][<schema>.]<tablename>

(

);

If you look in Books Online, you will see a lot of additional settings that allow you to place the table on a filegroup, partition the table onto multiple filegroups, control where maximum/overflow data is placed, and so on. Where (and why) to place your data on different filegroups other than the default will be discussed in Chapter 10 on table structures and indexing. The reason for not doing this now is that it is important to segregate the tuning process from the basic data storage implementation. Once you have the database created, developers can use it to start creating code, and tuning can commence as the usage patterns emerge.

Tip Don’t make this your only source of information about the DDL in SQL Server. Books Online is another great place to get exhaustive coverage of the DDL, and other sorts of books will cover the physical aspects of table creation in great detail. In this book, we focus largely on the relational aspects of database design with enough of the physical implementation to start you on the right direction. Hardware and tuning are really deep subjects that can spin in many different directions based on just how busy and large your databases are.

The base CREATE clause is straightforward:

CREATE TABLE [<database>.][<schema>.] <tablename>

I’ll expand on the items between the angle brackets ( < and >). Anything in square brackets ([ and ]) is optional.

<database>: It’s seldom necessary to specify the database in the CREATE TABLE statements. If not specified, this defaults to the current database where the statement is being executed. Specifying the database means that the script will be executable only in a single database, which precludes us from using the script unchanged to build alternately named databases on the same server, should the need arise.
<schema>: This is the schema to which the table will belong. We specified a schema in our model, and we will create schemas as part of this section
<tablename>: This is the name of the table.

For the table name, if the first character of the table name is a single # symbol, the table is a temporary table. If the first two characters of the table name are ##, it’s a global temporary table. Temporary tables are not so much a part of database design as a mechanism to hold intermediate results in complex queries, so don’t use them in your database design. You can also create a local variable table that has the same scope as a variable by using an @ in front of the name that can be used to hold small sets of data.

The combination of schema and tablename must be unique in a database. In versions of SQL Server prior to 2005, the second part of the name was the owner, and almost every best-practice guide would suggest that all tables were owned by the dbo (database user).

Schema

As discussed in the earlier section where we defined schemas for our database, a schema is a namespace: a container where database objects are contained, all within the confines of a database. One thing that is nice is that because the schema isn’t tightly tied to a user, you can drop the user without changing the exposed name of the object. Changing owners of the schema changes owners of the table. (This is done using the ALTER AUTHORIZATION statement.)

In SQL Server 2000 and earlier, the table was owned by a user, which made using two (or more) part names difficult. Without getting too deep into security, objects owned by the same user are easier to handle with security due to ownership chaining. If one object references another and they are owned by the same user, the ownership chain isn’t broken. So we had every object owned by the same user.

Starting with SQL Server 2005, a schema is owned by a user, and tables are contained in a schema. Just as in 2000, the generally suggested best practice was that all tables were owned by the dbo user. Now, this is done by having the schema owned by dbo, but this doesn’t mean you have to have every schema named dbo since an object in schema A can reference objects in schema B without breaking the ownership chain as long as they are both owned by dbo.

Not just tables are bound to a given schema; just about every object is schema bound. You can access objects using the following naming method:

[<databaseName>.][<schemaName>.]objectName

The <databaseName> defaults to the current database. The <schemaName> defaults to the user’s default schema. In general, it is best to always specify the schema in any and all SQL statements because it saves SQL the work of having to decide which schema to use (when the schema is not specified, the call is considered to be caller dependent because what it refers to may change user to user).

Schemas are of great use to segregate objects within a database for clarity of use. In our database, we have already specified two schemas, Messages and Attendees. The basic syntax is simple, just CREATE SCHEMA <schemaName> (it must be the first statement in the batch.) So I will create them using the following commands:

CREATE SCHEMA Messages; --tables pertaining to the messages being sent

CREATE SCHEMA Attendees; --tables pertaining to the attendees and how they can send messages

You can view the schemas created using the sys.schemas catalog view:

SELECT name, USER_NAME(principal_id) as principal

FROM sys.schemas

WHERE name <> USER_NAME(principal_id); --don't list user schemas

This returns

name	principal
----------	------------
Messages	dbo
Attendees	dbo

Sometimes, schemas end up owned by a user other than dbo, like when a developer without db_owner privileges creates a schema. You can change the ownership using ALTER AUTHORIZATION, just like for the database:

ALTER AUTHORIZATION ON SCHEMA::Messages To DBO;

As a note, it is suggested to always specify the two-part name for objects in code. It is safer, because you know what schema it is using, and it doesn’t need to check the default on every execution. However, for ad-hoc access, it can be annoying to type the schema if you are commonly using a certain schema. You can set a default schema for a user in the CREATE and ALTER USER statements, like this:

CREATE USER <schemaUser>

FOR LOGIN <schemaUser>

WITH DEFAULT SCHEMA = schemaname;

There’s also an ALTER USER command that allows the changing of default schema for existing users (and in SQL Server 2012, it now works for Windows Group–based users as well. For 2005-2008R2, it only worked for standard users).

Columns and Base Datatypes

The next part of the CREATE TABLE statement is for the column specifications:

CREATE TABLE [<database>.][<schema>.]<tablename>

(

<columnName> <datatype> [<NULL specification>]

[IDENTITY [(seed,increment)]

--or

);

The <columnName> placeholder is where you specify the name of the column.

There are two types of columns:

Implemented: This is an ordinary column, in which physical storage is allocated and data is stored for the value.
Computed (or virtual): These columns are made up by a calculation derived from any of the physical columns in the table.

Most of the columns in any database will be implemented columns, but computed columns have some pretty cool uses, so don’t think they’re of no use just because they aren’t talked about much. You can avoid plenty of code-based denormalizations by using computed columns. In our example tables, we specified one computed column shown in Figure 6-27.

Figure 6-27. Message table with computed column highlighted

So the basic columns (other than the computed column) are fairly simple, just name and datatype:.

MessageId int,

SentToMessagingUserId int,

MessagingUserId int,

Text nvarchar(200),

MessageTime datetime2(0),

RowCreateTime datetime2(0),

RowLastUpdateTime datetime2(0)

The requirements called for the person to not send the same message more than once an hour. So we construct a function that takes the MessageTime in datetime2(0) datatype. That time is at a level of seconds, and we need the data in the form of hours. So first, we develop an expression that will do this. I start out with a variable of the type of the column we are deriving from and then set it to some value. I start with a variable of datetime2(0) and load it with the time from SYSDATETIME():

declare @time datetime2(0)

set @time = SYSDATETIME()

Next, I write the following expression:

dateadd(hour,datepart(hour,@time),cast(cast(@time as date)as datetime2(0)) )

which can be broken down fairly simply, but basically takes the number of hours since midnight and adds that to the date-only value by casting it to a date and then to a datetime2, which allows you to add hours to it. Once the column is tested, you replace the variable with the MessageTime column. So, our calculated column will be specified as follows:

,RoundedMessageTime as dateadd(hour,datepart(hour,MessageTime),

cast(cast(MessageTime as date)as datetime2(0)) ) PERSISTED

The persisted specification indicates that the value will be calculated and saved as part of the definition and storage of the table, just like a fully implemented column. In order to be persisted, the expression must be deterministic, which basically means that for the same input, you will always get the same output (much like we covered in normalization). You can also use a deterministic expression as a column in an index (we will use it as part of the uniqueness constraint for this table). So an expression like getdate() is possible, but you could not index it.

Nullability

In the column-create phrase, simply change the <NULL specification> in your physical model to NULL to allow NULLs, or NOT NULL not to allow NULLs:

<columnName><data type>[<NULL specification>]

There’s nothing particularly surprising here. For the noncomputed columns in the Messages.Message table back in Figure 6-27, we will specify the following nullabilities:

MessageId int NOT NULL ,

SentToMessagingUserId int NULL ,

MessagingUserId int NOT NULL ,

Text nvarchar(200) NOT NULL ,

MessageTime datetime2(0) NOT NULL ,

RowCreateTime datetime2(0) NOT NULL ,

RowLastUpdateTime datetime2(0) NOT NULL

Note Leaving off the NULL specification altogether, the SQL Server default is used. To determine the current default property for a database, check the column is_ansi_null_default_on for the database in sys.databases and it can be changed using ALTER DATABASE ANSI_NULL_DEFAULT. It is so much of a best practice to ALWAYS specify the nullability of a column that I won’t attempt to demonstrate how that works, as it is quite confusing.

Managing Nonnatural Primary Keys

Finally, before getting too excited and completing the table creation script, there’s one more thing to discuss. Earlier in this chapter, I discussed the basics of using a surrogate key in your table. In this section, I’ll present the method that I typically use. I break down surrogate key values into the types that I use:

Manually managed, where you let the client choose the surrogate value, usually a code or short value that has meaning to the client. Sometimes, this might be a meaningless value, if the textual representation of the code needs to change frequently, but often, it will just be a simple code value. For example, if you had a table of U.S. states, you might use 'TN' for the code of the state of Tennessee.
Automatically generated, using the IDENTITY property or a GUID stored in a uniqueidentifier type column.
A cross between the two, where you use the IDENTITY property for user created data but manually load some values to give programmers direct access to the surrogate value.

Of course, if your tables don’t use any sort of surrogate values, you can move on to the next section.

Manually Managed

In the example model, I have one such situation in which I set up a domain table where I won’t allow users to add or subtract rows from the table. Changes to the rows in the table could require changes to the code of the system and application layers of the application. Hence, instead of building tables that require code to manage, as well as user interfaces, we simply choose a permanent value for each of the surrogate values. This gives you control over the values in the key (which you pretty much won’t have when using the IDENTITY property) and allows usage of the surrogate key directly in code if desired (likely as a constant construct in the host language). It also allows a user interface to cache values from this table or to even implement them as constants, with confidence that they won’t change without the knowledge of the programmer who is using them (see Figure 6-28).

Figure 6-28. AttendeeType table for reference

Note that it’s generally expected that once you manually create a value, the meaning of this value will never change. For example, you might have a row, ('SPEAKER', 'Persons who are speaking at the conference and have special priviledges'). In this case, it would be fine to change the Description but not the value for AttendeeType.

Generation Using the IDENTITY Property

Most of the time, tables are created to allow users to create new rows. Implementing a surrogate key on these tables is commonly done using (what are commonly referred to as) IDENTITY columns. For any of the precise numeric datatypes, there’s an option to create an automatically incrementing (or decrementing, depending on the increment value) column. The identity value increments automatically, and it works outside of transactions and locking so it works extremely fast. The column that implements this IDENTITY column should also be defined as NOT NULL. From our initial section on columns, I had this for the column specification:

<columnName> <data type> [<NULL specification>] IDENTITY [(seed,increment)]

The seed portion specifies the number that the column values will start with, and the increment is how much the next value will increase. For example, take the Movie table created earlier, this time implementing the IDENTITY-based surrogate key:

MessageId int NOT NULL IDENTITY(1,1) ,

SentToMessagingUserId int NULL ,

MessagingUserId int NOT NULL ,

Text nvarchar(200) NOT NULL ,

MessageTime datetime2(0) NOT NULL ,

RowCreateTime datetime2(0) NOT NULL ,

RowLastUpdateTime datetime2(0) NOT NULL

To the column declaration for the MessageId column of the Message table we have been using in the past few sections. I’ve added the IDENTITY property for the MovieId column. The seed of 1 indicates that the values will start at 1, and the increment says that the second value will be 1 greater, in this case 2, the next 3, and so on. You can set the seed and increment to any value that is of the datatype of the column it is being applied to. For example, you could declare the column as IDENTITY(1000,-50), and the first value would be 1000, the second 950, the third 900, and so on.

The IDENTITY property is useful for creating a surrogate primary key that’s small and fast. The int datatype requires only 4 bytes and is good because most tables will have fewer than 2 billion rows. There are, however, a couple of major caveats that you have to understand about IDENTITY values:

IDENTITY values are apt to have holes in the sequence. If an error occurs when creating a new row, the IDENTITY value that was going to be used will be lost to the identity sequence. This is one of the things that allows them to be good performers when you have heavy concurrency needs. Because IDENTITY values aren’t affected by transactions, other connections don’t have to wait until another’s transaction completes.
If a row gets deleted, the deleted value won’t be reused unless you insert a row yourself (not a simple task). Hence, you shouldn’t use IDENTITY columns if you cannot accept this constraint on the values in your table.
The value of a column with the IDENTITY property cannot be updated. You can insert your own value by using SET IDENTITY_INSERT <tablename>ON, but for the most part, you should use this only when starting a table using values from another table.
You cannot alter a column to turn on the IDENTITY property, but you can add an IDENTITY column to an existing table.

Keep in mind the fact (I hope I’ve said this enough) that the surrogate key should not be the only key on the table or that the only uniqueness is a more or less random value!

Generation Using a Default Constraint

Using identity values you get a very strict key management system, where you have to use special syntax (SET IDENTITY_INSERT) to add a new row to the table. Instead of using a strict key generation tool like the identity, you can use a sequence object to generate new values for you. Like an identity column, it is not subject to transactions, so it is really fast, but a rollback will not recover a value that is used, leaving gaps on errors/rollbacks.

For our database, I will use the sequence object with a default constraint instead of the identity column for the key generator of the Topic table. Users can add new general topics, but special topics will be added manually with a specific value. I will start the user generated key values at 10000, since it is unlikely that 10,000 specially coded topics will be needed.

CREATE SEQUENCE Messages.TopicIdGenerator

AS INT

MINVALUE 10000 --starting value

NO MAXVALUE --technically will max out at max int

START WITH 10000 --value where the sequence will start, differs from min based on

--cycle property

INCREMENT BY 1 --number that is added the previous value

NO CYCLE --if setting is cycle, when it reaches max value it starts over

CACHE 100; --Use adjust number of values that SQL Server caches. Cached values would

--be lost if the server is restarted, but keeping them in RAM makes access faster;

You can get the first values using the NEXT VALUE statement for sequence objects.

SELECT NEXT VALUE FOR Messages.TopicIdGenerator AS TopicId

UNION ALL

SELECT NEXT VALUE FOR Messages.TopicIdGenerator AS TopicId;

This returns

TopicId

--------

10000

Note The default datatype for a sequence object is bigint, and the default starting point is the smallest number that the sequence supports. So if you declared CREATE SEQUENCE dbo.test and fetched the first value, you would get -9223372036854775808, which is an annoying starting place for most usages. Like almost every DDL you will use in T-SQL, it is generally desirable to specify most settings, especially those that control settings that affect the way the object works for you.

You can then reset the sequence to the START WITH value using the ALTER SEQUENCE statement with a RESTART clause:

--To start a certain number add WITH <starting value literal>

ALTER SEQUENCE Messages.TopicIdGenerator RESTART;

For the Topic table, I will use the following column declaration to use the sequence object in a default. This is the first time I have use a default, so I will note that the name I gave the default object starts with a prefix of DFLT, followed by the schema name, underscore, table name, underscore, and then the column the default pertains to. This will be sufficient to keep the names unique and to identify the object in a query of the system catalog.

TopicId int NOT NULL CONSTRAINT DFLTMessage_Topic_TopicId

DEFAULT(NEXT VALUE FOR Messages.TopicIdGenerator),

In the final section of this chapter, I will load some data for the table to give an idea of all the parts work together. One additional super-nice property of sequence objects is that you can preallocate values to allow for bulk inserts. So if you want to load 100 topic rows, you can get the values for use, build your set, and then do the insert. The allocation is done using a system stored procedure:

DECLARE @range_first_value sql_variant, @range_last_value sql_variant,

@sequence_increment sql_variant;

EXEC sp_sequence_get_range @sequence_name = N'Messages.TopicIdGenerator'

,@range_size = 100

,@range_first_value = @range_first_value OUTPUT

,@range_last_value = @range_last_value OUTPUT

,@sequence_increment = @sequence_increment OUTPUT;

SELECT CAST(@range_first_value as int) as firstTopicId,

CAST(@range_last_value as int) as lastTopicId,

CAST(@sequence_increment as int) as increment;

Since our object was just reset, the first 100 values are returned, along with the increment (something you should not assume when you use these values and you want to follow the rules of the object):

firstTopicId	lastTopicId	increment
------------	-----------	----------
10000	10099	1

If you want to get metadata about the sequences in the database, you can use the sys.sequences catalog view.

SELECT start_value, increment, current_value

FROM sys.sequences

WHERE scheama_name(schema_id) = 'Messages'

AND name = 'TopicIdGenerator';

For the TopicGenerator object we set up, this returns

start_value	increment	current_value
------------	-----------	--------------------------
10000	1	10099

Sequences can be a great improvement on identities, especially whenever you have any need to control the values in the surrogate key (like having unique values across multiple tables). They are a bit more work than identity values, but the flexibility is worth it when you need it. I foresee identity columns to remain the standard way of creating surrogate keys for most purposes, as their inflexibility offers some protection against having to manage data in the surrogate key, since you have to go out of your way to insert a value other than what the next identity value is with SET IDENTITY_INSERT ON.

Tip An alternative method for creating a surrogate key is to use GUIDs. If you use GUIDs, you would use the uniqueidentifier data type and use a default of NewID() or NewSequentialId(). Both generate new GUID values, and the NewSequentialID generates the GUID values to limit the amount of page splits that need to occur (page splits are a topic for Chapter 10 on data structures), which makes them less likely to be unique, but still having a duplicate would be very unlikely.

DDL to Build Tables

We have finally reached the point were we are going to create the basic table structures we have specified, including generating the primary key and the calculated column that we created. Note that we have already created the schema and sequence objects earlier in the chapter.

CREATE TABLE Attendees.AttendeeType (

AttendeeType varchar(20) NOT NULL ,

Description varchar(60) NOT NULL

);

--As this is a non-editable table, we load the data here to

--start with

INSERT INTO Attendees.AttendeeType

VALUES ('Regular', 'Typical conference attendee'),

('Speaker', 'Person scheduled to speak'),

('Administrator','Manages System'),

CREATE TABLE Attendees.MessagingUser (

MessagingUserId int NOT NULL IDENTITY ( 1,1 ) ,

UserHandle varchar(20) NOT NULL ,

AccessKeyValue char(10) NOT NULL ,

AttendeeNumber char(8) NOT NULL ,

FirstName varchar(50) NULL ,

LastName varchar(50) NULL ,

AttendeeType varchar(20) NOT NULL ,

DisabledFlag bit NOT NULL ,

RowCreateTime datetime2(0) NOT NULL ,

RowLastUpdateTime datetime2(0) NOT NULL

);

CREATE TABLE Attendees.UserConnection

(

UserConnectionId int NOT NULL IDENTITY ( 1,1 ) ,

ConnectedToMessagingUserId int NOT NULL ,

MessagingUserId int NOT NULL ,

RowCreateTime datetime2(0) NOT NULL ,

RowLastUpdateTime datetime2(0) NOT NULL

);

CREATE TABLE Messages.Message (

MessageId int NOT NULL IDENTITY ( 1,1 ) ,

RoundedMessageTime as (dateadd(hour,datepart(hour,MessageTime),

CAST(CAST(MessageTime as date)as datetime2(0)) ))

PERSISTED,

SentToMessagingUserId int NULL ,

MessagingUserId int NOT NULL ,

Text nvarchar(200) NOT NULL ,

MessageTime datetime2(0) NOT NULL ,

RowCreateTime datetime2(0) NOT NULL ,

RowLastUpdateTime datetime2(0) NOT NULL

);

CREATE TABLE Messages.MessageTopic (

MessageTopicId int NOT NULL IDENTITY ( 1,1 ) ,

MessageIdint NOT NULL ,

UserDefinedTopicName nvarchar(30) NULL ,

TopicId int NOT NULL ,

RowCreateTime datetime2(0) NOT NULL ,

RowLastUpdateTime datetime2(0) NOT NULL

);

CREATE TABLE Messages.Topic (

TopicId int NOT NULL CONSTRAINT DFLTMessage_Topic_TopicId

DEFAULT(NEXT VALUE FOR Messages.TopicIdGenerator),

Name nvarchar(30) NOT NULL ,

Description varchar(60) NOT NULL ,

RowCreateTime datetime2(0) NOT NULL ,

RowLastUpdateTime datetime2(0) NOT NULL

);

After running this script, you are getting pretty far down the path, but there are still quite a few steps to go before we get finished, but sometimes, this is as far as people go when building a “small” system. It is important to do all of the steps in this chapter for almost every database you create to maintain a reasonable level of data integrity.

Adding Uniqueness Constraints

As I’ve mentioned several times, it’s important that every table have at least one constraint that prevents duplicate rows from being created. In this section, I’ll introduce the following tasks, plus a topic (indexes) that inevitably comes to mind when I start talking keys that are implemented with indexes:

Adding primary key constraints
Adding alternate (UNIQUE) key constraints
Viewing uniqueness constraints
Where other indexes fit in

Both types of constraints are implemented on top of unique indexes to enforce the uniqueness. It’s conceivable that you could use unique indexes instead of constraints, but using a constraint is the favored method of implementing a key and enforcing uniqueness.

Constraints are intended to semantically represent and enforce constraints on data, and indexes (which are covered in detail in Chapter 9) are intended to speed access to data. In actuality, it doesn’t matter how the uniqueness is implemented, but it is necessary to have either unique indexes or unique constraints in place. In some cases, other RDBMSs don’t always use indexes to enforce uniqueness by default. They can use hash tables that are good only to see whether the values exist but not to look up values. By and large, when you need to enforce uniqueness, it’s also the case that a user or process will be searching for values in the table and often for a single row, which indexes are perfect for.

Adding Primary Key Constraints

The first set of constraints we will add to the tables will be the primary key constraints. The syntax of the primary key declaration is straightforward:

[CONSTRAINT constraintname] PRIMARY KEY [CLUSTERED | NONCLUSTERED]

As will all constraints, the constraint name is optional, but you should never treat it as such. I’ll name primary key constraints using a name such as PK_<schema>_<tablename>. In almost all cases, you will want to make the primary key clustered, especially when it is the most frequently used key for accessing rows. In Chapter 10, I will describe the physical/internal structures of the database and will give more indications of when you might alter from the clustered primary key path, but I generally start with clustered and adjust if the usage patterns lead you down a different path.

Tip The primary key and other constraints of the table will be members of the table’s schema, so you don’t need to name your constraints for uniqueness over all objects, just those in the schema.

You can specify the primary key constraint when creating the table, just like we did the default for the sequence object. If it is a single column key, you could add it to the statement like this:

CREATE TABLE Messages.Topic (

TopicId int NOT NULL CONSTRAINT DFLTMessage_Topic_TopicId

DEFAULT(NEXT VALUE FOR dbo.TopicIdGenerator)

CONSTRAINT PK_Messages_Topic PRIMARY KEY,

Name nvarchar(30) NOT NULL ,

Description varchar(60) NOT NULL ,

RowCreateTime datetime2(0) NOT NULL ,

RowLastUpdateTime datetime2(0) NOT NULL

);

Or if it was a multiple column key, you can specify it inline with the columns like the following example:

CREATE TABLE Examples.ExampleKey

(

ExampleKeyColumn1 int NOT NULL,

ExampleKeyColumn2 int NOT NULL,

CONSTRAINT PK_Examples_ExampleKey

PRIMARY KEY (ExampleKeyColumn1, ExampleKeyColumn2)

)

The more common method is use the ALTER TABLE statement and simply alter the table to add the constraint, like the following, which is the code in the downloads that will add the primary keys:

ALTER TABLE Attendees.AttendeeType

ADD CONSTRAINT PK_Attendees_AttendeeType PRIMARY KEY CLUSTERED (AttendeeType);

ALTER TABLE Attendees.MessagingUser

ADD CONSTRAINT PK_Attendees_MessagingUser PRIMARY KEY CLUSTERED (MessagingUserId);

ALTER TABLE Attendees.UserConnection

ADD CONSTRAINT PK_Attendees_UserConnection PRIMARY KEY CLUSTERED (UserConnectionId);

ALTER TABLE Messages.Message

ADD CONSTRAINT PK_Messages_Message PRIMARY KEY CLUSTERED (MessageId);

ALTER TABLE Messages.MessageTopic

ADD CONSTRAINT PK_Messages_MessageTopic PRIMARY KEY CLUSTERED (MessageTopicId);

ALTER TABLE Messages.Topic

ADD CONSTRAINT PK_Messages_Topic PRIMARY KEY CLUSTERED (TopicId);

Tip Although the CONSTRAINT <constraintName> part of any constraint declaration isn’t required, it’s a very good idea always to name constraint declarations using some name. Otherwise, SQL Server will assign a name for you, and it will be ugly and will be different each and every time you execute the statement. For example, create the following object in tempdb:

CREATE TABLE TestConstraintName (TestConstraintNameId int PRIMARY KEY);

Look at the object name with this query:

SELECT constraint_name

FROM information_schema.table_constraints

WHERE table_schema = 'dbo'

AND table_name = 'TestConstraintName';

You see the name chosen is something ugly like PK__TestCons__BA850E1F645CD7F4.

Adding Alternate Key Constraints

Alternate key creation is an important task of implementation modeling. Enforcing these keys is probably more important than for primary keys, especially when using an artificial key. When implementing alternate keys, it’s best to use a UNIQUE constraint. These are pretty much the same thing as primary key constraints and can even be used as the target of a relationship (relationships are covered later in the chapter).

The syntax for their creation is as follows:

[CONSTRAINT constraintname] UNIQUE [CLUSTERED | NONCLUSTERED] [(ColumnList)]

Just like the primary key, you can declare it during table creation or as an alter statement. I usually use an alter statement for code I am managing, because having the table create separate seems cleaner, but as long as the constraints get implemented, either way is fine.

ALTER TABLE Messages.Message

ADD CONSTRAINT AK_Messages_Message_TimeUserAndText UNIQUE

(RoundedMessageTime, MessagingUserId, Text);

ALTER TABLE Messages.Topic

ADD CONSTRAINT AK_Messages_Topic_Name UNIQUE (Name);

ALTER TABLE Messages.MessageTopic

ADD CONSTRAINT AK_Messages_MessageTopic_TopicAndMessage UNIQUE

(MessageId, TopicId, UserDefinedTopicName);

ALTER TABLE Attendees.MessagingUser

ADD CONSTRAINT AK_Attendees_MessagingUser_UserHandle UNIQUE (UserHandle);

ALTER TABLE Attendees.MessagingUser

ADD CONSTRAINT AK_Attendees_MessagingUser_AttendeeNumber UNIQUE

(AttendeeNumber);

ALTER TABLE Attendees.UserConnection

ADD CONSTRAINT AK_Attendees_UserConnection_Users UNIQUE

(MessagingUserId, ConnectedToMessagingUserId);

The only really interesting tidbit here is in the Messages.Message declaration. Remember in the table declaration this was a computed column, so now, by adding this constraint, we have prevented the same message from being entered more than once per hour. This should show you that you can implement some fairly complex constraints using the basic building blocks we have covered so far. I will note that again that the computed column you specify must be a deterministic to be used in an index. Declaring the column as persisted is a good way to know if it is deterministic or not.

In the next few chapters, we will cover many different patterns for using these building blocks in very interesting ways. Now, we have covered all of the uniqueness constraints that were needed in our ConferenceMessaging database.

What About Indexes?

The topic of indexes is one that generally begins to be discussed before the first row of data is loaded into the first table. Indexes have a singular responsibility for increasing performance. At the same time, they have to be maintained, so they decrease performance too, though hopefully considerably less than they increase it. This conundrum is the foundation of the “science” of performance tuning. Hence, it is best to leave any discussion of adding indexes until data is loaded into tables and queries are executed that show the need for indexes.

In the previous section, we created uniqueness constraints whose purpose is to constrain the data in some form to make sure integrity is met. These uniqueness constraints we have just created are actually built using unique indexes and will also incur some performance penalty just like any index will. To see the indexes that have been created for your constraints, you can use the sys.indexes catalog view:

SELECT OBJECT_SCHEMA_NAME(object_id) + '.'

+ OBJECT_NAME(object_id) as object_name,

Name,is_primary_key, is_unique_constraint

FROM sys.indexes

WHERE OBJECT_SCHEMA_NAME(object_id) <> 'sys'

AND is_primary_key = 1 or is_unique_constraint = 1

ORDER BY object_name;

which, for the constraints we have created so for, returns:

object_name	Name	primary_key	unique_constraint
------------------------	-------------------------------	------------	-----------------
Attendees.AttendeeType	PK_Attendees_AttendeeType	1	0
Attendees.MessagingUser	PK_Attendees_MessagingUser	1	0
Attendees.MessagingUser	AK_Attendees_MessagingUser_U	0	1
Attendees.MessagingUser	AK_Attendees_MessagingUser_A	0	1
Attendees.UserConnection	PK_Attendees_UserConnection	1	0
Attendees.UserConnection	AK_Attendees_UserConnection_	0	1
Messages.Message	PK_Messages_Message	1	0
Messages.Message	AK_Messages_Message_TimeUser	0	1
Messages.MessageTopic	PK_Messages_MessageTopic	1	0
Messages.MessageTopic	AK_Messages_MessageTopic_Top	0	1
Messages.Topic	PK_Messages_Topic	1	0
Messages.Topic	AK_Messages_Topic_Name	0	1

As you start to do index tuning, one of the major tasks will be to determine whether indexes are being used and eliminate the indexes that are never (or very rarely) used to optimize queries, but you will not want to remove any indexes that show up in the results of the previous query, because they are there for data integrity purposes.

Building Default Constraints

If a user doesn’t know what value to enter into a table, the value can be omitted, and the default constraint sets it to a valid predetermined value. This helps, in that you help users avoid having to make up illogical, inappropriate values if they don’t know what they want to put in a column yet they need to create a row. However, the true value of defaults is lost in most applications, because the user interface would have to honor this default and not reference the column in an insert operation (or use the DEFAULT keyword for the column value for a default constraint to matter).

We used a default constraint earlier to implement the primary key generation, but here, I will spend a bit more time describing how it works. The basic syntax for the default constraint is

[CONSTRAINT constraintname] DEFAULT (<simple scalar expression>)

The scalar expression must be a literal, or it can use a function, even a user defined one that accesses a table. A literal is a simple single value in the same datatype that requires no translation by SQL Server. For example, Table 6-7 has sample literal values that can be used as defaults for a few datatypes.

Table 6-7. Sample Default Values

Datatype	Possible Default Value
Int	1
varchar(10)	'Value'
binary(2)	0x0000
Datetime	'20080101'

As an example in our sample database, we have the DisabledFlag on the Attendees.MessagingUser table. I’ll set the default value to 0 for this column here:

ALTER TABLE Attendees.MessagingUser

ADD CONSTRAINT DFAttendees_MessagingUser_DisabledFlag

DEFAULT (0) FOR DisabledFlag;

Beyond literals, you can use system functions to implement default constraints. In our model, we will use a default on all of the table’s RowCreateTime and RowLastUpdateTime columns. To create these constraints, I will demonstrate one of the most useful tools in a DBA’s toolbox: using the system views to generate code. Since we have to do the same code over and over, I will query the metadata in the INFORMATION_SCHEMA.COLUMN view, and put together a query that will generate the default constraints (you will need to set your output to text and not grids in SSMS to use this code):

SELECT 'ALTER TABLE ' + TABLE_SCHEMA + '.' + TABLE_NAME + CHAR(13) + CHAR(10) +

' ADD CONSTRAINT DFLT' + TABLE_SCHEMA + '_' + TABLE_NAME + '_' +

COLUMN_NAME + CHAR(13) + CHAR(10) +

' DEFAULT (SYSDATETIME()) FOR ' + COLUMN_NAME + ';'

FROM INFORMATION_SCHEMA.COLUMNS

WHERE COLUMN_NAME in ('RowCreateTime', 'RowLastUpdateTime')

and TABLE_SCHEMA in ('Messages','Attendees')

ORDER BY TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME;

This code will generate the code for ten constraints:

ALTER TABLE Attendees.MessagingUser

ADD CONSTRAINT DFLTAttendees_MessagingUser_RowCreateTime

DEFAULT (SYSDATETIME()) FOR RowCreateTime;

ALTER TABLE Attendees.MessagingUser

ADD CONSTRAINT DFLTAttendees_MessagingUser_RowLastUpdateTime

DEFAULT (SYSDATETIME()) FOR RowLastUpdateTime;

ALTER TABLE Attendees.UserConnection

ADD CONSTRAINT DFLTAttendees_UserConnection_RowCreateTime

DEFAULT (SYSDATETIME()) FOR RowCreateTime;

ALTER TABLE Attendees.UserConnection

ADD CONSTRAINT DFLTAttendees_UserConnection_RowLastUpdateTime

DEFAULT (SYSDATETIME()) FOR RowLastUpdateTime;

ALTER TABLE Messages.Message

ADD CONSTRAINT DFLTMessages_Message_RowCreateTime

DEFAULT (SYSDATETIME()) FOR RowCreateTime;

ALTER TABLE Messages.Message

ADD CONSTRAINT DFLTMessages_Message_RowLastUpdateTime

DEFAULT (SYSDATETIME()) FOR RowLastUpdateTime;

ALTER TABLE Messages.MessageTopic

ADD CONSTRAINT DFLTMessages_MessageTopic_RowCreateTime

DEFAULT (SYSDATETIME()) FOR RowCreateTime;

ALTER TABLE Messages.MessageTopic

ADD CONSTRAINT DFLTMessages_MessageTopic_RowLastUpdateTime

DEFAULT (SYSDATETIME()) FOR RowLastUpdateTime;

ALTER TABLE Messages.Topic

ADD CONSTRAINT DFLTMessages_Topic_RowCreateTime

DEFAULT (SYSDATETIME()) FOR RowCreateTime;

ALTER TABLE Messages.Topic

ADD CONSTRAINT DFLTMessages_Topic_RowLastUpdateTime

DEFAULT (SYSDATETIME()) FOR RowLastUpdateTime;

Obviously it’s not the point of this section, but generating code with the system metadata is a very useful skill to have, particularly when you need to add some type of code over and over.

Adding Relationships (Foreign Keys)

I’ve covered relationships in previous chapters already, so I’ll try to avoid saying too much more about why to use them. In this section, I’ll simply discuss how to implement relationships. It’s common to add constraints using the ALTER TABLE statement, but you can also do this using the CREATE TABLE statement. However, because tables are frequently created all at once, it’s usually more likely that the ALTER TABLE command will be used, because parent tables needn’t be created before dependent child tables in scripts.

The typical foreign key is implemented as a primary key of one table migrated to the child table that represents the entity from which it comes. You can also reference a unique constraint as well, but it is pretty rare and a very atypical implementation. An example is a table with an identity key and a textual code. You could migrate the textual code to a table to make it easier to read, if user requirements required it and you failed to win the argument against doing something that will confuse everyone for years to come.

The syntax of the statement for adding foreign key constraints is pretty simple:

[CONSTRAINT <constraintName>]

FOREIGN KEY REFERENCES <referenceTable> (<referenceColumns>)

[ON DELETE <NO ACTION | CASCADE | SET NULL | SET DEFAULT> ]

[ON UPDATE <NO ACTION | CASCADE | SET NULL | SET DEFAULT> ]

where

<referenceTable> is the parent table in the relationship.
<referenceColumns> is a comma-delimited list of columns in the child table in the same order as the columns in the primary key of the parent table.
ON DELETE or ON UPDATE clauses specify what to do when a row is deleted or updated. Options are
- NO ACTION: Raises an error if you end up with a child with no matching parent after the statement completes
- CASCADE: Applies the action on the parent to the child, either updates the migrated key values in the child to the new parent key value or deletes the child row
- SET NULL: If you delete or change the value of the parent, you set the child key to NULL
- SET DEFAULT: If you delete or change the value of the parent, the child key is set to the default value from the default constraint, or NULL if no constraint exists.

If you are using surrogate keys, you will very rarely need either of the ON UPDATE options, since the value of a surrogate is rarely editable. For deletes, 98 % of the time you will use NO ACTION, because most of the time, you will simply want to make the user delete the children first to avoid accidentally deleting a lot of data. Lots of NO ACTION foreign key constraints will tend to make it much harder to execute an accidental DELETE FROM <tableName> when you accidentally didn’t highlight the WHERE clause in SSMS. The most useful of the actions is ON DELETE CASCADE, which is frequently useful for table sets where the child table is, in essence, just a part of the parent table. For example, invoice ←invoiceLineItem. Usually, if you are going to delete the invoice, you are doing so because it is bad ,and you will want the line items to go away too. On the other hand, you want to avoid if for relationships like Customer ←Invoice. Deleting a customer who has invoices is probably a bad idea.

Thinking back to our moeling, there were optional and required relationships such as the one in Figure 6-29.

Figure 6-29. Optional parent-to-child relationship requires NULL on the migrated key

The child.parentId column needs to allow NULLs (which it does on the model). For a required relationship, the child.parentId would not be null, like in Figure 6-30.

Figure 6-30. Required parent-to-child relationship requires NOT NULL on the migrated key

This is all you need to do, because SQL Server knows that when the referencing key allows a NULL, the relationship value is optional. You don’t need to have a NULL primary key value for the relationship because, as discussed, it’s impossible to have a NULL attribute in a primary key. In our model, represented here in Figure 6-31, we have seven relationship modeled.

Figure 6-31. Messaging model for reference

In Figure 6-9, you will remember that we had given the relationships a verb phrase, which is used to read the name. For example, in the relationship between User and Message, we have two relationships. One of them was verb phrased as "Is Sent" as in User-Is Sent-Message. In order to get interesting usage of these verb phrases, I will use them as part of the name of the constraint, so that constraint will be named:

FK__Attendees_MessagingUser$IsSent$Messages_Message

By doing this, it greatly improves the value of the names for constraints, particularly when you have more than one foreign key going between the same two tables. Now, let’s go through the seven constraints and decide the type of options to use on the foreign key relationships. First up is the relationship between AttendeeType and MessagingUser . Since it uses a natural key, it is a target for the UPDATE CASCADE option. However, it should be noted that if you have a lot of MessagingUser rows, this operation can be a very costly, so it should be done during off hours. And, if it turns out it is done very often, the choice to use a volatile natural key value ought to be reconsidered. We will use ON DELETE NO ACTION, because we don’t usually want to cascade a delete from a table that is strictly there to implement a domain.

ALTER TABLE Attendees.MessagingUser

ADD CONSTRAINT

FK__Attendees_MessagingUser$IsSent$Messages_Message

FOREIGN KEY (AttendeeType) REFERENCES Attendees.AttendeeType(AttendeeType)

ON UPDATE CASCADE

ON DELETE NO ACTION;

Next, let’s consider the two relationships between the MessagingUser and the UserConnection table. Since we modeled both of the relationships as required, if one user is deleted (as opposed to being disabled), then we would delete all connections to and from the MessagingUser. Hence, you might consider implement both of these as DELETE CASCADE. However, if you execute the following statements:

ALTER TABLE Attendees.UserConnection

ADD CONSTRAINT

FK__Attendees_MessagingUser$ConnectsToUserVia$Attendees_UserConnection

FOREIGN KEY (MessagingUserId) REFERENCES Attendees.MessagingUser(MessagingUserId)

ON UPDATE NO ACTION

ON DELETE CASCADE;

ALTER TABLE Attendees.UserConnection

ADD CONSTRAINT

FK__Attendees_MessagingUser$IsConnectedToUserVia$Attendees_UserConnection

FOREIGN KEY (ConnectedToMessagingUserId)

REFERENCES Attendees.MessagingUser(MessagingUserId)

ON UPDATE NO ACTION

ON DELETE CASCADE;

Introducing FOREIGN KEY constraint

'FK__Attendees_MessagingUser$IsConnectedToUserVia$Attendees_UserConnection' on table 'UserConnection'

may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or

modify other FOREIGN KEY constraints.

Basically, this is stating that you cannot have two CASCADE operations on the same table. This even more limits the value of the CASCADE operations. Instead, we will use NO ACTION for the DELETE and will just have to implement the cascade in the client code or using a trigger. I will also note that, in many ways, this is probably a good thing. Too much automatically executing code is going to make developers antsy about what is going on with the data, and if you accidentally delete a user, having NO ACTION specified can actually be a good thing to stop dumb mistakes. So I will change the constraints to NO ACTION and re-create (dropping the one that was created first):

ALTER TABLE Attendees.UserConnection

DROP CONSTRAINT

FK__Attendees_MessagingUser$ConnectsToUserVia$Attendees_UserConnection

ALTER TABLE Attendees.UserConnection

ADD CONSTRAINT

FK__Attendees_MessagingUser$ConnectsToUserVia$Attendees_UserConnection

FOREIGN KEY (MessagingUserId) REFERENCES Attendees.MessagingUser(MessagingUserId)

ON UPDATE NO ACTION

ON DELETE NO ACTION;

ALTER TABLE Attendees.UserConnection

ADD CONSTRAINT

FK__Attendees_MessagingUser$IsConnectedToUserVia$Attendees_UserConnection

FOREIGN KEY (ConnectedToMessagingUserId)

REFERENCES Attendees.MessagingUser(MessagingUserId)

ON UPDATE NO ACTION

ON DELETE NO ACTION;

For the two relationships between MessagingUser and Message , you again might want to consider using ON DELETE CASCADE because if you delete a user, we will again want to get rid of all of that user’s messages. Note that we would do this here because we implemented a disabled indicator, and if the user really needed to be deleted, it is very likely that all of the messages would need to be deleted. If a reasonable user was disabled for quitting the service or repeatedly using the wrong access key, you wouldn’t delete them, and you would want to keep their messages. However, for the same reasons as for the previous constraints, you will have to do it manually if you decide to delete all of the messages, and that will help to give the user a choice if they have a lot of messages and maybe just need to be disabled.

ALTER TABLE Messages.Message

ADD CONSTRAINT FK__Messages_MessagingUser$Sends$Messages_Message FOREIGN KEY

(MessagingUserId) REFERENCES Attendees.MessagingUser(MessagingUserId)

ON UPDATE NO ACTION

ON DELETE NO ACTION;

ALTER TABLE Messages.Message

ADD CONSTRAINT FK__Messages_MessagingUser$IsSent$Messages FOREIGN KEY

(SentToMessagingUserId) REFERENCES Attendees.MessagingUser(MessagingUserId)

ON UPDATE NO ACTION

ON DELETE NO ACTION;

The next relationship we will deal with is between Topic and MessageTopic . We don’t want Topics to be deleted once set up to be used, other than by the administrator as a special operation perhaps, where special requirements are drawn up and not done as a normal thing. Hence, we use the DELETE NO ACTION.

ALTER TABLE Messages.MessageTopic

ADD CONSTRAINT

FK__Messages_Topic$CategorizesMessagesVia$Messages_MessageTopic FOREIGN KEY

(TopicId) REFERENCES Messages.Topic(TopicId)

ON UPDATE NO ACTION

ON DELETE NO ACTION;

The next to the last relationship to implement is the MessageTopic to Message relationship. Just like the Topic to Message topic relationship, there is no need to automatically delete messages if the topic is deleted.

ALTER TABLE Messages.MessageTopic

ADD CONSTRAINT FK_Message$iscCategorizedVia$MessageTopic FOREIGN KEY

(MessageId) REFERENCES Messages.Message(MessageId)

ON UPDATE NO ACTION

ON DELETE NO ACTION;

One of the primary limitations on constraint-based foreign keys is that the tables participating in the relationship cannot span different databases. When this situation occurs, these relationship types need to be implemented via triggers.

It’s generally a bad idea to design databases with cross-database relationships. A database should be considered a unit of related tables that are always kept in sync. When designing solutions that extend over different databases or even servers, carefully consider how spreading around references to data that isn’t within the scope of the database will affect your solution. You need to understand that SQL Server cannot guarantee the existence of the value, because SQL Server uses databases as its “container,” and another user could restore a database with improper values, even an empty database, and the cross-database RI would be invalidated. Of course, as is almost always the case with anything that isn’t best-practice material, there are times when cross-database relationships are unavoidable, and I’ll demonstrate building triggers to support this need in the next chapter on data protection.

In the security chapter (Chapter 9), we will discuss more about how to secure cross database access, but it is generally considered a less than optimal usage. In SQL Server 2012, the concepts of contained databases, and even SQL Azure, the ability to cross database boundaries is changing in ways that will generally be helpful for building secure databases that exist on the same server.

Adding Basic Check Constraints

In our database, we have specified a couple of domains that need to be implemented a bit more strictly. In most cases, we can implement validation routines using simple check constraints. Check constraints are simple, single-row predicates that can be used to validate the data in a row. The basic syntax is

ALTER TABLE <tableName> [WITH CHECK | WITH NOCHECK]

ADD [CONSTRAINT <constraintName>]

CHECK <BooleanExpression>

There are two interesting parts of check constraints that are different from what you have seen in previous constraints, and we need to cover these briefly before we start creating them. The first is <BooleanExpression>. The <BooleanExpression> component is similar to the WHERE clause of a typical SELECT statement, but with the caveat that no subqueries are allowed. (Subqueries are allowed in standard SQL but not in T-SQL). In T-SQL, you must use a function to access other tables, something I will use later in this book as I create more interesting check constraints to implement data protection patterns of Chapter 8.)

CHECK constraints can reference system and user-defined functions and use the name or names of any columns in the table. However, they cannot access any other table, and they cannot access any row other than the current row being modified (except through a function, and the row values you will be checking will already exist in the table). If multiple rows are modified, each row is checked against this expression individually.

The interesting thing about this expression is that unlike a WHERE clause, the condition is checked for falseness rather than truth. Without going completely into a discussion of NULL, it’s important to understand that CHECK constraints fail only on rows that are explicitly False. If the result of a comparison is UNKNOWN because of a NULL comparison, the row will pass the check constraint and be entered.

Even if this isn’t immediately confusing, it is often confusing when figuring out why an operation on a row did or did not work as you might have expected. For example, consider the Boolean expression value <> 'fred'. If value is NULL, this is accepted, because NULL <> 'fred' evaluates to UNKNOWN. If value is 'fred', it fails because 'fred' <> 'fred' is False. The reason for the way CHECK constraints work with Booleans is that if the column is defined as NULL, it is assumed that you wish to allow a NULL value for the column value. You can look for NULL values by explicitly checking for them using IS NULL or IS NOT NULL. This is useful when you want to ensure that a column that technically allows nulls does not allow NULLs if another column has a given value. As an example, if you have a column defined name varchar(10) null, having a check constraint that says name = 'fred' technically says name = 'fred' or name is null. If you want to ensure it is not null if the column NameIsNotNullFlag = 1, you would state (NameIsNotNullFlag = 1 and Name is not null) or (Name = 0).

The second interesting part of the statement is the [WITH CHECK | WITH NOCHECK] specification. When you create a CHECK constraint, the WITH CHECK setting (the default) gives you the opportunity to create the constraint without checking existing data. Using NOCHECK and leaving the values unchecked is a pretty bad thing to do, in my opinion. First off, when you try to resave the exact same data, you will get an error. Plus, if the constraint is built WITH CHECK, the optimizer could possibly make use of this fact when building plans if the constraint didn’t use any functions and just used simple comparisons such as less than, greater than, and so on. For example, imagine you have a constraint that says that a value must be less than 10. If, in a query, you look for all values 11 and greater, the optimizer can use this fact and immediately return zero rows, rather than having to scan the table to see whether any value matches.

In our model, we had two domains specified in the text that we will implement here. The first is the TopicName , which called for us to make sure that the value is not an empty string or all space characters. I repeat it here in Table 6-8 for review.

Table 6-8. Domain: TopicName

Property	Setting
Name	TopicName
Optional	No
Datatype	Unicode text, 30 characters
Value Limitations	Must not be an empty string or only space characters
Default Value	n/a

The maximum length of 30 characters was handled by the datatype of nvarchar(30) we used but now will implement the rest of the value limitations. The method I will use for this is to do a ltrim on the value and then check the length. If it is 0, it is either all spaces or empty. We used the topicName domain for two columns, the name column from Messages.Topic, and the UserDefinedTopicName from the Messages.MessageTopic table:

ALTER TABLE Messages.Topic

ADD CONSTRAINT CHK__Messages_Topic_Name_NotEmpty

CHECK (LEN(RTRIM(Name)) > 0);

ALTER TABLE Messages.MessageTopic

ADD CONSTRAINT CHK__Messages_MessageTopic_UserDefinedTopicName_NotEmpty

CHECK (LEN(RTRIM(UserDefinedTopicName)) > 0);

The other domain we specifically mentioned was for the UserHandle , as repeated in Table 6-9.

Table 6-9. Domain: UserHandle

Property	Setting
Name	UserHandle
Optional	No
Datatype	Basic character set, maximum of 20 characters
Value Limitations	Must be 5–20 simple alphanumeric characters and start with a letter
Default Value	n/a

To implement this domain, we will get a bit more interesting.

ALTER TABLE Attendees.MessagingUser

ADD CONSTRAINT CHK__Attendees_MessagingUser_UserHandle_LenthAndStart

CHECK (LEN(Rtrim(UserHandle)) >= 5

AND LTRIM(UserHandle) LIKE '[a-z]' + REPLICATE('[a-z1-9]',LEN(RTRIM(UserHandle)) -1));

The first part of the CHECK constraint Boolean expression simply checks to see if the string is greater than five characters long. The latter part creates a like expression that checks that the name starts with a letter and that the following characters are only alphanumeric. It looks like it might be slow, based on the way we are taught to write WHERE clause expressions, but in this case, you aren’t searching but are working on a single row already in memory.

Finally, we had one other predicate that we need to implement. Back in the requirements, it was specified that the MessageTopic table, we need to make sure that the UserDefinedTopicName is NULL unless the Topic that is chosen is the one set up for the UserDefined topic. So we will create a new row. Since the surrogate key of MessageTopic is a default constraint using a sequence, we can simply enter the row specifying the TopicId as 0:

INSERT INTO Messages.Topic(TopicId, Name, Description)

VALUES (0,'User Defined','User Enters Their Own User Defined Topic'),

Then, we add the constraint, checking to make sure that the UserDefinedTopicId is null if the TopicId = 0 and vice versa.

ALTER TABLE Messages.MessageTopic

ADD CONSTRAINT CHK__Messages_MessageTopic_UserDefinedTopicName_NullUnlessUserDefined

CHECK ((UserDefinedTopicName is NULL and TopicId <> 0)

or (TopicId = 0 and UserDefinedTopicName is NOT NULL));

Be sure to be as specific as possible with your check criteria, as it will make implementation a lot safer. Now, we have implemented all of the check constraints we are going to for our demonstration database. In the testing section later in this chapter, one of the most important things to test are the check constraints (and if you have done any advanced data integrity work in triggers, which we will leave to later chapters).

Triggers to Maintain Automatic Values

For all of our tables, we included two columns that we are going to implement as automatically maintained columns. These columns are the RowCreateTime and RowLastUpdateTime that we added earlier in this chapter (shown in Figure 6-27). These columns are useful to help us get an idea of some of the actions that have occurred on our row without resorting to looking through change tracking. Sometimes, these values have meaning to the end users as well, but most often, we are implementing them strictly for to software’s sake, hence the reason that we will implement them in such a manner that the client cannot modify the values.

I will do this with an “instead of” trigger, which will, for the most part, be a very smooth way to manage automatic operations on the base table, but it does have a few downsides.

SCOPE_IDENTITY() will no longer return the identity value that is returned, since the actual insert will be done in the trigger, outside of the scope of the code. @@identity will work, but it is own issues, particularly with triggers that perform cascading operations.
The output clause will not work if you have triggers on the table.

The scope_identity issue can be gotten around by using an AFTER trigger for an insert (which I will include as a sample in this section). I personally suggest that you ought to use one of the natural keys you have implemented to get the inserted value if you are inserting a single row anyhow.

One of the downsides of triggers can be performance, so sometimes automatically generated values will simply be maintained by the SQL code that uses the tables or perhaps the columns are simply removed. I far prefer a server-based solution, because clock synchronization can be an issue when even two distinct servers are involved with keeping time. So if an action says it occurred at 12:00 AM by the table, you look in the log and at 12:00 AM, everything looks fine, but at 11:50 PM there was a glitch of some sort. Are they related? It is not possible to know to the degree you might desire.

As it is my favored mechanism for maintaining automatically maintained columns, I will implement triggers for tables, other than Attendees.AttendeeType , because, you should recall, we will not enable end users to make changes to the data, so tracking changes will not be needed.

To build the triggers, I will use the trigger templates that are included in Appendix B as the basis for the trigger. If you want to know more about the basics of triggers and how these templates are constructed, check Appendix B. The basics of how the triggers work should be very self explanatory. The code added to the base trigger template from the appendix will be highlighted in bold.

In the following “instead of” insert trigger, we will replicate the operation of insert on the table, passing through the values from the user insert operation, but replacing the RowCreateTime and RowLastUpdateTime with the function SYSDATETIME() . One quick topic we should hit on here is multirow operations. Well written triggers take into consideration that any INSERT, UPDATE, or DELETE operation must support multiple rows being operated on simultaneously. The inserted and deleted virtual tables house the rows that have been inserted or deleted in the operation. (For an update, think of rows as being deleted and then inserted, at least logically.)

CREATE TRIGGER MessageTopic$InsteadOfInsertTrigger

ON Messages.MessageTopic

INSTEAD OF INSERT AS

BEGIN

DECLARE @msg varchar(2000), --used to hold the error message

--use inserted for insert or update trigger, deleted for update or delete trigger

--count instead of @@rowcount due to merge behavior that sets @@rowcount to a number

--that is equal to number of merged rows, not rows being checked in trigger

@rowsAffected int = (select count(*) from inserted)

--@rowsAffected = (select count(*) from deleted)

--no need to continue on if no rows affected

IF @rowsAffected = 0 RETURN;

SET NOCOUNT ON; --to avoid the rowcount messages

SET ROWCOUNT 0; --in case the client has modified the rowcount

BEGIN TRY

--[validation section]

--[modification section]

--<perform action>

INSERT INTO Messages.MessageTopic (MessageId, UserDefinedTopicName,

TopicId,RowCreateTime,RowLastUpdateTime)

SELECT MessageId, UserDefinedTopicName, TopicId, SYSDATETIME(), SYSDATETIME()

FROM inserted ;

END TRY

BEGIN CATCH

IF @@trancount > 0

ROLLBACK TRANSACTION;

THROW; --will halt the batch or be caught by the caller's catch block

END CATCH

END

For the update operation, we will do very much the same thing, only when we replicate the update operation, we will make sure that the RowCreateTime stays the same, no matter what the user might send in the update, and the RowLastUpdateTIme will be replaced by SYSDATETIME().

CREATE TRIGGER Messages.MessageTopic$InsteadOfUpdateTrigger

ON Messages.MessageTopic

INSTEAD OF UPDATE AS

BEGIN

DECLARE @msg varchar(2000), --used to hold the error message

--use inserted for insert or update trigger, deleted for update or delete trigger

--count instead of @@rowcount due to merge behavior that sets @@rowcount to a number

--that is equal to number of merged rows, not rows being checked in trigger

@rowsAffected int = (select count(*) from inserted)

--@rowsAffected = (select count(*) from deleted)

--no need to continue on if no rows affected

IF @rowsAffected = 0 return;

SET NOCOUNT ON; --to avoid the rowcount messages

SET ROWCOUNT 0; --in case the client has modified the rowcount

BEGIN TRY

--[validation section]

--[modification section]

--<perform action>

UPDATE MessageTopic

SET MessageId = Inserted.MessageId,

UserDefinedTopicName = Inserted.UserDefinedTopicName,

TopicId = Inserted.TopicId,

RowCreateTime = MessageTopic.RowCreateTime, --no changes allowed

RowLastUpdateTime = SYSDATETIME()

FROM inserted

JOIN Messages.MessageTopic

on inserted.MessageTopicId = MessageTopic.MessageTopicId;

END TRY

BEGIN CATCH

IF @@trancount > 0

ROLLBACK TRANSACTION;

THROW; --will halt the batch or be caught by the caller's catch block

END CATCH

END

If you find that using an “instead of” insert trigger is too invasive of a technique, you can change to using an after trigger. For an after trigger, you only need to update the columns that are important. It is a bit slower because it is updating the row after it is in the table, but it does work quite well. Another reason why an “instead of” trigger may not be allowed is if you have a cascade operation. For example, consider our relationship from Messging User to Attendee Type:

ALTER TABLE [Attendees].[MessagingUser]

ADD CONSTRAINT [FK__Attendees_MessagingUser$IsSent$Messages_Message]

FOREIGN KEY([AttendeeType])

REFERENCES [Attendees].[AttendeeType] ([AttendeeType])

ON UPDATE CASCADE;

Since this is cascade, we will have to use an after trigger for the UPDATE trigger, since when the cascade occurs in the base table, the automatic operation won’t use the trigger, but only the base table operations. So we will implement the update trigger as:

CREATE TRIGGER MessageTopic$UpdateRowControlsTrigger

ON Messages.MessageTopic

AFTER UPDATE AS

BEGIN

DECLARE @msg varchar(2000), --used to hold the error message

--use inserted for insert or update trigger, deleted for update or delete trigger

--count instead of @@rowcount due to merge behavior that sets @@rowcount to a number

--that is equal to number of merged rows, not rows being checked in trigger

@rowsAffected int = (select count(*) from inserted)

--@rowsAffected = (select count(*) from deleted)

--no need to continue on if no rows affected

IF @rowsAffected = 0 return;

SET NOCOUNT ON; --to avoid the rowcount messages

SET ROWCOUNT 0; --in case the client has modified the rowcount

BEGIN TRY

--[validation section]

--[modification section]

UPDATE MessageTopic

SET RowCreateTime = SYSDATETIME(),

RowLastUpdateTime = SYSDATETIME()

FROM inserted

JOIN Messages.MessageTopic

on inserted.MessageTopicId = MessageTopic.MessageTopicId;

END TRY

BEGIN CATCH

IF @@trancount > 0

ROLLBACK TRANSACTION;

THROW --will halt the batch or be caught by the caller's catch block;

END CATCH

END

In the downloads for this chapter, I will include “instead of” triggers for all of the tables in our database. They will all follow the same pattern, and because of this, I will almost always use some form of code generation tool to create these triggers. We discussed code generation earlier for building default constraints for these same columns, and you could do the very same thing for building triggers. I generally use a third-party tool to do code generation, but it is essential to the learning process that you code the first ones yourself so you know how things work.

Documenting Your Database

In your modeling, you’ve created descriptions, notes, and various pieces of data that will be extremely useful in helping the developer understand the whys and wherefores of using the tables you’ve created. In previous versions of SQL Server, it was difficult to make any use of this data directly in the server. In SQL Server 2000, Microsoft introduced extended properties that allow you to store specific information about objects. This is great, because it allows you to extend the metadata of your tables in ways that can be used by your applications using simple SQL statements.

By creating these properties, you can build a repository of information that the application developers can use to do the following:

Understand what the data in the columns is used for
Store information to use in applications, such as the following:
- Captions to show on a form when a column is displayed
- Error messages to display when a constraint is violated
- Formatting rules for displaying or entering data
- Domain information, like the domain you have chosen for the column during design

To maintain extended properties, you’re given the following functions and stored procedures:

sys.sp_addextendedproperty: Used to add a new extended property
sys.sp_dropextendedproperty: Used to delete an existing extended property
sys.sp_updateextendedproperty: Used to modify an existing extended property
fn_listextendedproperty: A system-defined function that can be used to list extended properties
sys.extendedproperties: Can be used to list all extended properties in a database, less friendly than fn_listextendedproperty

Each (other than sys.extendedproperties) has the following parameters:

@name: The name of the user-defined property.
@value: What to set the value to when creating or modifying a property.
@level0type: Top-level object type, often schema, especially for most objects that users will use (tables, procedures, and so on).
@level0name: The name of the object of the type that’s identified in the @level0type parameter.
@level1type: The name of the type of object such as Table, View, and so on.
@level1name: The name of the object of the type that’s identified in the @level1type parameter.
@level2type: The name of the type of object that’s on the level 2 branch of the tree under the value in the @level1Type value. For example, if @level1type is Table, then @level2type might be Column, Index, Constraint, or Trigger.
@level2name: The name of the object of the type that’s identified in the @level2type parameter.

For our example, let’s use the Messages.Topic table, which was defined by the following DDL:

CREATE TABLE Messages.Topic (

TopicId int NOT NULL CONSTRAINT DFLTMessage_Topic_TopicId

DEFAULT(NEXT VALUE FOR dbo.TopicIdGenerator),

Name nvarchar(30) NOT NULL ,

Description varchar(60) NOT NULL ,

RowCreateTime datetime2(0) NULL ,

RowLastUpdateTime datetime2(0) NULL

);

For simplicity sake, I will just be adding a property with a description of the table, but you can add whatever bits of information you may want to enhance the schema, both in usage and for management tasks. For example, you might add an extended property to tell the reindexing schemes when or how to reindex a table’s indexes. To document this table, let’s add a property to the table and columns named Description . You execute the following script after creating the table (note that I used the descriptions as outlined in the start of the chapter for the objects):

--Messages schema

EXEC sp_addextendedproperty @name = 'Description',

@value = 'Messaging objects',

@level0type = 'Schema', @level0name = 'Messages';

--Messages.Topic table

EXEC sp_addextendedproperty @name = 'Description',

@value = ' Pre-defined topics for messages',

@level0type = 'Schema', @level0name = 'Messages',

@level1type = 'Table', @level1name = 'Topic';

--Messages.Topic.TopicId

EXEC sp_addextendedproperty @name = 'Description',

@value = 'Surrogate key representing a Topic',

@level0type = 'Schema', @level0name = 'Messages',

@level1type = 'Table', @level1name = 'Topic',

@level2type = 'Column', @level2name = 'TopicId';

--Messages.Topic.Name

EXEC sp_addextendedproperty @name = 'Description',

@value = 'The name of the topic',

@level0type = 'Schema', @level0name = 'Messages',

@level1type = 'Table', @level1name = 'Topic',

@level2type = 'Column', @level2name = 'Name';

--Messages.Topic.Description

EXEC sp_addextendedproperty @name = 'Description',

@value = 'Description of the purpose and utilization of the topics',

@level0type = 'Schema', @level0name = 'Messages',

@level1type = 'Table', @level1name = 'Topic',

@level2type = 'Column', @level2name = 'Description';

--Messages.Topic.RowCreateTime

EXEC sp_addextendedproperty @name = 'Description',

@value = 'Time when the row was created',

@level0type = 'Schema', @level0name = 'Messages',

@level1type = 'Table', @level1name = 'Topic',

@level2type = 'Column', @level2name = 'RowCreateTime';

--Messages.Topic.RowLastUpdateTime

EXEC sp_addextendedproperty @name = 'Description',

@value = 'Time when the row was last updated',

@level0type = 'Schema', @level0name = 'Messages',

@level1type = 'Table', @level1name = 'Topic',

@level2type = 'Column', @level2name = 'RowLastUpdateTime';

Now, when you go into Management Studio, right-click the Messages.Topic table, and select Properties. Choose Extended Properties, and you see your description, as shown in Figure 6-32.

Figure 6-32. Reward for hard work done. Descriptions in Management Studio.

The fn_listExtendedProperty object is a system-defined function you can use to fetch the extended properties (the parameters are as discussed earlier—the name of the property and then each level of the hierarchy):

SELECT objname, value

FROM fn_listExtendedProperty ( 'Description',

'Schema','Messages',

'Table','Topic',

'Column',null);

This code returns the following results:

objname	value
-----------------	--------------------------------------------------------
TopicId	Surrogate key representing a Topic
Name	The name of the topic
Description	Description of the purpose and utilization of the topics
RowCreateTime	Time when the row was created
RowLastUpdateTime	Time when the row was last updated

There’s some pretty cool value in there using extended properties, and not just for documentation. Because the property value is a sql_variant, you can put just about anything in there (within a 7,500-character limitation, that is). A possible use could be to store data entry masks and other information that the client could read in once and use to make the client experience richer. In the code download, I have included descriptions for all of the columns in the database.

You aren’t limited to tables, column, and schemas either. Constraints, databases, and many other objects in the database can have extended properties. For more information, check the SQL Server 2012 Books Online section “Using Extended Properties on Database Objects.”

Viewing the Basic Metadata

In the process of creating a model, knowing where to look in the system metadata for descriptive information about the model is extremely useful. Futzing around in the UI will give you a headache and is certainly not the easiest way to see all of the objects at once, particularly to make sure everything seems to make sense.

There are a plethora of sys schema objects. However, they can be a bit messier to use and aren’t based on standards, so they’re apt to change in future versions of SQL Server, just as these views replaced the system tables from versions of SQL Server before 2005. Of course, with the changes in 2005, it became a lot easier to user the sys schema objects (commonly referred to as the system catalog) to get metadata as well. I will stick to the information schema as much as I can because they are based on the SQL standards.

First, let’s get a list of the schemas in our database. To view these use the INFORMATION_SCHEMA.SCHEMATA view.

SELECT SCHEMA_NAME, SCHEMA_OWNER

FROM INFORMATION_SCHEMA.SCHEMATA

WHERE SCHEMA_NAME <> SCHEMA_OWNER;

Note that I limit the schemas to the ones that don’t match their owners. SQL Server automatically creates a schema for every user that gets created.

SCHEMA_NAME SCHEMA_OWNER

------------ ------------

Messages dbo

Attendees dbo

Note If you are really paying attention, you probably are thinking “didn’t he use the sys.schema catalog view before?” And yes, that is true. I tend to use the INFORMATION_SCHEMA views for reporting on metadata that I want to view, and the catalog views when doing development work, as they can be a bit easier on the eyes since these views include the database name, often as the first column at 128 characters. However, the INFORMATION_SCHEMA has a lot of niceties that are useful, and the schema is based on standards so is less likely to change from version to version.

For tables and columns we can use INFORMATION SCHEMA.COLUMNS, and with a little massaging, you can see the table, the column name, and the datatype in a format that is easy to use:

SELECT table_schema + '.' + TABLE_NAME as TABLE_NAME, COLUMN_NAME,

--types that have a character or binary lenght

case when DATA_TYPE IN ('varchar','char','nvarchar','nchar','varbinary')

then DATA_TYPE + case when character_maximum_length = -1 then '(max)'

else '(' + CAST(character_maximum_length as

varchar(4)) + ')' end

--types with a datetime precision

when DATA_TYPE IN ('time','datetime2','datetimeoffset')

then DATA_TYPE + '(' + CAST(DATETIME_PRECISION as varchar(4)) + ')'

--types with a precision/scale

when DATA_TYPE IN ('numeric','decimal')

then DATA_TYPE + '(' + CAST(NUMERIC_PRECISION as varchar(4)) + ',' +

CAST(NUMERIC_SCALE as varchar(4)) + ')'

--timestamp should be reported as rowversion

when DATA_TYPE = 'timestamp' then 'rowversion'

--and the rest. Note, float is declared with a bit length, but is

--represented as either float or real in types

else DATA_TYPE end as DECLARED_DATA_TYPE,

COLUMN_DEFAULT

FROM INFORMATION_SCHEMA.COLUMNS

ORDER BY TABLE_SCHEMA, TABLE_NAME,ORDINAL_POSITION

which in our database, we have been working on throughout the chapter, returns:

TABLE_NAME	DECLARED_DATA_TYPE	COLUMN_DEFAULT
------------------------	-------------------------------	------------
Attendees.AttendeeType	varchar(20)	NULL
Attendees.AttendeeType	varchar(60)	NULL
Attendees.MessagingUser	int	NULL
Attendees.MessagingUser	varchar(20)	NULL
Attendees.MessagingUser	char(10)	NULL
Attendees.MessagingUser	char(8)	NULL
Attendees.MessagingUser	varchar(50)	NULL
Attendees.MessagingUser	varchar(50)	NULL
Attendees.MessagingUser	varchar(20)	NULL
Attendees.MessagingUser	bit	((0))
Attendees.MessagingUser	datetime2(0)	(getdate())
Attendees.MessagingUser	datetime2(0)	(getdate())
Attendees.UserConnection	int	NULL
Attendees.UserConnection	int	NULL
Attendees.UserConnection	int	NULL
Attendees.UserConnection	datetime2(0)	(getdate())
Attendees.UserConnection	datetime2(0)	(getdate())
Messages.Message	int	NULL
Messages.Message	datetime2(0)	NULL
Messages.Message	int	NULL
Messages.Message	int	NULL
Messages.Message	nvarchar(200)	NULL
Messages.Message	datetime2(0)	NULL
Messages.Message	datetime2(0)	(getdate())
Messages.Message	datetime2(0)	(getdate())
Messages.MessageTopic	int	NULL
Messages.MessageTopic	int	NULL
Messages.MessageTopic	nvarchar(30)	NULL
Messages.MessageTopic	int	NULL
Messages.MessageTopic	datetime2(0)	(getdate())
Messages.MessageTopic	datetime2(0)	(getdate())
Messages.Topic	int	(NEXT VALUE FOR…
Messages.Topic	nvarchar(30)	NULL
Messages.Topic	varchar(60)	NULL
Messages.Topic	datetime2(0)	(getdate())
Messages.Topic	datetime2(0)	(getdate())

To see the constraints we have added to these objects (other than the defaults which were included in the previous results, use this code:

SELECT TABLE_NAME, CONSTRAINT_NAME, CONSTRAINT_TYPE

FROM INFORMATION_SCHEMA.table_constraints

WHERE CONSTRAINT_SCHEMA in ('Attendees','Messages')

ORDER BY CONSTRAINT_SCHEMA, TABLE_NAME

This returns the following results (with the name column truncated for some of the results to fit the data in):

TABLE_SCHEMA	TABLE_NAME	CONSTRAINT_NAME	CONSTRAINT_TYPE
-------------	---------------	------------------------------	---------------
Attendees	AttendeeType	PK_Attendees_AttendeeType	PRIMARY KEY
Attendees	MessagingUser	PK_Attendees_MessagingUser	PRIMARY KEY
Attendees	MessagingUser	AK_Attendees_MessagingUser_Us…	UNIQUE
Attendees	MessagingUser	AK_Attendees_MessagingUser_At…	UNIQUE
Attendees	MessagingUser	AK_Attendees_MessagingUser_At…	FOREIGN KEY
Attendees	MessagingUser	CHK__Attendees_MessagingUser_…	CHECK
Attendees	UserConnection	FK__Attendees_MessagingUser$C…	FOREIGN KEY
Attendees	UserConnection	AK_Attendees_UserConnection_U…	UNIQUE
Attendees	UserConnection	PK_Attendees_UserConnection	PRIMARY KEY
Messages	Message	PK_Messages_Message	PRIMARY KEY
Messages	Message	AK_Messages_Message_TimeUserA…	UNIQUE
Messages	Message	FK__Messages_MessagingUser$Se…	FOREIGN KEY
Messages	Message	FK__Messages_MessagingUser$Is…	FOREIGN KEY
Messages	MessageTopic	FK__Messages_Topic$Categorize…	FOREIGN KEY
Messages	MessageTopic	FK__Message$iscCategorizedVia…	FOREIGN KEY
Messages	MessageTopic	AK_Messages_MessageTopic_Topi…	UNIQUE
Messages	MessageTopic	CHK__Messages_MessageTopic_Us…	CHECK
Messages	MessageTopic	PK_Messages_MessageTopic	PRIMARY KEY
Messages	Topic	PK_Messages_Topic	PRIMARY KEY
Messages	Topic	AK_Messages_Topic_Name	UNIQUE
Messages	Topic	CHK__Messages_Topic_Name_NotE…	CHECK

Doing this will minimally help you get an idea of what tables you have create and if you have followed your naming standards. Finally, the following query will give you the list of triggers that have been created.

SELECT OBJECT_SCHEMA_NAME(parent_id) + '.' + OBJECT_NAME(parent_id) as TABLE_NAME,

name as TRIGGER_NAME,

case when is_instead_of_trigger = 1 then 'INSTEAD OF' else 'AFTER' End

as TRIGGER_FIRE_TYPE

FROM sys.triggers

WHERE type_desc = 'SQL_TRIGGER' --not a clr trigger

and parent_class = 1 --DML Triggers

ORDER BY TABLE_NAME, TRIGGER_NAME

In the text of the chapter, we created three triggers. In the downloads, I have finished the other seven triggers needed to implement the database. The following results include all ten triggers that are included in the downloads:

TABLE_NAME	TRIGGER_NAME	TRIGGER_FIRE_TYPE
------------------------	--------------------------------------	-----------------
Attendees.MessagingUser	MessagingUser$InsteadOfInsertTrigger	INSTEAD OF
Attendees.MessagingUser	MessagingUser$UpdateRowControlsTrigger	AFTER
Attendees.UserConnection	UserConnection$InsteadOfInsertTrigger	INSTEAD OF
Attendees.UserConnection	UserConnection$InsteadOfUpdateTrigger	INSTEAD OF
Messages.Message	Message$InsteadOfInsertTrigger	INSTEAD OF
Messages.Message	Message$InsteadOfUpdateTrigger	INSTEAD OF
Messages.MessageTopic	MessageTopic$InsteadOfInsertTrigger	INSTEAD OF
Messages.MessageTopic	MessageTopic$InsteadOfUpdateTrigger	INSTEAD OF
Messages.Topic	Topic$InsteadOfInsertTrigger	INSTEAD OF
Messages.Topic	Topic$InsteadOfUpdateTrigger	INSTEAD OF

Finally, if you need to see the check constraints in the database, you can use the following:

SELECT TABLE_SCHEMA + '.' + TABLE_NAME as TABLE_NAME,

TABLE_CONSTRAINTS.CONSTRAINT_NAME, CHECK_CLAUSE

FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS

JOIN INFORMATION_SCHEMA.CHECK_CONSTRAINTS

ON TABLE_CONSTRAINTS.CONSTRAINT_SCHEMA =

CHECK_CONSTRAINTS.CONSTRAINT_SCHEMA

AND TABLE_CONSTRAINTS.CONSTRAINT_NAME = CHECK_CONSTRAINTS.CONSTRAINT_NAME

This will return

TABLE_NAME	CONSTRAINT_NAME	CHECK_CLAUSE
-----------------------	----------------------------------	--------------------------
Messages.Topic	CHK__Messages_Topic_Name_NotEmpty	(len(rtrim([Name]))>(0))
Messages.MessageTopic	CHK__Messages_MessageTopic_UserDe…	(len(rtrim([UserDefinedT…
Attendees.MessagingUser	CHK__Attendees_MessagingUser_User…	(len(rtrim([UserHandle])…
Messages.MessageTopic	CHK__Messages_MessageTopic_UserDe…	([UserDefinedTopicName] …

This is just a taste of the metadata available, and we will make use of the information schema and other catalog views throughout this book, rather than give you any screen shots of management or developer studio.

Tip The INFORMATION_SCHEMA and catalog views are important resources for the DBA to find out what is in the database. Throughout this book, I will try to give insight into some of them, but there is another book’s worth of information out there on the metadata of SQL Server.

Unit Testing Your Structures

Designing tables is a lot of fun, but it is by no means the end of the process. Once you have structures created, you now need to create test scripts to insert good and bad data into your tables to make sure they work. There are some automated tools that are set up to help you with this task (a cursory scan of Sourceforge will show at least five tools that you can use, at a minimum).

Many people have different ideas of testing, particularly trying to treat a database like it is a normal coded object and set up a state for the objects (create or read in some data), try your code, often a stored procedure, then delete the data. This technique works when you are trying to test a single module of code, but it is pretty tedious when you want to test the entire database and you have to load 20 tables to test one.

In this section, I will give you the most simplistic version of testing your database structures. In it, I will use a single script that will run and will basically insert data into your entire database. So far in this chapter, we have created a script to create a completely empty database. This is the database that we will use for our testing. Performance is of no concern, nor is any consideration of concurrency. For this pass of testing, we want to make sure that the database will save and deal with data and will catch data outside of the reasonable norm. I say “reasonable” because unless we have a real reason to do so, we won’t be testing minimum and maximum values for a datatype, since we will trust that SQL Server can handle minimum and maximum values. We will also assume that foreign key constraints work to validate insert and updates and will not take time seeing what happens when we violate a constraint with an invalid key value. We will check deletes to make sure that cascading operations work where they need to and not where they do not. We will test any check constraints we have built because these are apt to be an issue, and we will check to make sure that the triggers we have created work as well.

Note A comprehensive discussion of testing is out of scope for this book as complete testing will require involvement of the entire team. The unit test is generally the first of several testing steps including integration testing, user testing, performance testing, and so on. These will flow out of the application implementation that will be tested simultaneously. During unit testing, the goal will be simply to prove to ourselves that the code we created does the minimum that it is programmed to do.

In the text, I will include an abridged version of the test script that you can get with the downloads. We will work with two types of queries. Those we expect to work, and those we expect to fail. For the statements we expect to succeed, we will check the row count after the statement and see if it is what we expect. If an error is raised, that will be self explanatory.

if @@ROWCOUNT <> 1 THROW 50000,'Description of Operation Failed',16;

For statements that we expect to fail, we will use a statement that uses a TRY . . . CATCH block to capture the error. If no error occurs, the RAISERROR statement will force an error to occur. Then in the CATCH block, we check to make sure the error message references the constraint we are testing.

BEGIN TRY

THROW 50000,'No error raised',16;

END TRY

BEGIN CATCH

if ERROR_MESSAGE() not like '%<constraint being violated>%'

THROW 50000,'<Description of Operation> Failed',16;

END CATCH

The preceding example is a very minimalist method to test your structures, but even this will take quite a while to build, even for a smallish database. As the number of tables climbs, the complexity rises exponentially because of the likely intertable relationships that have to be violated. The goal is to build a test script that loads up a complete database full of data and tests failures along the way (using our technique to quash errors that are expected) and end up with a full database.

Note In the download, I have included a script file named Chapter 6 – Database Create Only.sql that includes the minimal script to create the database and return the metadata. This will allow you to start with a clean database over and over without working through the entire chapter script.

The first step is to include delete statements to clear out all of the data in the database, except for any data that is part of the base load. The goal here is to make your test script repeatable so you can run your script over and over, particularly if you get an error that you don’t expect and you have to go fix your structures.

SET NOCOUNT ON;

USE ConferenceMessaging;

DELETE FROM Messages.MessageTopic ;

DELETE FROM Messages.Message;

DELETE FROM Messages.Topic WHERE TopicId <> 0; --Leave the User Defined Topic

DELETE FROM Attendees.UserConnection;

DELETE FROM Attendees.MessagingUser;

By deleting the data in the table, you will reset the data, but you won’t reset the identity values and the sequence objects. This will help you to make sure that you aren’t relying on certain identity values to test with. Next, I will add a legal user to the MessagingUser table:

INSERT INTO [Attendees].[MessagingUser]

([UserHandle],[AccessKeyValue],[AttendeeNumber]

,[FirstName],[LastName],[AttendeeType]

,[DisabledFlag])

VALUES ('FredF','0000000000','00000000','Fred','Flintstone','Regular',0);

if @@ROWCOUNT <> 1 THROW 50000,'Attendees.MessagingUser Single Row Failed',16;

Next, we will test entering data that fails one of the check constraints. In the next statement, I will enter data with a user handle that is too small:

BEGIN TRY --Check UserHandle Check Constraint

INSERT INTO [Attendees].[MessagingUser]

([UserHandle],[AccessKeyValue],[AttendeeNumber]

,[FirstName],[LastName],[AttendeeType]

,[DisabledFlag])

VALUES ('Wil','0000000000','00000001','Wilma','Flintstone','Regular',0);

THROW 50000,'No error raised',16,1;

END TRY

BEGIN CATCH

if ERROR_MESSAGE() not like

'%CHK__Attendees_MessagingUser_UserHandle_LenthAndStart%'

THROW 50000,'Check Messages.Topic.Name didn''t work',16;

END CATCH

When you execute this statement, you won’t get an error if the constraint you expect to fail is mentioned in the error message (and it will be if you have built the same database I have). Then, I will enter another row that fails the check constraint due to use of a non-alphanumeric character in the handle:

BEGIN TRY --Check UserHandle Check Constraint

INSERT INTO [Attendees].[MessagingUser]

([UserHandle],[AccessKeyValue],[AttendeeNumber]

,[FirstName],[LastName],[AttendeeType]

,[DisabledFlag])

VALUES ('Wilma@','0000000000','00000001','Wilma','Flintstone','Regular',0);

THROW 50000,'No error raised',16,1;

END TRY

BEGIN CATCH

if ERROR_MESSAGE() not like

'%CHK__Attendees_MessagingUser_UserHandle_LenthAndStart%'

THROW 50000,'Check Messages.Topic.Name didn''t work',16;

END CATCH

Note This method of unit testing can be a bit confusing when casually checking things. In the previous block of code, the statement fails, but no error is returned. The goal is that you can run your test script over and over and getting no output other then seeing rows in your tables. However, in practice, it is a lot cleaner to see only problematic output. If you would prefer, add more output to your test script as best suits your desire.

Skipping some of the simpler test items we now arrive at a test of the unique constraint we set up based on the RoundedMessageTime that rounds the MessageTime to the hour. (Some of the data to support these tests are included in the sample code. You can search for the comments in the example code to find our place in the download). To test this, I will enter a row into the table and then immediately enter another at exactly the same. If you happen to run this on a slow machine right at the turn of the hour, although it is extremely unlikely,the two statements execute in the same second (probably even the same millisecond).

INSERT INTO [Messages].[Message]

([MessagingUserId]

,[SentToMessagingUserId]

,[Text]

,[MessageTime])

VALUES

((SELECT MessagingUserId FROM Attendees.MessagingUser WHERE UserHandle = 'FredF')

,(SELECT MessagingUserId FROM Attendees.MessagingUser WHERE UserHandle = 'WilmaF')

,'It looks like I will be late tonight'

,GETDATE());

IF @@ROWCOUNT <> 1 THROW 50000,'Messages.Messages Single Insert Failed',16;

Then, this statement will cause an error that should be caught in the CATCH block:

BEGIN TRY --Unique Message Error…

INSERT INTO [Messages].[Message]

([MessagingUserId]

,[SentToMessagingUserId]

,[Text]

,[MessageTime])

VALUES

((SELECT MessagingUserId FROM Attendees.MessagingUser WHERE UserHandle = 'FredF')

,(SELECT MessagingUserId FROM Attendees.MessagingUser WHERE UserHandle = 'WilmaF') --

,'It looks like I will be late tonight'

,GETDATE())

THROW 50000,'No error raised',16;

END TRY

BEGIN CATCH

if ERROR_MESSAGE() NOT LIKE '%AK_Messages_Message_TimeUserAndText%'

THROW 50000,'Unique Message Error didn''t work (check times)',16;

END CATCH

If the error occurs, it is trapped and we know the constraint is working. If no error occurs, then the no error THROW will. Finally, I will show in the text the most complicated error checking block we have to deal with. This is the message, and the message Topic. In the download, I insert the two successful cases, first for a specific topic, then with a user defined topic. In the next block, I will show the failure case.

--Do this in a more natural way. Usually the client would pass in these values

DECLARE @messagingUserId int, @text nvarchar(200),

@messageTime datetime2, @RoundedMessageTime datetime2(0);

SELECT @messagingUserId = (SELECT MessagingUserId FROM Attendees.MessagingUser

WHERE UserHandle = 'FredF'),

@text = 'Oops Why Did I say That?', @messageTime = SYSDATETIME();

--uses the same algorithm as the check constraint to calculate part of the key

SELECT @RoundedMessageTime = (

DATEADD(HOUR,DATEPART(HOUR,@MessageTime),CONVERT(datetime2(0),CONVERT(date,@MessageTime))));

BEGIN TRY

BEGIN TRANSACTION

--first create a new message

INSERT INTO [Messages].[Message]

([MessagingUserId],[SentToMessagingUserId]

,[Text] ,[MessageTime])

VALUES

(@messagingUserId,NULL,@text, @messageTime);

--then insert the topic,but this will fail because General topic is not

--compatible with a UserDefinedTopicName value

INSERT INTO Messages.MessageTopic(MessageId, TopicId, UserDefinedTopicName)

VALUES(

(SELECTMessageId

FROMMessages.Message

WHEREMessagingUserId = @messagingUserId

AND Text = @text

AND RoundedMessageTime = @RoundedMessageTime),

(SELECT TopicId

FROM Messages.Topic

WHERE Name = 'General'),'Stupid Stuff'),

COMMIT TRANSACTION

END TRY

BEGIN CATCH

if ERROR_MESSAGE() not like '%CHK__Messages_MessageTopic_UserDefinedTopicName_NullUnlessUserDefined%'

THROW 50000,'User Defined Message Check Failed',16;

END CATCH

The test script provided with the download is just a very basic example of a test script, and it will take a while to get a good unit test script created. It took me more than a several hours to create this one for this simple six-table database. Inserting data in a script like this, you will have to do some ugly coding to make it work. In reality, I usually start with a simple script that creates data only without testing the check constraints, unique constraints, and so on, because it is rare that you have time to do the complete test script before turning the database over to developers.

Once the process of building your unit tests is completed, you will find that it will have helped you find issues with your design and any problems with constraints. In many cases, you may not want to put certain constraints on the development server immediately and work with developers to know when they are ready. As a DB developer, and a lapsed UI developer, I personally liked it when the database prevented me from breaking a fundamental rule, so your mileage may vary as to what works best. I will say this, as I created this script, I discovered a few semi-significant issues with the demo design I created for this chapter that wouldn’t have likely been noticed except by you as you, the intrepid reader, work through the design.

Best Practices

The following are a set of some of the most important best practices when implementing your database structures. Pay particular attention to the advice about UNIQUE constraints. Just having a surrogate key on a table is one of the worst mistakes made when implementing a database.

Invest in database generation tools: Do this after you know what the tool should be doing (not before). Implementing tables, columns, relationships, and so on is a tedious and painful task when done by hand. There are many great tools that double as logical data modeling tools and also generate these objects, as well as sometimes the objects and code to be covered in the upcoming three chapters.
Maintain normalization: As much as possible, try to maintain the normalizations that were designed in Chapter 5. It will help keep the data better protected and will be more resilient to change.
Develop a real strategy for naming objects: Keep the basics in mind:
- Give all objects reasonably user-friendly names. Make sure that it’s obvious—at least to support personnel—what the purpose of every object, column, and so on is without digging into documentation, if at all possible.
- Have either all plural or all singular names for tables. Consistency is the key.
- Have all singular names for columns.
- I use singular names for tables or columns.
Develop template domains: Reuse in every case where a similar datatype is needed. This cuts down on time spent implementing and makes users of the data happy, because every time they see a column called Description, it’s likely that it will have the same characteristics of other like columns.
Carefully choose the datatype and nullability for each column: These are the first level of data protection to keep your data clean and pure. Also, improper datatypes can cause precision difficulties with numbers and even performance issues.
Make certain that every table has at least one UNIQUE constraint that doesn’t include an artificial value: It’s a good idea to consider using an IDENTITY column as the primary key. However, if that is the only UNIQUE constraint on the table, then there can (and usually will) be duplication in the real columns of the table—a bad idea.
Implement foreign keys using foreign key constraints: They’re fast, and no matter what kind of gaffes a client makes, the relationship between tables cannot be gotten wrong if a foreign key constraint is in place.
Document and script everything: Using extended properties to document your objects can be extremely valuable. Most of all, when you create objects in the database, keep scripts of the T-SQL code for later use when moving to the QA and production environments. A further step of keeping your scripts in a source control repository is a definite good next step as well so you can see where you are, where you are going, and where you have been in one neat location.
Develop a test script: Test your structures as much as possible. Testing is often the forgotten step in database design, but good testing is essential to know your design works.

Summary

This has been a long chapter covering a large amount of ground. Understanding how to build tables, and how they’re implemented, is the backbone of every database designer’s knowledge.

After getting satisfied that a model was ready to implement, I took a deep look at SQL Server tables, walking through the process of creating the database using the CREATE TABLE and ALTER TABLE syntax for adding constraints and modifying columns, and even created triggers to manage automatically maintained columns. General guidelines were given for naming tables, columns, indexes, and foreign key constraints. The key to good naming is consistency, and if the naming standards I suggested here seem too ugly, messy, or just plain weird to you, choose your own. Consistency is the most important thing for any naming convention.

The two most important sections of this chapter were on choosing datatypes and implementing unique keys. I completed this chapter by discussing the process of choosing primary keys and at least one natural key per table. Of course, the section on testing is pretty important too as good testing is the key to finding those obvious errors that some developers will eagerly point out to anyone will listen to about how much slower the process is with constraints, triggers, and such.

In the next chapter, I’ll show how to finish the task of implementing the base OLTP system by implementing the rest of the business rules required to keep the data in your database as clean as possible.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 6: Physical Model Implementation Case Study

Create new playlist

Sign In

Sign Up

Table of Contents for
CHAPTER 6: Physical Model Implementation Case Study