Chapter 2: Getting It Right: Designing the Database for Performance

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 2 Getting It Right: Designing the Database for Performance

By Louis Davidson

Most of the topics in the book are related to admin tasks: indexing, performance tuning, hardware, backup, compression, and so forth. Many times, database administrators (DBAs) don’t have input into the design of the database they are responsible for. But even if you don’t have control over the design, it’s nice to know what a good design ought to look like—if for no other reason than you can be specific in your complaints when you are struggling with a piece of software.

In this chapter, I’ll present an overview of the entire process of designing a database system, discussing the factors that make a database perform well. Good performance starts early in the process, well before code is written during the definition of the project. (Unfortunately, it really starts well above of the pay grade of anyone who is likely to read a book like this.) When projects are hatched in the board room, there’s little understanding of how to create software and even less understanding of the unforgiving laws of time. Often the plan boils down to “We need software X, and we need it by a date that is completely pulled out of…well, somewhere unsavory.” So the planning process is cut short of what it needs to be, and you get stuck with the seed of a mess. No matter what part you play in this process, there are steps required to end up with an acceptable design that works, and that is what you as a database professional need to do to make sure this occurs.

In this chapter, I’ll give you a look at the process of database design and highlight many of the factors that make the goals of many of the other chapters of this book a whole lot easier to attain. Here are the main topics I’ll talk about:

Requirements  Before your new database comes to life, there are preparations to be made. Many database systems perform horribly because the system that was initially designed bore no resemblance to the system that was needed.

Table Structure  The structure of the database is very important. Microsoft SQL Server works a certain way, and it is important that you build your tables in a way that maximizes the SQL Server engine.

Design Testing  The process of testing should go beyond code and extend into the design process, with the primary goal of the process being to make sure the software that is created matches to needs of the user.

The process of database design is not an overly difficult one, yet it is so often done poorly. Throughout my years of writing, working, speaking, and answering questions in forums, easily 90 percent of the problems came from databases that were poorly planned, poorly built, and (perhaps most importantly) unchangeable because of the mounds of code accessing those structures. With just a little bit of planning and some knowledge of the SQL Server engine’s base patterns, you’ll be amazed at what you can achieve and that you can do it in a lot less time than initially expected.

Requirements

The foundation of any implementation is an understanding of what the heck is supposed to be created. Your goal in writing software is to solve some problem. Often, this is a simple business problem like creating an accounting system, but sometimes it’s a fun problem like shuffling riders on and off of a theme-park ride, creating a television schedule, creating a music library, or solving some other problem. As software creators, our goal ought to be to automate the brainless work that people have to do and let them use their brains to make decisions .

Requirements take a big slap in the face because they are the first step in the classic “waterfall” software-creation methodology. The biggest lie that is common in the programming community is that the waterfall method is completely wrong. The waterfall method states that a project should be run in the following steps:

Requirements Gathering  Document what a system is to be, and identify the criteria that will make the project a success.

Design  Translate the requirements into a plan for implementation.

Implementation  Code the software.

Testing  Verify that the software does what it is supposed to do.

Maintenance  Make changes to address problems not caught in testing.

Repeat the process.

The problem with the typical implementation of the waterfall method isn’t the steps, nor is it the order of the steps, but rather it’s the magnitude of the steps. Projects can spend months or even years gathering requirements, followed by still more months or years doing design. After this long span of time, the programmers finally receive the design to start coding from. (Generally, it is slid under their door so that the people who devised it can avoid going into the dungeons where the programmers are located, shackled to their desks.) The problem with this approach is that, the needs of the users changed frequently in the years that passed before software was completed. Or (even worse) as programming begins, it is realized that the requirements are wrong, and the process has to start again.

As an example, on one of my first projects as a consultant, we were designing a system for a chemical company. A key requirement we were given stated something along the lines of: “Product is only sold when the quality rating is not below 100.” So, being the hotshot consultant programmer who wanted to please his bosses and customers, I implemented the database to prevent shipping the product when the rating was 99.9999 or less, as did the UI programmer. About a week after the system is shipped, the true requirement was learned. “Product is only sold when the quality rating is not below 100…or the customer overrides the rating because they want to.” D’oh! So after a crazy few days where sleep was something we only dreamt about, we corrected the issues. It was an excellent life lesson, however. Make sure requirements make sense before programming them (or at least get it down in writing that you made sure)!

As the years have passed and many projects have failed, the pendulum has swung away from the pure waterfall method of spending years planning to build software, but too often the opposite now occurs. As a reaction to the waterfall method, a movement known as Agile has arisen. The goal of Agile is to attempt to shorten the amount of time between requirements gathering and implementation by shortening the entire process from gathering requirements to shipping software from years to merely weeks. (If you want to know more about Agile, start with their “manifesto” http://agilemanifesto.org/.) The critisisms of Agile are very often the exact opposite of the waterfall method: very little time is spent truly understanding the problems of the users, and after the words “We need a program to do…” are spoken, the coding is underway. The results are almost always predictably (if somewhat quicker…) horrible.

Note In reality, Agile and Waterfall both can actually work well in their place, particularly when executed in the right manner by the right people. Agile methodologies in particular are very effective when used by professionals who really understand the needs of the software development process, but it does take considerable discipline to keep the process from devolving into chaos.

The ideal situation for software design and implementent lies somewhere between spending no time on requirements and spending years on them, but the fact is, the waterfall method at least has the order right, because each step I listed earlier needs to follow the one that comes before it. Without understanding the needs of the user (both now and in the future,) the output is very unlikely to come close to what is needed.

So in this chapter on performance, what do requirements have to do with anything? You might say they have everything to do with everything. (Or you might not—what do I know about what you would say?) The best way to optimize the use of software is to build correct software. The database is the foundation for the code in almost every business system, so once it gets built, changing the structure can be extremely challenging. So what happens is that as requirements are discovered late in the process, the database is morphed to meet new requirements. This can leave the database in a state that is hard to use and generally difficult to optimize because, most of the time, the requirements you miss will be the obscure kind that people don’t think of immediately but that are super-important when they crop up in use. The SQL Server engine has patterns of use that work best, and well-designed databases fit the requirements of the user to the requirements of the engine.

As I mentioned in the opening section of this chapter, you might not be the person gathering requirements, but you will certainly be affected by them. Very often (even as a production DBA who does no programming), you might be called on to give opinions on a database. The problem almost always is that if you don’t know the requirements, almost any database can appear to be correct. If the requirements are too loose, your code might have to optimize for a ridiculously wide array of situations that might not even be physically possible. If the requirements are too strict, the software might not even work.

Going back to the chemical plant example, suppose that my consulting firm had completed our part of the project, we had been paid, and the programmers had packed up to spend our bonuses at Disney World and the software could not be changed. What then? The user would then find some combination of data that is illogical to the system, but tricks the system into working. For example, they might enter a quality rating of 10,000 + the actual rating. This is greater than 100, so the product can ship. But now every usage of the data has to take into consideration that 10,000 and greater actually means that the value was the stored value minus 10,000 that the customer has accepted, and values under 100 are failed products that the user did not accept. In the next section, I’ll discuss normalization, but for now take my word for it that designing a column that holds multiple values and/or multiple meanings is not a practice that you would call good, making it more and more difficult to achieve adequate performance.

As a final note about requirements, requirements should be written in such a way that they are implementation non-specific. Stating that “We need to have a table to store product names and product quality scores” can reduce the quality of the software because it is hard to discern actual requirements from poorly written prose. This statement could mean, “Products have names, and each batch of that product has a quality score” or “Each batch will have a different name and a quality score.” As you build your tables, your goal will be to match your design to the version that is correct. Sure the first makes sense, but business doesn’t always make sense.

Note In this short chapter on design and the effect on performance, I am going to assume you are acclimated to the terminology of the relational database and understand the basics. I also will not spend too much time on the overall process of design, but you should spend time between getting requirements and writing code visualizing the output, likely with a data-modeling tool. For more information on the overall process of database design, might I shamelessly suggest Pro SQL Server 2012 Relational Database Design and Implementation (Apress, 2012)? My goal for the rest of this chapter will be to cover the important parts of design that can negatively affect your performance of (as well as your happiness when dealing with) your SQL Server databases.

Table Structure

The engineers who work for Microsoft on the SQL Server team are amazing. They have built a product that, in each successive version, continues to improve and attempt to take whatever set of structures is thrown at them, any code that a person of any skill level throws at it, and make the code work well. Yet for all of their hard work, the fact remains that the heart of SQL Server is a relational database engine and you can get a lot more value by using the engine following good relational practices. In this section, I will discuss what goes into making a database “relational,” and how you can improve your data quality and database performance by just following the relational pattern as closely as possible.

To this end, I will cover the following topics:

A Really Quick Taste of History  Knowing why relational databases are what they are can help to make it clear why SQL Server is built the way it is and why you need to structuring your databases that way too.

Why a Normal Database Is Better Than an Extraordinary One  Normalization is the process of making a database work well with the relational engine. I will describe normalization and how to achieve it in a practical manner.

Physical Model Choices  There are variations on the physical structure of a database that can have distinct implications for your performance.

Getting the database structure to match the needs of the engine is the second most important part in the process of performance tuning your database code. (Matching your structure to the user’s needs being the most important!)

A Really Quick Taste of History

The concept of a relational database originated in the late 1970s. (The term relation is mostly analogous to a table in SQL and does not reference relationships.) In 1979, Edgar F. Codd, who worked for the IBM Research Laboratory at the time, wrote a paper titled “A Relational Model of Data for Large Shared Data Banks,” which was printed in “Communications of the ACM.” (“ACM” stands for the Association for Computing Machinery, which you can learn more about at www.acm.org.) In this 11-page paper, which really should be required reading, Codd introduced to the world outside of academia the fairly revolutionary idea for how to break the physical barriers of the types of databases in use at that time.

Following this paper, in the 1980s Codd presented 13 rules for what made a database “relational,” and it is quite useful to see just how much of the original vision persists today. I won’t regurgitate all of the rules, but the gist of them was that, in a relational database, all of the physical attributes of the database are to be encapsulated from the user. It shouldn’t matter if the data is stored locally, on a storage area network (SAN), elsewhere on the local area network (LAN), or even in the cloud (though that concept wasn’t quite the buzz worthy one it is today.) Data is stored in containers that have a consistent shape (same number of columns in every row), and the system works out the internal storage details for the user. The higher point is that the implementation details are hidden from the users such that data is interacted with only at a very high level.

Following these principles ensures that data is far more accessible using a high level language, accessing data by knowing the table where the data resided, the column name (or names), and some piece of unique data that could be used to identify a row of data (aka a “key”). This made the data far more accessible to the common user because to access data, you didn’t need to know what sector on a disc the data was on, or a file name, or even the name of physical structures like indexes. All of those details were handled by software.

Changes to objects that are not directly referenced by other code should not cause the rest of the system to fail. So dropping a column that isn’t referenced would not bring down the systems. Moreover, data should not only be treated row by row but in sets at a time. And if that isn’t enough, these tables should be able to protect themselves with integrity constraints, including uniqueness constraints and referential integrity, and the user shouldn’t know these constraints exist (unless they violate one, naturally). More criteria is included in Codd’s rules, including the need for NULLs, but this is enough for a brief infomercial on the concept of the relational database.

The fact is, SQL Server—and really all relational database management systems (RDBMSs) —is just now getting to the point where some of these dreams are achievable. Computing power in the 1980s was nothing compared to what we have now. My first SQL Server (running the accounting for a midsized, nonprofit organization) had less than 100 MB of disk space and 16 MB of RAM. My phone five years ago had more power than that server. All of this power we have now can go to a developer’s head and give him the impression that he doesn’t need to spend the time doing design. The problem with this is that data is addictive to companies. They get their taste, and they realize the power of data and the data quantity explodes, leading us back to the need of understanding how the relational engine works. Do it right the first time…. That sounds familiar, right?

Why a Normal Database Is Better Than an Extraordinary One

In the previous section, I mentioned that the only (desirable) method of directly accessing data in a relational database is by the table, row key, and column name. This pattern of access ought to permeate your thinking when you are assembling your databases. The goal is that users have exactly the right number of buckets to put their data into, and when you are writing code, you do not need to break down data any further for usage.

As the years passed, the most important structural desires for a relational database were formulated into a set of criteria known as the normal forms. A table is normalized when it cannot be rewritten in a simpler manner without changing the meaning. This meaning should be more concerned with actual utilization than academic exercises, and just because you can break a value into pieces doesn’t mean that you have to. Your decision should be based mostly on how data is used. Much of what you will see in the normal forms will seem very obvious. As a matter of fact, database design is not terribly difficult to get right, but if you don’t know what right is, it is a lot easier to get things wrong.

There are two distinct ways that normalization is approached. In a very formal manner, you have a progressive set of draconian “rules” that specify “forms” to adhere to. There is nothing wrong with that approach, and it’s an essential exercise to do before building a “great” database, but progressing through the forms in a stepwise manner is not how any seasoned data architect is likely to approach the process of design. Instead, you design with the principles of normalization in mind, and use the normal forms as a way to test your design.

The problem with getting a great database design is compounded with how natural the process seems. The first database that “past, uneducated me” built had 10+ tables—all of the obvious ones, like customer, orders, and so forth that are set up so that the user interface could be produced to satisfy the client. However, addresses, order items, and other items were left as part of the main tables, making it a beast to work with for queries. As my employer wanted more and more out of the system, the design became more and more taxed (and the data became more and more polluted). The basics were there, but the internals were all wrong and the design could have used about 50 or so tables to flesh out the correct solution. Soon after (at my next company, sorry Terry), I gained a real education in the basics of database design, and the little 1000-lumen light bulb in my head went off.

That light bulb was there because what had looked like a more complicated database than a normal person would have created in my college database class was there to help designs fit the tools that I was using (SQL Server 1.0). And because the people who create relational database engines use the same concepts of normalization to help guide how the engine works, it was a win/win situation. So if the relational engine vendors are using a set of concepts to guide how they create the engine, it turns out to be actually quite helpful if you follow along.

In this section, I will cover the concept of normalization in two stages:

(Semi-)Formal Definition Using the normal-form definitions, I will establish what the normal forms are.

Practical Application Using a simple restatement of the goals of normalization, I will work through a few examples of normalization and demonstrate how violations of these principals will harm your performance as well as your data quality.

In the end, I will have established at least a basic version of what “right” is, helping you to guide your designs toward correctness. Here’s a simple word of warning, though: all of these principles must be guided by the user’s desires, or the best looking database will be a failure.

(Semi-)Formal Definition

First, let’s look at the “formal” rules in a semi-formal manner. Normalization is stated in terms of “forms,” starting with the first normal form and including several others. Some forms are numbered, and others are named for the creators of the rule. (Note that in the strictest terms, to be in a greater form, you must also conform to the lesser form. So you can’t be in the third strictest normal form and not give in to the definition of the first.) It’s rare that a data architect actually refers to the normal forms in conversation specifically, unless they are trying to impress their manager at review time, but understanding the basics of normalization is essential to understanding why it is needed. What follows is a quick restatement of the normal forms:

First Normal Form /Definition of a Table  Attribute and row shape:

All columns must be atomic—one individual value per column that needn’t be broken down for use.

All rows of a table must contain the same number of values—no arrays or repeating groups (usually denoted by columns with numbers at the end of the name, such as the following: payment1, payment2, and so on).

Each row should be different from all other rows in the table. Rows should be unique.

Boyce-Codd Normal Form  Every possible key is identified, and all attributes are fully dependent on a key. All non-key columns must represent a fact about a key, a whole key, and nothing but a key. This form was an extension of the second and third normal forms, which are a subset of the Boyce-Codd normal forms because they were initially defined in terms of a single primary key:

Second Normal Form  All attributes must be a fact about the entire primary key and not a subset of the primary key.

Third Normal Form  All attributes must be a fact about the primary key and nothing but the primary key.

Fourth Normal Form  There must not be more than one multivalued dependency represented in the entity. This form deals specifically with the relationship of attributes within the key, making sure that the table represents a single entity.

Fifth Normal Form    A general rule that breaks out any data redundancy that has not specifically been culled out by additional rules. If a table has a key with more than two columns and you can break the table into tables with two column keys and be guaranteed to get the original table by joining them together, the table is not in Fifth Normal Form. The form of data redundancy covered by Fifth Normal Form is very rarely violated in typical designs.

Note The Fourth and Fifth Normal Forms will become more obvious to you when you get to the practical applications in the next section. One of the main reasons why they are seldom covered isn’t that they aren’t interesting, but more because they are not terribly easy to describe. However, examples of both are very accessible.

There are other, more theoretical forms that I won’t mention because it’s rare that you would even encounter them. In the reality of the development cycle of life, the stated rules are not hard-and-fast rules, but merely guiding principles you can use to avoid certain pitfalls. In practice, you might end up with denormalization (meaning purposely violating a normalization principle for a stated, understood purpose, not ignoring the rules to get the job done faster, which should be referred to as unnormalized). Denormalization occurs mostly to satisfy some programming or performance need from the consumer of the data (programmers, queriers, and other users).

Once you deeply “get” the concepts of normalization, you’ll that you build a database like a well-thought-out Lego creation. You’ll design how each piece will fit into the creation before putting the pieces together because, just like disassembling 1000 Lego bricks to make a small change makes Lego building more like work than fun, database design is almost always work to start with and usually is accompanied by a manager who keeps looking at a watch while making accusatory faces at you. Some rebuilding based on keeping your process agile might be needed, but the more you plan ahead, the less data you will have to reshuffle.

Practical Application

In actual practice, the formal definition of the rules aren’t referenced at all. Instead, the guiding principles that they encompass are referenced. I keep the following four concepts in the back of my mind to guide the design of the database I am building, falling back to the more specific rules for the really annoying or complex problem I am trying to solve:

Columns  Make sure every column represents only one value.

Table/row uniqueness  One row represents one independent thing, and that thing isn’t represented anywhere else.

Columns depend only on an entire key  Columns either are part of a key or describe something about the row identified by the key.

Keys always represent a single expression  Make sure the dependencies between three or more key values are correct.

Throughout this section, I’ll provide some examples to fortify these definitions, but it is a good point here to understand the term: atomic. Atomic is a common way to describe a value that cannot be broken down further without changing it into something else. For example, a water molecule is made up of hydrogen and oxygen. Inside of the molecule, you can see both types of atoms if you look really close. You can split them up and you still have hydrogen and oxygen. Try to split hydrogen, and it will turn into something else altogether (and your neighbors are not going to be pleased one little bit). In SQL, you want to break things down to a level that makes them easy to work with without changing the meaning beyond what is necessary.

Tables and columns split to their atomic level have one, and only one, meaning in their programming interface. If you never need to use part of a column using SQL, a single column is perfect. (A set of notes that the user uses on a screen is a good example.) You wouldn’t want a paragraph, sentence, and character table to store this information, because the whole value should be useful only as a whole. If you are building a system to count the characters in that document, it could be a great idea to have one row per character.

If your tables are too coarsely designed, your rows will have multiple meanings that never share commonality. For example, if one row represents a baboon and the other represents a manager, even though the comedic value is worth its weight in gold, there is very likely never going to be a programming reason to combine the two in the same row. Too many people try to make objects extremely generic, and the result is that they lose all meaning. Still others make the table so specific that they spend extreme amounts of coding and programming time reassembling items for use.

As a column example, consider a column that holds the make, model, and color of a vehicle. Users will have to parse the data to pick out blue vehicles. So they will need to know the format of the data to get the data out, leading to the eventual realization of the database administrator that all this parsing of data is slowing down the system and just having three columns in the first place would make life much better.

At the same time, we can probably agree that the car model name should have a single column to store the data, right? But what if you made a column for the first character, the last character, and middle characters? Wouldn’t that be more normalized? Possibly, but only if you actually needed to program with the first and last characters independently on a regular basis. You can see that the example here is quite silly, and most designers stop designing before things get weird. But like the doctor will tell you when looking at a wound you think is disgusting, “That is nothing, you should have seen the…” and a few words later you are glad to be a computer programmer. The real examples of poor design are horribly worse than any example you can put in a chapter.

Columns

Make sure every column represents only one value.

Your goal for columns is to make sure every column represents only one value, and the primary purpose of this is performance. Indexes in SQL Server have key values that are complete column values, and they are sorted on complete column values. This leads to the desire that most (if not all, but certainly most) searches use the entire value. Indexes are best used for equality comparisons, and their next-best use is to make range comparisons. Partial values are generally unsavory, with the only decent partial-value usage is a string or binary value that uses the leftmost character or binary value, because that is how the data is sorted in the index. To be fair, indexes can be used to scan values to alleviate the need to touch the table’s data (and possibly overflow) pages, but this is definitely not the ideal utilization.

To maximize index usage, you should never need to parse a column to get to a singular piece of data. A common scenario is a column that contains a comma-delimited list of values. For example, you have a table that holds books for sale. To make displaying the data more natural, the following table is built (the key of the table is BookISBN):

BookISBN    BookTitle      BookPublisher   Authors

----------- -------------  --------------- -----------

111111111   Normalization  Apress          Louis

222222222   T-SQL          Apress          Michael

333333333   Indexing       Microsoft       Kim

444444444   DB Design      Apress          Louis, Jessica

On the face of things, this design makes it easy for the developer to create a screen for editing, for the user to enter the data, and so forth. However, although the initial development is not terribly difficult, using the data for any reason that requires differentiating between authors is. What are the books that Louis was an author of? Well, how about the following query? It’s easy, right?

SELECT BookISBN, BookTitle

FROM   Book

WHERE  Authors LIKE '%Louis%'

Yes, this is exactly what most designers will do to start with. And with our data, it would actually work. But what happens when author “Louise” is added? And it is probably obvious to anyone that two people named Louis might write a book, so you need more than the author’s first name. So the problem is whether you should have AuthorFirstName and AuthorLastName—that is, two columns, one with “Louis, Jessica” and another with “Davidson, Moss”. And what about other bits of information about authors? What happens when a user uses an ampersand (&) instead of a comma (,)? And…well, these are the types of questions you should be thinking about when you are doing design, not after the code is written.

If you have multiple columns for the name, it might not seem logical to use the comma-delimited solution, so users often come up with other ingenious solutions. If you enter a book with the ISBN number of 444444444, the table looks like this (the key of this set is the BookISBN column):

BookISBN    BookTitle      BookPublisher   AuthorFirstName  AuthorLastName

----------- -------------  --------------- ---------------  --------------

444444444   DB Design      Apress          Jessica          Moss

That’s fine, but now the user needs to add another author, and her manager says to make it work. So, being the intelligent human being she is, the user must figure out some way to make it work. The comma-delimited solution feels weird and definitely not “right”:

BookISBN    BookTitle      BookPublisher   AuthorFirstName  AuthorLastName

----------- -------------  --------------- ---------------  --------------

444444444   DB Design      Apress          Jessica, Louis   Moss, Davidson

So the user decides to add another row and just duplicate the ISBN number. The uniqueness constraint won’t let her do this, so voila! The user adds the row with the ISBN slightly modified:

BookISBN     BookTitle      BookPublisher   Author

------------ -------------  --------------- -----------

444444444    DB Design      Apress          Jessica

444444444-1  DB Design      Apress          Louis

You might think is grounds to fire the user, but the fact is, she was just doing her job. Until the system can be changed to handle this situation, your code has to treat these two rows as one row when talking about books, and treat them as two rows when dealing with authors. This means grouping rows when dealing with substringed BookISBN values or with foreign key values that could include the first or second values. And the mess just grows from there. To the table structures, the data looks fine, so nothing you can do in this design can prevent this from occurring. (Perhaps the format of ISBNs could have been enforced, but it is possible the user’s next alternative solution may have been worse).

Designing this book and author solution with the following two tables would be better. In a second table (named BookAuthor), the BookISBN is a foreign key to the first table (named Book), and the key to BookAuthor is BookISBN and AuthorName. Here’s what this solution looks like:

BookISBN    BookTitle      BookPublisher

----------- -------------  ---------------

111111111   Normalization  Apress

222222222   T-SQL          Apress

333333333   Indexing       Microsoft

444444444   DB Design      Apress



BookISBN    AuthorName  ContributionType

----------- ----------- --------------------

111111111   Louis       Primary Author

222222222   Michael     Primary Author

333333333   Kim         Primary Author

444444444   Louis       Primary Author

444444444   Jessica     Contributor

Note too that adding more data about the author’s contribution to the book was a very natural process of simply adding a column. In the single table solution, identifying the author’s contribution would have been a nightmare. Furthermore, if you wanted to add royalty percentages or other information about book’s author, it would be an equally simple process. You should also note that it would be easy to add a table for authors and expand the information about the author. In the example, you would not want to duplicate the data twice for Louis, even though he wrote two of the books in the example.

Table/row uniqueness

One row represents one independent thing, and that thing isn’t represented anywhere else.

The first normal form tells you that rows need to be unique. This is a very important point, but it needs to be more than a simple mechanical choice. Just having a uniqueness constraint with a meaningless value technically makes the data unique. As an example of how generated values lead to confusion, consider the following subset of a table that lists school mascots. (The primary key is on MascotId.)

MascotId    Name

----------- ------------------

1           Smokey

112         Bear

4567        Bear

9757        Bear

The rows are technically unique, because the ID values are different. If those ID numbers represent a number that the user uses to identify rows in all cases, this might be a fine table design. However, in the far more likely case where MascotId is just a number generated when the row is created and has no actual meaning, this data is a disaster waiting to occur. The first user will use MascotId 9757, the next user might use 4567, and the user after that might use 112. There is no real way to tell the rows apart. And although the Internet seems to tell me that the mascot name “Smokey” is used only by the University of Tennessee, the bear is a common mascot used not only by my high school but by many other schools as well.

Ideally, the table will contain a natural key (or a key based on columns that have a relationship to the meaning of the table of data being modeled instead of an artificial key that has no relationship to the same). In this case, the combination of SchoolName and the mascot Name probably will suffice:

MascotId    Name               SchoolName

----------- ------------------ ----------------------------------

1           Smokey             University of Tennessee

112         Bear               Bradley High School

4567        Bear               Baylor University

979796      Bear               Washington University

You might also think that the SchoolName value is unique in and of itself, but many schools have more than one mascot. Because of this, you may need multiple rows for each SchoolName. It is important to understand what you are modeling and make sure it matches what your key is representing.

Note Key choice can be a contentious discussion, and it’s also is a very important part of any design. The essential part of any design is that you can tell one row from another in a manner that makes sense to the users of the system. SQL Server physical considerations include what column is used to cluster the table (or physically order the internal structures), what columns are frequently used to fetch data based on user usage, and so forth. The physical considerations should be secondary to making sure the data is correct.

Why is uniqueness so important to performance? When users can do a search and know that they have the one unique item that meets their needs, their job will be much easier. When duplicated data goes unchecked in the database design and user interface, all additional usage has to deal with the fact that where the user expects one row, he might get back more than one.

One additional uniqueness consideration is that a row represents one unique thing. When you look at the columns in your tables, does this column represent something independent of what the table is named and means? In the following table that represents a customer, check each column:

CustomerId        Name              Payment1    Payment2      Payment3

----------------- ----------------- ------------ ------------- --------------

0000002323        Joe's Fish Market 100.03      120.23        NULL

0000230003        Fred's Cat Shop   200.23      NULL          NULL

CustomerId and Name clearly are customer related, but the payment columns are completely different things than customers. So now two different sorts of objects are related to one another. This is important because it becomes difficult to add a new payment type object. How do you know what the difference is between Payment1, Payment2, and Payment3? And what if there turns out to be a fourth payment? To add the next payment for Fred’s Cat Shop, you might use some SQL code along these lines:

UPDATE dbo.Customer

SET Payment1 = CASE WHEN Payment1 IS NULL THEN 1000.00 ELSE Payment1 END,

    Payment2 = CASE WHEN Payment1 IS NOT NULL AND Payment2 IS NULL

                    THEN 1000.00 ELSE Payment2 END,

    Payment3 = CASE WHEN Payment1 IS NOT NULL 

                         AND Payment2 IS NOT NULL

                         AND Payment3 IS NULL

                     THEN 1000.00 ELSE Payment3 END

WHERE CustomerId = '0000230003';

If payments were implemented as their own table, the table might look like this:

CustomerId      PaymentNumber  Amount        Date

--------------- -------------- ------------- --------------

0000002323      1              120.23        2012-10-15

0000002323      2              100.03        2012-11-12

0000230003      1              200.23        2012-08-13

Adding a new payment would be as simple as adding another row, and adding metadata about the payment is as easy as adding a single column. Although the payments example is probably a bit unrealistic and certainly a case where most anyone would see the need for multiple rows, there are many cases where designers think that just using columns is the best answer. A system I am working with right now has email addresses stored as emailAddress1, emailAddress2, and emailAddress3. The emailAddress1 value is used straightforwardly as the primary email address, but the values in emailAddress2 and emailAddress3 are used either as previous email addresses or alternate email addresses. And without adding 10 to 20 new columns of metadata (one set for each email address), it will be nearly impossible to reign in usage and know what all the values mean.

Keep in mind that even if all of the payment entries are done manually through the UI, even things like counting the number of payments tends to be a difficult task. How many payments has Fred made? You could do something like this:

SELECT  CASE WHEN Payment1 IS NOT NULL THEN 1 ELSE 0 END + 

        CASE WHEN Payment2 IS NOT NULL THEN 1 ELSE 0 END +

        CASE WHEN Payment3 IS NOT NULL THEN 1 ELSE 0 END AS PaymentCount

FROM    dbo.Customer

WHERE   CustomerId = 0000230003;

When you do it that way and you have to start doing this work for multiple accounts simultaneously, it gets complicated. In many cases, the easiest way to deal with this condition is to normalize the set, probably through a view:

CREATE VIEW  dbo.CustomerPayment

AS

   SELECT  CustomerId, 1 AS PaymentNumber, Payment1 AS PaymentAmount

   FROM    dbo.Customer

   WHERE   Payment1 IS NOT NULL

   UNION ALL

   SELECT  CustomerId, 2 AS PaymentNumber, Payment2 AS PaymentAmount

   FROM    dbo.Customer

   WHERE   Payment2 IS NOT NULL

   UNION ALL

   SELECT  CustomerId, 3 AS PaymentNumber, Payment3 AS PaymentAmount

   FROM    dbo.Customer

   WHERE   Payment3 IS NOT NULL

Now you can do all of your queries just as if the table was properly structured, although it’s not going to perform nearly as well as if the table was designed correctly:

SELECT CustomerId, COUNT(*)

FROM   dbo.CustomerPayment

GROUP  BY CustomerId

Now you just use the columns of the customer objects that are unique to the customer, and these rows are unique for each customer payment.

Columns depend only on an entire key

Columns either are part of a key or describe something about the row identified by the key.

In the previous section, I focused on getting unique rows, based on the correct kind of data. In this section, I focus on finding keys that might have been missed earlier in the process. The keys I am describing are simply dependencies in the columns that aren’t quite right. For example, consider the following table (where X is the declared key of the table):

X             Y             Z 

------------- ------------- ----------------

1             1             2

2             2             4

3             2             4

Values in the X column are unique, so that is fine. You can determine the corresponding Y and Z values from the value of X. Now look at the other, non-key columns. Given a value of Y, you can’t determine a specific value of X, but you seemingly can determine the value of Z. For all cases where Y = 1, you know that Z = 2, and when Y = 2, you know that Z = 4. Before you pass judgment, consider that this could be a coincidence. It is very much up to the requirements to help you decide if Y and Z are related (and it could be that the Z value determines the Y value also).

When a table is designed properly, any update to a column requires updating one and only one value. In this case, if Z is defined as Y*2, updating the Y column would require updating the Z column as well. If Y could be a key of your table, this would be acceptable as well, but Y is not unique in the table. By discovering that Y is the determinant of Z, you have discovered that YZ should be its own independent table. So instead of the single table you had before, you have two tables that express the previous table with no invalid dependencies, like this (where X is the key of the first table and Y is the key of the second):

X             Y

------------- -------------

1             1

2             2

3             2



Y             Z

------------- ----------------

1             2

2             4

For a somewhat less abstract example, consider the following set of data, representing book information (the key of the table is BookISBN):

BookISBN    BookTitle      PublisherName   PublisherLocation

----------- -------------  --------------- -------------------

111111111   Normalization  Apress          California

222222222   T-SQL          Apress          California

444444444   DMV Book       Simple Talk     England

BookISBN is the defined key, so every one of the columns should be completely dependent on this value. The title of the book is dependent on the book ISBN and the publisher, too. The concern in this table is the PublisherLocation value. A book doesn’t have a publisher location, a publisher does. So if you needed to change the publisher, you would also need to change the publisher location.

To correct this situation, you need to create a separate table for the publisher. Following is one approach you can take (in which the key of the first table is BookISBN and the key of the second table is Publisher):

BookISBN    BookTitle      PublisherName

----------- -------------  ---------------

111111111   Normalization  Apress

222222222   T-SQL          Apress

444444444   DMV Book       Simple Talk



Publisher     PublisherLocation

------------- ---------------------

Apress        California

Simple Talk   England

Now a change of publisher for a book requires only changing the publisher value in the Book table, and a change to publisher location requires only a single update to the Publisher table. Note, however, that if PublisherLocation actually represented the location of the publisher at the time the book was printed, the design might have been just fine because you might have a new book with Apress and they could be in New York now:

BookISBN    BookTitle        PublisherName   PublisherLocation

----------- ---------------  --------------- -------------------

111111111   Normalization    Apress          California

222222222   T-SQL            Apress          California

444444444   DMV Book         Simple Talk     England

555555555   Expert Practices Apress          New York



Publisher     PublisherLocation

------------- ---------------------

Apress        New York

Simple Talk   England

Now you can get the current location of Apress, and the location of Apress when the book was published. Is this important information? If you think you know, you missed the point of the first section of this chapter. What matters is the requirements, not what seems to be right when you are designing.

Of course, you should consider all sorts of dependencies, in the same table, like calculations (which ought to be done with a calculated column if it is necessary and cannot be overridden), summary data like the total on an invoice, or the count of employees who work for a department. Usually, the cost to store and maintain incorrect dependencies is far more than the cost of calculating the values as you need them. In our example, if updating the publisher table required updates to the book table, you probably would have issues (except in the case of fixing a mistake in the current location, perhaps).

Are there cases where calculating summary values at the time they are needed is more costly than the maintenance? Yes, but the best rule of thumb is to try the correct or easy way first. Usually, if your tables follow the normal patterns, optimization almost just happens.

Keys always represent a single expression

Make sure dependency relationships between three values or tables are correct.

Once you have reached the point that your tables have only atomic values for columns and tables have a single meaning, unique rows, and seemingly no dependency issues, you are really close to being done. In fact, you’re basically in Boyce–Codd normal form, which (as I stated earlier) is a better version of the third normal form that is so highly touted as the ultimate normal form. However, you aren’t quite done, and the last test is still interesting and corresponds to the fourth and fifth normal forms.

You need to check out whether the keys that have been chosen represent only one thing. Consider the following example table that represents the types of vehicles a driver likes to drive (where the keys of the table are Driver and VehicleStyle):

Driver   VehicleStyle     DesiredModelLevel

-------- ---------------- ------------------

Louis    CUV              Premium

Louis    Sedan            Standard

Ted      Coupe            Performance

The relationship between Driver and VehicleStyle represents a multivalued dependency for the Driver and the VehicleStyle entities. The driver Louis will drive either CUV or Sedan vehicles, and Louis is the only driver who drives the CUV style. As you add more data, each vehicle style will have many drivers that choose the type as a preference. A table such as this one for Driver and VehicleStyle is used frequently to resolve a many-to-many relationship between two tables—in this case, the Driver and VehicleStyle tables, which define the characteristics that pertain specifically to the driver and the vehicle style. The DesiredModelLevel in this case is a characteristic that applies only to the intersection of Driver and VehicleStyle.

The modeling problem comes when you need to model a relationship between three entities, modeled as three columns in a key (usually) from three separate table types. As an example, consider the following table representing the assignment of a trainer to a class that is assigned to use a certain book (where all three columns make up the key currently):

Trainer    Class          Book

---------- -------------- -----------------------------

Louis      Normalization  DB Design & Implementation

Chuck      Normalization  DB Design & Implementation

Fred       Implementation DB Design & Implementation

Fred       Golf           Topics for the Non-Technical

To decide if this table is acceptable, you need to look at each column and determine how they are related to the others. If any two columns are not directly related to one another, there will be an issue with the table design because the key actually represents more than one independent thing. Here are the possible combinations and their relationships:

Class (in this case, the type of class) and Trainer are related, and a class might have multiple trainers.

Book and Class are related, and a book might be used for multiple classes.

Trainer and Book are not directly related, because the rule stated that the class uses a specific book.

Hence, regardless of how correct the table initially seemed, what you really have here are two independent types of information being represented in a single table. To deal with this, you will split the table on the column that is common to the two dependent relationships. Take this one table and make two tables that express the data that was in the first table:

Class           Trainer

--------------- --------------

Normalization   Louis

Normalization   Chuck

Implementation  Fred

Golf            Fred



Class          Book

-------------- --------------------------

Normalization  DB Design & Implementation

Implementation DB Design & Implementation

Golf           Topics for the Non-Technical

Joining these two tables together on Class, you will find that you get the exact same table as before. However, if you change the book for the Normalization class, it will be changed immediately for both of the classes that are being taught by the different teachers. Note that initially it seems like you have more data because you have more rows and more tables. However, notice the redundancy in the following data from the original design:

Louis  Normalization  DB Design & Implementation

Chuck  Normalization  DB Design & Implementation

The redundancy comes from stating twice that the book DB Design & Implementation is used for the Normalization class. The new design conveys that same information with one less row of data. When the system grows to the point of having 50 Normalization classes being taught, you will have much less data, making the storage of data more efficient, possibly creating some performance benefits, and reducing the amount of redundant data that can get out of sync.

Note, too, that if the table was modeling an instance of the class, the key would be class, time, location, and the teacher and book would be simply columns. (This assumes a class can have only one teacher and classroom; otherwise, you end up with more multivalued dependencies.) Most of these conditions are caught during testing, but not always.

As an alternate situation, consider the following table of data, which might be part of the car rental system that I used in the initial example before. This table defines the brand of vehicles that the driver will drive:

Driver              VehicleStyle                 VehicleBrand

------------------- ---------------------------- -----------------------

Louis               Station Wagon                Ford

Louis               Sedan                        Hyundai

Ted                 Coupe                        Chevrolet

Driver and VehicleStyle are related, representing the style the driver will drive.

Driver and VehicleBrand are related, representing the brand of vehicle the driver will drive.

VehicleStyle and VehicleBrand are related, defining the styles of vehicles the brand offers that can be rented.

This table defines the types of vehicles that the driver will take. Each of the columns has a relationship to the other, so it properly represents a single thing only. As our final multiple-key example, consider the following key that is meant to represent book Authors and Editors:

Book                Author                     Editor

------------------- -------------------------- ----------------

Design              Louis                      Jonathan

Design              Jeff                       Leroy

Golf                Louis                      Steve

Golf                Fred                       Tony

There are two possible interpretations that might be made somewhat clear by the name of the table (and the documentation as well):

This key does not represent a singular value if it represents the following independent relationships:

The book Design has authors Louis and Jeff, and it has editors Jonathan and Leroy.

The book Golf has authors Louis and Fred, and it has editors Steve and Tony.

The key does represent a singular value if it represents the following:

For the book Design, editor Jonathan edits Louis’ work and editor Leroy edits Jeff’s work.

For the book Golf, editor Steve edits Louis’ work and editor Tony edits Fred’s work.

In the first case, the author and editor are independent of each other, meaning that, technically, you should have a table for the Book-to-Author relationship and another for the Book-to-Editor relationship. In the second case, the author and editor are directly related. Hence, all three of the values are required to express the single thought of “for book X, editor Y edits Z’s work.”

Like in many cases we have discussed so far, the goal of this rule is to get the design right so that you don’t have to mess around with poorly formatted data. When the keys that you expect to represent a single value actually represent multiple values, your data gets messy and dealing with bad data is the single largest cause of poor performance on many levels. Of course, sometimes the performance issues arise only after you discover that your data is massively messed up and you have layered on tons of hacks to keep it in order.

Physical Model Choices

In this section, I will discuss a few of the physical model choices that can greatly affect the performance of your database system. Normalization gets you to the point of having the data structured in the correct way, but next you have to fine-tune the implementation details. In this section, I will talk about the following topics:

Datatype  Datatypes are often implemented as an afterthought. This is actually one of the more important choices you make, both for data-integrity purposes and in terms of performance.

Constraints  Protecting the integrity of the data might not have an immediate positive effect on the performance of your system, but not having to worry about the quality of the data pays tremendous benefits when the data is used.

Indexing  A database without indexes is probably unusable. (For starters, uniqueness constraints are implemented with indexes.) Getting the indexing right during design (and testing) can limit the amount of tuning you need to do once the cranky users get their hands on your code.

One choice that I won’t talk about any more than I have already is primary key choice. Choosing to use natural key values or surrogates is not likely to kill your system’s performance as long as your choice is reasonable. Either approach has its own difficulties depending on how complex your natural key values are or what you choose for a surrogate value. I will touch on the indexing considerations in the indexing section, but as long as your indexes are optimum and your data is unique enough that the user doesn’t have to randomly choose between rows, any reasonable primary key choice is going to be just fine.

Datatype

Choosing proper datatypes to match the domain chosen during logical modeling is an important task. One datatype might be more efficient than another of a similar type. For example, you can store integer data in an integer datatype, a numeric datatype, a floating-point datatype, or even a varchar(10) type, but these datatypes are certainly not alike in implementation or performance. Matching the datatype to the need is a large step in getting the data quality that will make your users happy with their database.

Getting the datatype right is the first step in getting the implementation correct, so it can really help to spend a reasonable amount of time here making sure it is right. Too many databases end up with all datatypes the same size and nullable (except for primary keys, if they have them) and lose the integrity of having properly sized and constrained data storage.

It usually isn’t that difficult a task, but all too often it is a choice that is ignored. Too many databases have no more than three datatype configurations (consider varchar(10) and varchar(20) as two configurations of varchar) in their database. Using int, varchar(200), and varchar(max), you could implement almost any database. Of course, there are two problems with only using very generic datatypes for every column.

First, although you could implement most numbers as integers, some numbers need decimal places. (Yes, you could store your monetary values in pennies instead of dollars, but that would confuse the heck out of most users and make using the data confusing.) And you can also put all other types of data in a varchar(200) and a varchar(max). Any data that you can store in SQL Server can be translated into a textual representation, but not necessarily in a manner that is conducive to use, either by the user or the computer system. The first goal of datatypes is to represent the data in the most natural manner so that the computer can work with it best and the user will be less likely to make a mistake. SQL Server provides a rich set of datatypes, including the following groups of types:

Precise numeric data  Stores data with no loss of precision due to storage. This group includes integer types, which are implemented in hardware (bit, tinyint, smallint, int, bigint); decimal; money; and smallmoney. (The money types are based on integers, which leads to some mathematical limitations and should be rarely used. Use decimal instead.)

Approximate numeric data  Stores approximations of numbers. This group provides for a large range of values. It includes the float and real floating point types, which are based on IEEE standards and are supported via hardware but have round-off issues that have to be understood. Use this group primarily for scientific applications that may need to store extremely small or extremely large values in the same column, and can handle a small amount of rounding error.

Date and time  Stores date values, including the time of day. This group includes the types date, time, datetime2, datetimeoffset, smalldatetime, and datetime. (Note: You should phase out usage of this smalldatetime and datetime and use the more standards-oriented datetime2, though neither are technically deprecated.)

Character (or string) data  Used to store textual data, such as names, descriptions, notes, and so on. This group includes Char, varchar, varchar(max), and their Unicode equivalents: nchar, nvarchar, and nvarchar(max). (Note that there also text/ntext datatypes that are deprecated (and fairly unpleasant) and that you should phase out for the varchar(max)/nvarchar(max) datatypes, immediately if not sooner.)

Binary data  Data stored in bytes (rather than as human-readable values, for example), files, or images. This group includes Binary, Varbinary, and varbinary(max). (Note that there is also an image datatype that is deprecated and that you should phase out for the varbinary(max) datatype, immediately if not sooner.)

And the rest  Datatypes that don’t fit into any other groups nicely but are still interesting. This group includes rowversion (also known as timestamp), uniqueidentifier, sql_variant, XML, hierarchyId, and spatial types (such as geometry and geography).

Picking the right datatype from these types to match your need is an important step in establishing data integrity, and it’s crucial for performance. (There is even the sql_variant type that you can use in the rare case that you can’t actually predict the datatype.) SQL Server (believe it or not) is optimized to work with data in a certain way, and to do comparisons in each datatype in a natural way, not in the way that you may imagine it will in a given situation. For example, say you store numbers in a character string. Then you evaluate the following expression: '100' > '2'. Even when just looking at this, your mind really wants to say that it is true. CAST('100' AS int) > CAST('2' AS int) will evaluate to true, but '100' > '2' will evaluate to false because character strings are ordered from left to right in the way character strings are in the real world. This is obvious when you ask whether 'Apple' < 'Microsoft'. This evaluates to true because the ‘A’ is less than ‘M’ regardless of the size of the string (and for no other reason, no matter what humor I might be trying to derive from the names of those companies.)

If you don’t get how a relational database uses indexes, the significance of this previous paragraph has probably not yet hit you. (You also really need to read the indexing chapter later in this book.) In procedural programming, executing CAST(value AS type) > CAST(value2 AS type) is no slower than executing value > value2. You’ll see this if you frequently use SQL Server Integration Services (SSIS) to do comparisons, because you’ll do a lot of type casts and uppercasing for comparisons since data is all case sensitive as you work on the data row by row.

But in a set-based and declarative language like SQL, your goal isn’t to take complete control over the entire process, but rather to set up the scenario for the engine to get the right answer fast. There are two ways that indexes are used optimally in a query like the following one:

SELECT <columnList>

FROM   <tableName>

WHERE  <column1> = <comparisonValue1>

The ideal case is that column1 and comparisonValue1 have the exact same datatype. The next best case is that the comparisonValue can be implicitly converted to the datatype of the column. Implicit conversion is generally possible when one datatype is less complex than the other. For example, you can implicitly convert an ASCII string to a Unicode string, but you cannot do it the other way around. For the complete chart, check the “Data Type Precedence ” topic in SQL Server books online. As an example, consider a table such as this (which I will create in tempdb):

CREATE TABLE dbo.testType

(

    varcharType varchar(20),

    nvarcharType varchar(20)

)

Obviously, this is not the most useful table, but even a table this simple will show you how performance will be affected by a simple datatype choice. In the following query, the column is higher in datatype precedence. So the value that is lower in precedence will be converted to the higher one. In the following query:

SELECT  *

FROM    testType

WHERE   nvarcharType = 'Test'

This is okay, because the optimizer will convert the literal value to the more complex type. (The value that is implicitly converted will always be less complex, because implicit conversions are always non-lossy in nature.) In the query plan, you will find the following expression: [tempdb].[dbo].[testType]. [nvarcharType] = CONVERT_IMPLICIT( nvarchar(4000),[@1],0). However, consider the following query:

SELECT  *

FROM    testType

WHERE   varcharType = N'Test' --Note: A capital N before a string literal makes it UNICODE

You will see the following expression: CONVERT_IMPLICIT(nvarchar(20),[tempdb].[dbo].[testType]. [varcharType],0)=[@1]. This indicates that it has converted the column value, effectively rendering any indexes on the varcharType column useless for any seek operation. Note, too, that when it comes to character values, there is also a collation that controls how values are sorted and compared that can also cause type mismatches.

These are simple examples, but these issues crop up all of the time because there are plenty of places where you have to compare values, in joins, searches, and so on. Attention to the detail of getting the datatypes right can give you great dividends in terms of performance and data quality. In the next section, I’ll expand the topic of data quality and limiting data beyond datatypes, but the datatype choice is quite often your firs t physical model choice that you will make.

Constraints

Once you have designed the tables, and picked the datatype to match the need, you need to constrain the data to only what is “legal.” The more correct you can keep the data, the less time you’ll spend correcting mistakes. Any data that can be realistically constrained ought to be. For example, often someone will design a column to hold the country name in an address. There are fewer than two hundred countries in the world, but if you give users a simple text column to enter the country name, you might get over 200 different spellings of each country. With a little ingenuity, you can prevent misspellings by using one of SQL Server’s declarative constraints. It might not be possible to validate all data—for example, in the US, it is pretty easy to validate states, but validation gets more difficult as you get more granular. Validating counties and cities gets costly, and making sure postal codes match the city and state is a bit more difficult because there are so many allowable exceptions. Validating complete addresses may only be worth the effort only when you are doing bulk mailing operations that will give you a mailing discount.

To aid in constraining and guiding data-integrity challenges, SQL Server has five kinds of declarative constraints:

NULL  Determines if a column will accept NULL for its value. Though NULL constraints aren’t objects like the other constraint types, they behave very much in the same manner.

PRIMARY KEY and UNIQUE constraints  Used to make sure your rows contain only unique combinations of values over a given set of key columns. This type of constraint is used to protect the keys you design from duplicate data.

FOREIGN KEY  Used to make sure that any foreign-key references have only valid values that match the key columns they reference.

DEFAULT  Used to set an acceptable default value for a column when the user doesn’t provide one. (Some people don’t count defaults as constraints, because defaults don’t constrain updates.)

CHECK  Used to limit the values that can be entered into a single column or an entire row.

Constraints are part of the base implementation of a table that helps to ensure data quality. You should use constraints as extensively as possible to protect your data, because they’re simple and, for the most part, have minimal overhead. Of course, they are not free, and they can slow down your data creation to some extent (definitely for CHECK, FOREIGN KEY, and PRIMARY KEY/UNIQUE constraints), but poorly formatted data is far worse for your performance over time.

Note SQL Server also provides triggers that can be used for complex data-integrity implementation. However, they should be used only as a last resort because they can be tricky to get correct and can be very harmful to data integrity if not used correctly.

One of the greatest aspects of all of SQL Server’s constraints (other than defaults) is that the query optimizer can use them to optimize queries, because the constraints tell the optimizer about some additional quality aspect of the data. For example, say you place a constraint on a column that requires all values for that column to fall between 5 and 10. If a query is executed that asks for all rows with a value greater than 100 for that column, the optimizer will know without even looking at the data that no rows meet the criteria.

Regardless of whether constraints help query performance directly, their part in maintaining data quality is one of the biggest factors in overall system performance. Beyond the topics I covered on normalization and datatypes, consider a database that contains the calendar of speakers for an event. Suppose that the database is perfectly designed but a few speaker assignments are not right. So every user now has to qualify all usage for all rows by manually checking values for bad data (a significant waste of time) and dealing with the data values that are wrong.

Of course, while you can do a lot with constraints, you can only do so much. Users are humans (for the most part), so sometimes they hear things wrong. Sometimes they spell things wrong. (Trust me, the name Louis has always been a delight to see spelled; at least some of the spellings are masculine.). For example, say you have spent a large amount of time designing a customer database and your user has the following conversation with the customer:

Computer operator: “Hi, what is your first name?”

Phone Voice: “Fred.”

[Computer keys click; pan to computer screen.]

Input invalid: 23720823437 is not a valid person name

Computer operator: “Hmm, I think I misspelled that. Can you spell it again?”

Phone Voice (less happily): “F-r-e-d.”

[Computer keys click; pan to computer screen.]

First Name (as shown on screen): Jerkface

There is really nothing you can do about these types of “mistakes”. First, the caller’s name might have been Jerkface (but probably actually spelled Jearkfase!), so maybe it was entered correctly, but the software has no way of preventing the user’s error. Second, it might have been a female voice and the name “Fred” could have been wrong as it is a very uncommon female name(though the name Frederica could be shortened to Fred). These sorts of data issues are just the start of how complicated it is to stop all of the sorts of data integrity issues. I have worked with databases of addresses (all adequately designed for their purpose) with many permutations of American state, country, city, and street names that just couldn’t be correct. Yes, there is a country named Georgia, but it doesn’t have a state of Atlanta most likely.

A useful implementation pattern is to take a column with a naturally fixed domain (such as the primary colors, or names of football teams in a league) and instead of trying to have a constraint that hard-codes a list of values (or perhaps hard-codes the list of values in the UI layer), create a table that implements a fixed domain. For example, suppose you have an address table like this:

CREATE TABLE Address

(

       …

       CountryName varchar(30)

       …

);

This table would allow users to enter anything for the country. Instead of leaving this open for users to enter fifteen different spellings of “United States,” adding a table to implement the domain of countries such as the following forces users to always enter (or choose) an actual, correctly spelled value.

CREATE TABLE Country

(

       CountryName varchar(30) NOT NULL PRIMARY KEY

);

ALTER TABLE Address

    ADD FOREIGN KEY (CountryName) REFERENCES Country (CountryName);

Now the only values that can be contained in the Address.CountryName column must also exist in the Country table. Without changing the existing schema, you have constrained the values to a set of known values that cannot be violated. Using a domain table has several benefits:

The domain is known to users Because the values are in a table, users can use them to see what the possible values are without having foreknowledge of the system or looking in the metadata.

Associating additional information with the value is easy This domain might have information that is useful for users of the data. For example, in the country example mentioned earlier, you might associate an abbreviation with the country, population, geographic information, and so forth.

I nearly always include tables for all domains that are essentially lists of items because it’s far easier to manage, even if it requires more tables. (This is particularly true when you want to allow users to make changes on their own.) The choice of key in this case is a bit different than for most tables. Sometimes I use an artificial key for the actual primary key, and other times I use a natural key. The general difference in key choice is whether or not using the integer or GUID key value has value for the client’s implementation, usually based on whether their implementation tools care about tables’ primary key structures. Either a natural or an artificial key for primary key will suffice for tables that are specifically built to implement a domain.

Of course, the key benefit of constraints of any sort is the improved performance users experience when using the data. This improvement occurs because you don’t need to clean the data on every usage. When the data contains “USA,” “U.S.A,” and “United States,” this looks to your code like three values, even though you know it’s actually just one. And while known variants can be easily repaired, but a misspelling like “Urnited States,” not so much.

Indexing

Probably the most common area of overdesign and underdesign is indexes. The difficult part of indexing is that there are multiple places along the journey of implementation where you have to apply indexes. During the design phase, you specify indexes in the form of uniqueness constraints. Sometimes the indexes are useful only to protect uniqueness, but often columns that are required to be unique are also the same ones people use to do common searches. You also add indexes in some typically useful situations, like most foreign keys. During implementation, you apply indexes to places where it is obvious that there will be the need for performance improvements, often because these places are obvious even on small sets you are working with in development. Finally, during the testing and production phases of the database’s lifecyle, you need to add indexes to tune queries that don’t perform well enough.

Indexing is a pretty complex topic, but it is one that every DBA and database developer should understand deeply. Because there is an indexing chapter later in the book, I will simply point out a few concerns you should consider during the design phase. Typically, even before development, you specify the following types of indexes:

Uniqueness Indexes should be declared as unique whenever possible because the optimizer can use that information when determining how to optimize a query. Commonly, you try to enforce uniqueness with constraints and speed up queries with indexes. Indexes should typically not change the meaning of the system. (An exception is that sometimes you can use filtered indexes to implement uniqueness on a subset of rows, such as non-null values.)

Foreign Keys Foreign key columns are a special case where you often need an index of some type. This is because you build foreign keys so that you can match up rows in one table to rows in another. For this, you have to take a value in one table and match it to another. Be careful to consider how many unique values will be in the rows of the index. Foreign keys used with domain tables are sometime not useful to index because of the low number of values in the domain compared to the base table. But even then, when you are deleting a domain value, it can be useful.

The biggest key to indexing is to manage how useful the indexes are that you apply. If indexes are never used in a query, they can be removed (unless they are implementing a uniqueness constraint). You can use the sys.dm_db_index_usage_stats dynamic management function to see how often SQL Server is using the index.

Design Testing

Testing is a large topic, and one that I will not even attempt to do complete justice to. However, I do want to mention testing during the phases that happen even before the programming begins. The first section of this chapter was titled “Requirements,” and it is essential that the requirements be very well done for the rest of any computer project to succeed in any organized fashion.

Throughout the design phase, you might think that you can’t start doing any testing because there is no software created. However, every step in the process can be tested. From the requirements, the design team creates a design that will serve as the blueprint for the implementation. In many methodologies (and sometimes just because the design takes too long to actually create), the requirements are somewhat simultaneously created along with the design. (And yes, too often the requirement are barely even done, or are mixed in with implementation details. All I can do is advise you on what is the easiest path to well-built software. I won’t lie to you and pretend I always do things perfectly either.) In any case, the goal of testing will always be to make sure the target you hit is the target you planned to hit.

After you have designed your tables, it is time to test your design. By working through scenarios of utilization, you will probably discover many holes in your initial designs. For example, most of the time, when a design is not that well thought out, the holes usually stem from a desire to keep designs reasonable. So the requirements-gathering team consider only the most typical cases but ignore the abnormal cases. For example, suppose your company has a maximum salary it will pay a new DBA, even if that person has lots of experience. Then the best DBA in the world (one of the other writers of this book would do as an example!) might want the job but needs 10 percent more than the maximum allowed. The system denies the request for an increased salary amount, so the company has to hire this DBA as a second Chief Information Officer (CIO). Writing software to handle only the reasonable case, as I just described, is not the kind of “reasonable” I am talking about. During the design testing process, look for obvious holes in your design (and in the requirements as well) and make sure that the final product does what the users realistically would desire (and make sure that they agree, naturally).

On the other end of the spectrum, you have the case where during the normalization phase, the designer went hog wild breaking tables down into more and more separate tables. In some cases, this is absolutely necessary, but the goal of design is to match the actual user’s needs to the software created. In the previous example, the database had too simplistic of a system view. However, the opposite case, where the system takes an overly complex view of the problem solved, is equally difficult.

Suppose you were designing a database to store information about camp activities and manage signups for the activities. It might be tempting to have an individual table for archery classes, polka dances, tango dances, snorkeling, basket weaving, s’mores baking, aerobics, canoeing, swimming lessons, fishing and so on, because each of these activities is really nothing like the other. So you go off and model with great details of each camp activity that were described in some detail in the requirements. If there were 50 activities at the camp, you would design 50 tables, plus a bunch of other tables to tie these 50 together. All of these tables are massive enough to document the entire camp and track the movements of all the counselors and campers…if the users had the time and inclination to enter all of that data.

It’s probably more likely that the client really wanted to automate the administrative processes of the camp, such as assigning an instructor, signing up attendees, and associating a description with each class. Rather than having the system meticulously model each activity, what you truly needed was to model was the abstraction of a camp activity that can then have people assigned to it. Camp activity description sheets are probably a Microsoft Word document that gets printed out for each new season (and could possibly be stored as a Word document in the database to make sure the documents don’t get lost).

By taking the time to test your design and make sure that it matches the actual requirements and needs of the user/application, you can make changes to the design when all that is required is the eraser end of the pencil. That is always the value of testing: to catch errors before they compound themselves into bigger issues. As you move to implementation, use the requirements and design to get the implementation correct. If you have already coded and tested the system in your mind, that is a solid first step toward creating a great piece of software. Matching the needs of the data with the needs of the user will make sure that what the user needs is covered without the performance overhead of software running that is never even used.

Conclusion

Performance is often mistaken as a purely technical topic. If you are great at SQL and know everything there is about the internals of SQL Server, you probably feel pretty confident that you are good at optimizing performance. It might be true, and without any knowledge of what is the right way to implement, sometimes one can squeeze out good performance from even the worst databases. But unless you are a consultant who makes money off of fixing poorly created software, actual performance considerations start way before the first line of code is written and even before anyone technical is involved with the process.

The first, most important step to great performance is actually understanding the problem you are trying to solve. It doesn’t matter if you use the waterfall method or an Agile methodology with nearly religious reverence. It doesn’t matter how fast you perform a task if you aren’t headed in the right direction. The process for planning for a trip to DisneyWorld or an Alaskan cruise will be quite similar, but the outcome will be worlds apart. Without knowing where the final destination is, successfully arriving at your destination with the right clothes in your luggage is nearly impossible. If you end up in Alaska in February wearing Bermuda shorts and with summer tires on your car, your performance traveling from point A to point B is going to suffer.

Once you have gathered the requirements, and you know what you want to build, the process takes a turn towards how you will implement the solution. The table structure is formed, using patterns that were formed in the 1970s and are still valid today. Tables are normalized, making tables that are atomic, with each table having a single meaning, and each column in the table having a single value. A normalized table has little redundancy in information, leading not only to a high level of data integrity, but structures that are well suited to work with the relational engine that SQL Server is based upon. A very important part of normalization is to match the needs of the users to the needs of the relational engine, not as some form of academic one-upmanship to show that you understand the data better than your user. A database that is over engineered, with fifty tables where five meet the requirements frequently leads to poor performance.

After I discussed getting tables normalized in a practical manner, I discussed several choices for building your tables. Datatypes are chosen to closely match the shape and physical characteristics of the data to be stored to the engine’s method of storage. Constraints are chosen to make sure that the data that is created matches the set of rules (requirements!) that govern the data. Indexes are applied as needed to make access to the data optimal. In particular, these three choices are most often made easier by the normalization process. Scalar values (as defined by the First Normal Form) make choosing a datatype straightforward because only one value needs to be stored. Indexes are straightforward because most common access to data can be done using entire data values.

The final topic was a process I referred to as design testing. This process is basically testing before actual implementation occurs, and throughout the process. It is not always possible to catch every issue before the user gets to the software, but spending time, mapping the expected and typical usage patterns to the design at every step, from your data structures that you have applied normalization to, as well as the final table structures that you plan to turn into physical tables in a database can help you see the issues before they are too late to be changed.

This chapter is titled as being about performance, but ideally, in the end, performance is not a thing that you achieve by accident, no more than you arrive at your perfectly restful vacation without planning ahead. Rather, database performance boils down to three things: solid requirements, good design, and defensive coding. Get the design right, and the performance will follow. The real cases you will deal with will be less simplistic than the ones shown in this chapter, and the politics that surround the requirements and eventual design and implementation often will eat up tiny parts of your soul that you will wish you could get back. Getting the design right the first time does matter, however, because databases last for a very long time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2: Getting It Right: Designing the Database for Performance

Create new playlist

Sign In

Sign Up

CHAPTER 2

Getting It Right: Designing the Database for Performance

Requirements

Table Structure

A Really Quick Taste of History

Why a Normal Database Is Better Than an Extraordinary One

(Semi-)Formal Definition

Practical Application

Columns

Table/row uniqueness

Columns depend only on an entire key

Keys always represent a single expression

Physical Model Choices

Datatype

Constraints

Indexing

Design Testing

Conclusion

Table of Contents for
Chapter 2: Getting It Right: Designing the Database for Performance