CHAPTER 3

image

The Language of Data Modeling

The aim of art is to represent not the outward appearance of things but their inward significance.

—Aristotle

A data model is one of the most important tools in the design process, but it has to be done right. A common misconception is that a data model is a picture of a database. That is partly true, but a model can do so much more. A great data model covers pretty much everything about a database and serves as the primary documentation for the life cycle of the database. Aspects of the model will be useful to developers, users, and the database administrators (DBAs) who maintain the system.

In this chapter, I will introduce the basic concept of data modeling, in which a representation of your database will be produced that shows the objects involved in the database design and how they interrelate. It is really a description of the exterior and interior parts database, with a graphical representation being just one facet of the model (the graphical part of the model is probably the most interesting to a general audience, because it gives a very quick and easy-to-work-with overview of your objects and their relationships). Best of all, using a good tool, you can practically design the basics of a system live, right with your clients as they describe what they want (hopefully, someone else is gathering client requirements that are not data-structure related).

In the next section, I’ll provide some basic information about data modeling and introduce the language I prefer for data modeling (and will use for many examples in this book): IDEF1X. I’ll then cover how to use the IDEF1X methodology to model and document the following:

  • Entities/tables
  • Attributes/columns
  • Relationships
  • Descriptive information

In the process of creating a database, we will start out modeling entities and attributes, which do not follow very strict definitions, and refine the models until we end up producing tables and columns, which, as discussed in Chapter 1, have very formal definitions that we have started to define and will refine even more in Chapter 5. For this chapter and the next, we will primarily refer to entities during the modeling exercises, unless we’re trying to demonstrate something that would be created in SQL Server. The same data modeling language will be used for the entire process of modeling the database, with some changes in terminology to describe an entity or a table later in this book.

After introducing IDEF1X, we will briefly introduce several other alternative modeling methodology styles, including information engineering (also known as “crow’s feet”) and the Chen Entity Relationship Model (ERD) methodology. I’ll also show an example of the diagramming capabilities built into SQL Server Management Studio.

image  Note   This chapter will mainly cover the concepts of modeling. In the next chapter, we will apply these concepts to build a data model.

Introducing Data Modeling

Data modeling is a skill at the foundation of database design. In order to start designing databases, it is very useful to be able to effectively communicate the design as well as make it easier to visualize. Many of the concepts introduced in Chapter 1 have graphical representations that make it easy to get an overview of a vast amount of database structure and metadata in a very small amount of space. As mentioned earlier, a common misconception about the data model is that it is solely about painting a pretty picture. In fact, the model itself can exist without the graphical parts; it can consist of just textual information, and almost everything in the data model can be read in a manner that makes grammatical sense to almost any interested party. The graphical nature is simply there to fulfill the baking powder prophecy—that a picture is worth a thousand words. It is a bit of a stretch, because as you will see, the data model will have lots of words on it!

image  Note   There are many types of models or diagrams: process models, data flow diagrams, data models, sequence diagrams, and others. For the purpose of database design, however, I will focus only on data models.

Several popular modeling languages are available to use, and each is generally just as good as the others at the job of documenting a database design. The major difference will be some of the symbology that is used to convey the information. When choosing my data modeling methodology, I looked for one that was easy to read and could display and store everything required to implement very complex systems. The modeling language I use is Integration Definition for Information Modeling (IDEF1X) . (It didn’t hurt that the organization I have worked for over ten years has used it for that amount of time too.)

IDEF1X is based on Federal Information Processing Standards Publication 184, published September 21, 1993. To be fair, the other major methodology, Information Engineering, is good too, but I like the way IDEF1X works, and it is based on a publicly available standard. IDEF1X was originally developed by the U.S. Air Force in 1985 to meet the following requirements:

  • Support the development of data models.
  • Be a language that is both easy to learn and robust.
  • Be teachable.
  • Be well tested and proven.
  • Be suitable for automation.

image  Note   At the time of this writing, the full specification for IDEF1X is available at http://www.itl.nist.gov/fipspubs/idef1x.doc . The exact URL of this specification is subject to change, but you can likely locate it by searching the http://www.itl.nist.gov site for “IDEF1X.”

While the selection of a data modeling methodology may be a personal choice, economics, company standards, or features usually influence tool choice. IDEF1X is implemented in many of the popular design tools, such as the following, which are just a few of the products available that claim to support IDEF1X (note that the URLs listed here were correct at the time of this writing, but are subject to change in the future):

Let’s next move on to practice modeling and documenting, starting with entities.

Entities

In the IDEF1X standard, entities (which, as discussed previously, are loosely synonymous with tables) are modeled as rectangular boxes, as they are in most data modeling methodologies. Two types of entities can be modeled: identifier-independent and identifier-dependent, usually referred to as “independent” and “dependent,” respectively.

The difference between a dependent entity and an independent entity lies in how the primary key of the entity is structured. The independent entity is so named because it has no primary key dependencies on any other entity, or in other words, the primary key contains no foreign key columns from other entities.

Chapter 1 introduced the term “foreign key,” and the IDEF1X specification introduces an additional term: migrated. A foreign key is referred to as a migrated key when the key of a parent table is moved into the child table. The term “migrated” can be slightly misleading, because the primary key of one entity is not actually moving; rather, in this context, the primary key of one entity is copied as an attribute in a different entity to establish a relationship between the two entities. However, knowing its meaning in this context (and a slight release of your data-architect anal-retentive behavior), “migrated” is a good term to indicate what occurs when you put the primary key of one entity into another table to set up the reference.

If the primary key of one entity is migrated into the primary key of another, it is considered dependent on the other entity, because one entity’s meaning depends on the existence of the other. If the attributes are migrated to the nonprimary key attributes, they are independent of any other entities. All attributes that are not migrated as foreign keys from other entities are owned, as they have their origins in the current entity. Other methodologies and tools may use the terms “identifying” and “nonidentifying” instead of “owned” and “independent.”

For example, consider an invoice that has one or more line items. The primary key of the invoice entity might be invoiceNumber. If the invoice has two line items, a reasonable choice for the primary key would be invoiceNumber and lineNumber. Since the primary key contains invoiceNumber, it would be dependent on the invoice entity. If you had an invoiceStatus entity that was also related to invoice, it would be independent, because an invoice’s existence is not really predicated on the existence of a status (even if a value for the invoiceStatus to invoice relationship is required (in other words, the foreign key column would be NOT NULL).

An independent entity is drawn with square corners, as follows:

9781430236955_Ill03-01.jpg

The dependent entity is the converse of the independent entity—it will have the primary key of one or more entities migrated into its primary key. It is called “dependent” because its identifier depends on the existence of another entity. It is drawn with rounded corners, as follows:

9781430236955_Ill03-02.jpg

image  Note   The concept of dependent and independent entities lead us to a bit of a chicken and egg paradox (not to mention, a fork in the road). The dependent entity is dependent on a certain type of relationship. However, the introduction of entity creation can’t wait until after the relationships are determined, since the relationships couldn’t exist without entities. If this is the first time you’ve looked at data models, this chapter may require a reread to get the full picture, as the concept of independent and dependent objects is linked to relationships.

As we start to identify entities, we need to deal with the topic of naming. One of the most important aspects of designing or implementing any system is how objects, variables, and so forth are named. Long discussions about names always seem like a waste of time, but if you have ever gone back to work on code that you wrote months ago, you understand what I mean. For example, @x might seem like an OK variable name when you first write some code, and it certainly saves a lot of keystrokes versus typing @holdEmployeeNameForCleaningInvalidCharacters, but the latter is much easier to understand after a period of time has passed (for me, this period of time is approximately 14.5 seconds).

Naming database objects is no different; actually, naming database objects clearly is more important than naming other programming objects, as your end users will almost certainly get used to these names: the names given to entities will be translated into table names that will be accessed by programmers and users alike. The conceptual and logical model will be considered your primary schematic of the data in the database and should be a living document that you change before changing any implemented structures.

Frequently, discussions on how objects should be named can get heated because there are several different schools of thought about how to name objects. The central issue is whether to use plural or singular names. Both have merit, but one style has to be chosen. I choose to follow the IDEF1X standard for object names, which says to use singular names. By this standard, the name itself doesn’t name the container but, instead, refers to an instance of what is being modeled. Other standards use the table’s name for the container/set of rows.

Is either way more correct? Each has benefits; for example, in IDEF1X, singular entity/table names lead to the ability to read the names of relationships naturally. But honestly, plural or singular naming might be worth a few long discussions with fellow architects, but it is certainly not something to get burned at the stake over. If the organization you find yourself beholden to uses plural names, that doesn’t make it a bad place to work. The most important thing is to be consistent and not let your style go all higgledy-piggledy as you go along. Even a bad set of naming standards is better than no standards at all, so if the databases you inherit use plural names, follow the “when in Rome” principle and use plural names so as not to confuse anyone else.

In this book, I will follow these basic guidelines for naming entities:

  • Entity names should never be plural. The primary reason for this is that the name should refer to an instance of the object being modeled, rather than the collection. This allows you to easily use the name in a sentence. It is uncomfortable to say that you have an “automobiles row,” for example—you have an “automobile row.” If you had two of these, you would have two automobile rows.
  • The name given should directly correspond to the essence of what the entity is modeling. For instance, if you are modeling a person, name the entity . If you are modeling an automobile, call it . Naming is not always this straightforward, but keeping the name simple and to the point is wise. If you need to be more specific, that is fine too. Just keep it succinct (unlike this explanation!).

Entity names frequently need to be made up of several words. During the conceptual and logical modeling phases, including spaces, underscores, and other characters when multiple words are necessary in the name is acceptable but not required. For example, an entity that stores a person’s addresses might be named Person Address, Person_Address, or using the style I have recently become accustomed to and the one I’ll use in this book, PersonAddress. This type of naming is known as Pascal case or mixed case. (When you don’t capitalize the first letter, but capitalize the first letter of the second word, this style is known as camelCase.) Just as in the plural/singular argument, there really is no “correct” way; these are just the guidelines that I will follow to keep everything uniform.

Regardless of any style choices you make, very few abbreviations should be used in the logical naming of entities unless it is a universal abbreviation that every person reading your model will know. Every word ought to be fully spelled out, because abbreviations lower the value of the names as documentation and tend to cause confusion. Abbreviations may be necessary in the implemented model because of some naming standard that is forced on you or a very common industry standard term. Be careful of assuming the industry-standard terms are universally known. For example, at the time of this writing, I am helping breaking in a new developer at work, and every few minutes, he asks what a term means—and the terms are industry standard.

If you decide to use abbreviations in any of your names, make sure that you have a standard in place to ensure the same abbreviation is used every time. One of the primary reasons to avoid abbreviations is so you don’t have to worry about different people using Description, Descry, Desc, Descrip, and Descriptn for the same attribute on different entities.

Often, novice database designers (particularly those who come from interpretive or procedural programming backgrounds) feel the need to use a form of Hungarian notation and include prefixes or suffixes in names to indicate the kind of object—for example, tblEmployee or tblCustomer. Prefixes like this are generally considered a bad practice, because names in relational databases are almost always used in an obvious context. Using Hungarian notation is a good idea when writing procedural code (like Visual Basic or C#), since objects don’t always have a very strict contextual meaning that can be seen immediately upon usage, especially if you are implementing one interface with many different types of objects. In SQL Server Integration Services (SSIS) packages, I commonly name each control with a three- or four-letter prefix to help identify them in logs. However, with database objects, questioning whether a name refers to a column or a table is rare. Plus, if the object type isn’t obvious, querying the system catalog to determine it is easy. I won’t go too far into implementation right now, but you can use the sys.objects catalog view to see the type of any object. For example, this query will list all of the different object types in the catalog (your results may vary; this query was executed against the AdventureWorks2012 database we will use for some of the examples in this book):

 SELECT distinct type_desc

 FROM sys.objects

Here’s the result:

 type_desc

 -------------------------

 CHECK_CONSTRAINT

 DEFAULT_CONSTRAINT

 FOREIGN_KEY_CONSTRAINT

 INTERNAL_TABLE

 PRIMARY_KEY_CONSTRAINT

 SERVICE_QUEUE

 SQL_SCALAR_FUNCTION

 SQL_STORED_PROCEDURE

 SQL_TABLE_VALUED_FUNCTION

 SQL_TRIGGER

 SYNONYM

 SYSTEM_TABLE

 UNIQUE_CONSTRAINT

 USER_TABLE

 VIEW

We will use sys.objects and other catalog views throughout this book to view properties of objects that we create.

Attributes

All attributes in the entity must be uniquely named within it. They are represented by a list of names inside of the entity rectangle:

9781430236955_Ill03-03.jpg

image  Note   The preceding image shows a technically invalid entity, as there is no primary key defined (a requirement of IDEF1X). I’ll cover the notation for keys in the following section.

At this point, you would simply enter all of the attributes that you discover from the requirements (the next chapter will demonstrate this process). In practice, you would likely have combined the process of discovering entities and attributes with the initial modeling phase (we will do so in Chapter 4 as we go through the process of creating a logical data model). Your process will depend on how well the tools you use work. Most data modeling tools cater for building models fast and storing a wealth of information along the way to document their entities and attributes.

In the early stages of logical modeling, there can be quite a large difference between an attribute and what will be implemented as a column. As I will demonstrate in Chapter 5, the attributes will be transformed a great deal during the normalization process. For example, the attributes of an Employee entity may start out as follows:

9781430236955_Ill03-04.jpg

However, during the normalization process, tables like this will often be broken down into many attributes (e.g., address might be broken into number, street name, city, state, zip code, etc.) and possibly many different entities.

image  Note   Attribute naming is one place where I tend to deviate from IDEF1X standard. The standard is that names are unique within a model. This tends to produce names that include the table name followed by the attribute name, which can result in unwieldy, long names that look archaic. You can follow many naming standards you can follow to avoid unwieldy names (and even if I don’t particularly like them), some with specific abbreviations, name formats, and so forth. For example, a common one has each name formed by a descriptive name and a class word, which is an abbreviation like EmployeeNumber, ShipDate, or HouseDescription. For sake of nonpartisan naming politics, I am happy to say that any decent naming standard is acceptable, as long as it is followed.

Just as with entity names, there is no need to include Hungarian notation prefixes or suffixes in the attribute or implementation names. The type of the attribute can be retrieved from the system catalog if there is any question about it.

Next, we will go over the following aspects of attributes on your data model:

  • Primary keys
  • Alternate keys
  • Foreign keys
  • Domains
  • Attribute naming

Primary Keys

As noted in the previous section, an IDEF1X entity must have a primary key. This is convenient for us, because an entity is defined such that each instance must be unique (see Chapter 1). The primary key may be a single attribute, or it may be a composite of multiple attributes. A value is required for every attribute in the key (logically speaking, no NULLs are allowed in the primary key).

The primary key is denoted by placing attributes above a horizontal line through the entity rectangle. Note that no additional notation is required to indicate that the value is the primary key.

9781430236955_Ill03-05.jpg

For example, consider the Employee entity from the previous section. The EmployeeNumber attribute is unique, and logically, every employee would have one, so this would be an acceptable primary key:

9781430236955_Ill03-06.jpg

The choice of primary key is an interesting one. In the early logical modeling phase, I generally do not like to spend time choosing the final primary key attribute(s). The main reason for this is to avoid worrying too much about what the key is going to be. I tend to create a simple surrogate primary key to migrate to other entities to help me see when there is any ownership. In the current example, EmployeeNumber clearly refers to an employee, but not every entity will be so clear—not to mention that more advanced business rules may dictate that EmployeeNumber is not always unique. (For example, the company also may have contractors in the table. That’s not a good practice perhaps, but no matter how much I try to describe perfect databases in this book, not every table will end up being perfect.) Having to repeatedly go back and change the entity used for the primary key in the logical model over and over can be tiresome, particularly when you have a very large model.

It is also quite likely that you may have multiple column sets that uniquely identify a given instance of many of your entities. As an example, consider an entity that models a product manufactured by a company. The company may identify the product by the type, style, size, and series:

9781430236955_Ill03-07.jpg

The name may also be a good key, and more than likely, there is also a product code. Which attribute is the best key—or which is even truly a key—may not become completely apparent until later in the process. There are many ways to implement a good key, and the best way may not be recognizable right away.

Instead of choosing a primary key at this point, I add a value to the entity for identification purposes and then model all candidate keys as alternate keys (which I will discuss in the next section). As a result, the logical model clearly shows what entities are in an ownership role to other entities, since the key that is migrated contains the name of the modeled entity. I would model this entity as follows:

9781430236955_Ill03-08.jpg

image  Note   Using surrogate keys is certainly not a requirement in logical modeling; it is a personal preference that I have found a useful documentation method to keep models clean, and it corresponds to my method of implementation later. Not only is using a natural key as the primary key in the logical modeling phase reasonable but many architects find it preferable. Either method is perfectly acceptable (and just as likely to start a religious debate at a table of data modelers. You have been warned, so start the debate after the desert course).

Alternate Keys

As defined in Chapter 1, an alternate key is a grouping of one or more attributes whose uniqueness needs to be guaranteed over all of the instances of the entity. Alternate keys do not have specific locations in the entity graphic like primary keys, nor are they typically migrated for any relationship (you can reference an alternate key with a foreign key based on the SQL standards, but this feature is very rarely used feature, and when used, it will often really confuse even the best DBAs). They are identified on the model in a very simple manner:

9781430236955_Ill03-09.jpg

In this example, there are two alternate key groups: group AK1, which has one attribute as a member, and group AK2, which has two attributes. There also is nothing wrong with overlapping alternate keys, which could be denoted as (AK1,AK2). Thinking back to the product example, the two keys could then be modeled as follows:

9781430236955_Ill03-10.jpg

One extension that Computer Associates’ ERwin adds to this notation is shown here:

9781430236955_Ill03-11.jpg

A position number notation is tacked onto the name of each key (AK1 and AK2) to denote the position of the attribute in the key. In the logical model, technically, the order of attributes in the key should not be considered even if the tool does display them (unique is unique, regardless of key column order). Which attribute comes first in the key really does not matter; all that ] matters is that you make sure there are unique values across multiple attributes. When a key is implemented, the order of columns will become interesting for performance reasons (because SQL Server implements uniqueness with an index), but uniqueness will be served no matter what the order of the columns of the key is.

image  Note   The discussion of index utilization for performance reasons is left to Chapter 10. Do your best to more or less ignore performance tuning needs during the conceptual, logical, and even most of the physical modeling phases. Defer performance tuning issues until the storage modeling phase, which includes query tuning, indexing, and disk layout.

Foreign Keys

Foreign key attributes, as I’ve alluded to, are also referred to as migrated attributes. They are primary keys from one entity that serve as references to an instance in another entity. They are, again, a result of relationships (we’ll look at their graphical representation later in this chapter). They are indicated, much like alternate keys, by adding the letters “FK” after the foreign key:

9781430236955_Ill03-12.jpg

As an example of a table with foreign keys, consider an entity that is modeling a music album:

9781430236955_Ill03-13.jpg

The artistId and publisherId represent migrated foreign keys from the artist and publisher entities. We’ll revisit this example in the “Relationships” section later in this chapter.

One tricky thing about this example is that the diagram doesn’t show what entity the key is migrated from. This can tend to make things a little messy, depending on how you choose your primary keys. This lack of clarity about what table a foreign key migrates from is a limitation of most modeling methodologies, because displaying the name of the entity where the key came from would be unnecessarily confusing for a couple of reasons:

  • There is no limit (nor should there be) on how far a key will migrate from its original owner entity (the entity where the key value was not a migrated foreign key reference).
  • It is not completely unreasonable that the same attribute might migrate from two separate entities with the same name, especially early in the logical design process. This is certainly not a design goal, but it is possible and can make for interesting situations.

One of the reasons for the primary key scheme I will employ in logical models is to add a key named < entityName > Id as the identifier for entities, so the name of the entity is easily identifiable and lets us easily know where the original source of the attribute is. Also, we can see the attribute migrated from entity to entity even without any additional documentation. For example, in the Album entity example, we instinctively know that the ArtistId attribute is a foreign key and most likely was migrated from the Artist entity just because of the name alone.

Domains

In Chapter 1, the term “domain” referred to a set of valid values for an attribute. In IDEF1X, you can formalize domains and define named, reusable specifications known as domains, for example:

  • String: A character string
  • SocialSecurityNumber: A character value with a format of ###-##-####
  • PositiveInteger: An integer value with an implied domain of 0 to max(integer value)
  • Truth: A five-character value with a domain of ('FALSE','TRUE')

Domains in the specification not only allow us to define the valid values that can be stored in an attribute but also provide a form of inheritance in the datatype definitions. Subclasses can then be defined of the domains that inherit the settings from the base domain. It is a good practice to build domains for any attributes that get used regularly, as well as domains that are base templates for infrequently used attributes. For example, you might have a character type domain where you specify a basic length, like 60. Then, you may specify common domains, like name and description, to use in many entities. For these, you should choose a reasonable length for the values, plus you could include requirements that the data in the column cannot be just space characters to prevent a user from having one, two, or three spaces each look like different values—except in the rare cases where that is desirable.

Regardless of whether or not you are using an automated tool for modeling, try to define common domains that you use for specific types of things (applying a common pattern to solve a common problem). For example, a person’s first name might be a domain. This is cool because you don’t have to answer “Hmm, how long to make a person’s name?” or “what is the format of our part numbers?” or any similar questions more than once. After you make a decision, you just use what you have used before.

image  Note   Defining common domains during design fights against varchar(200) syndrome, where every column in a database stores textual data in columns of exactly same length. Putting in some early thought on the minimum and maximum lengths of data is easier than doing it when the project manager is screaming for results later in the process, and the programmers are champing at the bit to get at your database and get coding.

Early in the modeling process, you’ll commonly want to gather a few bits of information, such as the general type of the attribute: character, numeric, logical, or even binary data. Determining minimum and maximum lengths may or may not be possible, but the more information you can gather without crushing the process the better. Another good thing to start is documenting the legal values for an attribute that is classified as being of the domain type. This is generally done using some pseudocode or in a textual manner, either in your modeling tool or even in a spreadsheet.

It is extremely important to keep these domains as implementation-independent datatype descriptions. For example, you might specify a domain of GloballyUniqueIdentifier, a value that will be unique no matter where it is generated. In SQL Server, a unique identifier could be used (GUID value) to implement this domain. In another operating system (created by a company other than Microsoft, perhaps) where there is not exactly the same mechanism, it might be implemented differently; the point is that this value is statistically guaranteed to be unique every time it is generated. The conceptual/logical modeling phase should be done without too much thinking about what SQL Server can do, if for no other reason than to prevent you from starting to impose limitations on the future solution prior to understanding the actual problem. Another sort of domain might be a set of legal values, like if the business users had defined three customer types, you could specify the legal string values that could be used.

When you start the physical modeling of the relational structures, you will use the same domains to assign the implementation properties. This is the real value in using domains. By creating reusable template attributes that will also be used when you start creating columns, you’ll spend less effort and time building simple entities, which make up the bulk of your work. Doing so also provides a way for you to enforce companywide standards, by reusing the same domains on all corporate models (predicated, of course, on you being diligent with your data modeling processes over time!).

Later on, implementation details such as exact datatypes, constraints, and so forth will be chosen, just to name a few of the more basic properties that may be inherited (and if Microsoft adds a better datatype in the future, you can simply change all of the columns with that domain type to the new type). Since it is very likely that you will have fewer domains than implemented attributes, the double benefit of speedy and consistent model assembly is achieved. However, it is probably not overly reasonable or even useful to employ the inheritance mechanisms when building tables by hand. Implementation of a flat domain structure is enough work without a tool.

As an example of a domain hierarchy, consider this set of character string domains:

9781430236955_Ill03-14.jpg

Here, String is the base domain from which you can then inherit Name and Description. FileName, FirstName, and LastName are inherited from Name. During logical modeling, this might seem like a lot of work for nothing, because most of these domains will share only a few basic details, such as not allowing NULLs or blank data. However, FileName may be optional, whereas LastName might be mandatory. Setting up domains for as many distinct attribute types as possible is important, in case rules or datatypes are discovered that are common to any domains that already exist. Things get good when you need to change all of your datatypes for all string types, for example, if you decide to make a blanket change from ANSI character sets to UNICODE, or to implement compression on all description type attributes but not name ones.

Domains are a nice feature of IDEF1X (and other methodologies or tools that support them). They provide an easy method of building standard attribute types, reducing both the length of time required for repeating common attribute types and the number of errors that occur in doing so. Specific tools implement domains with the ability to define and inherit more properties throughout the domain chain to make creating databases easier. During logical modeling, domains might optionally be shown to the right of the attribute name in the entity:

9781430236955_Ill03-15.jpg

So if I have an entity that holds domain values for describing a type of person, I might model it as follows:

9781430236955_Ill03-16.jpg

To model this example, I defined four domains:

  • SurrogateKey: The surrogate key value. (Implementation of the surrogate should not be implied by building a domain, so later, this can be implemented in any manner.) I could also choose to use a natural key.
  • Description: Holds the description of “something” (can be 60 characters maximum).
  • PersonFirstName: A person’s first name (30 characters maximum).
  • PersonLastName: A person’s last name (50 characters maximum).

The choice of the length of name is an interesting one. I searched on Google for “person first name varchar” and found lots of different possibilities: 10, 35, unlimited, 25, 20, and 15— all on the first page of the search! Just as you should use a consistent naming standard, you should use standard lengths every time like data is represented, so when you hit implementation, the likelihood that two columns storing like data will have different definitions is minimized.

During the implementation phase, all of the domains will get mapped to some form of datatype or, if you are so inclined, a user-defined type in SQL Server. However, the future implementation isn’t quite the point at this point of the process. The point of a domain in the logical model is to define common types of storage patterns that can be applied in a common manner, including all of the business rules that will govern their usage.

Naming

Attribute naming is a bit more interesting than entity naming. I stated earlier that my preference is to use singular, not plural, entity names. The same issues that apply in entity naming are technically true for attribute naming (and no one disagrees with this!). However, until the logical model is ­completed, the model may still have attribute names that are plural. Leaving a name plural during early modeling phases can be a good reminder that you expect multiple values, but in my final relational table model, all attributes are almost always singular. For example, consider a Person entity with a ­Children attribute identified. The Person entity would identify a single person, and the Children attribute would identify sons and daughters of that person.

The naming standard I follow is very straightforward and is intended to guide the names without being overly specific:

  • Avoid repeating the entity name in the attribute name for most attributes except where it is natural to do so. The most common place where we include the table name is with some primary key attributes, particularly surrogate keys, since the key is specific for that table and some code values that are used in common language. For attributes that are not migrated to other entities, there is no need to prefix the name with the entity.
  • Choose an attribute name to reflect precisely what is contained in the attribute and how it relates to the entity.
  • Use abbreviations rarely in attribute names, especially for the conceptual/logical model. Every word ought to be spelled out in its entirety. If, for some reason, an abbreviation must be used (for example, due to the naming standard currently in use), a method should be put into place to make sure the abbreviation is used consistently, as discussed earlier in this chapter. For example, if your organization has a ZRF “thing” that is commonly used and referred to in general conversation as a ZRF, you might use this abbreviation. In general, however, I recommend avoiding abbreviations in all naming unless the client is insistent.
  • Include no information in the name other than that necessary to explain the meaning of the attribute. This means no Hungarian notation of the type of data it represents (e.g., ) or prefix notation to tell you that it is, in fact, an attribute.
  • End the name with a suffix that denotes general usage (often called a classword). It helps to standardize names and to let the user know the purpose of the column. Examples are:
    • UserName: A textual string referencing the name of the user. Whether or not it is a or is immaterial.
    • PaymentId: An identifier for a payment, usually a surrogate key. It can be implemented using a GUID, an integer, a random set of characters, and so on.
    • EndDate: The date when something ends. It does not include the time.
    • SaveTime: The point in time when the row was saved.
    • PledgeAmount: An amount of money (using a , or , or any sort of type).
    • DistributionDescription: A textual string that is used to describe how funds are distributed.
    • TickerCode: A short textual string used to identify a ticker row.

image  Note   Attribute names in the finalized logical and implementation models will not be plural, but we’ll work this out in Chapter 5 when normalizing the model. At this point, it is not a big deal at all and actually is desirable if the values represented by a column are plural (and not simple atomic values as discussed in Chapter 1). Implementing this way would be bad, but during design, the only time it matters that the design is perfect is when it is being used for production.

As in pretty much all things, consistency is the key to proper naming, so if you or your organization does not have a standard naming policy, developing one is worthwhile, even if it is very simple in nature. The overarching principle of my naming philosophy is to keep it simple and readable and to avoid all but universally standard corporate abbreviations. This standard will be followed from logical modeling into the implementation phase. Whatever your standard is, establishing a pattern of naming will make your models easy to follow, both for yourself and for your programmers and users. As always, any standard is better than no standard, and sometimes just the statement “make new columns and tables use the same names as those around them” is a good enough start if you have existing systems you are modifying. Trying to enforce a new standard on new work in an old system can make life more complicated than it is worth.

Relationships

Up to this point, the constructs we have looked at have been pretty much the same across most data modeling methodologies. Entities are always signified by rectangles, and attributes are quite often words within the rectangles. Relationships are where things start to diverge greatly, as many of the different modeling languages approach representing relationships graphically a bit differently. To make the concept of relationships clear, I need to go back to the terms “parent” and “child.” Consider the following definitions from the IDEF1X specification’s glossary (as these are remarkably lucid definitions to have been taken straight from a government specification!):

  • Entity, Child: The entity in a specific connection relationship whose instances can be related to zero or one instance of the other entity (parent entity)
  • Entity, Parent: An entity in a specific connection relationship whose instances can be related to a number of instances of another entity (child entity)
  • Relationship: An association between two entities or between instances of the same entity

In IDEF1X, every relationship is denoted by a line drawn between two entities, with a solid circle at one end of that line to indicate where the primary key attribute is migrated to as a foreign key. In the following image, the primary key of the parent will be migrated to the child. This is how to denote a foreign key on a model.

9781430236955_Ill03-17.jpg

Relationships come in several different flavors that indicate how the parent table is related to the child. We will look at examples of several different types of relationships in this section:

  • Identifying, where the primary key of one table is migrated to the primary key of another. The child will be a dependent entity.
  • Nonidentifying, where the primary key of one table is migrated to the nonprimary key attributes of another. The child will be an independent entity as long as no identifying relationships exist.
  • Optional identifying, when the nonidentifying relationship does not require a child value.
  • Recursive relationships, when a table is related to itself.
  • Subtype or categorization, which is a one-to-one relationship used to let one entity extend another.
  • Many-to-many, where an instance of an entity can be related to many in another, and in turn, many instances of the second entity can be related to multiples in the other.

We’ll also cover the cardinality of the relationship (how many of the parent relate to how many of the child), role names (changing the name of a key in a relationship), and verb phrases (the name of the relationship).

Relationships are a key topic in database design but not a completely simple one. A lot of information is related using a few dots and lines.

image  Note    All of the relationships discussed in this section (except many-to-many) are of the one-to-many variety, which encompasses one-to-zero, one-to-one, one-to-many, or perhaps exactly-n relationships. Technically, it is more accurately one-to-(from M to N), as this enables specification of the many in very precise (or very loose) terms as the situation dictates. However, the more standard term is “one-to-many,” and I will not try to make an already confusing term more so.

Identifying Relationships

The concept of a relationship being identifying is used to indicate containership, that the essence (defined as the intrinsic or indispensable properties that serve to characterize or identify something) of the child instance is defined by the existence of a parent. Another way to look at this is that generally the child in an identifying relationship is an inseparable part of the parent. Without the existence of the parent, the child would make no sense.

The relationship is drawn as follows:

9781430236955_Ill03-18.jpg

To implement this relationship in the model, the primary key attribute(s) are migrated to the primary key of the child. Because of this, the key of a parent instance is needed to be able to identify a child instance record, which is why the name “identifying relationship” is used. In the following example, you can see that the ParentId attribute is a foreign key in the Child entity, from the Parent entity.

9781430236955_Ill03-19.jpg

The child entity in the relationship is drawn as a rounded-off rectangle, which, as mentioned earlier in this chapter, means it is a dependent entity. A common example is an invoice and the line items being charged to the customer on the invoice:

9781430236955_Ill03-20.jpg

Without the existence of the invoice, the line items would have no purpose to exist. It can also be said that the line items are identified as being part of the parent.

Nonidentifying Relationships

In contrast to identifying relationships, where relationships indicated that the child was an essential part of the parent entity, the nonidentifying relationship indicates that the child represents a more informational attribute of the parent.

When implementing the nonidentifying relationship, the primary key attribute is not migrated to the primary key of the child. It is denoted by a dashed line between the entities. Note too that the rectangle representing the child now has squared off corners, since it stands alone, rather than being dependent on the Parent:

9781430236955_Ill03-21.jpg

This time, the key attribute of the parent will not migrate to the primary key of the child entity; instead, it will be in the nonprimary-key attributes.

9781430236955_Ill03-22.jpg

Taking again the example of an invoice, consider the vendor of the products that have been sold and documented as such in the line items. The product vendor does not define the existence of a line item, because with or without specifying the exact vendor the product originates from, the line item still makes sense.

The difference between identifying and nonidentifying relationships can be tricky but is essential to understanding the relationship between tables and their keys. If the parent entity defines the need for the existence of the child (as stated in the previous section), then use an identifying relationship. If, on the other hand, the relationship defines one of the child’s attributes, use a nonidentifying relationship.

Here are some examples:

  • Identifying: You have an entity that stores a contact and another that stores the contact’s telephone number. defines the phone number, and without the contact, there would be no need for the .

9781430236955_Ill03-23.jpg

  • Nonidentifying: Consider the entities that were defined for the identifying relationship, along with an additional entity called ContactPhoneNumberType.. This entity is related to the entity ContactPhoneNumber, but in a nonidentifying way, and defines a set of possible phone number types (Voice, Fax, etc.) that a ContactPhoneNumber might be. The type of phone number does not identify the phone number; it simply classifies it. Even if the type wasn’t known, recording the phone number could still be valid, as the number still has informational merit. However, a row associating a contact with a phone number would be useless information without the contact’s existence.

9781430236955_Ill03-24.jpg

The ContactPhoneNumberType entity is commonly known as a domain entity or domain table, as it serves to implement an attributes domain in a nonspecific manner. Rather than having a fixed domain for an attribute, an entity is designed that allows programmatic changes to the domain with no recoding of constraints or client code. As an added bonus, you can add columns to define, describe, and extend the domain values to implement business rules. It also allows the client user to build lists for users to choose values with very little programming.

While every nonidentifying relationship defines the domain of an attribute of the child table, sometimes when the row is created, the values don’t need to be selected. For example, consider a database where you model houses, like for a neighborhood. Every house would have a color, a style, and so forth. However, not every house would have an alarm company, a mortgage holder, and so on. The relationship between the alarm company and bank would be optional in this case, while the color and style relationships would be mandatory. The difference in the implemented table will be whether or not the child table’s foreign key will allow nulls. If a value is required, then it is considered mandatory. If a value of the migrated key can be null, then it is considered optional.

The optional case is signified by an open diamond at the opposite end of the dashed line from the black circle, as shown here:

9781430236955_Ill03-25.jpg

In the mandatory case, the relationship is drawn as before, without the diamond. Note that an optional relationship means that the cardinality of the relationship may be zero, but a mandatory relationship must have a cardinality of one or greater (cardinality refers to the number of values that can be related to another value, and the concept will be discussed further in the next section).

So why would you make a relationship optional? Consider once again the nonidentifying ­relationship between the invoice line item and the product vendor. The vendor in this case may be required or not required as the business rules dictate. If it is not required, you should make the relationship optional.

image  Note    You might be wondering why there is not an optional identifying relationship. This is because you may not have any optional attributes in a primary key, which is true in relational theory and for SQL Server as well.

For a one-to-many, optional relationship, consider the following:

9781430236955_Ill03-26.jpg

The invoiceLineItem entity is where items are placed onto an invoice to receive payment. The user may sometimes apply a standard discount amount to the line item. The relationship, then, from the invoiceLineItem to the discountType entity is an optional one, as no discount may have been applied to the line item.

For most optional relationships like this, there is another possible solution, which can be modeled as required, and in the implementation, a row can be added to the discountType table that indicates “none.” An example of such a mandatory relationship could be genre to movie in a movie rental system­database:

9781430236955_Ill03-27.jpg

The relationship is genre<classifies>movie, where the genre entity represents the “one” and movie represents the “many” in the one-to-many relationship. Every movie being rented must have a genre, so that it can be organized in the inventory and then placed on the appropriate rental shelf.

Role Names

A role name is an alternative name you can give an attribute when it is used as a foreign key. The purpose of a role name is to clarify the usage of a migrated key, because either the parent entity is generic and a more specific name is needed or the same entity has multiple relationships. As attribute names must be unique, assigning different names for the child foreign key references is often necessary. Consider these tables:

9781430236955_Ill03-28.jpg

In this diagram, the Parent and Child entities share two relationships, and the migrated attributes have been role named as FirstParentPrimaryKey and SecondParentPrimaryKey. In diagrams, you can indicate the original name of the migrated attribute after the role name, separated by a period (.), as follows (but usually it takes up too much space on the model when you are using it):

9781430236955_Ill03-29.jpg

As an example, say you have a User entity, and you want to store the name or ID of the user who created a DatabaseObject entity instance as well as the user that the DatabaseObject instance was created for. It would then end up as follows:

9781430236955_Ill03-30.jpg

Note that there are two relationships to the DatabaseObject entity from the User entity. Due to the way the lines are drawn on a diagram, it is not clear from the diagram which foreign key goes to which relationship. Once you name the relationship (with a verb phrase, which will be covered later in this chapter), the keys’ relationships will be easier to determine, but often, determining which line indicates which child attribute is simply trial and error.

Relationship Cardinality

The cardinality of the relationship denotes the number of child instances that can be inserted for each parent of that relationship. Cardinality may seem like a fringe topic because the logic to implement can be tricky, but the reality is that if the requirements state the cardinality, it can be important to document the cardinality requirements and to implement restrictions on cardinality in the database constraints where possible. At this point, however, logical modeling is not about how to implement but about documenting what should be. We will talk implementation of data constraints starting in Chapter 6 when we implement our first database, and even more in Chapter 7, the data protection chapter.

Figures 3-1 through 3-6 show the six possible cardinalities that relationships can take. The cardinality indicators are applicable to either mandatory or optional relationships.

9781430236955_Fig03-01.jpg

Figure 3-1. One-to-zero or more

9781430236955_Fig03-02.jpg

Figure 3-2. One-to-one or more (at least one), indicated by P

9781430236955_Fig03-03.jpg

Figure 3-3. One-to-zero or one (no more than one), indicated by Z

9781430236955_Fig03-04.jpg

Figure 3-4. One-to-some fixed range (in this case, between 4 and 8 inclusive)

9781430236955_Fig03-05.jpg

Figure 3-5. One-to-exactly N (in this case, 5, meaning each parent must have five children)

9781430236955_Fig03-06.jpg

Figure 3-6. Specialized note describing the cardinality

9781430236955_Ill03-31.jpg

This is a good example of a zero-or-one to one-or-more relationship, and a fairly interesting one at that. It says that for a guardian instance to exist, a related student must exist, but a student need not have a guardian for us to wish to store the student’s data. Next, let’s consider the case of a club that has members with certain positions that they should or could fill, as shown in Figures 3-7 through 3-9.

9781430236955_Fig03-07.jpg

Figure 3-7. One-to-many allows unlimited positions for the member.

9781430236955_Fig03-08.jpg

Figure 3-8. One-to-one allows only one position per member

9781430236955_Fig03-09.jpg

Figure 3-9. A one-to-zero, one-to-one, or one-to-two relationship specifies a limit of two positions per person.

Figure 3-7 shows that a member can take as many positions as are possible. Figure 3-8 shows that a member can serve in no position or one position, but no more. Finally, Figure 3-9 shows that a member can serve in zero, one, or two positions. They all look pretty similar, but the Z or 0–2 is important in signifying the cardinality.

image  Note    It is a fairly rare occurrence that I have needed anything other than the basic one-many, one-zero, or one relationship types, but your experience may lend itself to the specialized relationship cardinalities.

Recursive Relationships

One of the more difficult—and often important—relationship types to implement is the recursive relationship, also known as a self-join, hierarchical, self-referencing, or self-relationship (I have even heard them referred to as fish-hook relationships, but that name always seems silly to me). This is modeled by drawing a nonidentifying relationship not to a different entity, but to the same entity. The migrated key of the relationship is given a role name. (In many cases, a naming convention of adding “parent” to the front of the attribute name is useful if no natural naming is available.)

9781430236955_Ill03-32.jpg

The recursive relationship is useful for creating tree structures, as in the following organizational chart:

9781430236955_Ill03-33.jpg

To explain this concept fully, I will show the data that would be stored to implement this hierarchy:

9781430236955_Ill03-34.jpg

Here is the sample data for this table:

OrganizationName ParentOrganizationName
---------------- ----------------------
All
IT ALL
HR ALL
Marketing ALL
Programming IT
Database Management IT

The organizational chart can now be traversed by starting at All and getting the children of ALL, for example: IT. Then, you get the children of those values, like for IT one of the values is Programming.

As a final example, consider the case of a Person entity. If you wanted to associate a person with a single other person as the first person’s spouse, you might design the following:

9781430236955_Ill03-35.jpg

Notice that this is a one-to-zero or one-to-one relationship, since (in most places) a person may have no more than a single spouse but need not have one. If you require one person to be related as a child to two parents, another table entity is required to link two people together.

image  Note    Hierarchies will be covered in some detail in Chapter 8 when we discuss various modeling patterns and techniques In this chapter, it is just important to understand what the hierarchical relationship looks like on a data model.

Subtypes

Subtypes (also referred to as categorization relationships) are another special type of one-to-zero or one-to-one relationship used to indicate whether one entity is a specific type of a generic entity. Note also that there are no black dots at either end of the lines; the specific entities are drawn with rounded corners, signifying that they are, indeed, dependent on the generic entity.

9781430236955_Ill03-36.jpg

There are three distinct parts of the subtype relationship:

  • Generic entity: This entity contains all of the attributes common to all of the subtyped entities.
  • Discriminator: This attribute acts as a switch to determine the entity where the additional, more specific information is stored.
  • Specific entity: This is the place where the specific information is stored, based on the ­discriminator.

For example, consider an inventory of your home video library. If you wanted to store information about each of the videos that you owned, regardless of format, you might build a categorization relationship like the following:

9781430236955_Ill03-37.jpg

In this manner, you might represent each video’s price, title, actors, length, and possibly description of the content in the VideoProgram entity, and then, based on format—which is the discriminator—you might store the information that is specific to Discs or FileSystem in its own separate entity (e.g., physical location, special features, format [BluRay, DVD, digital copy] for Disc based video and directory and format for FileSystem).

There are two distinct category types: complete and incomplete. The complete set of categories is modeled with a double line on the discriminator, and the incomplete set is modeled with a single line (see Figure 3-10).

9781430236955_Fig03-10.jpg

Figure 3-10. Complete (left) and incomplete (right) sets of categories

The primary difference between the complete and incomplete categories is that in the complete categorization relationship, each generic instance must have one specific instance, whereas in the incomplete case, this is not necessarily true. An instance of the generic entity can be associated with an instance of only one of the category entities in the cluster, and each instance of a category entity is associated with exactly one instance of the generic entity. In other words, overlapping subentities are not allowed.

For example, you might have a complete set of categories like this:

9781430236955_Ill03-38.jpg

This relationship is read as follows: “A Person must be either Male or Female.” This is certainly a complete category. This is not to say that you know the gender of every person in every instance of all entities. Rather, it simply means that if the instance has been categorized, any person must fall in one of the two buckets (male or female).

However, consider the following incomplete set of categories:

9781430236955_Ill03-39.jpg

This is an incomplete subtype, because employees are either salaried or hourly, but there may be other categories, such as contract workers. You may not need to store any additional information about them, though, so there is no need to implement the specific entity. This relationship is read as follows: “An Employee can be either Salaried or Hourly or other.”

Many-to-Many Relationship

The many-to-many relationship is also known as the nonspecific relationship, which is actually a better name, but far less well known. Having quite a few many-to-many relationships in the data model is common, particularly in the early conceptual model. These relationships are modeled by a line with a solid black dot at either end:

9781430236955_Ill03-40.jpg

There is one real problem with modeling a many-to-many relationship: it is often necessary to have more information about the relationship than that simply many EntityX instances are connected to many EntityY instances. So the relationship is usually modeled as follows:

9781430236955_Ill03-41.jpg

Here, the intermediate EntityX_EntityY entity is known as an associative entity (names like bridge, tweener, and joiner are not uncommon either). In early modeling, I will often stick with the former representation when I haven’t identified any extended attributes to describe the relationship and the latter representation when I need to add additional information to the model. To clarify the concept, let’s look at the following example:

9781430236955_Ill03-42.jpg

Here, I have set up a relationship where many customers are related to many products. This situation is common situation because in most cases, companies don’t create specific products for specific customers; rather, any customer can purchase any of the company’s products. At this point in the modeling, it is reasonable to use the many-to-many representation. Note that I am generalizing the customer-to-product relationship. It is not uncommon to have a company build specific products for only one customer to purchase.

Consider, however, the case where the Customer need only be related to a Product for a certain period of time. To implement this, you can use the following representation:

9781430236955_Ill03-43.jpg

In fact, almost all of the many-to-many relationships tend to require some additional information like this to make them complete. It is not uncommon to have no many-to-many relationships modeled with the black circle on both ends of a model, so you will need to look for entities modeled like this to be able to discern them.

image  Note    I should also note that you can’t implement a many-to-many relationship in SQL without using a table for the resolution. This is because there is no way to migrate keys both ways. In the database, you are required to implement all many-to-many relationships using a resolution entity.

Verb Phrases (Relationship Names)

Relationships are given names, called verb phrases, to make the relationship between a parent and child entity a readable sentence and to incorporate the entity names and the relationship cardinality. The name is usually expressed from parent to child, but it can be expressed in the other direction, or even in both directions. The verb phrase is located on the model somewhere close to the line that forms the relationship:

9781430236955_Ill03-44.jpg

The relationship should be named such that it fits into the following general structure for reading the entire relationship: parent cardinality – parent entity name – relationship name – child cardinality – child entity name.

For example, the following relationship

9781430236955_Ill03-45.jpg

would be read as “one Contact is phoned using zero, one, or more PhoneNumbers.”

Of course, the sentence may or may not make perfect sense in normal conversational language; for example, this one brings up the question of how a contact is phoned using zero phone numbers. If presenting this phrase to a nontechnical person, it would make more sense to read it as follows: “Each contact can have either no phone number or one or more phone numbers.”

The modeling language does not take linguistics into consideration when building this specification, but from a technical standpoint, it does not matter that the contact is phoned using zero phone numbers, since it follows that the contact would have no phone number. Being able to read the relationship helps you to notice obvious problems. For instance, consider the following relationship:

9781430236955_Ill03-46.jpg

It looks fine at first glance, but when read as “one contactType classifies zero or one Contacts,” it doesn’t make logical sense. It means to categorize all of the contacts, a unique ContactType row would be required for each Contact, which clearly is not at all desirable. This would be properly modeled as follows:

9781430236955_Ill03-47.jpg

which now reads, “one contactType classifies zero or more Contacts.”

Note that the type of relationship, whether it is identifying, nonidentifying, optional, or mandatory, makes no difference when reading the relationship. You can also include a verb phrase that reads from child to parent. For a one-to-many relationship, this would be of the following format: “One child instance (relationship) exactly one parent instance.”

In the case of the first example, you could have added an additional verb phrase:

9781430236955_Ill03-48.jpg

The parent-to-child relationship again is read as “one Contact is phoned using zero, one, or more phoneNumbers.”

You can then read the relationship from child to parent. Note that, when reading in this direction, you are in the context of zero or one phone number to one and only one contact: “zero or one phoneNumbers may be used to phone exactly one contact.”

Since this relationship is going from many to one, the parent in the relationship is assumed to have one related value, and since you are reading in the context of the existence of the child, you can also assume that there is zero or one child record to consider in the sentence.

For the many-to-many relationship, the scheme is pretty much the same. As both entities are parents in this kind of relationship, you read the verb phrase written above the line from left to right and from right to left for the verb phrase written below it.

9781430236955_Ill03-49.jpg

image  Note    Taking the time to define verb phrases can be a hard sell at times, because they are not actually used in a substantive way in the implementation of the database, and often people consider doing work that doesn’t produce code directly to be a waste of time. However, well-defined verb phrases make for great documentation, giving the reader a good idea of why the relationship exists and what it means. I usually use the verb phrase when naming the foreign key constraints too, which you will see in Chapter 6 when we actually create a database with foreign keys.

Descriptive Information

Take a picture of a beautiful mountain, and it will inspire thousands of words about the beauty of the trees, the plants, the babbling brook (my relative ability to describe a landscape being one of the reasons I write technical books). What it won’t tell you is how to get there yourself, what the temperature is, and whether you should bring a sweater and mittens or your swim trunks.

Data models are the same way. You can get a great start on understanding the database from the model, as I have discussed in the previous sections of this chapter. We started the documentation process by giving good names to entities, attributes, and the relationships, but even with well-formed names, there will still likely be confusion as to what exactly an attribute is used for and how it might be used.

For this, we need to add our own thousand words (give or take) to the pictures in the model. When sharing the model, descriptions will let the eventual reader—and even a future version of yourself—know what you originally had in mind. Remember that not everyone who views the models will be on the same technical level: some will be nonrelational programmers, or indeed users or (nontechnical) product managers who have no modeling experience.

Descriptive information need not be in any special format. It simply needs to be detailed, up to date, and capable of answering as many questions as can be anticipated. Each bit of descriptive information should be stored in a manner that makes it easy for users to quickly connect it to the part of the model where it was used, and it should be stored either in a document or as metadata in a modeling tool.

You should start creating this descriptive text by asking questions such as the following:

  • What is the object supposed to represent?
  • How will the object be used?
  • Who might use the object?
  • What are the future plans for the object?
  • What constraints are not specifically implied by the model?

The scope of the descriptions should not extend past the object or entities that are affected. For example, the entity description should refer only to the entity, and not any related entities, relationships, or even attributes unless completely necessary. An attribute definition should only speak to the single attribute and where its values might come from.

Maintaining good descriptive information is equivalent to putting decent comments in code. As the eventual database that you are modeling is usually the central part of any computer system, comments at this level are more important than at any others. We can also say that this is the bread and butter of having a logical model. For most people, being able to go back and review notes that were taken about each object and why things were implemented is invaluable, especially true for organizations that bring in new employees and need to bring them up to speed on complex systems.

For example, say the following two entities have been modeled:

9781430236955_Ill03-50.jpg

The very basic set of descriptive information in Tables 3-1 and 3-2 could be captured to describe the attributes created.

Table 3-1. Entities

Entity Attribute Description
Contact Persons that can be contacted to do business with
ContactId Surrogate key representing a Contact
ContactTypeId Primary key reference for a contactType, classifies the type of contact
Name The full name of a contact
ContactType Domain of different contact types
ContactTypeId Surrogate key representing a ContactType
Name The name that the contact type will be uniquely known as
Description The description of exactly how the contact should be used as

Table 3-2. Relationships

Parent Entity Name Phrase Child Entity Name Definition
ContactType Classifies Contact Contact type classification

Alternative Modeling Methodologies

In this section, I will briefly describe a few of the other modeling methodologies that you will likely run into with tools you may use when looking for database information on the Web. You will see a lot of similarities among them—for example, most every methodology uses a rectangle to represent a table and a line to indicate a relationship. You will also see some big differences among them, such as how the cardinality and direction of a relationship is indicated. Where IDEF1X uses a filled circle on the child end and an optional diamond on the other, one of the most popular methodologies uses multiple lines on one end and several dashes to indicate the same things. Still others use an arrow to point from the child to the parent to indicate where the migrated key comes from (that one really confuses people who are used to IDEF1X and crow’s feet.)

While all of the examples in this book will be done in IDEF1X, knowing about the other methodologies will be helpful when you are surfing around the Internet, looking for sample diagrams to help you design the database you are working on. (Architects are often particularly bad about not looking for existing designs, because frankly, solving the problem at hand is one of the best parts of the job. However, don’t reinvent the wheel every time!)

I will briefly discuss the following:

  • Information engineering (IE): The other main methodology, which is commonly referred to as the crow’s feet method
  • Chen Entity Relationship Model (ERD): The methodology used mostly by academics, though you can run into these models online
  • Visio: A tool that many developers have handy that will do an admirable job of helping you to design a database
  • Management Studio database diagrams: The database object viewer that can be used to view the database as a diagram right in Management Studio

image  Note    This list is by no means exhaustive. For example, several variations loosely based on the Unified Modeling Language (UML) class modeling methodology are not listed. These types of diagrams are common, particularly with people who use the other components of UML, but these models really have no standards. Some further reading on UML data models can be found in Clare Churcher’s book Beginning Database Design (Apress, 2007), on Scott Adler's AgileData site ( http://www.agiledata.org/essays/umlDataModelingProfile.html ), and on IBM's Rational UML documentation site ( http://www.ibm.com/software/rational/support/documentation/ ), and many others. (The typical caveat that these URLs are apt to change applies.)

Information Engineering

The information engineering (IE) methodology is well known and widely used. Like IDEF1X, it does a very good job of displaying the necessary information in a clean, compact manner that is easy to follow. The biggest difference is in how this method denotes relationship cardinalities: it uses a crow’s foot instead of a dot and lines and dashes instead of diamonds and some letters.

Tables in this method are denoted as rectangles, basically the same as in IDEF1X. According to the IE standard, attributes are not shown on the model, but most models show them the same as in IDEF1X—as a list, although the primary key is usually denoted by underlining the attributes, rather than the position in the table. (I have seen other ways of denoting the primary key, as well as alternate/foreign keys, but they are all very clear.) Where things get very different using IE is when dealing with relationships.

Just like in IDEF1X, IE has a set of symbols that have to be understood to indicate the cardinality and ownership of the data in the relationships. By varying the basic symbols at the end of the line, you can arrive at all of the various possibilities for relationships. Table 3-3 shows the different symbols that can be employed to build relationship representations.

Table 3-3. Information Engineering Symbols

Symbol Relationship Type Description
images/Tab03-01.jpg Many The entity on the end with the crow’s foot denotes that there can be greater than one value related to the other entity.
images/Tab03-02.jpg Required The key of the entity on the other end of the relationship is required to exist. A line on both ends indicates that a child is required for a parent row to exist, just like a “P” on the end of an IDEF1X model.
images/Tab03-03.jpg Optional This symbol indicates that there does not have to be a related instance on this end of the relationship for one to exist on the other. It can appear at the parent or the child.
images/Tab03-04.jpg Nonrequired A set of dashed lines on one end of the relationship line indicates that the migrated key may be null.

Figures 3-11 through 3-14 show some examples of relationships in IE.

9781430236955_Fig03-11.jpg

Figure 3-11. One-to-many: Specifically, one row in Table A may be related to zero, one or more rows in Table B. A related row must exist in Table A for a row to exist in Table B.

9781430236955_Fig03-12.jpg

Figure 3-12. One-to-many: Specifically, one Row in Table A may be related to one or more rows in Table B. A related row must exist in Table A and a row must exist in Table B.

9781430236955_Fig03-13.jpg

Figure 3-13. One-to-one: Specifically, zero or one row in Table A can be related to zero or one row in Table B. A row needn't exist in Table A for a row to exist in Table B (the key value would be optional).

9781430236955_Fig03-14.jpg

Figure 3-14. Many-to-many relationship

I have never felt that this notation was as clean as IDEF1X, because much of the symbology seems a bit complicated to master (I have heard the same about IDEF1X from crow’s feet users as well, so it could just be a taste thing). IE conveys the information well though and is likely to be used in some of the documents that you will come across in your work as a data architect or developer. IE is also not always fully implemented in tools; however, usually the circle, dashes, and crow’s feet are implemented properly.

image  Note    You can find more details about the Information Engineering methodology in the book Information Engineering, Books 1, 2, and 3 by James Martin (Prentice Hall, 1990).

Chen ERD

The Chen Entity Relationship Model (ERD) methodology is quite a bit different from IDEF1X, but it’s pretty easy to follow and largely self-explanatory. You will seldom see this methodology anywhere other than in academic texts, but since quite a few of these types of diagrams are on the Internet, it’s good to understand the basics of the methodology. Here’s a very simple Chen ERD diagram showing the basic constructs:

9781430236955_Ill03-51.jpg

Each entity is again a rectangle; however, the attributes are not shown in the entity but are instead attached to the entity in circles. The primary key either is not denoted or, in some variations, is underlined. The relationship is denoted by a rhombus, or diamond shape.

The cardinality for a relationship is denoted in text. In the example, it is 1 and Only 1 Parent rows < relationship name > 0 to Many Child rows. The primary reason for including the Chen ERD format is for contrast. Several other modeling methodologies—for example, Object Role Modeling (ORM) and Bachman—implement attributes in this style, where they are not displayed in the rectangle.

While I understand the logic behind this approach (entities and attributes are separate things), I have found that models I have seen using the format with attributes attached to the entity like this seemed overly cluttered, even for fairly small diagrams. The methodology does, however, do an admirable job with the logical model of showing what is desired and also does not rely on overcomplicated symbology to describe relationships and cardinality.

image  Note    You can find further details on the Chen ERD methodology in the paper “The Entity Relationship Model—Toward a Unified View of Data” by Peter Chen (it can be found by performing a Google search for the title of the paper).

Also, note that I am not saying that such a tool to create Chen diagrams does not exist; rather, I personally have not seen the Chen ERD methodology implemented in a mainstream database design tool other than some early versions of Microsoft Visio. Quite a few of the diagrams you will find on the Internet will be in this style, however, so understanding at least the basics of the Chen ERD methodology is useful.

Visio

Visio is a tool that many developers use for designing databases; often, they already have it in their tool belts for other types of drawings and models (such as process flow diagrams). By nature, Visio is a multipurpose drawing tool and, as such, does not lend itself well to being a fully featured database design tool. That said, Visio is not the world’s worst tool to design a database either. It does lack a refined manner of going from conceptual to logical and finally to an implemented database, but unless you are doing serious enterprise-level designs, this limitation may not matter much to you. (Also, like many of us, you may not have the ducats to shell out for a more fully featured tool, and using Visio is better than just using Management Studio’s diagrams.)

Models created using Visio have a distinctive style that shares some features with object-oriented design tools. While the tool supports many of the features of more powerful tools, the picture of the model is pretty basic:

9781430236955_Ill03-52.jpg

The arrow points to the parent in all cases but does not indicate ownership, cardinality, or even optionality. It tells you what columns are primary keys in two ways (using the line and the “PK”), as well as telling you what columns are part of foreign keys with “FK” plus a number, in case there are more than one. Alternate keys are denoted with “U” plus a number. In the preceding model, the Parent entity/table has two alternate keys.

Visio implements a good amount of detail to define columns, include comments, and set cardinalities via dialogs and editors. All in all, the tool is not a terrible choice for modeling if it is the only one available, but far better choices are out there for what, over the lifetime of the tool, will be a pretty reasonable amount of money.

Management Studio Database Diagrams

The database diagramming capability built into SQL Server Management Studio is not a modeling tool, though often, it can be a developer’s only available tool for modeling. It is a very useful tool to provide a graphical view the structure of an implemented database (a picture really is worth a thousand words!), but because it works directly against the implemented tables, it is not overly useful for design but only for the final implementation. You can use this view to modify structures as well, but I would not suggest it. I typically suggest that one use the T-SQL code in the Management Studio query window to make table changes. (All table structures in this book will be done in this way, and doing so is a best practice for repeatability purposes. I will talk more on this in the latter half of the book.) New to SQL Server 2012 is a tool called SQL Server Developer Tools that will allow you to create a database as well. In this book, I will largely be tool agnostic, choosing to do almost everything via scripts, because it will behoove you as a designer/coder to know what the tools do behind the scenes anyhow.

You will find that Management Studio diagrams are excellent tools when you are looking at a database in an environment where you have no other tools, just to look at the structures. As such, the following is an example of a one-to-many relationship in Management Studio:

9781430236955_Ill03-53.jpg

The primary keys are identified by the little key in an attribute. The relationship is denoted by the line between the entities, with the “one” end having a key and the “many” end having an infinity sign. You can display the entities in several formats by just showing the names of the entities or by showing all of the attributes with datatypes, for example:

9781430236955_Ill03-54.jpg

While the database diagram tool does have its place, I must stress that it isn’t a full-featured data modeling tool and shouldn’t be used as such if you can avoid it. I included coverage of the SQL Server modeling capabilities here because it’s included in SQL Server, and in some situations, it’s the best tool you may have access to. It does give access to all implementation-specific features of SQL Server, including the ability to annotate your tables and columns with descriptive information. Unfortunately, if you decide to implement a relationship in a trigger, it will not know that the trigger exists. (I cover triggers in Chapter 6 and Chapter 7, so if you have no idea what a trigger is right now, don’t worry.)

In most cases, the SQL Server tool isn’t the optimal way to see actual relationship information that is designed into the database, but it does offer a serviceable look at the database structure when needed.

image  Note   In the Business Intelligence tools for SQL Server 2005 and later, there is also another tool that resembles a data model in the Data Source view. It is used to build a view of a set of tables, views, and (not implemented) queries for building reports from. It is pretty much self-explanatory as a tool, but it uses an arrow on the parent end of the relation line to indicate where a primary key came from, much like Visio does. This tool is not pertinent to the task of building or changing a database, but I felt I should at least mention it briefly, as it does look very much like a data modeling tool.

Best Practices

The following are some basic best practices that can be very useful to follow when doing data ­modeling:

  • Modeling language: Pick a model language, understand it, and use it fully. This chapter has been a basic coverage of much of the symbology of the IDEF1X modeling language. IDEF1X is not the only modeling language, and after using one style, it is likely you will develop a favorite flavor and not like the others (guess which one I like best). The plain fact is that almost all of the modeling options have some merit. The important thing is that you understand your chosen language and can use it to communicate with users and programmers at the levels they need and can explain the intricacies as necessary.
  • Entity names: There are two ways you can go with these: plural or singular. I feel that names should be singular (meaning that the name of the table describes a single instance, or row, of the entity, much like an OO object name describes the instance of an object, not a group of them), but many other highly regarded data architects and authors feel that the table name refers to the set of rows and should be plural. Whichever way you decide to go, it’s most important that you are consistent. Anyone reading your model shouldn’t have to guess why some entity names are plural and others aren’t.
  • Attribute names: It’s generally not necessary to repeat the entity name in the attribute name, except for the primary key and some common terms. The entity name is implied by the attribute’s inclusion in the entity. The attribute name should reflect precisely what is contained in the attribute and how it relates to the entity. And as with entities, abbreviations ought to be used extremely sparingly in naming of attributes and columns; every word should be spelled out in its entirety. If any abbreviation is to be used, because of some naming standard currently in place for example, a method should be put into place to make sure the abbreviation is used consistently.
  • Relationships: Name relationships with verb phrases, which make the relationship between a parent and child entity a readable sentence. The sentence expresses the relationship using the entity names and the relationship cardinality. The relationship sentence is a very powerful tool for communicating the purpose of the relationships with nontechnical members of the project team (e.g., customer representatives).
  • Domains: Using defined, reusable domains gives you a set of standard templates to apply when building databases to ensure consistency across your database and, if the templates are used extensively, all of your databases. Implement type inheritance wherever possible to take advantage of domains that are similar and maximize reusability.

Summary

One of the primary tools of a database designer is the data model. It’s such a great tool because it can show not only the details of single table at a time but the relationships between several entities at a time. Of course, it is not the only way to document a database; each of the following is useful, but not nearly as useful as a full-featured data model:

  • Often a product that features a database as the central focus will include a document that lists all tables, datatypes, and relationships.
  • Every good DBA has a script of the database saved somewhere for re-creating the database.
  • SQL Server’s metadata includes ways to add properties to the database to describe the objects.

A good data modeling tool (often a costly thing to purchase but definitely well worth the investment over time) will do all of these things and more for you. I won’t give you any guidance as to which tool to purchase, as this is not an advertisement for any tool (not even for the basic Microsoft tools that you likely have your hands on already, which frankly are not the best-in-class tools that you need to get). Clearly, you need to do a bit of research to find a tool that suits you.

Of course, you don’t need a model to design a database, and I know several very talented data architects who don’t use any kind of tools to model with, sticking with SQL scripts to create their databases, so using a modeling tool is not necessary to get the database created. However, a graphical representation of the data model is a very useful tool to quickly share the structure of the database with developers and even end users. And the key to this task is to have common symbology to communicate on a level that all people involved can understand on some level.

In this chapter, I presented the basic process of graphically documenting the objects that were introduced in the first chapter. I focused heavily on the IDEF1X modeling methodology, taking a detailed look at the symbology that will be used through database designs. The base set of symbols outlined here will enable us to fully model logical databases (and later physical databases) in great detail.

All it takes is a little bit of training, and the rest is easy. For example, take the model in Figure 3-15.

9781430236955_Fig03-15.jpg

Figure 3-15. Reading this basic model is not difficult at all, if you simply apply the explanations from this chapter.

Customers place orders. Orders have line items. The line items are used to order products. With very little effort, nontechnical users can understand your data models, allowing you to communicate very easily, rather than using large spreadsheets as your primary communication method. The finer points of cardinality and ownership might not be completely clear, but usually, those technical details are not as important as the larger picture of knowing which entities relate to which.

If you have named attributes well, users won’t be confused about what most attributes are, but if so, your spreadsheet of information should be able clear up confusion about the finer points of the model.

Now that we’ve considered the symbology required to model a database, I’ll use data models throughout this book to describe the entities in the conceptual model in Chapter 4, and then, in many other models throughout the implementations presented in the rest of the book as shorthand to give you an overview of the scenario I am setting up, often in addition to scripts to demonstrate how to create the tables in the model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.206.69