Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

1
Database Design Goals

Using modern database tools, just about anyone can build a database. The question is, will the resulting database be useful?

A database won't do you much good if you can't get data out of it quickly, reliably, and consistently. It won't be useful if it's full of incorrect or contradictory data, nor will it be useful if it is stolen, lost, or corrupted by data that was only half written when the system crashed.

You can address all of these potential problems by using modern database tools, a good database design, and a pinch of common sense, but only if you understand what those problems are so you can avoid them.

The first step in the quest for a useful database is understanding database goals. What should a database do? What makes a database useful and what problems can it solve? Working with a powerful database tool without goals is like flying a plane through clouds without a compass—you have the tools you need but no sense of direction.

This chapter describes the goals of database design. By studying information containers, such as files that can play the role of a database, the text defines properties that good databases have and problems that they should avoid.

In this chapter, you will learn about the following:

Why a good database design is important
The strengths and weaknesses of various kinds of information containers that can act as databases
How computerized databases can benefit from those strengths and avoid those weaknesses
How good database design helps achieve database goals
What CRUD, ACID, and BASE are, and why they are relevant to database design

THE IMPORTANCE OF DESIGN

Forget for a moment that this book is about designing databases and consider software design in general. Software design plays a critical role in software development. The design lays out the general structure and direction that future development will take. It determines which parts of the system will interact with other parts. It decides which subsystems will provide support for other pieces of the application.

If an application's underlying design is flawed, the system as a whole is at risk. Bad assumptions in the design creep into the code at the application's lowest levels, resulting in flawed subsystems. Higher-level systems built on those subsystems inherit those design flaws, and soon their code is corrupted, too.

Sometimes, a sort of decay pervades the entire system and nobody notices until relatively late in the project. The longer the project continues, the more entrenched the incorrect assumptions become, and the more reluctant developers are to scrap the whole design and start over. The longer problems remain in the system, the harder they are to remove. At some point, it might be easier to throw everything away and start over from scratch, a decision that few managers will want to present to upper management.

SPACE SPAT

An engineer friend of mine was working on a really huge satellite project. After a while, the engineers all realized that the project just wasn't feasible given the current state of technology and the design. Eventually, the project manager was forced to admit this to upper management and he was fired. The new project manager stuck it out for a while and then he, too, was forced to confess to upper management that the project was unfeasible. He, too, was fired.

For a while, this process continued—with a new manager taking over, realizing the hopelessness of the design, and being fired. That is, until eventually even upper management had to admit the project wasn't going to work out and the whole thing collapsed.

They could have saved time, money, and several careers if they had spent more time up-front on the design and either fixed the problems or realized right away that the project wasn't going to work and scrapped it at the start.

Building an application is often compared to building a house or skyscraper. You probably wouldn't start building a multibillion-dollar skyscraper without a comprehensive design that is based on well-established architectural principles. Unfortunately, software developers often rush off to start coding as soon as they possibly can because coding is more fun and interesting than design is. Coding also lets developers tell management and customers how many lines of code they have written, so it seems like they are making progress even if the lines of code are corrupted by false assumptions. Only later do they realize that the underlying design is flawed, the code they wrote is worthless, and the project is in serious trouble.

Now, let's get back to database design. Few parts of an application's design are as critical as the database's design. The database is the repository of the information that the rest of the application manages and displays to the users. If the database doesn't store the right data, doesn't keep the data safe, or doesn't let the application find the data it needs, then the application has little chance for success. Here, the garbage-in, garbage-out (GIGO) principle is in full effect. If the underlying data is unsound, it doesn't matter what the application does with it; the results will be suspect at best.

For example, imagine that you've built an order-tracking system that can quickly fetch information about a customer's past orders. Unfortunately, every time you ask the program to fetch a certain customer's records, it returns a slightly different result. Although the program can find data quickly, the results are not trustworthy enough to be usable.

For another example, imagine that you have built an amazing program that can track the thousands of tasks that make up a single complex job, such as building a cruise liner or passenger jet. It can track each task's state of completion, determine when you need to order new parts for them to be ready for future construction phases, and can even determine the present value of future purchases so you can decide whether it is better to buy parts now or wait until they are needed. Unfortunately, the program takes hours to recalculate the complex task schedule and pricing details. Although the calculations are correct, they are so slow that users cannot reasonably make any changes. Changing the color of the fabric of a plane's seats or the tile used in a cruise liner's hallways could delay the whole project. (I once worked on a project with a similar issue. It worked, but it was so slow that it became a serious problem.)

For a final example, suppose you have built an efficient subscription application that lets customers subscribe to your company's quarterly newsletters, data services, and sarcastic demotivational quote of the day. It lets you quickly find and update any customer's subscriptions, and it always consistently shows the same values for a particular customer. Unfortunately, when you change the price of one of your publications, you find that not all of the customers' records show the updated price. Some customers' subscriptions are at the new rate, some are at the old rate, and some seem to be at a rate you've never seen before. (This example isn't as far-fetched as it may seem. Some systems allow you to offer sale prices or special incentives to groups of customers, or they allow sales reps to offer special prices to particular customers. That kind of system requires careful design if you want to be able to do things like change standard prices without messing up customized pricing.)

Poor database design can lead to these and other annoying and potentially expensive scenarios. A good design creates a solid foundation on which you can build the rest of the application.

Experienced developers know that the longer a bug remains in a system, the harder it is to find and fix. From that it logically follows that it is extremely important to get the design right before you start building on it.

Database design is no exception. A flawed database design can doom a project to failure before it has begun as surely as ill-conceived software architecture, poor implementation, or incompetent programming can.

INFORMATION CONTAINERS

What is a database? This may seem like a trivial question, but if you take it seriously the result can be pretty enlightening. By studying the strengths and weaknesses of some physical objects that meet the definition of a database, you can learn about the features that you might like a computerized database to have.

This is a pretty broad definition and includes a lot of physical objects that most people don't think of as modern databases. For example, Figure 1.1 shows a box full of business cards, a notebook, a filing cabinet full of customer records, and your brain, all of which fit this definition. Each of these physical databases has advantages and disadvantages that can give insight into the features that you might like in a computer database.

A representation of an ER diagram. — **FIGURE 1.1**

A box of business cards is useful as long as it doesn't contain too many cards. You can find a particular piece of data (for example, the phone number for your favorite Canadian restaurant) by looking through all the cards. You can easily expand the database by shoving more cards into the box, at least up to a point. If you have more than a dozen or so business cards, finding a particular card can be time consuming. You can even rearrange the cards a bit to improve performance for cards you use often. Each time you use a card, you can move it to the front of the box. Over time, those that are used most often will migrate to the front.

A notebook (the cardboard and paper kind, not the small laptop kind) is small, easy to use, easy to carry, doesn't require electricity, and doesn't need to boot before you can use it. A notebook database is also easily extensible because you can buy another notebook to add to your collection when the first one is full. However, a notebook's contents are arranged sequentially. If you want to find information about a particular topic, you'll have to look through the pages one at a time until you find what you want. The more data you have, the harder this kind of search becomes.

A filing cabinet can store a lot more information than a notebook, and you can easily expand the database by adding more files or cabinets. Finding a particular piece of information in the filing cabinet can be easier than finding it in a notebook, as long as you are searching for the type of data used to arrange the records. If the filing cabinet is full of customer information sorted by customer name, and you want to find a particular customer's data, then you're in luck. If you want to find all of the customers who live in a certain city, you'll have to dig through the files one at a time.

Your brain is the most sophisticated database ever created. It can store an incredible amount of data and allows you to retrieve a particular piece of data in several different ways. For example, right now you could probably easily answer the following questions about the restaurants that you visit frequently:

Which is closest to your current location?
Which has the best desserts?
Which has the best service?
Which is least expensive?
Which is the best for a business lunch?
Which is your overall favorite?
Why don't we go there tonight?

Your brain provides many different ways you can access the same restaurant information. You can search based on a variety of keys (such as location, quality of dessert, expense, and so forth). To answer these questions with a box of business cards, a notebook, or a filing cabinet would require a long and grueling search.

Still your brain has some drawbacks, at least as a database. Most notably it forgets. You may be able to remember an incredible number of things, but some become less reliable or disappear completely over time. Do you remember the names of all of your elementary school teachers? I don't. (I don't remember my own teachers' names, much less yours!)

Your brain also gets tired, and when it is tired it is less accurate.

Although your brain is good at certain tasks, such as recognizing faces or picking restaurants, it is not so good at other tasks like providing an accurate list of every item a particular customer purchased in the past year. Those items have less emotional significance than, for example, your spouse's name, so they're harder to remember.

All of these information containers (business cards, notebooks, filing cabinets, and your brain) can become contaminated with misleading, incorrect, and contradictory information. If you write different versions of the same information in a notebook, the data won't be consistent. Later when you try to look up the data, you may find either version first and may not even remember that there's another version. (Your brain can become especially cluttered with inconsistent and contradictory information, particularly if you listen to politicians during an election year.)

The following section summarizes some of the strengths and weaknesses of these information containers.

STRENGTHS AND WEAKNESSES OF INFORMATION CONTAINERS

By understanding the strengths and weaknesses of information containers like those described in the previous section, you can learn about features that would be useful in a computerized database. So, what are some of those strengths and weaknesses?

The following list summarizes the advantages of some information containers:

None of these databases require electricity so they are safe from power failures (although your brain requires food; as the dormouse said, feed your head).
These databases keep data fairly safe and permanent (barring fires and memory loss). The data doesn't just disappear.
These databases (excluding your brain) are inexpensive and easy to buy.
These databases have simple user interfaces so that almost anyone can use them.
Using these databases, it's fairly easy to add, edit, and remove data.
The filing cabinet lets you quickly locate data if you search for it in the same way it is arranged (for example, by customer name).
Your brain lets you find data by using different keys (for example, by location, cost, or quality of service).
All of these databases allow you to find every piece of information that they contain, although it may take a while to dig through it all.
All of these databases (except possibly your brain) provide consistent results as long as the facts they store are consistent. For example, two people using the same notebook will find the same data. Similarly if you look at the same notebook at a later time, it will show the same data you saw before (if it hasn't been modified).
All of these databases except the filing cabinet are portable.
Your brain can perform complex calculations, at least of a limited type and number.
All of these databases provide atomic transactions.

The final advantage is a bit more abstract than the others so it deserves some additional explanation. An atomic transaction is a possibly complex series of actions that is considered to be a single operation by those who are not involved directly in performing the transaction.

The classic example is transferring money from one bank account to another. Suppose Alice writes Bob a check for $100 and you need to transfer the money between their accounts. You pick up the account book, subtract $100 from Alice's record, add $100 to Bob's record, and then put the notebook down. Someone else who uses the notebook might see it before the transaction (when Alice has the $100) or after the transaction (when Bob has the $100), but they won't see it during the transaction where the $100 has been subtracted from Alice but not yet given to Bob. The office bullies aren't allowed to grab the notebook from your hands when you're halfway through and play keep-away. It's an all-or-nothing transaction.

In addition to their advantages, information containers like notebooks and filing cabinets have some disadvantages. It's worth studying these disadvantages so that you can try to avoid them when you build computerized databases.

The following list summarizes some of the disadvantages that these information containers have:

All of these databases can hold incomplete, incorrect, or contradictory data.
Some of them are easy to lose or steal. Someone could grab your notebook while you're eating lunch or read over your shoulder on the bus. You could even forget your notebook at the security counter as you dash to catch your flight.
In all of these databases, correcting large errors in the data can be difficult. For example, it's easy to use a pen to change one person's address in an address notebook. It's much harder to update hundreds of addresses if a new city is created in your area. (This recently happened near where I live.) Such a circumstance might require a tedious search through a set of business cards, a notebook, or a filing cabinet. It may be years before your brain makes the switch completely.
These databases are relatively slow at creating, retrieving, updating, and deleting data. Your brain is much faster than the others at some tasks but is not good at manipulating a lot of information all at once. For example, how quickly can you list your 20 closest friends in alphabetical order? Even picking your closest friends can be difficult at times. (And don't think you can cheat by using Facebook because you probably have hundreds of friends there and we only want your top 20.)
Your brain can give different results at different times depending on uncontrollable factors such as your mood, how tired you are, and even whether you're hungry.
Each of these databases is located in a single place so it cannot be easily shared. Each also cannot be easily backed up, so if the original is lost or destroyed, you lose your data.

The following section considers how you can translate these strengths and weaknesses into features to prefer or avoid in a computerized database.

DESIRABLE DATABASE FEATURES

By looking at the advantages and disadvantages of physical databases, you can create a list of features that a computerized database should have. Some are fundamental characteristics that any database must have. (“You should be able to get data from it.” How obvious is that?)

Most of these features, however, depend at least in part on good database design. If you don't craft a good design, you'll miss out on some or all of the benefits of these features. For example, any decent database provides backup features, but a good design can make backup and recovery a lot quicker and easier.

The following sections describe some of the features that a good database system should provide and explain to what degree they depend on good database design.

CRUD

CRUD stands for the four fundamental database operations that any database should provide: Create, Read, Update, and Delete. If you read database articles and discussions on the web, you will often see people tossing around the term CRUD. (They may be using the term just to sound edgy and cool. Now that you know the term, you can sound cool, too!)

You can imagine some specialized data-gathering devices that don't support all these methods. For example, an airplane's black-box flight data recorders record flight information and later play it back without allowing you to modify the data. In general, however, if it doesn't have CRUD, it's not a database.

CRUD is more a general feature of databases than it is a feature of good database design. However, a good database design provides CRUD efficiently. For example, suppose you design a database to track times for your canuggling league (look it up online and don't let autocorrect send you to definitions of “snuggling”), and you require that the addresses for participants include a State value that is present in the States table. When you create a new record (the C in CRUD), the database must validate the new State entry (so no one enters “confusion” as their state, even if it is true). Similarly, when you update a record (the U in CRUD), the database must validate the modified State entry. When you delete an entry in the States table (the D in CRUD), the database must verify that no Participant records use that state. Finally, when you read data (the R in CRUD), the database design determines whether you find the data you want in seconds, hours, or not at all.

Many of the concepts described in the following sections relate to CRUD operations.

Retrieval

Retrieval is another word for “read,” the R in CRUD. (Happily “retrieval” also starts with R, so you don't need to memorize a new acronym.) Your database should allow you to find every piece of data. There's no point putting something in the database if there's no way to retrieve it later. (That would be a “data black hole,” not a database.)

The database should allow you to structure the data so that you can find particular pieces of data in one or more specific ways. For example, you should be able to find a customer's billing record by searching for customer name or customer ID.

Ideally the database will also allow you to structure the data so that it is relatively quick and easy to fetch data in a particular manner.

For example, suppose you want to see where your customers live so you can decide whether you should start a delivery service in a new city. To obtain this information, it would be helpful to be able to find customers based on their addresses. Ideally you could optimize the database structure so that you can quickly search for customers by address.

In contrast, you probably don't need to search for customers by their middle name too frequently. (Imagine a customer calling you and saying, “Can you look up my record? I don't remember if I paid my bill last month. I also don't remember my account number or my last name but my middle name is ‘Konfused.’”) It would be nice if the common search by address was faster than the rare search by middle name.

Being able to find all of the data in the database quickly and reliably is an important part of database design. Finding the data that you need in a poorly designed database can take hours or days instead of mere seconds.

Consistency

Another aspect of the R in CRUD is consistency. (The fact that “consistency” and CRUD both start with a C is purely coincidental. Don't get too excited.) A database should provide consistent results. If you perform the same search twice in a row, you should get the same results. Another user who performs the same search should also get the same results. (Of course, this assumes that the underlying data hasn't changed in the meantime. You can't expect your net worth query to return the same results every day when stock prices fluctuate wildly.)

A well-built database product can ensure that the exact same query returns the same result, but the design also plays an important role. If the database is poorly designed, you might be able to store conflicting data in different parts of the database. For example, you might be able to store one set of contact information in a customer's order and a different set of information in the main customer record. Later, if you need to contact the customer with a question about the order, which contact information should you use?

Validity

Validity is closely related to the idea of consistency. Consistency means different parts of the database don't hold contradictory views of the same information. Validity means data is checked where possible against other pieces of data in the database. In CRUD terms, data can be validated when a record is created, updated, or deleted.

Just like physical data containers, a computerized database can hold incomplete, incorrect, or contradictory data. You can never protect a database from users who can't spell or who just plain enter the wrong information, but a good database design can help prevent some kinds of errors that a physical database cannot prevent.

For example, the database can easily verify that data has the correct type. If the user sees a Date field and enters “No thanks, I'm married,” the database can tell that this is not a valid date format and can refuse to accept the value. Similarly, it can tell that “Old” is not a valid Age, “Lots” is not a valid Quantity, and 3 is not a valid PhoneNumber.

The database can also verify that a value entered by the user is present in another part of the database. For example, a poor typist trying to enter CO in a State field might type CP instead. The database can check a list of valid states and refuse to accept the data when it doesn't find CP listed. The user interface could also make the user pick the state from a drop-down list and avoid this problem, but the database should still protect itself against invalid data wherever possible.

The database can also check some kinds of constraints. Suppose the database contains a book-ordering system. When the customer orders 500 copies of this book (who wouldn't want that many copies?), the database can check another part of the database to see if that many copies are available (most bookstores carry only a few copies of any given book) and suggest that the customer backorder the copies if there aren't enough.

A good database design also helps protect the database against incorrect changes. Suppose you run a high-end coffee delivery service called the Brew Crew, and you've decided to drop coverage for a nearby city. When you try to remove that city from your list of valid locations, the database can tell you if you have existing customers in that city. Depending on the database's design, it could refuse to allow you to remove the city until you have apologized to those customers and removed them from the database.

All these techniques rely on a good, solid database design. They still can't protect you from a user who types first names in the last name field or who keeps accidentally bumping THE CAPS LOCK KEY, but it can prevent many types of errors that a paper and cardboard notebook can't.

Easy Error Correction

Even a perfectly designed database cannot ensure perfect validity. How can the database know that a customer's name is supposed to be spelled Pheidaux, not Fido as typed by your order entry clerk?

Correcting a single error in a notebook is fairly easy. Simply cross out the wrong value and write in the new one.

Correcting systematic errors in a notebook is a lot harder. Suppose you hire a summer intern to go door-to-door selling household products and he writes up a lot of orders for “duck tape” not realizing that the actual product is “duct tape.” Fixing all the mistakes could be tedious and time consuming. (Of course tedious and time-consuming jobs are what summer interns are for, so you could make him fix them.) You could just ignore the problem and leave the orders misspelled, but then how would you tell when a customer really wants to tape a duck?

In a computerized database, this sort of correction is trivial. A simple database command can update every occurrence of the product name “duck tape” throughout the whole system. (In fact, this kind of fix is sometimes too easy to make. If you aren't careful, you may accidentally change the names of every product to “duct tape.” You can prevent this by building a safe user interface for the database or by being really, really careful.)

Easy correction of errors is a built-in feature of computerized databases, but to get the best advantage from this feature you need a good design. If order information is contained in a free-formatted text section, the database will have trouble fixing typos. If you put the product name in a separate field, the database can make this change easily.

Although easy corrections are almost free, you need to do a little design work to make them as efficiently and effectively as possible.

Speed

An important aspect of all of the CRUD components is speed. A well-designed database can create, read, update, and delete records quickly.

There's no denying that a computerized database is a lot faster than a notebook or a filing cabinet. Instead of processing dozens of records per hour, a computerized database can process dozens or hundreds per second. (I once worked with a billing center that processed around 3 million accounts every three days.)

Good design plays a critical role in database efficiency. A poorly organized database may still be faster than the paper equivalent, but it will be a lot slower than a well-designed database.

CAN'T SEE THE FOREST FOR THE TREES

The billing center I mentioned two paragraphs earlier had a simple but annoying problem: they couldn't find the customers who owed them the most money. Every three days, the computer would print out a list of customers who owed money, and the list created a stack of paper almost 3 feet tall. (Sadly, I'm not exaggerating. Three feet of paper every three days! That's a small forest of trees' worth of paper every month.) Unfortunately, the list was randomly ordered (probably ordered by customer ID or shoe size or something equally unhelpful), so they couldn't figure out who owed the most money. The majority of the customers owed only a few dollars—too little to worry about—but a few customers owed tens of thousands of dollars.

We captured this printout electronically and sorted the accounts by balance. It turned out that the really problematic customers only filled two or three pages and the first five or so customers owed more than all of the others in the entire 3-foot tall stack combined.

I didn't include this story just to impress you with my mad ninja programming skillz (to be honest, it was a pretty easy project), but to illustrate how database design can make a big difference in performance. Had the database been designed properly to begin with, it would have been trivial to find the most problematic customers in a just few seconds.

Not all changes to a database's design can produce dramatic results, but design definitely plays an important role in performance.

Atomic Transactions

Recall that an atomic transaction is a possibly complex series of actions that is considered as a single operation by those not involved directly in performing the transaction. If you transfer $100 from Alice's account to Bob's account, no one else can see the database while it is in an intermediate state where the money has been removed from Alice's account and not yet added to Bob's.

The transaction either happens completely or none of its pieces happen—it cannot happen halfway.

Atomic transactions are important for maintaining consistency and validity, and are thus important for the R and U parts of CRUD.

Physical data containers like notebooks support atomic transactions because typically only one person at a time can use them. Unless you're distracted and wander off to lunch halfway through, you can finish a series of operations before you let someone else have a turn at the notebook.

Some of the most primitive kinds of electronic databases, such as flat files and XML files (which are described later in this book), don't intrinsically support atomic transactions, but the more advanced relational database products do. Those databases allow you to start a transaction and perform a series of operations. You can then either commit the transaction to make the changes permanent or roll back the transaction to undo them all and restore the database to the state it had before you started the transaction.

These databases also automatically roll back any transaction that is open when the database halts unexpectedly. For example, suppose you start a transaction, take $100 from Alice's account, and then your company's mascot (a miniature horse) walks through the computer room, steps on a power strip, and kills the power to your main computer. When you restart the database (after sending the mascot to the HR department for a stern lecture), it automatically rolls the transaction back so that Alice gets her money back. You'll need to try the transaction again, but at least no money has been lost by the system.

Atomic transactions are more a matter of properly using database features rather than database design. If you pick a reasonably advanced database product and use transactions properly, you gain their benefits. If you decide to use flat files to store your data, you'll need to implement transactions yourself.

ACID

This section provides some more detail about the transactions described in the previous section rather than discussing a new feature of physical data containers and computerized databases.

ACID is an acronym describing four features that an effective transaction system should provide. ACID stands for Atomicity, Consistency, Isolation, and Durability.

Atomicity means transactions are atomic. The operations in a transaction either all happen or none of them happen.

Consistency means the transaction ensures that the database is in a consistent state before and after the transaction. In other words, if the operations within the transaction would violate the database's rules, the transaction is rolled back. For example, suppose the database's rules say an account cannot make a payment that would result in a balance less than $0. Also, suppose that Alice's account holds only $75. You start a transaction, add $100 to Bob's account, and then try to remove $100 from Alice's. That would put Alice $25 below $0, violating the database's rules, so the transaction is canceled and we all try to forget that this ugly incident ever occurred. (Actually, we probably bill Alice an outrageous overdraft fee for writing a bad check.)

Isolation means the transaction isolates the details of the transaction from everyone except the person making the transaction. Suppose you start a transaction, remove $100 from Alice's account, and add $100 to Bob's account. Another person cannot peek at the database while you're in the middle of this process and see a state where neither Alice nor Bob has the $100. Anyone who looks in the database will see the $100 somewhere, either in Alice's account before the transaction or in Bob's account afterward.

In particular, two transactions operate in isolation and cannot interfere with each other. Suppose one transaction transfers $100 from Alice to Bob, and then a second transaction transfers $100 from Bob to Cindy. Logically one of these transactions occurs first and finishes before the other starts. For example, when the second transaction starts, it will not see the $100 missing from Alice's account unless it is already in Bob's account.

OUT OF ORDER

Note that the order in which the transactions occur may make a big difference. Suppose Alice starts with $150, Bob starts with $50, and Cindy starts with $50.

Now suppose the second Bob-to-Cindy transaction occurs first. If the transaction starts by removing $100 from Bob's account, Bob is overdrawn, this transaction is rolled back, we charge Bob an overdraft fee, and we try to sell Bob overdraft protection for the very low price of only $10 per month. After this, the Alice-to-Bob transaction occurs and we successfully move $100 into Bob's account.

In contrast, suppose the Alice-to-Bob transaction occurs first. That transaction succeeds with no problem so, when the Bob-to-Cindy transaction starts, Bob has $150 and the second transaction can complete successfully.

The database won't determine which transaction occurs first, just that each commits or rolls back before the other starts.

Durability means that once a transaction is committed, it will not disappear later. If the power fails, the effects of this transaction will still be there when the database restarts.

The durability requirement relies on the consistency rule. Consistency ensures that the transaction will not complete if it would leave the database in a state that violates the database's rules. Durability means that the database will not later decide that the transaction caused such a state and retroactively remove the transaction.

Once the transaction is committed, it is final.

SHADOW OF (NO) DOUBT

A high-end database might provide durability through continuous shadowing. Every time a database operation occurs, it is shadowed to another system. If the main system crashes, the shadow database can spring instantly into service. Other databases provide durability through logs. Every time the database performs an operation, it writes a record of the operation into the log. Now suppose the system crashes. When the database restarts, it reloads its last saved data and then reapplies all of the operations described by the log. This takes longer than restarting from a shadow database but requires fewer resources so it's generally less expensive.

To provide durability, the database cannot consider the transaction as committed until its changes are shadowed or recorded in the log so that the database will not lose the changes if it crashes.

BASE

Recently, some kinds of nonrelational, NoSQL databases have become increasingly popular, and they don't always satisfy the ACID rules. Their structure, plus the fact that they are often distributed across multiple servers, makes them behave differently. (We'll talk more about NoSQL databases a bit later. For now, just think of them as alternatives to relational databases.)

To better suit the asynchronous, distributed, sometimes in the cloud NoSQL lifestyle, these databases have their own feature acronym: BASE. (I'm sure this acronym was inspired partly by the fact that acids and bases are chemical opposites.)

BASE stands for Basically Available, Soft state, and Eventually consistent.

Basically Available means that the data is available. (It wouldn't be much of a database if you couldn't retrieve the data!) Relational databases spend a lot of effort requiring data to be immediately consistent (which is why you'll spend a fair amount of time studying normalization and other relational concepts later in this book). NoSQL databases spread their data across the database's clusters, so the data might not be immediately consistent. These databases do, however, guarantee that any query will return some sort of result, even if it's a failure (the electronic equivalent of a shrug).

Soft state means that the state of the data may change over time. Because these databases do not require immediate consistency, it may take a while for changes to filter through the entire system. Instead of enforcing consistency, these databases rely on the developer to provide whatever consistency is needed by the application.

Eventually consistent means that these databases do eventually become consistent, just not necessarily before the next time that you read the data (usually we're talking about seconds, not days).

We'll talk more about NoSQL databases in Chapter 3, “NoSQL Overview.”

NewSQL

NewSQL is a relatively new breed of relational database management system (RDBMS) that attempts to combine the ACID guarantees of traditional databases with the flexibility and scalability of NoSQL databases. These are particularly useful for high-volume online transaction processing (OLTP) systems where distributing the data can greatly improve performance.

One way to create a NewSQL database is to split the data into disjoint subsets called partitions or shards.

Persistence and Backups

The data must be persistent—it shouldn't change or disappear by itself. If you can't trust the database to keep the data safe, then the database is pretty much worthless.

Database products do their best to keep the data safe, and in normal operation you don't need to do much to get the benefit of data persistence. When something unusual happens, however, you may need to take special action and that requires prior planning. For example, suppose the disk drives holding the database simply break, a fire reduces the computer to a smoldering puddle of slag, or a user accidentally or intentionally deletes the database. (A user tried that once on a project I was working on. We were not amused!)

In these extreme cases, the database alone cannot help you. To protect against this sort of trouble, you need to perform regular backups.

Physical data containers like notebooks are generally hard to back up, so it's hard to protect them against damage. If a fire burns up your accounts receivable notebook, you'll have to rely on your customers' honesty in paying what they owe you. Even though we like customers, I'm not sure most businesses trust them to that extent.

In theory, you could make copies of a notebook and store them in separate locations to protect against these sorts of accidents, but in practice few legit businesses do (except perhaps money laundering, smuggling, and other shady endeavors where it's handy to show law enforcement officials one set of books and the “shareholders” another).

Computerized databases, however, are relatively easy to back up. If the loss of a little data won't hurt you too badly, you can back up the database daily. If fire, a computer virus, or some other accident destroys the main database, you can reload the backup and be ready to resume operation in an hour or two.

If the database is very volatile or if losing even a little data could cause big problems (how much money do you think gets traded through the New York Stock Exchange during a busy hour?), then you need a different backup strategy. Many higher-end database products allow you to shadow every database operation as it occurs, so you always have a complete copy of everything that happens. If the main database is destroyed, you can be back in business within minutes. Some database architectures can switch to a backup database so quickly that the users don't even know it's happened.

BACKUP PLANS

It's always best to store backups away from the computer that you're backing up. Then if a big accident like a fire, tornado, or attack by space aliens occurs and destroys the whole building holding the database, the backup is still safe.

I've known of several development groups that stored their backups right next to the computer they were backing up. That guards against some kinds of stupidity but doesn't protect against big accidents. (In the teams I've worked on, about once every 10 person-years or so, someone accidentally deleted a file that we needed to recover from backups.)

I've also known of companies that had an official backup plan, but once you submitted a backup for proper storage it was shipped off-site and it took a long time to get it back if you needed it. A backup doesn't do much good if you can't access it!

In a very extreme example, I had a customer who was concerned that backups were stored only 30 miles from the database. Their thought was that the backups might not be safe in the event of a volcanic eruption or nuclear explosion. Everyone needs to pick their own level of paranoia.

Exactly how you implement database backups depends on several factors, such as how likely you think a problem will be, how quickly you need to recover from it, and how disastrous it would be to lose some data and spend time restoring from a backup. In any case, a computerized database gives you a lot more options than a notebook does.

Good database design can help make backups a bit easier. If you arrange the data so that changes occur in a fairly localized area, you can back up that area frequently and not waste time backing up data that changes only rarely.

Low Cost and Extensibility

Ideally, the database should be simple to obtain and install, inexpensive, and easily extensible. If you discover that you need to process a lot more data per day than you had expected, you should be able to somehow increase the database's capacity.

Although some database products are quite expensive, most have reasonable upgrade paths, which means you can buy the least expensive license that can handle your needs, at least in the beginning. For example, SQL Server, Oracle, and MySQL provide free editions that you can use to get started building small, single-user applications. They also provide more expensive editions that are suitable for very large applications with hundreds of users.

Similarly, many cloud databases have a free tier for smaller projects and more expensive options for larger groups. Their pricing can be fairly complicated and can involve such factors as the number of cores (the part of the computer that executes code) you get to use, the amount of memory available, the type of computer holding the data, the type of backups you want, and the number of database transaction units (DTUs) that you use per month. (DTUs are a weighted measure of CPU, memory, reads, and writes.)

Installing a database will never be as easy and inexpensive as buying a new notebook, but it also doesn't need to be a time-consuming financial nightmare.

Although expense and capacity are more features of the particular database product than database design, good design can help with a different kind of extensibility. Suppose you have been using a notebook database for a while and discover that you need to capture a new kind of information. Perhaps you decide that you need to track customers' dining habits so you know what restaurant gift certificate to give them on special occasions. In this case, it would be nice if you could add extra paper to the bottom of the pages in the notebook. Good database design can make that kind of extension possible.

Ease of Use

Notebooks and filing cabinets have simple user interfaces so almost anyone can use them effectively—although sometimes even they get messed up pretty badly. Should you file “United States Postal Service” under “United States?” “Postal Service?” “Snail Mail?”

A computer application's user interface determines how usable it is by average users. User interface design is not a part of database design, so you might wonder why ease of use is mentioned here.

The first-level users of a database are often programmers and relatively sophisticated database users who understand how to navigate through a database. A good database design makes the database much more accessible to those users. Just by looking at the names of the tables, fields, and other database entities that organize the data, this type of user should be able to figure out how different pieces of data go together and how to use them to retrieve the data they need. If those sophisticated users can easily understand the database, then they can build better user interfaces for the less advanced users to come.

Portability

A computerized database allows for a portability that is even more powerful than the portability of a notebook. It allows you to access the data from anywhere that you have access to the web without actually moving the physical database. You can access the database from just about anywhere, while the data itself remains safely at home, far from the dangers of pickpockets, being dropped in a puddle, and getting forgotten on the bus.

DEFINITION Cloud providers may move your data around from server to server to balance their computing and network loads. That can become a problem if your data is moved from one country to another, because some countries have laws about what kinds of data you're allowed to store and where you can move it. This brings up two new terms: data residency and data sovereignty.

Data residency is where the data is stored. Data sovereignty is the idea that data is subject to the laws of the country in which it originated. For example, Canada passed laws requiring that Canadian data be stored on Canadian servers to protect the privacy of Canadian citizens from snooping if the data was stored in another country. (Think of the U.S. National Security Agency and the USA Patriot Act.)

In a similar manner, the General Data Protection Regulation (GDPR) regulates the transfer of data outside of the European Union (EU) and the European Economic Area (EEA).

These issues mean that, in some cases, companies may require that their cloud providers keep their data physically stored in certain countries. (I'm sure this causes headaches for both the companies and the cloud providers—and keeps many lawyers employed.)

Unfortunately, the new kind of portability may be a little too easy. Although someone in the seat behind you on the airplane can't peek over your shoulder to read computerized data the way they can a notebook (well, maybe they can if you're using your laptop in airplane mode), a hacker located on the other side of the planet may try to sneak into your database and rifle through your secret cookie recipes while you're asleep.

This leads to the next topic: security.

Security

A notebook is relatively easy to lose or steal, but a highly portable database can be even easier to compromise. If you can access your database from all over the world, then so can cyber-banditos and other ne'er-do-wells.

Locking down your database is mostly a security issue that you should address by using the security tools provided by your network and database. However, there are some design techniques that you can use to make securing the database easier.

If you separate the data into categories that different types of users need to manipulate, then you can grant different levels of permission to the different kinds of users. Giving users access to only the data they absolutely need not only reduces the chance of a legitimate user doing something stupid or improper, but also decreases the chance that an attacker can pose as that user and do something malicious. Even if Clueless Carl won't mistreat your data intentionally, an online attacker might be able to guess Carl's password (which naturally is “Carl”) and try to wreak havoc. If Carl doesn't have permission to trash the accounting data, then neither does the cyber-mugger.

Yet another novel aspect to database security is the fact that users can access the database remotely without holding a copy of the database locally. You can use your laptop or phablet to access a database without storing the data on your computer. That means if you do somehow lose your computer, the data may still be safe on the database server.

This is more an application architecture issue than a database design issue (don't store the data locally on laptops), but using a database design that restricts users' access to what they really need to know can help.

Sharing

It's not easy to share a notebook or a box full of business cards among many people. Two people can't really use a notebook at the same time, and there's some overhead in shipping the notebook back and forth among users. Taking time to walk across the room a dozen times a day would be annoying; express mailing a notebook across the country every day would be just plain silly.

Modern networks can let hundreds or even thousands of users access the same database at the same time from locations scattered across the globe. Although this is largely an exercise in networking and the tools provided by a particular database product, some design issues come into play. (And don't forget what I said earlier about data residency and data sovereignty.)

If you compartmentalize the data into categories that different types of users need to use as described in the previous section, this not only aids with security, but also helps reduce the amount of data that must be shipped across the network.

Breaking the data into reasonable pieces can also aid in coordination among multiple users. When your coworker in London starts editing a customer's record, that record must be locked so that other users can't sneak in and mess up the record before the edit is finished. Grouping the data appropriately lets you lock the smallest amount of data possible, which makes more data available for other users to edit.

Careful design can allow the database to perform some calculations on the server and send only the results to your boss (who's working hard on the beaches of Hawaii) instead of shipping the whole database out there and making the boss's computer do all of the work.

Good application design is also important. Even after you prepare the database for efficient use, the application still needs to use it properly. But without a good database design, these techniques aren't possible.

Ability to Perform Complex Calculations

Compared to the human brain, computers are idiots. It takes seriously powerful hardware and frighteningly sophisticated algorithms to perform tasks that you take for granted, such as recognizing faces, speaker-independent speech recognition, and handwriting recognition (although neither the human brain nor computers have yet deciphered doctors' prescriptions). The human brain is also self-programming, so it can learn new tasks flexibly and relatively quickly.

Although a computer lacks the adaptability of the human brain, it is great at performing a series of well-defined tasks quickly, repeatedly, and reliably. A computer doesn't get bored, let its attention wander, or make simple arithmetic mistakes (unless it suffers from the Pentium FDIV bug, the f00f bug, the Cyrix coma bug, or a few others). The point is, if the underlying hardware and software work correctly, the computer can perform the same tasks again and again millions of times per second without making mistakes.

When it comes to balancing checkbooks, searching for accounts with balances less than $0, and performing a host of other number-crunching tasks, the computer is much faster and less error-prone than a human brain.

The computer is naturally faster at these sorts of calculations, but even its blazing speed won't help you if your database is poorly designed. A good design can make the difference between finding the data you need in seconds rather than hours, days, or not at all.

CAP Theorem

The CAP theorem (also called Brewer's theorem after computer scientist Eric Brewer, who devised it) says that any distributed data store can only provide two of the following reliably:

Consistency—Every read receives the most recently written data (or an error).
Availability—Every read receives a non-error response, if you don't guarantee that it receives the most recently written data.
Partition tolerance—The data store continues to work even if messages are dropped or delayed by the network between the store's partitions.

When a partition failure occurs, you need to decide whether to either (a) cancel the operation (and reduce availability), or (b) have the operation continue (but risk inconsistency).

Basically all of this says that network errors or tardiness can make the data temporarily unavailable, or you can use the most recent value to which you have access.

Brewer noted that you only need to sacrifice consistency or availability if the data store is partitioned. If the data is all stored in one place, then you should be able to achieve both consistency and availability.

Cloud Considerations

A cloud database hosts data in the cloud so it is accessible over a network. There are two common cloud deployment models. First, you can rent space on a virtual machine and run the database there. A virtual machine (VM) is a simulation of a physical computer running on a server somewhere.

Virtual machines have the advantage that the cloud provider can move them around, possibly hosting multiple VMs on a single physical machine. Conversely, you may also be able to use multiple computers to host a single VM. Those two capabilities together make it easier to scale an application up or down as needed.

For example, suppose you write some software to schedule appointments for nail salons. Initially you serve only a few salons, so your VM uses a small fraction of one physical server. Over time, as more and more salons sign up for your service, you need more space and faster processing, so you start using more of the server. Soon your provider moves you onto faster hardware and eventually onto a small group of servers (for a price, of course).

A second way to build a cloud database is to rent a database from a cloud database provider. This has the same advantages as the virtual machine, plus the database provider will maintain the database for you. For example, the provider may perform backups, guarantee a certain level of availability, and so forth (again, for a price).

In this scenario, the database is provided as a service (aaS). More precisely, it is considered a database as a service (DBaaS), which is a kind of software as a service (SaaS). (The phrase “as a service” is very in vogue lately.)

Note that you can create both relational and NoSQL databases in the cloud. You can also find some that are free as long as the database is relatively small. The provider hopes that your nail salon will become wildly successful and that you'll then share your success with them by renting larger virtual machines or databases.

Legal and Security Considerations

I won't talk too much about legal issues in this book, but you should determine whether you might encounter any of them. For example, I already mentioned data residency and data sovereignty earlier in this chapter. Some countries require that certain kinds of data reside physically within their borders, and you could be in big trouble if your data is stored in the cloud on foreign servers.

In addition to ensuring that your cloud servers have allowed physical locations, you need to ensure that your data is properly protected. For example, in the United States, HIPAA (which stands for the Health Insurance Portability and Accountability Act and is pronounced “hip-uh”) prohibits the disclosure of a patient's sensitive medical information without their consent or knowledge. I don't believe HIPAA requires data residency (but I'm not a lawyer, so don't take my word for it), but some states have their own special requirements. For example, all 50 U.S. states plus Washington D.C., Puerto Rico, and the U.S. Virgin Islands have some sort of law requiring you to notify residents if their personal information is compromised in a security breach.

Obviously sensitive information like credit card numbers, bank account numbers, Social Security numbers, driver's license numbers, website passwords, biometric data, business information, and other important items require top-notch security.

Certain other kinds of data are also considered personal and/or sensitive and may or may not be protected by law. Personally identifiable information (PII) is information that could be used to assist with identity theft and includes such items as a person's name, mother's maiden name, address and former addresses, phone numbers, and so on. Sensitive data may include gender identity, ethnic background, political or religious affiliation, union membership, and more.

When you're shopping for cloud databases, consider the kinds of data that you will store, whether it's subject to data residency requirements, and whether it needs more than the normal amount of protection. Then mercilessly grill your cloud provider to ensure that they can handle your needs.

If you don't think these aren't important issues, think again! Governments take these things very seriously. For example, here are a few of the larger fines levied against companies that didn't protect their data properly.

Athens Orthopedic Clinic—$1.5 million HIPAA penalty
CHSPSC LLC—$2.3 million HIPAA penalty and $5 million settlement for a 28-state legal action
Google LLC—€60 million ($68 million) GDPR fine
Facebook—€60 million ($68 million) GDPR fine
Premera Blue Cross—$6.85 million HIPAA penalty, plus a multistate legal settlement for $10 million and a class action lawsuit for $74 million, for a total of over $90 million
Google Ireland—€90 million ($102 million) GDPR fine
Uber—$148 million for 600,000 driver and 57 million user accounts breached
Capital One—$190 million to settle a class-action lawsuit over a breach that exposed the personal data of 100 million people
Home Depot—$200 million paid to credit companies, banks, other financial institutions, and customers because attackers breached their system and compromised their point-of-sale system
WhatsApp—€225 million ($255 million) GDPR fine
Equifax—$575 million–$700 million for “failure to take reasonable steps to secure its network”
Amazon—€746 million ($877 million) GDPR fine

Those dozen cases alone add up to more than $2.5 billion. You can find so many more examples by searching for “biggest data breach fines” that it's not funny. In fact, it's terrifying. Many of these fines were not levied because a company was being evil; they were mostly just being lazy (not following the rules) and careless (not properly securing their data).

So find out what your exposure is and take data security very seriously.

Consequences of Good and Bad Design

Table 1.1 summarizes how good and bad design can affect the features described in the previous sections.

TABLE 1.1: Good vs. bad design

FEATURE	GOOD DESIGN	BAD DESIGN
CRUD	You can find the data you need quickly and easily. The database prevents inconsistent changes.	You find the data you need either very slowly or not at all. You can enter inconsistent data or modify and delete data to make the result inconsistent. (Your products ship to the wrong address or the wrong person.)
Retrieval	You can find the correct data quickly and easily.	You cannot find the data that you need quickly. (Your customer waits on hold for 45 minutes to get information, gets tired of waiting, and then buys from your competition.)
Consistency	All parts of the database agree on common facts.	Different pieces of information hold contradictory data. (A customer's bills are sent to one address, but late payment notices are sent to another.)
Validity	Fields contain valid data.	Fields contain gibberish. (Your company's address has the State value “Chaos.” Although if the database does hold that value, it's probably correct on some level.)
Error Correction	It's easy to update incorrect data.	Simple and large-scale changes never happen. (Thousands of your customers' bills are returned to you because their city changed and the database didn't get updated.)
Speed	You can quickly find customers by name, account number, or phone number.	You can only find a customer's record if they know their 37-digit account number. Searching by name takes half an hour.
Atomic Transactions	Either all related transactions happen or none of them happen.	Related transactions may occur partially. (Alice loses $100, but Bob doesn't receive it. Prepare for customer complaints!)
Persistence and Backups	You can recover from computer failure. The data is safe.	Recovering lost data is slow and painful or even impossible. (You lose all of the orders placed in the last week!)
Low Cost and Extensibility	You can move to a bigger database when your needs grow.	You're stuck on a small-scale database. (When your website starts receiving hundreds of orders per second, the database cannot keep up, and you lose thousands of orders per day.)
Ease of Use	The database design is clear so developers understand it and can build a great user interface.	The database design is confusing so the developers produce an “anthill” program—confusing and buggy. (I've worked on projects like that and it's no picnic!)
Portability	The design allows different users to download relevant data quickly and easily.	Users must download much more data than they need, slowing performance and giving them access to sensitive data (such as the Corporate Mission Statement, which proves that management has no clue).
Security	Users have access to the data that they need and nothing else.	Hackers and disgruntled former employees have access to everything.
Sharing	Users can manipulate the data they need.	Users lock data they don't really need and get in each other's way, slowing them down.
Complex Calculations	Users can easily perform complex analysis to support their jobs.	Poor design makes calculations take far longer than necessary. (I worked on a project where a simple change to a data model could force a 20-minute recalculation.)

SUMMARY

Database design plays an important role in application development. If the database design doesn't provide a solid foundation for the rest of the project to build upon, the application as a whole will fail.

Physical data containers like notebooks and file cabinets can behave as databases. They have strengths and weaknesses that you can use to think about corresponding strengths and weaknesses in computerized databases. With some effort, you can provide the strengths while avoiding the weaknesses.

In this chapter, you learned that a good database provides:

CRUD
Retrieval
Consistency
Validity
Easy error correction
Speed
Atomic transactions
ACID
Persistence and backups
Low cost and extensibility
Ease of use
Portability
Security
Sharing
Ability to perform complex calculations

The next chapter provides an overview of the most commonly used databases today: relational databases. NoSQL databases, which are described in the chapter after that, are coming on strong, but relational databases still have some powerful advantages in many scenarios.

Before you move on to Chapter 2, however, take a look at the following exercises and test your knowledge of the database design goals described in this chapter. You can find the solutions to these exercises in Appendix A.

EXERCISES

Compare this book to a database (assuming you don't just use it as a notebook, scribbling in the margins). What features does it provide? What features are missing?
Describe two features that this book provides to help you look for particular pieces of data in different ways.
What does CRUD stand for? What do the terms mean?
How does a chalkboard implement the CRUD methods? How does a chalkboard's database features compare to those of this book?
Consider a recipe file that uses a single index card for each recipe with the cards stored alphabetically. How does that database's features compare to those of a book?
What does ACID stand for? What do the terms mean?
What does BASE stand for? What do the terms mean?
What does the CAP theorem say? Under what conditions is this not a “pick two of three” situation?
Suppose Alice, Bob, and Cindy all have account balances of $100 and the database does not allow an account's balance to ever drop below $0. Now consider three transactions: (1) Alice transfers $125 to Bob, (2) Bob transfers $150 to Cindy, and (3) Cindy transfers $25 to Alice and $50 to Bob. In what order(s) can those transactions be executed successfully?
Explain how a central database can protect your confidential data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.