Chapter 15: The Database Is Key

The database is the backbone and most important part of almost all web applications, regardless of the programming language used. This is because the database is where all of an application's data is stored, and in most cases, an application's data is more important than the application itself. In this chapter, you'll learn why proper database design is so important, some important database design principles, and why it is best to use the database's full capabilities. You'll also learn how to choose a model layer for your application, and how to handle database errors when using that model layer.

We will cover the following topics:

  • Learning why database design is so important
  • Understanding the most important database design principles
  • Treating the database as not just dumb storage
  • Choosing the model layer
  • Handling database and model errors

By the end of this chapter, you'll have a better understanding of database design and usage as it applies to Ruby web applications.

Technical requirements

In this chapter and all chapters of this book, code given in code blocks is designed to execute on Ruby 3.0. Many of the code examples will work on earlier versions of Ruby, but not all. The code for this chapter is available online at https://github.com/PacktPublishing/Polished-Ruby-Programming/tree/main/Chapter15.

Learning why database design is so important

In most cases, while your code is important, your data is far more important than your code. Let's say you have an online shop. If you have a bug in your code that prevents you from accepting orders, you are definitely losing money. However, at least the customer that is attempting to order realizes the order didn't go through. They might be annoyed, but they'll probably just shop somewhere else. However, if you suffer data loss or data corruption after accepting an order, that could be way worse. In that case, you have customers that expect to receive orders that they won't be receiving, and they will probably become very irate. It's even worse if the data loss or data corruption happens after the customer was charged, or the data corruption results in orders for customer A being shipped to customer B. Instead of just loss of goodwill, you may have an expensive lawsuit on your hands. For almost all businesses and applications, the data stored is more valuable than the application code itself.

In addition to data being more important than code, data almost always lasts far longer than code. It's not unusual for a business application to be completely rewritten in a new programming language, while the database remains exactly the same. For many large organizations, there is often a wide variety of separate applications that access the same data. The specific applications used may come and go with the ebb and flow of time. However, the data remains stored in the same database and is of primary importance.

Not only is the data more important and longer-lasting than code, but in most cases, the performance of applications is more dependent on how data is stored and structured than on the code itself. In the majority of applications, proper database design and indexing will make a larger performance difference than how code is written in a particular programming language, and often a larger performance difference than which programming language is used. An application written in Ruby using intelligent database design and indexing will often outperform an application written in C with less intelligent database design and improper indexing.

Often, you have a limited amount of time when designing your application, due to external constraints. It's wise to spend the majority of time thinking about the data you will be storing and how best to store it, what your access patterns will be, and how best to design your database indexes to make those access patterns as fast as possible.

Because data is so important, the database you use is probably the most important decision you will make for your application, even more than the choice of programming language. A poorly chosen database almost dooms an application to failure. In general, you should choose the most robust database that meets the needs of your application. Let's see how to decide that in the next section.

Deciding on a database to use

The first decision to make is what type of database to use, such as a relational (SQL) database, a schemaless document database, a key-value database, a graph database, or a more specialized database such as a time-series database. If you value your data at all and your database has any structure at all, a schemaless document database is usually a poor choice that you will end up regretting later when you discover anomalies in your data, far too late to fix them. Most key-value databases are too limited for the use of structured data unless they are treated as document databases, in which case they have the same issues as document databases. Unless you have studied and have had experience with graph databases and are sure they are the best database type for your application, they probably aren't. Similarly, unless you have very specialized needs, a specialized database such as a time series database is probably the wrong choice for your application.

For the vast majority of applications, the best database choice will be a relational database. Because a relational database requires that you explicitly design the schema you will be using, using a relational database requires more upfront thinking about the data you will be storing and how you will be accessing it. It is always worth spending the time thinking upfront about your data. Almost all the time you spend upfront designing a good database structure will be paid back in spades over the life of your application.

Assuming you will be using a relational database, you have to choose which one to use. In terms of open source databases, for the majority of applications, the best choice is PostgreSQL. PostgreSQL is the most robust open source database and has greater support for advanced data types and indexes. It is possible to be successful with MySQL, but you need to be careful to configure MySQL in a way that will not result in silent data loss.

For cases where you must use an embedded database and cannot use a client-server database, SQLite is the natural choice. However, be aware the SQLite does not enforce database types, treating database types as recommendations. If you must use SQLite, make sure to use CHECK constraints as much as possible to actually enforce database types. Another thing to be aware of with SQLite is that SQLite's support for altering tables is very limited. Making most changes to a table other than adding a column requires creating a new table with the structure you want, and copying all data from the old table into the new table, then dropping the old table and renaming the new table with the old table's name.

In this section, you learned why database design is so important, and how to decide on a database to use for your application. In the next section, you'll learn the most important principles of good database design.

Understanding the most important database design principles

In general, when designing a database structure, you want it to be normalized. Very briefly, here are the basic rules regarding data normalization:

  • Each row can be uniquely referenced by a single column or combination of columns (that is, always have a primary key).
  • Each column should contain a single value, not a collection of values (that is, avoid columns such as artist_names that contain column-separated names of multiple artists).
  • Each column should contain different data than contained in other columns (that is, avoid columns such as artist_name_1, artist_name_2, and so on).
  • Each column in the table is only dependent on the primary key, and not dependent on another column (that is, avoid an artist_name and artist_country column in a tracks table).

There are a lot of books that cover database normalization. Please see the Further reading section at the end of the chapter for some examples. In general, it's best to start with a normalized database design. Normalized database designs are usually the simplest to insert, update, and delete.

Considerations when denormalizing your database design

There are cases where a normalized database design may not provide the necessary performance for the application. In some cases, that can be worked around with intelligent caching or the use of materialized views, but in some cases it cannot and you need to at least partially denormalize your database design. Denormalizing your database should only be done as a form of performance optimization. As you learned from reading Chapter 14, Optimizing Your Library, performance optimization should only be done if you are sure it is necessary.

If you are sure that performance optimization is necessary, you should profile the code beforehand and determine that the reason that performance is not up to the necessary level is due to database queries, and not other factors. Then you should benchmark to determine a baseline of current performance. When profiling and benchmarking, try to use a copy of the production database, because database performance is often dependent on what data is being stored in the database. After you have determined a baseline of performance, you can try denormalizing the design and see if it performs better.

As an example of such denormalization, let's say you have three database tables, one for tracks, one for artists, and one joined table named artists_tracks with a foreign key for tracks and a foreign key for artists. In order to retrieve a selection of tracks and artists, you either need:

  • A three-way join between the tracks, artists_tracks, and artists tables
  • Three separate queries: one for tracks, one for related rows in artists_tracks, and one for the related rows in artists
  • Two separate queries: one for tracks, and the other with a join between the artists_tracks and artists tables

If retrieving tracks with the associated artists is critically important to performance, you could switch to storing an array of artist names in the tracks table. Then you do not need the artists_tracks join table, and you can retrieve the desired information in a single query without a join. However, by denormalizing, updating artist information becomes more difficult. Instead of updating an artist's name in one place, you need to update it in every track the artist worked on. That's much more difficult and performance-intensive than the normalized approach. However, in the real world, artist names are unlikely to change for existing tracks, so this particular denormalization is unlikely to cause problems. Remember that denormalizing is always a tradeoff to get increased performance in particular cases while making other cases more difficult.

Other database design principles

In general, you should design your database tables to be as small as possible. Only collect the information you will need for your application. Do not blindly add created_at and updated_at columns to every table, only add them to the tables that really need both. By limiting the size of your database tables, you'll make access to those tables faster. Remember that you can always add columns later if you determine you really need the information. Do not assume you will need information in the future that you currently don't need, because in most cases, you probably won't need it.

Spend time thinking about how you are going to access your data, and design your database indexes around those use cases. If you are going to be accessing tracks by name, you'll want an index on the name column. If you will be accessing tracks by release date, you'll want an index on the release_date column. If you are going to be accessing tracks by release date, but then showing them in order by name, you'll want a composite index on the release_date column and the name column in that order, because that will allow the database to avoid a separate sort step when retrieving the data. Similarly, if you are accessing data by name and then showing the matching tracks by release date, a composite index on the name column and the release_date column in that order is what you want.

Every index you add has a cost. Each index added to a table will make inserting, updating, and deleting slower. So, only add the indexes you need. Remember that a single index can serve multiple purposes. With one composite index on the release_date column and the name column, and another composite index on the name column and the release_date column, you can optimize all four of the cases described previously. When adding an index on a single column, always consider whether it is a better idea to add a composite index with that column as the first column and an additional column added after. Such composite indexes are in general more flexible and can optimize more types of queries, though they may not optimize a specific query as well.

You should enforce referential integrity whenever you can by using foreign keys. It's almost a guarantee that if you do not enforce referential integrity in your database, you'll eventually have foreign key columns that point to rows that no longer exist. You should avoid any feature that does not allow the enforcement of referential integrity, such as polymorphic associations. There are only rare exceptions where referential integrity shouldn't be enforced with foreign keys, such as audit logs stored in the database where you need to allow the deletion of rows but still retain the audit logs.

In this section, you've learned important database design principles. In the next section, you'll learn how to use all of a database's features to maximize the value it brings to your application.

Treating the database as not just dumb storage

You should avoid treating your database as just a place to store and retrieve your data, as a good relational database offers so much more.

In general, retrieving rows from a database table, performing operations on each row, and saving the modified row back to the database is relatively slow. If it is possible, it is much faster to write a database function for performing the same operation, and issue a single UPDATE statement to update all rows at once. This is usually at least one order of magnitude faster, and sometimes multiple orders of magnitude faster.

Even in cases where you aren't updating data, you can use database operations to improve performance. For example, let's say you have a table with first names and last names, and you want to return a single string combining the first and last names. Do the concatenation in the database via a query such as this:

# SELECT first_name || ' ' || last_name FROM names;

The preceding command is going to be significantly faster than a query such as this:

# SELECT first_name, last_name FROM names

And then doing the concatenation in Ruby via code such as this:

"#{row[:first_name]} #{row[:last_name]}"

One of the most important benefits of a database is enforcing data consistency. However, enforcing specific data types is only part of consistency. As much as possible, you should use database constraints to enforce consistency.

NOT NULL constraints allow you to enforce that database columns cannot have NULL values. You should use NOT NULL constraints on any column that will not need to contain NULL values. You should almost always default to using NOT NULL on all columns, and only skip adding a NOT NULL constraint if you are sure that you will need to store NULL values. If you actually have to deal with missing data, consider moving the column or columns to a separate table. That way if the data is not available, there is no row in the related table. In addition to avoiding NULL storage, this makes access to the main table faster, at the expense of requiring a separate query to get the column or columns from a separate table.

CHECK constraints enforce that a column meets particular criteria, such as being in a certain range. You should use CHECK constraints to enforce specific values for a single column or a combination of columns. If a number column should only have values between 1 and 99, use a CHECK constraint to enforce that. If you have a release_date column for tracks, you probably don't want to support tracks released in the year 200 or the year 20000, so a CHECK constraint to enforce a specific date range will add benefits later when someone leaves off a zero or adds a zero when updating the release_date column. Avoid trying to enforce consistency for column values solely using application layer rules such as model validations, as that will not catch cases where the model is not used to modify the data.

To enforce uniqueness on a single column, use a UNIQUE constraint or unique index on that column. To enforce uniqueness on a combination of columns, use a composite UNIQUE constraint or a unique index on the combination of columns. There are some cases where you want to enforce uniqueness, but only when certain criteria are met. For example, if you have a table of employees, and you want to enforce that the username column for the employee is unique, but only for active employees, you cannot just add a unique index on the username column. This is because if a previous employee had that user name 10 years ago, a new employee wouldn't be able to use it. For this case, use a partial unique index, where you enforce uniqueness on username, but only when the active column is true.

To enforce consistency across multiple tables, use database triggers. For example, if you have a database of albums and tracks, and want to have a num_tracks column in the albums table automatically updated when you add or remove a track for the album, that is the perfect case for a database trigger. Avoid trying to solely enforce consistency across multiple tables using application layer rules such as model hooks, since they cannot catch all cases and are prone to race conditions. Always use a database-level trigger to ensure data consistency across multiple tables.

In this section, you learned to not treat the database as dumb storage, and instead to use database features to improve the consistency of your data. In the next section, you'll learn how to choose a model layer for your application.

Choosing the model layer

Except in rare cases where you need the absolute maximum performance and are coding directly against the database driver, you will almost always want to use a model layer or other database abstraction layer in your application. Which model layer you choose can have significant performance effects on your application, as well as influence what other libraries you can use in your application.

For SQL database model layers in Ruby, there are only two libraries with significant popularity (that is, over one million downloads), Active Record and Sequel. Active Record is by far the most popular option, due to it being the default model layer for the most popular web framework, Ruby on Rails. Outside of usage in Ruby on Rails, Sequel is probably slightly more popular than Active Record, but there is still a lot of Active Record use outside of Ruby on Rails.

Due to Active Record's popularity, many more libraries integrate with Active Record than integrate with Sequel. Similarly, many more developers have experience of using Active Record than Sequel. In cases where you need to find other developers that already have experience with the model layer or other libraries that need to integrate with the model layer, Active Record provides many more options.

On the other hand, the advantage of using Sequel is that it is usually significantly faster, often multiple times faster for the same operation. In general, it is considered technically superior, easier to use, and less prone to bugs. When bugs are found and reported in Sequel, they tend to be fixed much quicker than bugs found and reported in Active Record. Sequel has a much more regular release cycle than Active Record so that after a bug is fixed, you never need to wait more than a month for it to appear in a released version.

One of the largest interface differences between Active Record and Sequel is that Active Record is designed to work with SQL string fragments, while Sequel is designed to work with Ruby expressions. For example, in Active Record, to retrieve records with more than 10 tracks, you would use an SQL string fragment like the following:

tracks = 10

Album.where("num_tracks > ?", tracks).to_a

Whereas in Sequel, you would typically use a Ruby expression:

tracks = 10

Album.where{num_tracks > tracks}.all

By avoiding SQL strings fragments, applications using Sequel are less prone to SQL injection attacks than applications using Active Record. Active Record does offer similar support for using Ruby expressions that Sequel does, using an internal library called Arel. However, it is considered a private API and it is not officially supported. It is also more cumbersome to use and more prone to breakage in new versions.

In this section, you learned the principles of choosing a model layer for your application. In the next section, you'll learn how best to handle database errors.

Handling database and model errors

In the previous section, you learned some differences between Active Record and Sequel. One additional difference is their default approach to error handling. By default in Active Record, saving a model instance can return false if the model is not valid:

model_instance.save

# => true or false

This is different than Sequel, where saving a model instance by default raises an exception:

model_instance.save

# => model

# or raise Sequel::ValidationFailed

Both Active Record and Sequel have options for either behavior, but this shows the philosophical difference between the two libraries. With Active Record, you need to be diligent to make sure you check every call to save to determine whether it succeeded or failed. If you forget to check a call to save, your code will continue running, and you may not realize your model instance was never saved. As you learned in Chapter 5, Handling Errors, this is a fail-open design. With Sequel, because exceptions are raised for every failure, you do not need to worry about missing an error. If you do miss handling an error, you end up with an unhandled exception, which you will probably be alerted to. Sequel uses a fail-closed design by default.

In general, when dealing with database access, it's best to catch errors as soon as possible, ideally identifying problems before sending a query to the database. In the model layer, the way you do this is generally by validating the model object before saving it. This validation happens automatically when calling the save method, before sending a query to the database.

With Active Record, you are encouraged to manually define model validations for each of the columns in your model, and avoid the use of database constraints. If you miss defining a validation with Active Record, you often will end up with bad data in the database, since there is no database constraint to enforce consistency. With Sequel, you are encouraged to use database constraints as the primary method of enforcing data consistency, with model validations only being used to provide nicer error messages. The advantage of Sequel's approach is that if you modify the database without going through the model layer, data consistency issues will still be caught by the database.

Sequel is shipped with multiple plugins that work with the database to add validations automatically, such as the following:

  • The auto_validations plugin, which adds type validations, presence validations, uniqueness validations, and max length validations based on the database schema.
  • The constraint_validations plugin, which defines constraints in your database for each validation, with metadata stored in the database to automatically set up model validations for each constraint.
  • The pg_auto_constraint_validations plugin, which will handle constraint violation exceptions raised when sending the database query, and automatically translate them into validation failures (this plugin is PostgreSQL specific).

When saving a model instance, your code always needs to be prepared to handle exceptions raised by the database driver. In many cases, there may be nothing you can do other than report the error. If you can translate the exception and provide a nice validation error to the user, that is best, but is often not possible.

Summary

In this chapter, you learned that database design is probably the most important part of your application. You've learned important database principles such as only storing information that is actually needed, and when to use a denormalized design. You've learned to take advantage of database features such as functions, triggers, constraints, and unique indexes. You've learned some differences between popular Ruby libraries for database modeling, and some principles for choosing the library that is best for your application. Finally, you've learned how to handle database and model errors in your application. With all of the information you've learned, you are now better able to design an appropriate data storage layer for your application.

In the next chapter, you'll learn important principles for designing your web application at levels above the database.

Further reading

Questions

  1. Why is database design the most important part of your application design?
  2. When should you consider denormalizing your database?
  3. What is a good reason to use a database trigger?
  4. If your primary consideration when developing is how many external libraries you can use to save development time, which model layer is probably best?
  5. Which model layer uses a fail-closed design for handling model errors during saving?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.157.186