Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5. Optimizing Schema and Data Types

Good logical and physical design is the cornerstone of high performance, and you must design your schema for the specific queries you will run. This often involves trade-offs. For example, a denormalized schema can speed up some types of queries but slow down others. Adding counter and summary tables is a great way to optimize queries, but they can be expensive to maintain. MySQL’s particular features and implementation details influence this quite a bit.

This chapter and the following one, which focuses on indexing, cover the MySQL-specific bits of schema design. We assume that you know how to design databases, so this is not an introductory chapter, or even an advanced chapter, on database design. It’s a chapter on MySQL database design - it’s about what is different when designing databases with MySQL rather than other relational database management systems. If you need to study the basics of database design, we suggest Clare Churcher’s book Beginning Database Design (Apress).

This chapter is preparation for the two that follow. In these three chapters, we will explore the interaction of logical design, physical design, and query execution. This requires a big-picture approach as well as attention to details. You need to understand the whole system to understand how each piece will affect others. You might find it useful to review this chapter after reading the chapters on indexing and query optimization. Many of the topics discussed can’t be considered in isolation.

Choosing Optimal Data Types

MySQL supports a large variety of data types, and choosing the correct type to store your data is crucial to getting good performance. The following simple guidelines can help you make better choices, no matter what type of data you are storing:

Smaller is usually better.: In general, try to use the smallest data type that can correctly store and represent your data. Smaller data types are usually faster, because they use less space on the disk, in memory, and in the CPU cache. They also generally require fewer CPU cycles to process.; Make sure you don’t underestimate the range of values you need to store, though, because increasing the data type range in multiple places in your schema can be a painful and time-consuming operation. If you’re in doubt as to which is the best data type to use, choose the smallest one that you don’t think you’ll exceed. (If the system is not very busy or doesn’t store much data, or if you’re at an early phase in the design process, you can change it easily later.)
Simple is good.: Fewer CPU cycles are typically required to process operations on simpler data types. For example, integers are cheaper to compare than characters, because character sets and collations (sorting rules) make character comparisons complicated. Here are two examples: you should store dates and times in MySQL’s built-in types instead of as strings, and you should use integers for IP addresses. We discuss these topics further later.
Avoid NULL if possible.: A lot of tables include nullable columns even when the application does not need to store NULL (the absence of a value), merely because it’s the default. It’s usually best to specify columns as NOT NULL unless you intend to store NULL in them.; It’s harder for MySQL to optimize queries that refer to nullable columns, because they make indexes, index statistics, and value comparisons more complicated. A nullable column uses more storage space and requires special processing inside MySQL.; The performance improvement from changing NULL columns to NOT NULL is usually small, so don’t make it a priority to find and change them on an existing schema unless you know they are causing problems.

The first step in deciding what data type to use for a given column is to determine what general class of types is appropriate: numeric, string, temporal, and so on. This is usually pretty straightforward, but we mention some special cases where the choice is unintuitive.

The next step is to choose the specific type. Many of MySQL’s data types can store the same kind of data but vary in the range of values they can store, the precision they permit, or the physical space (on disk and in memory) they require. Some data types also have special behaviors or properties.

For example, a DATETIME and a TIMESTAMP column can store the same kind of data: date and time, to a precision of one second. However, TIMESTAMP uses only half as much storage space, is time zone - aware, and has special autoupdating capabilities. On the other hand, it has a much smaller range of allowable values, and sometimes its special capabilities can be a handicap.

We discuss base data types here. MySQL supports many aliases for compatibility, such as INTEGER (maps to INT), BOOL (maps to TINYINT), and NUMERIC (maps to DECIMAL). These are only aliases. They can be confusing, but they don’t affect performance. If you create a table with an aliased data type and then examine SHOW CREATE TABLE, you’ll see that MySQL reports the base type, not the alias you used.

Whole Numbers

There are two kinds of numbers: whole numbers and real numbers (numbers with a fractional part). If you’re storing whole numbers, use one of the integer types: TINYINT, SMALLINT, MEDIUMINT, INT, or BIGINT. These require 8, 16, 24, 32, and 64 bits of storage space, respectively. They can store values from −2(N - 1) to 2(N - 1) - 1, where N is the number of bits of storage space they use.

Integer types can optionally have the UNSIGNED attribute, which disallows negative values and approximately doubles the upper limit of positive values you can store. For example, a TINYINT UNSIGNED can store values ranging from 0 to 255 instead of from −128 to 127.

Signed and unsigned types use the same amount of storage space and have the same performance, so use whatever’s best for your data range.

Your choice determines how MySQL stores the data, in memory and on disk. However, integer computations generally use 64-bit BIGINT integers. (The exceptions are some aggregate functions, which use DECIMAL or DOUBLE to perform computations.)

MySQL lets you specify a “width” for integer types, such as INT(11). This is meaningless for most applications: it does not restrict the legal range of values, but simply specifies the number of characters MySQL’s interactive tools (such as the command-line client) will reserve for display purposes. For storage and computational purposes, INT(1) is identical to INT(20).

Real Numbers

Real numbers are numbers that have a fractional part. However, they aren’t just for fractional numbers; you can also use DECIMAL to store integers that are so large they don’t fit in BIGINT. MySQL supports both exact and inexact types.

The FLOAT and DOUBLE types support approximate calculations with standard floating-point math. If you need to know exactly how floating-point results are calculated, you will need to research your platform’s floating-point implementation.

You can specify a floating-point column’s desired precision in a couple of ways, which can cause MySQL to silently choose a different data type or to round values when you store them. These precision specifiers are nonstandard, so we suggest that you specify the type you want but not the precision.

Floating-point types typically use less space than DECIMAL to store the same range of values. A FLOAT column uses four bytes of storage. DOUBLE consumes eight bytes and has greater precision and a larger range of values than FLOAT. As with integers, you’re choosing only the storage type; MySQL uses DOUBLE for its internal calculations on floating-point types.

Because of the additional space requirements and computational cost, you should use DECIMAL only when you need exact results for fractional numbers - for example, when storing financial data. But in some high-volume cases it actually makes sense to use a BIGINT instead, and store the data as some multiple of the smallest fraction of currency you need to handle. Suppose you are required to store financial data to the ten-thousandth of a cent. You can multiply all dollar amounts by a million and store the result in a BIGINT, avoiding both the imprecision of floating-point storage and the cost of the precise DECIMAL math.

String Types

MySQL supports quite a few string data types, with many variations on each. Each string column can have its own character set and set of sorting rules for that character set, or collation.

VARCHAR and CHAR types

The two major string types are VARCHAR and CHAR, which store character values. Unfortunately, it’s hard to explain exactly how these values are stored on disk and in memory, because the implementations are storage engine - dependent. We assume you are using InnoDB and if not, you should read the documentation for your storage engine.

Let’s take a look at how VARCHAR and CHAR values are typically stored on disk. Be aware that a storage engine may store a CHAR or VARCHAR value differently in memory from how it stores that value on disk, and that the server may translate the value into yet another storage format when it retrieves it from the storage engine. Here’s a general comparison of the two types:

VARCHAR: VARCHAR stores variable-length character strings and is the most common string data type. It can require less storage space than fixed-length types, because it uses only as much space as it needs (i.e., less space is used to store shorter values).; VARCHAR uses 1 or 2 extra bytes to record the value’s length: 1 byte if the column’s maximum length is 255 bytes or less, and 2 bytes if it’s more. Assuming the latin1 character set, a VARCHAR(10) will use up to 11 bytes of storage space. A VARCHAR(1000) can use up to 1002 bytes, because it needs 2 bytes to store length information.; VARCHAR helps performance because it saves space. However, because the rows are variable-length, they can grow when you update them, which can cause extra work. If a row grows and no longer fits in its original location, the behavior is storage engine - dependent. For example, InnoDB may need to split the page to fit the row into it. Other storage engines may never update data in-place at all.; It’s usually worth using VARCHAR when the maximum column length is much larger than the average length; when updates to the field are rare, so fragmentation is not a problem; and when you’re using a complex character set such as UTF-8, where each character uses a variable number of bytes of storage.; It’s trickier with InnoDB, which can store long VARCHAR values as BLOBs. We discuss this later.
CHAR: CHAR is fixed-length: MySQL always allocates enough space for the specified number of characters. When storing a CHAR value, MySQL removes any trailing spaces. Values are padded with spaces as needed for comparisons.; CHAR is useful if you want to store very short strings, or if all the values are nearly the same length. For example, CHAR is a good choice for MD5 values for user passwords, which are always the same length. CHAR is also better than VARCHAR for data that’s changed frequently, because a fixed-length row is not prone to fragmentation. For very short columns, CHAR is also more efficient than VARCHAR; a CHAR(1) designed to hold only Y and N values will use only one byte in a single-byte character set,¹ but a VARCHAR(1) would use two bytes because of the length byte.

This behavior can be a little confusing, so we’ll illustrate with an example. First, we create a table with a single CHAR(10) column and store some values in it:

mysql> CREATE TABLE char_test( char_col CHAR(10));
mysql> INSERT INTO char_test(char_col) VALUES
 -> ('string1'), (' string2'), ('string3 ');

When we retrieve the values, the trailing spaces have been stripped away:

mysql> SELECT CONCAT("'", char_col, "'") FROM char_test;
+----------------------------+
| CONCAT("'", char_col, "'") |
+----------------------------+
| 'string1'                  |
| ' string2'                 |
| 'string3'                  |
+----------------------------+

If we store the same values into a VARCHAR(10) column, we get the following result upon retrieval, where the trailing space on string3 has not been removed:

mysql> SELECT CONCAT("'", varchar_col, "'") FROM varchar_test;
+-------------------------------+
| CONCAT("'", varchar_col, "'") |
+-------------------------------+
| 'string1'                     |
| ' string2'                    |
| 'string3'                     |
+-------------------------------+

The sibling types for CHAR and VARCHAR are BINARY and VARBINARY, which store binary strings. Binary strings are very similar to conventional strings, but they store bytes instead of characters. Padding is also different: MySQL pads BINARY values with (the zero byte) instead of spaces and doesn’t strip the pad value on retrieval².

These types are useful when you need to store binary data and want MySQL to compare the values as bytes instead of characters. The advantage of byte-wise comparisons is more than just a matter of case insensitivity. MySQL literally compares BINARY strings one byte at a time, according to the numeric value of each byte. As a result, binary comparisons can be much simpler than character comparisons, so they are faster.

Generosity Can Be Unwise

Storing the value ‘hello’ requires the same amount of space in a VARCHAR(5) and a VARCHAR(200) column. Is there any advantage to using the shorter column?

As it turns out, there is a big advantage. The larger column can use much more memory, because MySQL often allocates fixed-size chunks of memory to hold values internally. This is especially bad for sorting or operations that use in-memory temporary tables. The same thing happens with filesorts that use on-disk temporary tables.

The best strategy is to allocate only as much space as you really need.

BLOB and TEXT types

BLOB and TEXT are string data types designed to store large amounts of data as either binary or character strings, respectively.

In fact, they are each families of data types: the character types are TINYTEXT, SMALLTEXT, TEXT, MEDIUMTEXT, and LONGTEXT, and the binary types are TINYBLOB, SMALLBLOB, BLOB, MEDIUMBLOB, and LONGBLOB. BLOB is a synonym for SMALLBLOB, and TEXT is a synonym for SMALLTEXT.

Unlike with all other data types, MySQL handles each BLOB and TEXT value as an object with its own identity. Storage engines often store them specially; InnoDB may use a separate “external” storage area for them when they’re large. Each value requires from one to four bytes of storage space in the row and enough space in external storage to actually hold the value.

The only difference between the BLOB and TEXT families is that BLOB types store binary data with no collation or character set, but TEXT types have a character set and collation.

MySQL sorts BLOB and TEXT columns differently from other types: instead of sorting the full length of the string, it sorts only the first max_sort_length bytes of such columns. If you need to sort by only the first few characters, you can decrease the max_sort_length server variable.

MySQL can’t index the full length of these data types and can’t use the indexes for sorting.

Images in a Database?

In the past, it was not uncommon for some applications to accept uploaded images and store them as BLOB data in a MySQL database. This method was convenient for keeping the data for an application together, however, as the size of the data grew, operations like schema changes got slower and slower due to the size of that BLOB data.

If you can avoid it, don’t store data like images in a database. Instead, write them to disk separately and use the table to track the location or filename on disk for the image.

Using ENUM instead of a string type

Sometimes you can use an ENUM column instead of conventional string types. An ENUM column can store a predefined set of distinct string values. MySQL stores them very compactly, packed into one or two bytes depending on the number of values in the list. It stores each value internally as an integer representing its position in the field definition list. Here’s an example:

mysql> CREATE TABLE enum_test(
-> e ENUM('fish', 'apple', 'dog') NOT NULL
-> );
mysql> INSERT INTO enum_test(e) VALUES('fish'), ('dog'), ('apple');

The three rows actually store integers, not strings. You can see the dual nature of the values by retrieving them in a numeric context:

mysql> SELECT e + 0 FROM enum_test;
+-------+
| e + 0 |
+-------+
| 1     |
| 3     |
| 2     |
+-------+

This duality can be terribly confusing if you specify numbers for your ENUM constants, as in ENUM(’1', ’2', ’3'). We suggest you don’t do this.

Another surprise is that an ENUM field sorts by the internal integer values, not by the strings themselves:

mysql> SELECT e FROM enum_test ORDER BY e;
+-------+
| e     |
+-------+
| fish  |
| apple |
| dog   |
+-------+

You can work around this by specifying ENUM members in the order in which you want them to sort. You can also use FIELD() to specify a sort order explicitly in your queries, but this prevents MySQL from using the index for sorting:

mysql> SELECT e FROM enum_test ORDER BY FIELD(e, 'apple', 'dog', 'fish');
+-------+
| e     |
+-------+
| apple |
| dog   |
| fish  |
+-------+

If we’d defined the values in alphabetical order, we wouldn’t have needed to do that.

Because MySQL stores each value as an integer and has to do a lookup to convert it to its string representation, ENUM columns have some overhead. This is usually offset by their smaller size, but not always. In particular, it can be slower to join a CHAR or VARCHAR column to an ENUM column than to another CHAR or VARCHAR column.

To illustrate, we benchmarked how quickly MySQL performs such a join on a table in one of our applications. The table has a fairly wide primary key:

CREATE TABLE webservicecalls (
  day date NOT NULL,
  account smallint NOT NULL,
  service varchar(10) NOT NULL,
  method varchar(50) NOT NULL,
  calls int NOT NULL,
  items int NOT NULL,
  time float NOT NULL,
  cost decimal(9,5) NOT NULL,
  updated datetime,
  PRIMARY KEY (day, account, service, method)
) ENGINE=InnoDB;

The table contains about 110,000 rows and is only about 10 MB, so it fits entirely in memory. The service column contains 5 distinct values with an average length of 4 characters, and the method column contains 71 values with an average length of 20 characters.

We made a copy of this table and converted the service and method columns to ENUM, as follows:

CREATE TABLE webservicecalls_enum (
  ... omitted ...
  service ENUM(...values omitted...) NOT NULL,
  method ENUM(...values omitted...) NOT NULL,
  ... omitted ...
) ENGINE=InnoDB;

We then measured the performance of joining the tables by the primary key columns. Here is the query we used:

mysql> SELECT SQL_NO_CACHE COUNT(*)
  -> FROM webservicecalls
  -> JOIN webservicecalls USING(day, account, service, method);

We varied this query to join the VARCHAR and ENUM columns in different combinations. Table 4-1 shows the results³.

Table 5-1. Speed of joining VARCHAR and ENUM columns
Test	Queries per second
VARCHAR joined to VARCHAR	2.6
VARCHAR joined to ENUM	1.7
ENUM joined to VARCHAR	1.8
ENUM joined to ENUM	3.5

The join is faster after converting the columns to ENUM, but joining the ENUM columns to VARCHAR columns is slower. In this case, it looks like a good idea to convert these columns, as long as they don’t have to be joined to VARCHAR columns. It’s a common design practice to use “lookup tables” with integer primary keys to avoid using character-based values in joins.

However, there’s another benefit to converting the columns: according to the Data_length column from SHOW TABLE STATUS, converting these two columns to ENUM made the table about 1/3 smaller. In some cases, this might be beneficial even if the ENUM columns have to be joined to VARCHAR columns. Also, the primary key itself is only about half the size after the conversion. Because this is an InnoDB table, if there are any other indexes on this table, reducing the primary key size will make them much smaller, too.

Date and Time Types

MySQL has many types for various kinds of date and time values, such as YEAR and DATE. The finest granularity of time MySQL can store is microsecond. Most of the temporal types have no alternatives, so there is no question of which one is the best choice. The only question is what to do when you need to store both the date and the time. MySQL offers two very similar data types for this purpose: DATETIME and TIMESTAMP. For many applications, either will work, but in some cases, one works better than the other. Let’s take a look:

DATETIME: This type can hold a large range of values, from the year 1000 to the year 9999, with a precision of one microsecond. It stores the date and time packed into an integer in YYYYMMDDHHMMSS format, independent of time zone. This uses eight bytes of storage space.; By default, MySQL displays DATETIME values in a sortable, unambiguous format, such as 2008-01-16 22:37:08. This is the ANSI standard way to represent dates and times.
TIMESTAMP: As its name implies, the TIMESTAMP type stores the number of seconds elapsed since midnight, January 1, 1970, Greenwich Mean Time (GMT) - the same as a Unix timestamp. TIMESTAMP uses only four bytes of storage, so it has a much smaller range than DATETIME: from the year 1970 to January 19, 2038. MySQL provides the FROM_UNIXTIME() and UNIX_TIMESTAMP() functions to convert a Unix timestamp to a date, and vice versa.; The value a TIMESTAMP displays also depends on the time zone. The MySQL server, operating system, and client connections all have time zone settings.; Thus, a TIMESTAMP that stores the value 0 actually displays it as 1969-12-31 19:00:00 in Eastern Standard Time (EST), which has a five-hour offset from GMT. It’s worth emphasizing this difference: if you store or access data from multiple time zones, the behavior of TIMESTAMP and DATETIME will be very different. The former preserves values relative to the time zone in use, while the latter preserves the textual representation of the date and time.; TIMESTAMP also has special properties that DATETIME doesn’t have. By default, MySQL will set the first TIMESTAMP column to the current time when you insert a row without specifying a value for the column⁴. MySQL also updates the first TIMESTAMP column’s value by default when you update the row, unless you assign a value explicitly in the UPDATE statement. You can configure the insertion and update behaviors for any TIMESTAMP column. Finally, TIMESTAMP columns are NOT NULL by default, which is different from every other data type.

Storing Date and Time as an Integer?

Both DATETIME and TIMESTAMP force you to deal with timezones on the server and the client, and while a TIMESTAMP is more space efficient than a DATETIME (4 bytes vs. 8 bytes, ignoring fractional second support), it also suffers from the year 2038 problem.

Ultimately, storing date and time comes down to a few things:

How far forward or backwards do you need to support date and time?
How much does storage space matter for this data?
Do you need fractional second support?
Do you want to shift date, time and timezone handling to MySQL or deal with it in code?

It is becoming increasingly popular to avoid the complexities of MySQL’s handling by storing your date and time as the Unix epoch, or number of seconds since January 1, 1970 in UTC. With a signed 32-bit INT, you get until the year 2038. With an unsigned 32-bit INT, you get until the year 2106. With 64-bit, you can go beyond that.

Much like popular discussions around operating systems, editors, and tabs vs. spaces, how you store this particular set of data can be more of an opinion than best practice. Consider if this is a viable route for your use-case.

Bit-Packed Data Types

MySQL has a few storage types that use individual bits within a value to store data compactly. All of these types are technically string types, regardless of the underlying storage format and manipulations:

BIT

You can use a BIT column to store one or many true/false values in a single column. BIT(1) defines a field that contains a single bit, BIT(2) stores 2 bits, and so on; the maximum length of a BIT column is 64 bits.

BIT behavior varies between storage engines. MyISAM packs the columns together for storage purposes, so 17 individual BIT columns require only 17 bits to store (assuming none of the columns permits NULL). InnoDB stores each column as the smallest integer type large enough to contain the bits, so you don’t save any storage space.

MySQL treats BIT as a string type, not a numeric type. When you retrieve a BIT(1) value, the result is a string but the contents are the binary value 0 or 1, not the ASCII value “0” or “1”. However, if you retrieve the value in a numeric context, the result is the number to which the bit string converts. Keep this in mind if you need to compare the result to another value. For example, if you store the value b’00111001’ (which is the binary equivalent of 57) into a BIT(8) column and retrieve it, you will get the string containing the character code 57. This happens to be the ASCII character code for “9”. But in a numeric context, you’ll get the value 57:

mysql> CREATE TABLE bittest(a bit(8));
mysql> INSERT INTO bittest VALUES(b'00111001');
mysql> SELECT a, a + 0 FROM bittest;
+------+-------+
| a    | a + 0 |
+------+-------+
| 9    | 57    |
+------+-------+

This can be very confusing, so we recommend that you use BIT with caution. For most applications, we think it is a better idea to avoid this type.

If you want to store a true/false value in a single bit of storage space, another option is to create a nullable CHAR(0) column. This column is capable of storing either the absence of a value (NULL) or a zero-length value (the empty string). This works in practice, but it can be obtuse to others using data in the database and make it difficult to write queries. Unless you’re hyper-focused on saving space, we still recommend using TINYINT.

SET

If you need to store many true/false values, consider combining many columns into one with MySQL’s native SET data type, which MySQL represents internally as a packed set of bits. It uses storage efficiently, and MySQL has functions such as FIND_IN_SET() and FIELD() that make it easy to use in queries. The major drawback is the cost of changing the column’s definition: this requires an ALTER TABLE, which is very expensive on large tables (but see the workaround later in this chapter). In general, you also can’t use indexes for lookups on SET columns.

Bitwise operations on integer columns

An alternative to SET is to use an integer as a packed set of bits. For example, you can pack eight bits in a TINYINT and manipulate them with bitwise operators. You can make this easier by defining named constants for each bit in your application code.

The major advantage of this approach over SET is that you can change the “enumeration” the field represents without an ALTER TABLE. The drawback is that your queries are harder to write and understand (what does it mean when bit 5 is set?). Some people are comfortable with bitwise manipulations and some aren’t, so whether you’ll want to try this technique is largely a matter of taste.

An example application for packed bits is an access control list (ACL) that stores permissions. Each bit or SET element represents a value such as CAN_READ, CAN_WRITE, or CAN_DELETE. If you use a SET column, you’ll let MySQL store the bit-to-value mapping in the column definition; if you use an integer column, you’ll store the mapping in your application code. Here’s what the queries would look like with a SET column:

mysql> CREATE TABLE acl (
    -> perms SET('CAN_READ', 'CAN_WRITE', 'CAN_DELETE') NOT NULL
    -> );
mysql> INSERT INTO acl(perms) VALUES ('CAN_READ,CAN_DELETE');
mysql> SELECT perms FROM acl WHERE FIND_IN_SET('CAN_READ', perms);
+---------------------+
| perms               |
+---------------------+
| CAN_READ,CAN_DELETE |
+---------------------+

If you used an integer, you could write that example as follows:

mysql> SET @CAN_READ := 1 << 0,
    -> @CAN_WRITE := 1 << 1,
    -> @CAN_DELETE := 1 << 2;
mysql> CREATE TABLE acl (
    -> perms TINYINT UNSIGNED NOT NULL DEFAULT 0
    -> );
mysql> INSERT INTO acl(perms) VALUES(@CAN_READ + @CAN_DELETE);
mysql> SELECT perms FROM acl WHERE perms & @CAN_READ;
+-------+
| perms |
+-------+
| 5     |
+-------+

We’ve used variables to define the values, but you can use constants in your code instead.

JSON Data

It is becoming increasingly common to use JSON as a format for interchanging data between systems. MySQL has a native JSON data type that makes it easy to operate on parts of the JSON structure directly within the table. Purists may suggest that storing raw JSON in a database is an anti-pattern because ideally schemas are a representation of the fields in JSON. Newcomers may look at the JSON data type and see a short path by avoiding creating and managing independent fields. Which method is better is largely subjective, but we’ll be objective by presenting a sample use case and compare both query speed and data size.

Our sample data was a list of 202 near earth asteroids and comets⁵ discovered courtesy of NASA. Tests were performed under MySQL 8.0.22 on a 4-core, 16GB RAM virtual machine. An example of the data:

[{"designation":"419880 (2011 AH37)","discovery_date":"2011-01-07T00:00:00.000","h_mag":"19.7","moid_au":"0.035","q_au_1":"0.84","q_au_2":"4.26","period_yr":"4.06","i_deg":"9.65","pha":"Y","orbit_class":"Apollo"}

This data represents a designation, date it was discovered, data collected about the entity including numeric and text fields.

First, we took the data set in JSON and converted it to be 1 row per entry. This resulted in a schema that looks relatively simple:

> DESC asteroids_json;
+-----------+------+------+-----+---------+-------+
| Field     | Type | Null | Key | Default | Extra |
+-----------+------+------+-----+---------+-------+
| json_data | json | YES  |     | NULL    |       |
+-----------+------+------+-----+---------+-------+

Second, we took this JSON and converted the fields to columns using a suitable data type for the data. This resulted in the following schema:

> desc asteroids_sql;
+----------------+-------------+------+-----+---------+-------+
| Field          | Type        | Null | Key | Default | Extra |
+----------------+-------------+------+-----+---------+-------+
| designation    | varchar(30) | YES  |     | NULL    |       |
| discovery_date | date        | YES  |     | NULL    |       |
| h_mag          | float       | YES  |     | NULL    |       |
| moid_au        | float       | YES  |     | NULL    |       |
| q_au_1         | float       | YES  |     | NULL    |       |
| q_au_2         | float       | YES  |     | NULL    |       |
| period_yr      | float       | YES  |     | NULL    |       |
| i_deg          | float       | YES  |     | NULL    |       |
| pha            | char(3)     | YES  |     | NULL    |       |
| orbit_class    | varchar(30) | YES  |     | NULL    |       |
+----------------+-------------+------+-----+---------+-------+

The first comparison is on data size:

> SHOW TABLE STATUSG
*************************** 1. row ***************************
Name: asteroids_json
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 202
Avg_row_length: 405
Data_length: 81920
Max_data_length: 0
Index_length: 0

*************************** 2. row ***************************
Name: asteroids_sql
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 202
Avg_row_length: 243
Data_length: 49152
Max_data_length: 0
Index_length: 0

Our SQL version uses 3 16kb pages and our JSON version uses 5 16kb pages. This doesn’t come as much surprise. A JSON data type will use more space to store the additional characters for defining JSON (braces, brackets, colons, etc.) as well as the whitespace. In this small example, the size of data storage can be improved by converting JSON to specific data types.

There may be valid use-cases where the data size is not that important. How does query latency measure up between the two? First, with no indexes applied to either table.

To select all of a single column in SQL, our syntax is straightforward:

> select designation from asteroids_sql;

On our first run of this query, uncached by InnoDB’s buffer pool, we got a result of 1.14ms. The second execution, with it in memory, we got 0.44ms.

For JSON, we are able to access a field inside of the JSON structure:

> select json_data->'$.designation' from asteroids_json

Similarly, our first execution, uncached, executed in 1.13ms. Subsequent executions around 0.80ms. At this execution speed, we expect that there will be a reasonable variation - we’re talking about hundreds of microseconds difference in a virtual machine environment. In our opinion, both queries executed reasonably quickly - although worth noting that JSON query is still about twice as long. .

What about accessing specific rows, though? For the single row lookup, we take advantage of using indexes:

ALTER TABLE asteroids_sql add index ( designation );

When we do a single row lookup, our SQL version runs in 0.33ms and our JSON version runs in 0.58ms, giving an edge to the SQL version. This is easily explained - our index is allowing InnoDB to return 1 row instead of 202 rows.

Comparing an indexed query to a full table scan is unfair though. To level the playing field, we need to use the generated columns feature to extract the designation, and then create an index against that virtual generated column:

ALTER TABLE asteroids_json ADD COLUMN designation VARCHAR(30) GENERATED ALWAYS AS (json_data->"$.designation"), ADD INDEX ( designation );

This gives us a schema on our JSON table that looks like this:

> desc asteroids_json;
+-------------+-------------+------+-----+---------+-------------------+
| Field       | Type        | Null | Key | Default | Extra             |
+-------------+-------------+------+-----+---------+-------------------+
| json_data   | json        | YES  |     | NULL    |                   |
| designation | varchar(30) | YES  | MUL | NULL    | VIRTUAL GENERATED |
+-------------+-------------+------+-----+---------+-------------------+

Our schema now generates a virtual column for the designation from the json_data column and indexes it. Now, we re-run our single row lookup to use the indexed column instead of the JSON column path operator (->). Since the field data is quoted in the JSON, we need to search for it quoted in our SQL as well:

> select * from asteroids_json where designation='"(2010 GW62)"';

This query executed in 0.4ms, fairly close to our SQL version of 0.33ms.

From our simple test case above, the amount of used table space seems to be the primary driver for why you would use SQL columns rather than storing a raw JSON document. Speed is still better with SQL columns. . Overall, the decision to use native SQL vs. JSON comes down to whether the ease of storing JSON in the database outweighs the performance. If you’re accessing this data millions or billions of times per day, the speed difference is going to add up.

Choosing Identifiers

In general, an identifier is the way you refer to a row and often what makes it unique. For example, if you have a table about users, you might want to assign each user a numerical ID or a unique username. This field may be some or all of your PRIMARY KEY.

Choosing a good data type for an identifier column is very important. You’re more likely to compare these columns to other values (for example, in joins) and to use them for lookups than other columns. You’re also likely to use them in other tables as foreign keys, so when you choose a data type for an identifier column, you’re probably choosing the type in related tables as well. (As we demonstrated earlier in this chapter, it’s a good idea to use the same data types in related tables, because you’re likely to use them for joins.)

When choosing a type for an identifier column, you need to consider not only the storage type, but also how MySQL performs computations and comparisons on that type. For example, MySQL stores ENUM and SET types internally as integers but converts them to strings when doing comparisons in a string context.

Once you choose a type, make sure you use the same type in all related tables. The types should match exactly, including properties such as UNSIGNED.⁶ Mixing different data types can cause performance problems, and even if it doesn’t, implicit type conversions during comparisons can create hard-to-find errors. These may even crop up much later, after you’ve forgotten that you’re comparing different data types.

Choose the smallest size that can hold your required range of values, and leave room for future growth if necessary. For example, if you have a state_id column that stores US state names, you don’t need thousands or millions of values, so don’t use an INT. A TINYINT should be sufficient and is three bytes smaller. If you use this value as a foreign key in other tables, three bytes can make a big difference. Here are a few tips:

Integer types: Integers are usually the best choice for identifiers, because they’re fast and they work with AUTO_INCREMENT.; AUTO_INCREMENT is a column attribute that generates a new integer type for each new row. For example, a billing system may need to generate a new invoice for each customer. Using AUTO_INCREMENT means that the first invoice generated would be 1, the second 2, and so on. Be aware that you should make sure you have the right integer size for the growth of the data you expect. There have been more than one stories of system downtime associated with unexpectedly running out of integers here.
ENUM and SET: The ENUM and SET types are generally a poor choice for identifiers, though they can be okay for static “definition tables” that contain status or “type” values. ENUM and SET columns are appropriate for holding information such as an order’s status or a product’s type.; As an example, if you use an ENUM field to define a product’s type, you might want a lookup table primary keyed on an identical ENUM field. (You could add columns to the lookup table for descriptive text, to generate a glossary, or to provide meaningful labels in a pull-down menu on a website.) In this case, you’ll want to use the ENUM as an identifier, but for most purposes you should avoid doing so.
String types: Avoid string types for identifiers if possible, because they take up a lot of space and are generally slower than integer types.; You should also be very careful with completely “random” strings, such as those produced by MD5(), SHA1(), or UUID(). Each new value you generate with them will be distributed in arbitrary ways over a large space, which can slow INSERT and some types of SELECT queries.⁷

They slow INSERT queries because the inserted value has to go in a random location in indexes. This causes page splits, random disk accesses, and clustered index fragmentation for clustered storage engines.
They slow SELECT queries because logically adjacent rows will be widely dispersed on disk and in memory.
Random values cause caches to perform poorly for all types of queries because they defeat locality of reference, which is how caching works. If the entire dataset is equally “hot,” there is no advantage to having any particular part of the data cached in memory, and if the working set does not fit in memory, the cache will have a lot of flushes and misses.

If you do store UUID values, you should remove the dashes or, even better, convert the UUID values to 16-byte numbers with UNHEX() and store them in a BINARY(16) column. You can retrieve the values in hexadecimal format with the HEX() function.

Beware of Auto Generated Schemas

We’ve covered the most important data type considerations (some with serious and others with more minor performance implications), but we haven’t yet told you about the evils of auto generated schemas.

Badly written schema migration programs and programs that auto generate schemas can cause severe performance problems. Some programs use large VARCHAR fields for everything, or use different data types for columns that will be compared in joins. Be sure to double-check a schema if it was created for you automatically.

Object-relational mapping (ORM) systems (and the “frameworks” that use them) are another frequent performance nightmare. Some of these systems let you store any type of data in any type of backend data store, which usually means they aren’t designed to use the strengths of any of the data stores. Sometimes they store each property of each object in a separate row, even using timestamp-based versioning, so there are multiple versions of each property!

This design may appeal to developers, because it lets them work in an object-oriented fashion without needing to think about how the data is stored. However, applications that “hide complexity from developers” usually don’t scale well. We suggest you think carefully before trading performance for developer productivity, and always test on a realistically large dataset, so you don’t discover performance problems too late.

Special Types of Data

Some kinds of data don’t correspond directly to the available built-in types. A good example is an IPv4 address. People often use VARCHAR(15) columns to store IP addresses. However, they are really unsigned 32-bit integers, not strings. The dotted-quad notation is just a way of writing it out so that humans can read it more easily. You should store IP addresses as unsigned integers. MySQL provides the INET_ATON() and INET_NTOA() functions to convert between the two representations. The space used shrinks from ~16 bytes for a VARCHAR(15), down to 4 bytes for an unsigned 32-bit integer.

Schema Design Gotchas in MySQL

Although there are universally bad and good design principles, there are also issues that arise from how MySQL is implemented, and that means you can make MySQL-specific mistakes, too. This section discusses problems that we’ve observed in schema designs with MySQL. It might help you avoid those mistakes and choose alternatives that work better with MySQL’s specific implementation.

Too many columns: MySQL’s storage engine API works by copying rows between the server and the storage engine in a row buffer format; the server then decodes the buffer into columns. It can be costly to turn the row buffer into the row data structure with the decoded columns. InnoDB’s row format always requires conversion. The cost of this conversion depends on the number of columns. We discovered that this can become expensive when we investigated an issue with high CPU consumption for a customer with extremely wide tables (hundreds of columns), even though only a few columns were actually used. If you’re planning for hundreds of columns, be aware that the server’s performance characteristics will be a bit different.
Too many joins: The so-called entity-attribute-value (EAV) design pattern is a classic case of a universally bad design pattern that especially doesn’t work well in MySQL. MySQL has a limitation of 61 tables per join, and EAV databases require many self-joins. We’ve seen more than a few EAV databases eventually exceed this limit. Even at many fewer joins than 61, however, the cost of planning and optimizing the query can become problematic for MySQL. As a rough rule of thumb, it’s better to have a dozen or fewer tables per query if you need queries to execute very fast with high concurrency.
The all-powerful ENUM: Beware of overusing ENUM. Here’s an example we saw: CREATE TABLE ... ( / country enum('','0','1','2',...,'31'); The schema was sprinkled liberally with this pattern. This would probably be a questionable design decision in any database with an enumerated value type, because it really should be an integer that is foreign-keyed to a “dictionary” or “lookup” table anyway.
The ENUM in disguise: An ENUM permits the column to hold one value from a set of defined values. A SET permits the column to hold one or more values from a set of defined values. Sometimes these can be easy to confuse. Here’s an example: CREATE TABLE ...( / is_default set('Y','N') NOT NULL default 'N'; That almost surely ought to be an ENUM instead of a SET, assuming that it can’t be both true and false at the same time.
NULL not invented here: We wrote earlier about the benefits of avoiding NULL, and indeed we suggest considering alternatives when possible. Even when you do need to store a “no value” fact in a table, you might not need to use NULL. Perhaps you can use zero, a special value, or an empty string instead.; However, you can take this to extremes. Don’t be too afraid of using NULL when you need to represent an unknown value. In some cases, it’s better to use NULL than a magical constant. Selecting one value from the domain of a constrained type, such as using −1 to represent an unknown integer, can complicate your code a lot, introduce bugs, and just generally make a total mess out of things. Handling NULL isn’t always easy, but it’s often better than the alternative.; Here’s one example we’ve seen pretty frequently: CREATE TABLE ... ( / dt DATETIME NOT NULL DEFAULT '0000-00-00 00:00:00'. That bogus all-zeros value can cause lots of problems. (You can configure MySQL’s SQL_MODE to disallow nonsense dates, which is an especially good practice for a new application that hasn’t yet created a database full of bad data.); On a related topic, MySQL does index NULLs, unlike Oracle, which doesn’t include non-values in indexes.

Summary

Good schema design is pretty universal, but of course MySQL has special implementation details to consider. In a nutshell, it’s a good idea to keep things as small and simple as you can. MySQL likes simplicity, and so will the people who have to work with your database:

Try to avoid extremes in your design, such as a schema that will force enormously complex queries, or tables with oodles and oodles of columns. (An oodle is somewhere between a scad and a gazillion.)
Use small, simple, appropriate data types, and avoid NULL unless it’s actually the right way to model your data’s reality.
Try to use the same data types to store similar or related values, especially if they’ll be used in a join condition.
Watch out for variable-length strings, which might cause pessimistic full-length memory allocation for temporary tables and sorting.
Try to use integers for identifiers if you can.
Avoid the legacy MySQL-isms such as specifying precisions for floating-point numbers or display widths for integers.
Be careful with ENUM and SET. They’re handy, but they can be abused, and they’re tricky sometimes. BIT is best avoided.

Database design is a science. If you’re strongly concerned with database design, consider using dedicated source material⁸. Don’t let this hold you back, however. Modern versions of MySQL allow you to make schema changes fairly easily.

¹ Remember that the length is specified in characters, not bytes. A multibyte character set can require more than one byte to store each character.

² Be careful with the BINARY type if the value must remain unchanged after retrieval. MySQL will pad it to the required length with s.

³ Times are for relative comparison, as the speed of CPUs, memory and other hardware changes over time.

⁴ The rules for TIMESTAMP behavior are complex and have changed in various MySQL versions, so you should verify that you are getting the behavior you want. It’s usually a good idea to examine the output of SHOW CREATE TABLE after making changes to TIMESTAMP columns.

⁵ https://data.nasa.gov/resource/2vr3-k9wn.json

⁶ If you’re using the InnoDB storage engine, you may not be able to create foreign keys unless the data types match exactly. The resulting error message, “ERROR 1005 (HY000): Can’t create table,” can be confusing depending on the context, and questions about it come up often on MySQL mailing lists. (Oddly, you can create foreign keys between VARCHAR columns of different lengths.)

⁷ On the other hand, for some very large tables with many writers, such pseudo random values can actually help eliminate “hot spots.”

⁸ For another in-depth read, consider Database Design for Mere Mortals by Michael Hernandez

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5. Optimizing Schema and Data Types

Create new playlist

Sign In

Sign Up