© Jesper Wisborg Krogh 2020
J. W. KroghMySQL 8 Query Performance Tuninghttps://doi.org/10.1007/978-1-4842-5584-1_13

13. Data Types

Jesper Wisborg Krogh1 
(1)
Hornsby, NSW, Australia
 

When you create a table in MySQL (and other relational databases), you specify the data type for each column. Why not just store everything as strings? After all, when the number 42 is used in this book, it is represented as a string, so why not just use strings for everything and allow all kinds of values for every column? There is some merit to that idea. This is partly how NoSQL databases work (though there is more to it than that), and the author of this book has seen tables with all columns defined as varchar(255) strings. Why bother with integers, decimals, floats, dates, strings, and so on? There are several reasons for this, and that is the topic of this chapter.

First, the benefits of using different data types for different types of values will be discussed. Then there will be an overview of the data types supported in MySQL. Finally, it will be discussed how data types affect query performance and how to choose the data type for a column.

Why Data Types?

The data type of a column defines what type of values can be stored and how the values are stored. Additionally, there may be meta properties associated with the data type, such as the size (e.g., number of bytes used for numbers and maximum number of characters in strings) and for strings the character set and collation. While the data type properties may seem like an unnecessary restriction, they also have benefits. These benefits include
  • Data validation

  • Documentation

  • Optimized storage

  • Performance

  • Correct sorting

The rest of this section will discuss these benefits.

Data Validation

At their core, data types define what kind of values are allowed. A column defined as an integer data type can only store integer values. This is also a safeguard. If you make a mistake and try to store a value into a column with a different data type than what was defined, it is possible to reject it or convert the value.

Tip

Whether assigning a value of a wrong data type to a column results in an error or the data type being converted depends on whether you have the STRICT_TRANS_TABLES (for transactional storage engines) and STRICT_ALL_TABLES (for all storage engines) SQL modes enabled and whether it is considered safe to convert the data type. Some conversions that are considered safe are always allowed, for example, converting '42' to 42 and vice versa. It is recommended to always enable strict mode which makes DML queries fail when an unsafe conversion or a truncation of data is attempted.

When you can be sure the data stored in your tables always have the expected data types, it makes life easier for you. If you query a column with an integer, you know it is safe to do arithmetic operations on the returned value. Likewise, if you know the value is a string, you can safely perform string operations. It requires a little more planning up front, but once it is done, you will learn to appreciate that you know the data type of your data.

There is one more consideration about the data type and data validation. In general, there are properties associated with the data type. In the simplest case, you have the maximum size. An integer can, for example, be 1, 2, 3, 4, or 8 bytes in size. This affects the range of values that can be stored. Additionally, integers can be signed or unsigned. A more complex example is strings which not only have a limit on how much text they store but also need a character set to define how the data is encoded and a collation to define how the data sorts.

Listing 13-1 shows an example of how MySQL validates the data against the data type.
mysql> SELECT @@sql_modeG
*************************** 1. row ***************************
@@sql_mode: ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION
1 row in set (0.0003 sec)
mysql> SHOW CREATE TABLE t1G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `val1` int(10) unsigned DEFAULT NULL,
  `val2` varchar(5) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.0011 sec)
mysql> INSERT INTO t1 (val1) VALUES ('abc');
ERROR: 1366: Incorrect integer value: 'abc' for column 'val1' at row 1
mysql> INSERT INTO t1 (val1) VALUES (-5);
ERROR: 1264: Out of range value for column 'val1' at row 1
mysql> INSERT INTO t1 (val2) VALUES ('abcdef');
ERROR: 1406: Data too long for column 'val2' at row 1
mysql> INSERT INTO t1 (val1, val2) VALUES ('42', 42);
Query OK, 1 row affected (0.0825 sec)
Listing 13-1

Data validation based on data type

The SQL mode is set to the default which includes STRICT_TRANS_TABLES. The table has two columns in addition to the primary key, of which one column is an unsigned integer and the other is a varchar(5) meaning it can store up to five characters. When it is attempted to insert a string or a negative integer into the val1 column, the value is rejected as it cannot be converted safely to an unsigned integer. Similarly, attempting to store a string with six characters into the val2 column fails. However, storing the string '42' into val1 and the integer 42 into val2 is considered safe and is thus allowed.

A side effect of the data validation is that you also describe what data you expect – this is an implicit documentation of the column.

Documentation

When you design a table, you know what the expected usage of the table is. This is however not necessarily clear when you or someone else uses the table later. There are several ways to document the columns: use a column name that describes the values, the COMMENT column clause, CHECK constraints, and the data type.

While not the most detailed way to document a column – and it certainly should not stand on its own – the data type does help to describe what kind of data you expect. If you choose the date column instead of datetime, you are clearly only intending for the date part to be stored. Similarly, using tinyint instead of int shows you only expect relatively small values. This all helps for yourself or others to understand what kind of data can be expected. The better understanding of the data, the better change you have to succeed when you need to optimize queries, so in that way it can indirectly help in query optimization.

Tip

The best way to provide documentation within the table is to use the COMMENT clause and CHECK constraints. These are however often not visible in table diagrams which is where the data type helps to give a better mental picture of the kind of data that is expected.

On the topic of performance, there are benefits of explicitly choosing the data type as well. One of them is related to how the values are stored.

Optimized Storage

MySQL does not store all data in the same way. The storage format for a given data type is chosen to be as compact as possible to reduce the storage needed. As an example, consider the value 123456. If this is stored as a string, it will require at least 6 bytes plus possible 1 byte to store the length of the string. If you instead choose an integer, you only need 3 bytes (for integers, all values are always using the same number of bytes depending on the maximum storage allowed for the column). Additionally, reading an integer from storage does not require any interpretation of the value,1 whereas for a string it is necessary to decode the value using its character set.

Choosing the correct maximum size of a column can reduce the amount of storage needed. If you need to store integers and know that you never need values that need more than 4 bytes of storage, you can use the int data type instead of bigint which uses 8 bytes of storage. This is half the amount of storage needed for the column. If you work with big data, the storage (and memory) savings may become large enough to be significant. However, be careful not to overoptimize. Changing the data type or size of a column in many cases requires rebuilding the whole table which can be an expensive operation, if the table is large. In that way, it can be better to use a little more storage now to save work later.

Tip

As with other types of optimization, be careful not to overoptimize the data types. A relatively small saving in storage now can cause pain later.

How the data is stored also affects the performance.

Performance

Not all data types are created equal. Integers are very cheap to use in computations and comparisons, whereas strings where the bytes stored must be decoded using a character set are relatively expensive. By choosing the correct data type, you can significantly improve the performance of your queries. Particularly, if you need to compare values in two columns (possibly in different tables), make sure they have the same data type including character set and collation for strings. Otherwise, the data in one of the columns will have to be converted before it can be compared with the other column.

While it is simple to understand why an integer performs better than a string, exactly what makes one data type perform better or worse than another is relatively complex and depends on how the data type is implemented (stored on disk). Thus, further discussion of performance will be deferred until after the walkthrough of the MySQL data types in the next section.

The last benefit that will be discussed is sorting.

Correct Sorting

The date type has a major impact on how values are sorted. While the human brain usually can make sense of the data intuitively, a computer needs some help understanding how two values compare to each other. The data type and for strings the collation are the key properties that are used to ensure your data is sorted correctly.

Why is sorting important? There are a couple of reasons for this:
  • Correct sorting requires knowledge of whether two values are equal or whether a value is in a given range. This is essential to have WHERE clauses and join conditions work as expected.

  • When you create indexes, the sorting is used to ensure that MySQL quickly can find the row(s) with the value you are looking for.2 Indexes are covered in detail in the next chapter.

Consider the values 8 and 10. How do they sort? If you consider them to be integers, 8 comes before 10. However, if you consider them as strings, then ‘10’ (ASCII: 0x3130) comes before ‘8’ (ASCII: 0x38). Whether you expect one or the other depends on your application, but unless there are also values with non-numeric parts, you likely expect the integer behavior which requires the data type to be of an integer type.

Now that it has been discussed what the benefits of explicit data types are, it is time to go through the data types that MySQL supports.

MySQL Data Types

There are more than 30 different data types in MySQL. Several of these can be fine-tuned with respect to the size, precision, and whether they accept signed values. It can at first seem overwhelming, but if you group the data types into categories, you can do a stepwise approach to select the correct data type for your data.

The data types in MySQL can be considered as part of one of the following categories:
  • Numeric: This includes integers, fixed precision decimal types, approximate precision floating point types, and bit types.

  • Temporal: This includes years, dates, times, datetime, and timestamp values.

  • Strings: This includes both binary objects and strings with a character set.

  • JSON: The JSON data type can store JSON documents.

  • Spatial: These types are used to store values that describe one or more points in a coordinate system.

  • Hybrid: MySQL has two data types that both can be used as integers and as strings.

Tip

The MySQL reference manual has a comprehensive discussion about data types in https://dev.mysql.com/doc/refman/8.0/en/data-types.html and references therein.

The rest of this section will go through the data types and discuss their specifics.

Numeric Data Types

Numeric data types are the simplest of the data types supported by MySQL. You can choose between integers, fixed precision decimal values, and approximate floating point values.

Table 13-1 summarizes the numeric data types including their storage requirements in bytes and the supported range of values. For integers, you can choose whether the values are signed or unsigned which affects the range of supported values. For the supported values, both the start and end values are included in the range of allowed values.
Table 13-1

The numeric data types (integers, fixed point, and floating point)

Data Type

Bytes Stored

Range

tinyint

1

Signed: -128–127

Unsigned: 0–255

smallint

2

Signed: -32768–32767

Unsigned: 0–65535

mediumint

3

Signed: -8388608–8388607

Unsigned: 0–16777215

int

4

Signed: -2147483648–2147483647

Unsigned: 0–4294967295

bigint

8

Signed: -263–263-1

Unsigned: 0–264-1

decimal(M, N)

1–29

Depends on M and N

float

4

Variable

double

8

Variable

bit(M)

1–8

 

The integer data types are the simplest with a fixed storage requirement and fixed ranges of supported values. A synonym for tinyint is bool (for a Boolean value).

The decimal data type (numeric is a synonym) takes two arguments, M and N, which define the precision and scale of the values. If you have decimal(5,2), the values will have at most five digits, of which two are decimals (to the right of the decimal point). That means that values between -999.99 and 999.99 are allowed. Up to 65 digits are supported. The amount of storage for decimals depends on the number of digits with each multiple of nine digits using 4 bytes and the remaining digits using 0–4 bytes.

The float and double data types store approximate values. These types are efficient for numeric calculations, but at the cost that there is an uncertainty in their values. They use 4 and 8 bytes, respectively, for storage.

Tip

Never use floating point data types to store exact data, such as monetary amounts. Use the exact precision decimal data type instead. For approximate floating point data types, you should never use the equal (=) and not equal (<>) operators as comparing two approximate values will in general not return that they are equal even if they are meant to be.

The final numeric data type is the bit type. It can store between 1 and 64 bits in one value. This can, for example, be used for bit masks. The storage required depends on the number of bits required (the M value); it can be approximated as FLOOR((M+7)/8) bytes.

A category of data types related to the numeric types are temporal data types, which are the next category that will be covered.

Temporal Data Types

Temporal data defines a point in time. The precision can range from a year to a microsecond. Except for the year data type, values are entered as strings, but internally an optimized format is used, and the values will sort correctly according to the point in time the values represent.

Table 13-2 shows the temporal data types supported by MySQL, the amount of storage in bytes each type uses, and the range of values supported.
Table 13-2

The temporal data types

Data Type

Bytes Stored

Range

year

1

1901–2155

date

3–6

'1000-01-01' to '9999-12-31'

datetime

5–8

'1000-01-01 00:00:00.000000' to '9999-12-31 23:59:59.999999'

timestamp

4–7

'1970-01-01 00:00:01.000000' to '2038-01-19 03:14:07.999999'

time

3–6

'-838:59:59.000000' to '838:59:59.000000'

The datetime, timestamp, and time types all support fractional seconds up to microsecond resolution. The storage requirement for the fractional seconds is 0–3 bytes depending on the number of digits (1 byte per two digits).

The datetime and timestamp columns differ in a subtle way. When you store a value in a datetime column, MySQL stores it as you specify it. For a timestamp column on the other hand, the value is converted to UTC using the time zone MySQL has been configured to use – the @@session.time_zone variable (by default the system time zone). In the same way, when you retrieve the data, datetime values are returned as you originally specified them, whereas timestamp columns are converted to the time zone set in the @@session.time_zone variable.

Tip

When using datetime columns, store the data in the UTC time zone and convert to the time zone required when using the data. By always storing the value in UTC, there is less chance of problems if the operating system time zone or MySQL Server time zone is changed, or you share data with users from different time zones.

While you enter and retrieve dates and times using a string, they are stored internally in a dedicated format. What about actual strings? Let’s take a look at the string and binary data types.

String and Binary Data Types

Strings and binary data types are very flexible types for storing arbitrary data. The difference between a binary value and a string is that the string has a character set associated with it, so MySQL knows how to interpret the data. Binary values on the other hand store raw data which means you can use them for any kind of data including images and custom data formats.

While strings and binary data are very flexible, they come with a cost. For strings, MySQL needs to interpret the bytes to determine which characters they represent. This is relatively expensive in terms of the computational power required. Some character sets, including UTF-8 which is the default character set in MySQL 8, are variable width, that is, a character uses a variable number of bytes; for UTF-8 it ranges from 1 to 4 bytes per character. This means that if you request the first four characters of a string, it can require reading between 4 and 16 bytes depending on which characters it is, so MySQL will need to analyze the bytes to determine when four characters have been found. For binary strings, the interpretation of the meaning of the data is put back on the application.

Table 13-3 shows the data types in MySQL representing strings and binary data. The table includes the maximum amount of data that can be stored as well as a description of the storage requirements. For the data types, (M) is the maximum number of characters that the column must be able to store, and in bytes stored L is the number of bytes required to represent the string value in character set used for the encoding.
Table 13-3

The string and binary data types

Data Type

Bytes stored

Max Length

char(M)

M*char width

255 chars

varchar(M)

L+1 or L+2

16383 chars for utf8mb4 and 65532 for latin1

tinytext

L+1

255 bytes

text

L+2

65535 bytes

mediumtext

L+3

16777216 bytes

longtext

L+4

4294967296 bytes

binary(M)

M

255 bytes

varbinary(M)

L+1 or L+2

65532 bytes

tinyblob

L+1

255 bytes

blob

L+2

65536 bytes

mediumblob

L+3

16777216 bytes

longblob

L+4

4294967296 bytes

The storage requirements for strings and binary objects depend on the length of the data. L is the number of bytes required to store the value; for text strings, the character set must be taken into account as well. For variable width types, 1–4 bytes are used to store the length of the value. For char(M) columns, the required storage may be less than M times the character width when the compact family of InnoDB storage formats is used and the string is encoded with a variable width character set.

For all but char and varchar, the maximum supported length of the strings is specified in bytes. This means that the number of characters that can be stored in string types depends on the character set. Additionally, char, varchar, binary, and varbinary columns count toward the row width which in total must be less than 64 kiB, which in practice means that it rarely is possible to create columns using the theoretical maximum length. (This is also the reason that varchar and varbinary columns at most can store 65532 characters/bytes.) For longtext and longblob columns, it should be noted that while they in principle can store up to 4 GiB of data, in practice the storage is limited by the max_allowed_packet variable which at most can be 1 GiB.

One additional consideration for the data types that store strings is that you must choose a character set and collation for the column. If you do not choose one explicitly, then the default for the table will be used. In MySQL 8, the default character set is utf8mb4 using the utf8mb4_0900_ai_ci collation. What does utf8mb4 and utf8mb4_0900_ai_ci mean?

The utf8mb4 character set is UTF-8 supporting up to 4 bytes per character (required, e.g., for some emojis). Originally, MySQL only supported up to 3 bytes per character for UTF-8, and later utf8mb4 was added to extend the support. Today, you should not use utf8mb3 (at most 3 bytes per character) or its utf8 alias (deprecated, so it later can be changed to mean utf8mb4). When you use UTF-8, always choose the 4-byte variant as there is little benefit from the 3-byte variant and it has been deprecated. In MySQL 5.7 and earlier, Latin 1 was the default character set, but with the improvements for UTF-8 in MySQL 8, it is recommended to use utf8mb4 unless you have a specific reason to choose another character set.

The utf8mb4_0900_ai_ci collation is a general-purpose collation for utf8mb4. A collation defines the sorting and comparison rules, so when you compare two strings they compare correctly. The rules can be quite complex and include that some character sequences compare as equal to other single characters (e.g., the German sharp ß is the same as “ss” in some collations). The collation name consists of several parts which are
  • utf8mb4: The character set the collation belongs to.

  • 0900: This means the collation is one of the Unicode Collation Algorithm (UCA) 9.0.0–based collations. These were introduced in MySQL 8 and provide a significant performance improvement compared to older UTF-8 collations.

  • ai: A collation can be accent insensitive (ai) or accent sensitive (as). When a collation is accent insensitive, an accented character like à is considered equal to the non-accented character a. In this case, it is accent insensitive.

  • ci: A collation can be case insensitive (ci) or case sensitive (cs). In this case, it is case insensitive.

The name can include other parts as well, and other character sets have other collations. Particularly, there are several country-specific character sets to take local sorting and comparison rules into account; for those, the country code is added to the name. It is recommended to use one of the UCA 9.0.0 collations as these have better performance and are more modern than the other collations. The information_schema.COLLATIONS view includes all collations supported by MySQL with support for filtering by the character set. As of 8.0.18, there are 75 collations available for utf8mb4, of which 49 are UCA 9.0.0 collations.

Tip

Character sets and collations are a large and interesting topic on their own. If you would like to dive further into that topic, a starting point is the following blog by the author of this book and the references therein: https://mysql.wisborg.dk/mysql-8_charset.

A special kind of strings is JSON documents. MySQL has a dedicated data type for them.

JSON Data Type

A popular format for storing data with more flexibility than relational tables is the JavaScript Object Notation (JSON) format. This is also the format that has been chosen for the MySQL Document Store that is available in MySQL 8. Support for the json data type was introduced in MySQL 5.7.

A JSON document is a combination of JSON objects (keys and values), JSON arrays, and JSON values. A simple example of a JSON document can be seen in the following:
{
    "name": "Sydney",
    "demographics": {
        "population": 5500000
    },
    "geography": {
        "country": "Australia",
        "state": "NSW"
    },
    "suburbs": [
        "The Rocks",
        "Surry Hills",
        "Paramatta"
    ]
}

As a JSON document is a string (or binary object) as well, it can also be stored in a string or binary object column. However, by having a dedicated data type, it is possible to add validation, and the storage is optimized for accessing specific elements in the document.

One great performance-related feature of JSON documents in MySQL 8 is that there is support for partial updates. That makes the change in-place which reduces not only the amount of work done during the update, but it is also possible to write just the partial change to the binary log. There are some requirements for a partial in-place update to be possible. These are as follows:
  • Only the JSON_SET(), JSON_REPLACE(), and JSON_REMOVE() functions are supported.

  • Only updates within a column are supported. That is, setting a column to the return value of one of the three JSON functions working on another column is not supported.

  • It must be an existing value that is replaced. Adding new object or array elements causes the whole document to be rewritten.

  • The new value must at most be the same size as the value that is replaced. The exceptions are cases where space freed by a previous partial update can be reused.

In order to log the partial updates to the binary log as partial updates, you need to set the binlog_row_value_options option to PARTIAL_JSON. The option can be set dynamically both at the session and global levels.

Internally, the document is stored as a long binary object (longblob) with the text interpreted using the utf8mb4 character set. The maximum storage is limited to 1 GiB. The storage requirements are similar to those for longblob, but it is necessary to take the overhead of metadata and the dictionaries used for lookups into account.

Thus far, numbers, temporal data, strings, binary objects, and JSON documents have been covered. What about data specifying a point in space? This is the next category of data types to cover.

Spatial Data Types

Spatial data specifies one or more points in a coordinate system, possibly forming an object such as a polygon. This is useful, for example, to specify the location of an item on a map.

MySQL 8 added support for specifying which reference system is used; this is called the Spatial Reference System Identifier (SRID). The reference systems supported can be found in the information_schema.ST_SPATIAL_REFERENCE_SYSTEMS view (the SRS_ID column has the value to use for the SRID); there are more than 5000 to choose from. Each spatial value has a reference system associated with it in order to make it possible for MySQL to correctly identify the relation between two values, for example, to calculate the distance between two points. To use Earth as the reference system, set the SRID to 4326.

There is support for eight different spatial data types, of which four are single-value types and four are collections of values. Table 13-4 summarizes the spatial types with the required storage listed in bytes.
Table 13-4

The spatial data types

Data Type

Bytes Stored

Description

geometry

Variable

A single spatial object of any type.

point

25

A single point, for example, the location of a person.

linestring

9+16*#points

A set of points that form a line, that is, it is not a closed object.

polygon

13+16*#points

A set of points that encloses an area. One polygon can include several such sets, for example, to create an inner and outer ring of a donut-shaped object.

multipoint

13+21*#points

A collection of points.

multilinestring

Variable

A collection of linestring values.

multipolygon

Variable

A collection of polygons.

geometrycollection

Variable

A collection of geometry values.

MySQL uses a binary format to store the data. The storage requirements for the geometry, multilinestring, multipolygon, and geometrycollection types depend on the size of the objects contained in the value. The storage for these collections of objects is a little larger than storing the objects in individual columns. You can use the LENGTH() function to get the size of the spatial object and then add 4 bytes to store the SRID to get the total storage required for the data.

That leaves one category of data types to be discussed: hybrids between numeric and string data types.

Hybrid Data Types

There are two special data types that combine the properties of integers and strings: enum and set. Both can be considered a collection of possible values with the difference that the enum data type allows you to choose exactly one of the possible values, whereas the set data type allows you to choose any of the possible values.

What makes the enum and set data types hybrid is that you can both use them as integers and as strings. The latter is the most common and the most user-friendly. Internally, the values are stored as integers which gives compact and efficient storage while still allowing to use strings when setting or querying the columns. Both data types can as an alternative be implemented using lookup tables.

The enum data type is the most commonly used of the two. When you create the column, you specify a list of allowed values, for example:
CREATE TABLE t1 (
   id int unsigned NOT NULL PRIMARY KEY,
   val enum('Sydney', 'Melbourne', 'Brisbane')
);

The numeric value is the position in the list starting with 1. That is, Sydney has the integer value 1, Melbourne 2, and Brisbane 3. The total storage requirement is just 1 or 2 bytes depending on the number of members in the list, and up to 65535 members are supported.

The set data type works similarly to enum except you can select more than one of the options. To create it, list the members you want to be available, for example:
CREATE TABLE t1 (
   id int unsigned NOT NULL PRIMARY KEY,
   val set('Sydney', 'Melbourne', 'Brisbane')
);
Each member in the list gets a numeric value in the series 1, 2, 4, 8, and so on based on the member’s position in the list. In the example, Sydney has the value 1, Melbourne 2, and Brisbane 4. What does the value 3 then represent? It is Sydney and Melbourne. If you want to include multiple values, you sum their individual values. In this way, the set data type works the same as the bit type. It is simpler when you specify the value as a string, as you include the members for the value in a comma-separated list. Listing 13-2 shows two examples of inserting set values with each example inserting the same value twice using both the numeric and string values.
mysql> INSERT INTO t1
       VALUES (1, 4),
              (2, 'Brisbane');
Query OK, 2 rows affected (0.0812 sec)
Records: 2  Duplicates: 0  Warnings: 0
mysql> INSERT INTO t1
       VALUES (3, 7),
              (4, 'Sydney,Melbourne,Brisbane');
Query OK, 2 rows affected (0.0919 sec)
Records: 2  Duplicates: 0  Warnings: 0
mysql> SELECT *
         FROM t1G
*************************** 1. row ***************************
 id: 1
val: Brisbane
*************************** 2. row ***************************
 id: 2
val: Brisbane
*************************** 3. row ***************************
 id: 3
val: Brisbane,Melbourne,Sydney
*************************** 4. row ***************************
 id: 4
val: Brisbane,Melbourne,Sydney
4 rows in set (0.0006 sec)
Listing 13-2

Working with set values

First, the value of 'Brisbane' is inserted. Since it is the third element in the set, it has a numeric value of 4. Then the sets Sydney, Melbourne, and Brisbane are inserted. Here you need to sum 1, 2, and 4. Notice in the SELECT query that the order of the elements is not the same as in the set definition.

A set column uses 1, 2, 3, 4, or 8 bytes of storage depending on the number of members in the set. It is possible to have up to 64 members in a set.

This concludes the discussion of the available data types. How does the data type influence the performance of your queries? Potentially quite a lot, so that is worth some consideration.

Performance

The choice of data type is not only important with respect to the data integrity and to tell what kind of data is expected, but also different data types have different performance characteristics. This section will discuss how the performance varies when comparing data types.

In general, the simpler the data type, the better it performs. Integers have the best performance, and floating point (approximate values) follows closely. Decimal (exact) values have a higher overhead than the approximate floating point values. Binary objects perform better than text strings as binary objects do not have the overhead of character sets.

When it comes to a data type like JSON, you may think it performs worse than using a binary object as JSON documents have some storage overhead as described earlier in the chapter. However, exactly this storage overhead means that a JSON data type will perform better than storing the same data as a blob. The overhead consists of metadata and a dictionary for lookups, and that means accessing the data is faster. Additionally, JSON documents support in-place updates, whereas the text and blob data types replace the entire object even if only a single character or byte is replaced.

Within a given family of data types (e.g., int versus bigint), the smaller data type performs better than the larger; however, in practice, there are also considerations about the alignment within the registers of the hardware, so for in-memory workloads the difference may be negligible or even the reverse.

So which data types should you use? That’s the final topic of the chapter.

Which Data Type Should You Choose?

In the beginning of the chapter, it was discussed how it could seem like a good idea just to store all data in strings or binary objects to have the greatest flexibility. During the course of the chapter, it has been discussed how there are benefits from using specific data types, and in the previous section the performance of different data types was discussed. So which data type should you choose?

You can start asking yourself some questions about the data you need to store in the column. Some examples of questions are as follows:
  • What is the native format for the data?

  • How large values can be expected initially?

  • Will the size of the values grow over time? If so, how much and how quickly?

  • How often will the data be retrieved in queries?

  • How many unique values do you expect?

  • Do you need to index the values? Particularly, is it the primary key of the table?

  • Do you need to store the data, or can it, for example, be fetched through a foreign key in another table (using an integer reference column)?

You should choose a data type that is native for the data you need to store. If you need to store integers, choose an integer data type, usually int or bigint depending on how large values you need. You can choose a smaller integer type if you want to restrict the values; for example, the number of children for a table storing data about parents need not be a bigint, but a tinyint suffices. Similarly, if you want to store JSON documents, use the json type instead of longtext or longblob.

For the size of the data type, you need to consider both the current need and the future need. If you expect within long to need larger values, it is likely best to choose the larger data type right away. That saves changing the table definition at a later date. However, if the expected change is years away, it may be better to go with the smaller data type now and reevaluate your needs over time. For the varchar and varbinary, you can also change the width in-place as long as you do not change the number of bytes required to store the length of the string or the character set.

When you work with strings and binary objects, you can also consider storing the data in a separate table and reference the values using an integer. This will add a join when you need to retrieve the values; however, if you only rarely need the actual string values, it may be an overall win to keep the main table small. The benefit of this approach also depends on the number of rows in the table and how you query the rows; large scans retrieving many rows will benefit more than single-row lookups, and using SELECT * even when not all columns are needed will benefit more than selecting just the columns needed.

If you only have a few unique string values, it can also be worth considering using the enum data type. It works similar to a lookup table but saves the join and allows you to retrieve the string values directly.

For non-integer numeric data, you have the choice between the exact decimal data type and the approximate float and double data types. If you need to store data such as monetary values that must be exact, you should always choose the decimal data type. This is also the type to choose if you need to do equality and non-equality comparisons. If you do not need the data to be exact, the float and double data types perform better.

With respect to string values, then the char, varchar, tinytext, text, mediumtext, and longtext data types require a character set and a collation. In general, it is recommended to choose utf8mb4 with one of the UCA 9.0.0–based collations (the collations with _0900_ in the name). The default utf8mb4_0900_ai_ci is a good choice, if you do not have specific requirements. Latin 1 will perform marginally better, but rarely enough to warrant the added complexity of having different character sets for different needs. The UCA 9.0.0 collations also provide more modern sorting rules than those collations that are available for Latin 1.

When you need to decide how large values to allow, go for the smallest data type or width that supports the values you need now and in the near future. Smaller data types also mean that less space is used toward the row size limit (64 kiB) and more data can fit into an InnoDB page. As the InnoDB buffer pool can store a certain number of pages according the size of the buffer pool and the pages, it in turn means more data can fit into the buffer pool and thus help to reduce disk I/O. At the same time, remember that optimization is also about knowing when you have optimized enough. Do not spend a long time to shave off a few bytes, just to end up having to do an expensive table rebuild in a year.

A last thing to consider is whether the value is included in indexes. The larger the values, the larger the index also becomes. This is a particular issue for the primary key. InnoDB organizes the data according to the primary key (as the clustered index), so when you add a secondary index, the primary key is added to the end of the index to provide the link to the row. Additionally, this organization of the data means that in general monotonically increasing values perform best for the primary key. If the column with the primary key changes randomly over time and/or it is large, you may likely be better off adding a dummy column with an auto-increment integer and using that as the primary key.

Indexes are themselves an important and large topic which will be discussed in the next chapter.

Summary

This chapter has gone through the concept of data types. There are several benefits from using data types: data validation, documentation, optimized storage, performance, and correct sorting.

MySQL supports a large range of data types, from simple integers over strings and spatial objects to complex JSON documents. Each data type was discussed with focus on the supported values, the supported size of the values, and the amount of storage required.

The final part of the chapter discussed how the data type can impact performance and how to determine which data type to choose for a column. This included considerations of whether the column will be indexed, which also relates to one of the benefits of data types: correct sorting. Indexes is a very important topic, and indeed the next chapter will be covering them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.119.219