© Charles Bell 2018
Charles BellIntroducing the MySQL 8 Document Storehttps://doi.org/10.1007/978-1-4842-2725-1_10

10. Planning for MySQL 8 and the Document Store

Charles Bell1 
(1)
Warsaw, Virginia, USA
 

This book has covered a lot of material including a brief overview of some of the newest features of MySQL 8. I focused on the MySQL Document Store including all its components: the X Protocol, X DevAPI, MySQL Shell, and changes to the server with the MySQL X Plugin. Not only that, but I also gave a walkthrough on how to develop applications using the X DevAPI—from SQL-based to hybrid to NoSQL solution. These technologies are fantastic additions to the server features and promise far more return on your development resources than traditional relational database application development. It is clear that there is a lot to MySQL 8 than just a new, jaunty jump in version numbering.

Recall, we received a glimpse of some new high availability features such as Group Replication and InnoDB Cluster. But it doesn’t end there, does it? We also have new authentication mechanisms, the new data dictionary, and many small but significant updates. So, where does one start considering the implications of migrating and upgrading to MySQL 8? In this chapter, I look at some strategies for migrating to MySQL 8 including considerations and best practices for migrating applications to use the document store with another example of migrating existing database applications. I explore some tips and tricks for working with MySQL 8.

Let’s begin by briefly discussing some strategies for considering upgrading to MySQL 8 from MySQL 5.7 and earlier.

Upgrading from MySQL 5.7 and Earlier

Although this book is not a tutorial on how to upgrade to MySQL 8, there are some things you should consider before adopting the MySQL Document Store, which will likely result in upgrading your existing MySQL servers.

There are several ways you can go about learning how to do an upgrade. The most obvious and recommended route is to read the online MySQL reference manual, which contains a section on upgrading MySQL (providing critical information you must know). However, there are some higher-level or general practices that apply to any form of upgrade or migration. This section presents upgrade practices that will help you avoid some of the trouble with upgrading a major system like MySQL.

In this section, we look at the types of upgrades you will encounter with MySQL then discuss some general practices for planning and executing the upgrade. We conclude the section with a brief discussion about reasons for performing the upgrade. We discuss the reasons for doing an upgrade last so that you will have a better understanding of what is involved including implied risks.

Let’s begin by looking at the types of upgrades you are likely to encounter.

Types of Upgrades

The online MySQL reference manual and similar publications describe two basic upgrade methods, which are strategies and procedures for how to do the upgrade. The following is a summary of the methods.
  • In-Place: MySQL server instances are upgraded with binaries using the existing data dictionary. This method employs various utilities and tools to ensure a smooth transition to the new version.

  • Logical: The data is backed up before installing the new version over the old installation and data is restored after the upgrade.

Although these describe two general strategies for upgrading MySQL, they don’t cover all possible options. In fact, we will see another method in a later section. After all, your installation is likely to be slightly different—especially if you’ve been using MySQL for a long time or have a lot of MySQL servers configured for high availability or are using third-party applications and components with your own applications. These factors can make following a given, generic procedure problematic.

Rather than try to expand on the upgrade methods, let’s look at it from the point of view of a system administrator. In particular, what do we do if we have version x.y.z and want to upgrade to a.b.c? The following sections describe upgrades based on versions.

Caution

Oracle only recommends upgrades of GA versions. Upgrading other releases is not recommended and may require accepting additional time to migrate and accepting potential incompatibilities. Upgrade non-GA releases at your own risk.

MySQL Version Number Terminology

MySQL uses a 3-digit version number in the form of major, minor, and revision (odd that it is also called the version in the documentation). This is often expressed with dot notation. For example, version 5.7.20 defines the major version as 5, the minor version as 7, and the revision as 20. Often, the version number is followed by text (called the suffix in the documentation) indicating additional version history, stability, or alignment such as general availability (GA), release candidate (RC), and so forth. For a complete explanation of the version number in MySQL, see https://dev.mysql.com/doc/refman/8.0/en/which-version.html .

Revision Upgrade

The simplest form of upgrade is when upgrading when only the revision number is changed. This is commonly referred to the Z in the X.Y.Z version number or simply “the version of the major.minor release.” For example, version 5.7.20 is revision 20 or version 20 of 5.7.

Upgrading at this version level is generally safe and, although not guaranteed to work flawlessly, is low risk. However, you still should take the precaution of reading the release notes before executing the upgrade. This is especially true if you are working with nongeneral availability (GA) releases. If the release is not a GA release, you must pay attention to the release notes and upgrade section in the online MySQL reference manual. Although it is rare, sometimes there are special considerations you must plan for and overcome to achieve the upgrade. Fortunately, Oracle does an excellent job of communicating any necessary steps and procedures—you just need to read the documentation!

Minor Upgrade

The next form of upgrade is upgrading when the minor number is changed. This is commonly referred to the Y in the X.Y.Z version number—for example, upgrading from 5.6 to 5.7.

Upgrades are generally acceptable and documented for single digit increment of the minor version. For example, the upgrade from 5.6 to 5.7 is supported, but an upgrade from 5.0 to 5.7 is not directly supported. This is because there are too many differences between the versions to make an upgrade viable (but not impossible).

Nevertheless, you can upgrade minor version changes with manageable risk if you plan accordingly. More about managing the risk in later sections.

Major Upgrade

The next form of upgrade is when upgrading when the major number is changed. This category—aside from the incompatible versions—is the one with the most risk and potentially the most likely to require more work.

Upgrades of versions at the major version are rare and only occur when Oracle has released a new, major set of changes (hence the name) to the server. MySQL 8 server contains many improvements over MySQL 5—most have been tremendous increases in performance, advanced features, and stability. However, there have been a few changes that have rendered some features in older versions incompatible.

For example, once MySQL 8.0 is released as GA, upgrading from MySQL 5.7 to MySQL 8.0 is supported but you may have to migrate certain features to complete the upgrade.

It is fortunate that Oracle has documented all the problem areas in detail with suggestions on how to migrate to the new features. We’ve even seen this extend beyond major versions—the MySQL Document Store is a very good example.

Incompatible Upgrades

As you may have surmised, there are some upgrades that are not recommended either due to lack of features to support the upgrade or major incompatibilities. For example, you should not consider upgrading from MySQL 5.0 to MySQL 8.0. This is simply because there is no support for some of the older 5.0 features in 8.0. Because these types of upgrades are not common, we summarize some of the incompatible upgrades in the following list. The subject of the incompatibility isn’t the new version to which you want to upgrade, it is the old version that you want to upgrade.
  • Skipping major versions: Upgrading major versions may introduce incompatible changes.

  • Skipping minor versions: Some upgrades of minor versions may introduce incompatible changes.

  • Upgrading incompatible hardware: Upgrading hardware of one endianness may not be compatible with another. For example, big-endian to little-endian may not be compatible.

  • Versions that change the InnoDB format: There have been some changes where the InnoDB storage engine internals have changed. Most have been planned for compatible minor.revision upgrades (e.g. 5.7.3 to 5.7.12), but some have required a few extra steps to prepare the data.

  • New features: Less frequently, there are new features introduced that may introduce incompatibilities. For example, the data dictionary was added rendering the .FRM metadata obsolete.

  • Platform changes: Some upgrades that include changing platforms may require additional work or introduce potential incompatibilities. For example, moving from a platform without case sensitivity support in the file system to one that does support case sensitivity.

  • Upgrading non-GA releases: Upgrades from a non-GA to a GA, GA to non-GA, and among non-GA releases is not recommended.

Without a doubt, the incompatibilities are dependent on certain features, hardware, or internal storage mechanisms. In most cases, the online documentation outlines what you can do to ensure success. Sometimes this requires following a specific upgrade path such as first upgrading to one version before upgrading to your target version.

What If I Must Upgrade An Incompatible Version?

If you find your upgrade strategy falls into this section listing incompatible upgrades, do not despair. You may still be able to perform the upgrade, but it may be costlier and require more work. For example, you could perform a logical upgrade by backing up your data using SQL statements with mysqldump or mysqlpump, installing the new version, then working with the SQL files to adjust them to remove any incompatibilities. Although this does introduce considerable risk that you can still import all your data cleanly, it is still possible. If you find yourself in this situation, be sure to spend more time on addressing risks using such strategies as parallel installation and extended periods of testing.

Now that we have a good idea of what types of upgrades are possible, let’s look at some best practices for performing the upgrade.

Upgrade Practices

When upgrading any system, there are some general practices you should adhere to or you should at least use as a guide. This section describes some of the fundamental practices you should consider for upgrading your MySQL servers. Again, some of these may be familiar and some may not be the one you would consider to use with upgrading MySQL. Further, some of these are not outlined in the online MySQL reference manual.

As you will see, these practices are not necessarily sequential or even a prerequisite for the next. For example, planning also should include time for testing. Therefore, the practices discussed here are in a general order of importance but should not be considered or implemented in this order.

Check Prerequisites

The first thing you should do when upgrading MySQL is to check the documentation for any prerequisites. Sometimes this is simply safely backing up your data, but also can include things such as which utilities and tools you need to use to migrate certain features (or data). Be sure you have all the prerequisites met before you being the upgrade.

The upgrade documentation also will include incompatibility issues. Most often, this occurs when upgrading major versions but sometimes this happens for minor versions. It is fortunate that these are outlined in the online MySQL reference manual. Checking the prerequisites also can help you by providing details you can use to plan the upgrade.

Caution

The online MySQL reference manual section on upgrading should be your first stop, not your last when things go wrong. “Fore read” means being forewarned.

Once you’ve read through the documentation, one of the things you will want to do as a prerequisite is to use the mysqlcheck utility to check your MySQL installation for compatibilities. For example, one of the prerequisites for upgrading to MySQL 8 is that, per the section entitled, “MySQL Upgrade Strategies” in the online MySQL reference manual, “there must be no tables that use obsolete data types, obsolete functions, orphan .frm files, InnoDB tables that use nonnative partitioning, or triggers that have a missing or empty definer or an invalid creation context.” We can use the mysqlcheck utility to identify any of these conditions as shown in Listing 10-1.

$ mysqlcheck -u root -p --all-databases --check-upgrade
Enter password:
library_v1.authors                                 OK
library_v1.books                                   OK
library_v1.books_authors                           OK
library_v1.notes                                   OK
library_v1.publishers                              OK
library_v2.books                                   OK
library_v2.notes                                   OK
library_v2.publishers                              OK
library_v3.books                                   OK
...
mysql.user                                         OK
sys.sys_config                                     OK
Listing 10-1

Using mysqlcheck to Identify Upgrade Issues

For best results, you should use the mysqlcheck utility from the version you are upgrading. This will ensure the utility is the most up to date and should identify more upgrade issues.

Plan the Upgrade

Once you have all the prerequisites mapped out and have identified any features that require special handling to solve incompatibilities, it is time to plan for upgrading your server. This may be an obvious thing to do if you have thousands of servers, but less obvious to those with only a few (or even one) server to upgrade.

You should resist the temptation to simply run the upgrade without planning what you are going to do. Recall, we want to ensure the upgrade goes smoothly by reducing (or eliminating) risk. This is much more critical for production environments, but any potential loss of availability, performance, or data can result in loss of productivity.

You can get most of what you need to plan from the documentation, but the documentation won’t be specific to your installation, servers, platform, and so forth. Therefore, you must fill in those blanks and adapt the procedures suggested in the documentation to your own installation. However, you can learn quite a lot by reading the section, “What’s New in MySQL 8.0,” paying attention to any subsections labeled “Ramifications for Upgrades” in the online MySQL reference manual. There you will find tips that may help you avoid some complicated decisions, or better to avoid complex repairs.

This also includes making sure you have the right personnel on hand to do the upgrade or to be ready to jump in case something goes wrong.1 For example, don’t forget your developers, web administrators, and other critical roles.

The form of the plan is up to you; however, I suggest that you write what you plan to do and share it with others. This way, everyone in the chain of ownership of the upgrade will know what is to be done. You will be surprised how much a little communication can do to reduce risk of things going wrong.

Caution

If you are using or plan to use a platform that supports automatic updates and those facilities include repositories that monitor MySQL, you may want to consider excluding MySQL from automatic updates. This is especially true for production environments. You should never automatically update MySQL in a production environment for any mission critical data.

Consider Parallel Deployment

One practice that can help the most when upgrading systems that require more than a trivial amount of work is installing the new version parallel to the existing version. This is a practice known to software engineering and is designed to ensure the existing data and applications remain unchanged while the new system is being installed and configured. The new version (installation) would be considered a development platform and often goes into production once sufficient testing of the migration is complete.

Although this isn’t an upgrade per se (it’s a new installation), having a new parallel version of MySQL running gives considerable freedom in how to attack the migration of your existing data and applications. After all, if something goes wrong, your data is still operational on the old system.

This practice also provides you another benefit: you can change platforms or other major hardware without having to risk your existing data. Therefore, if your existing servers have hardware that is to be updated at the same time, you can use a parallel installation to install MySQL on the new hardware thus isolating the risks with the new hardware.

Finally, employing a parallel installation may help with scheduling and planning your migration by ensuring the existing systems are fully capable. And, better, you can always go back to the old system if during the migration something goes wrong.

Parallel deployment often includes keeping both systems running for some period. The length of time may depend on the amount of risk you’re willing to take or it may be based on how long it takes to fully switchover all your applications.

It is unfortunate that some may not have the resources available to consider parallel deployments. As having two installations of MySQL running at the same time may place a greater burden on developers, administrators, and support personnel. Given the benefits of parallel development, it may be worth adding extra resources or accepting less productivity of some personnel for a short period.

However, even this safety net is tenuous if you don’t perform enough testing.

Test, Test, Test!

This practice, along with planning, is often overlooked or given far less importance. Sometimes this is due to external forces such as not having the right personnel available or failures in planning that results in no time for extensive testing. Regardless of the excuse, failing to adequately test your upgrade increases risk beyond what most would be willing to endure.

Testing should include that you ensure all the data has been migrated, all applications work completely, and all access (user accounts, permissions, etc.) are functional. However, don’t stop there. You should also ensure all your operational practices have been modified for the new version. More specific, your maintenance scripts, procedures, and tools all work correctly with the new version.

Furthermore, your testing should result in a go/no go decision to accept the upgrade. If things are not working or there are too many issues, you may need to decide to keep or reject the upgrade. The parallel installation practice can help in this manner because you don’t destroy the existing data or installation until you are certain everything is working. Make sure that you write those criteria into your plan and to ensure success.

Tip

Be sure to test all existing operational procedures as part of your acceptance criteria.

Production Deployment Strategies

If you have a production and development (or test) environment, you also should consider how to move the development or test deployments to production. If you are using parallel installations, it may be simply switching application routers and similar appliances and applications. If you are using in place installations, it may be more involved. For example, you may need to plan for a period of downtime to complete the migration.

For parallel installations, planning the downtime may be more precise and involve a shorter period because you have more time to test things. However, for in place upgrades, you may need to set aside a period to complete the migration. As expected, you will want to minimize the downtime by doing as much of the migration as you can. But in the base of MySQL, this may be nothing more than forming a plan and gathering resources. The bottom line is, don’t forsake including production deployment in your plan.

Now that we’ve discussed upgrade practices, let’s take a moment to discuss some reasons we may want to consider performing the upgrade, which clearly can be a very involved process with a certain amount of risk.

Reasons for Upgrading

If you’re like most avid users of platforms or systems, you will want to upgrade to the latest and greatest versions whenever a new one is released. Savvy administrators and planners know there is little room in a production database environment for such behavior. Thus, reasons for upgrading will require some genuine bang for the buck. That is, it must be worth your while. The main driving reasons for upgrading MySQL include the following.
  • Features: A new feature is released that can improve your applications or data, examples include the Document Store, Group Replication, and InnoDB Cluster

  • Performance: The newer version improves performance making your applications better. For example, the latest 5.7 release is many times faster than previous versions and MySQL 8 promises to improve on that.

  • Maintenance: There are new features that help you maintain the system better. Examples include the new data dictionary, Group Replication, and ancillary tools such as MySQL Enterprise Backup.

  • Bug fixes: There may be defects in older versions that required workarounds or limitations. Newer versions may contain fixes for critical bugs so you can remove the workarounds and limitations caused by the defect.

  • Compliance: Your platform, standard operating procedures, or external entities require the upgrade for compliance. For example, you may be required to run a specific version of MySQL for contractual agreements.

The bottom line is you must answer the question, “Why should I upgrade?” and that answer must result in some benefit for you, your data, clients, workforce, and the company’s future. It makes little sense to spend resources on an upgrade that has little or no benefit, which is another reason companies often skip version upgrades. Alas, skipping too many upgrades can make later upgrades more problematic. However, given how much improvement MySQL 8.0 is over MySQL 5.7 and earlier, many will want to plan to upgrade to MySQL 8.

Tip

For more details about migrating to MySQL 8 including platform-specific steps, see http://dev.mysql.com/doc/refman/8.0/en/upgrading-from-previous-series.html .

So, Should I Upgrade To MySQL 8 or Not?

The discussion in this section may cast some doubt on whether you should upgrade to MySQL 8. That is not the case. In fact, this book should convince you to upgrade to MySQL 8 as soon as you can do so in a safe, risk-free manner. Thus, in this section I suggest that you need to plan your upgrade and execute it carefully to ensure success.

Considerations for Upgrading to MySQL 8

There are several compatibility issues identified in the online MySQL reference manual MySQL 8.0. The following are a few that you should be aware of when planning your MySQL 8.0 upgrade.
  • Data dictionary: The new metadata, transactional storage mechanism is a major change in the architecture. If you have DevOps that work with .frm files and other metadata, you may need to make changes to migrate to using the data dictionary.

  • Authentication plugin: The default authentication plugin has changed. This may result in connection issues for those that use older authentication mechanisms.

  • Error codes: Some error codes have changed. If you to have applications that use error codes, you will want to explore these changes to avoid application errors after upgrading.

  • Partitioning: The default partitioning storage engine support has been removed. If you are using a custom storage engine (or an old one), you will need to ensure there exists an upgraded version for use with MySQL 8.

  • INFORMATION_SCHEMA: Minor changes to the views. If your applications or devops use these views, be sure to check to see if any of the views you are using have been removed or changed.

  • SQL commands: There are some new and obsolete SQL commands. Be sure to check your SQL statements to see if you are using some of the older, removed commands.

  • Default charset: The default character set (charset) has been changed to utf8mb4. If you have character set support in your applications, you may need to test with the new default to ensure compatibility.

Once again, be sure to read the online MySQL reference manual section, “Verifying Upgrade Prerequisites for Your MySQL 5.7 Installation” and the section, “Changes Affecting Upgrades to MySQL 8.0” for the most up-to-date information about these and other prerequisites and migration tasks needed to upgrade to MySQL 8.0.

Another excellent resource is the engineering blogs at https://mysqlserverteam.com/ . These blogs often discuss new features before they are released as GA and are a fount of knowledge about how the feature works as well as any upgrade issues that the engineering team has identified or are working to overcome. Watching the blogs will give an excellent early warning of changes.

Tip

See the engineering blogs at https://mysqlserverteam.com/ for early announcements about new features and how to work with them.

Now that we’ve discussed upgrading MySQL in general and some specifics about upgrading to MySQL 8.0, let’s discuss a very important development concept you should spend some time to understand before adopting the MySQL Document Store fully.

Migrating to Schemaless Documents

What does schemaless mean? It simply means that we don’t restrict our data storage to rigid formats with specific fields with a give type. The key factor to remember in adopting a schemaless mindset is play to the strengths of JSON documents: the ability to store only the data needed including the ability to add document elements where needed and keep all meaningful data together. This is also known as flexibility and is the cornerstone of designed schemaless documents.

For example, if a developer finds a new field must be added, it can. Or, if the developer discovers embedding information results in faster, easier code, the decision is one of weighing benefits of the application and the user experience rather than strict data storage rules.

Flexibility has another angle. Relational databases are typically designed using the same set of rules and tools. So much so that what applies to one database will show up in another. In schemaless designs, having data in separate collections or embedded are largely made based on how the data will be used and thus can vary from one application to another.

However, you should not conclude from this that developers have a free ride and can run amuck whenever they choose. Rather, you should consider flexibility a tool that you can use if warranted after due process to evaluate the change.

Another benefit of the schemaless mindset is the ability to scale the data without retrofitting. For example, if our applications gain more features resulting in more data in the documents, there is no need to go back and force older documents into a new structure. We simply scale the document as the application features mature. If we do need to go back and add data to the old documents, it’s a simple coding affair.

Although reducing unnecessary complexity and ambiguities should be goals in any data store design, a schemaless mindset should be more willing to accept responsibility of the tradeoffs and to work to minimize their impact rather than adhering to a set of fixed rules.

Thus, the schemaless mindset should be one of flexibility and scalability where we emphasize these qualities over data structure and conformity. Remember, we’re striving to keep the data together to reduce the number of times it must be retrieved.

The following sections identify some of the areas you may need to consider thinking about when adopting a schemaless mindset

Normalization vs. Denormalization

One of the most fundamental mindset challenges is recognizing the difference between normalization and denormalization. In basic terms, normalization is a goal that relational database designers strive to reduce the amount of data stored without the possibility of duplication or ambiguity. Denormalization is a goal for document store designers to strive to make the data as local as possible by describing an entity in its fullest with duplication a much lower concern.

In a schemaless world, we use denormalization to remove the need for joins thereby possibly increasing performance. However, it doesn’t end there. The goal is to make the data models store data that is used as a unit. In other words, the document should contain all the data pertinent.

In some cases, this may result in embedding data in the document that would normally be stored in a separate collection (or table). Denormalizing the data therefore may introduce some duplication. For example, we saw this in the last chapter where we stored the author name in the book document. In this case, the author names were indeed duplicated among books that had the same authors. However, the way the data is used—to view bibliography information—meant there was no practical need to search or perform queries on the author data. By simply listing authors was all that was need. Thus, the cost of retrieving the normalized author data was an artificial application of fixed rules.

When approaching the question of normalization or denormalization, take some time to analyze your data not only from the standpoint of how it is organized but also in how it is used. Sometimes you may find your reasons for isolating data may not be important. For example, if there is no need to have a child table and some duplication is acceptable, you can embed the information in one or more JSON fields thereby applying enough denormalization to get the benefits of having the data in one place (retrieved with one pass).

Formal Rules vs. Heuristics

Another area to consider working on is how data storage is designed or, more appropriately, what mechanisms are used to drive the development. In the relational database world, we have a set of rules we use (e.g., unyielding and sometimes unforgiving2 rules called normal forms) that guide designers to achieve the least redundant, most accurate retrieval solution possible. In the schemaless world, we use heuristics or rules of thumb when designing the data storage to achieve the storage that best optimizes describing the things we’re modeling and to make the data accessible.

The difference is not in how the data is designed so much as how the data is revealed. In the relational database world, the data store is designed largely by how to store it whereas in the schemaless world, the documents are designed based on how it will be used and how the users will view the data.

Therefore, we can predict reasonably well how a relational database will be accessed and how to form the queries (often in advance of the application) and with the right tools even how the queries will perform. However, in a schemaless solution, we cannot tell by examining the document how it will perform. We must test it with the application to learn how to access it more effectively. Sometimes, this can result in making minor changes, which can be in the code or in the document itself. It is fortunate that the process in which to make the modifications is much easier than in a relational database. Thus, in the schemaless world we must adapt a more change friendly attitude.

This is one of the things that sets the two apart. For example, in a relational database, when we need to change a table (or set of tables), we often must plan well in advance as changing the strict schema often forces application developers to change their applications. On the other hand, changing a document doesn’t require a lengthy retooling effort. In fact, developers can simply add the new data item in code never stressing the database administrator. Let’s look at an example.

Consider a solution where we are storing addresses. We all know what they look like—street1, street2, city, state, and zip code. But what if you had to make your database available for international data? Now, we’re looking at the possibility of adding a country name at the very least if not additional fields for country-specific addresses. Without a doubt, this would require modifying the table (as well as the code). Now, consider a document that stores addresses. If we need new fields, we just add them in code and write the code to detect the new fields.

View Data as Code

One of the hardest concepts about schemaless designs for those who have worked with relational databases is that the data (document) should (can) be viewed as code. Consider the JSON structure—it’s code! Thus, thinking about your data as part of your code will help you design better documents.

For example, if you know your document contains elements that are lists, and you need to iterate over the items in those lists, code to do this is typically some form of loop or iterator mechanism such as a for each or for X in Y construct. Thus, you can view the document has having data that is used in those looping constructs. Yes, this is akin to thinking in result sets (arrays of rows), but in this case, we’re much closer to the real code. In fact, due to the uniqueness of how JSON works, we can write code to reference the elements of the document rather than an abstraction layer as we see in result set processing. That is, we access the field by name rather than asking a library to give us the “nth” field. This results in code that is easier to read and is described by the data (and vice-versa).

Take Storage for Granted

This may sound a little odd, but the storage mechanism—placing a document in the document store—can largely be ignored by schemaless designers and code developers. For instance, we aren’t concerned about table rows, fields, and so forth. The schema is flexible, and the focus is on collecting documents.

The APIs for document stores in general and the X DevAPI make these operations ubiquitous thereby freeing the designers and developers to focus on how the document is used rather than how it is stored.

As expected, there are cases where we want to ensure we’re not over denormalizing and in those cases, we will need to think logically about how to organize the data in collections, but this too is a higher level and not directly liked to storage mechanisms.

Embed or Separate?

Knowing when to embed data is one of the skills you will learn as you design more schemaless documents. However, there are some general rules to follow to help answer the question “to embed or not?” The following lists a few conditions under which you may need to decide to embed or separate the data.
  • Integrity: The embedded data applies only to this document. It is not used elsewhere and is seldom changed (or viewed) without the document. If it can be used in other documents and changes must apply to all references or it is a separate entity, it should not be embedded.

  • Limited growth: The embedded data is not likely to grow in length. For example, if the embedded data is an array, the size of the array (number of items) will remain small (or few). If there is a chance to grow beyond a reasonable size, it should be made a separate collection.

  • Containership: If a relationship exists where one document contains another document and the documents are only accessed as a set, you may want to embed. However, if the documents can be accessed or changed separately (and it makes sense to do so), you may want to consider not to embed the document.

  • Frequency of edits : If there is data in a document that is seldom changed, you can embed it. However, if the data can change frequently either from another access point (view) or mechanism in the application that does not use the original document; you may want to consider moving the data to its own collection.

  • Links: If the data you want to embed is only ever referenced infrequently from one document, you can consider embedding it. However, if it is referenced by more than one document and the data must be the same for all references, you should place the linked data in its own collection.

Strategies for Migrating to a Document Store

Now that we have a better understanding of a schemaless mindset, let us review some strategies you can employ for migrating existing relational database to a document store. This section reinforces the lessons learned thus far in the book using another example of migrating to a document store.

As we saw in Chapter 9, we do not have to migrate all our database and data in one go—although you can do that if you have the resources, time, and sufficient needs. However, most will want to migrate slowly to a document store. In may also be the case that a pure document store may not meet your needs or may be too costly making a migration a longer-term goal.

Let’s use a commonly known database solution for a contact list. Here, we are storing names, addresses, and phone numbers. A typical relational database solution for a contact list would group all the data in a single database with one-to-many relationships for addresses, phone numbers, email addresses, and so forth. This because we know each contact will have one or more of these data items. Figure 10-1 shows an entity relationship diagram (ERD) for such a relational database design.
../images/432285_1_En_10_Chapter/432285_1_En_10_Fig1_HTML.jpg
Figure 10-1

Contact list (relational database)

First, note this database design is not completely normalized. For example, it is possible that a contact list could include two or more people who have the same work or home address as well as the same phone number. Without a doubt, we could make many-to-many relationships but that is taking normalization too far. More specific, we do not store addresses and phone numbers separately from the contact data—it doesn’t make any sense to use the data in that way.

Second, note that the database does support storing multiple email, address, and phone numbers for each contact. That is, there exists a one-to-many relationship among the contacts table and email_addresses, addresses, and phones tables. We will use the database name contact_list1 so that we can migrate this to other forms for comparison.

In this example, there are foreign key constraints as well as primary keys on all the tables. Listing 10-2 shows the SQL statements for the tables in the database.

CREATE DATABASE IF NOT EXISTS `contact_list1`;
CREATE TABLE `contact_list1`.`contacts` (
    `contact_id` int(11) NOT NULL AUTO_INCREMENT,
    `first` char(30) DEFAULT NULL,
    `last` char(30) DEFAULT NULL,
    PRIMARY KEY (`contact_id`),
    KEY `contact_id` (`contact_id`),
    CONSTRAINT `email_addresses_ibfk_1` FOREIGN KEY (`contact_id`) REFERENCES `contacts` (`contact_id`)
) ENGINE=InnoDB;
CREATE TABLE `contact_list1`.`addresses` (
    `addr_id` int(11) NOT NULL AUTO_INCREMENT,
    `contact_id` int(11) NOT NULL,
    `address_type` ENUM('work', 'home', 'other') DEFAULT 'home',
    `street1` char(100) DEFAULT NULL,
    `street2` char(100) DEFAULT NULL,
    `city` char(30) DEFAULT NULL,
    `state` char(30) DEFAULT NULL,
    `zip` char(10) DEFAULT NULL,
    PRIMARY KEY (`addr_id`,`contact_id`),
    KEY `contact_id` (`contact_id`),
    CONSTRAINT `addresses_ibfk_1` FOREIGN KEY (`contact_id`) REFERENCES `contacts` (`contact_id`)
) ENGINE=InnoDB;
CREATE TABLE `contact_list1`.`email_addresses` (
    `email_id` int(11) NOT NULL AUTO_INCREMENT,
    `contact_id` int(11) NOT NULL,
    `email_address` char(64) DEFAULT NULL,
    PRIMARY KEY (`email_id`,`contact_id`)
) ENGINE=InnoDB;
CREATE TABLE `contact_list1`.`phones` (
    `phone_id` int(11) NOT NULL AUTO_INCREMENT,
    `contact_id` int(11) NOT NULL,
    `phone` char(30) DEFAULT NULL,
    PRIMARY KEY (`phone_id`,`contact_id`),
    KEY `contact_id` (`contact_id`),
    CONSTRAINT `phones_ibfk_1` FOREIGN KEY (`contact_id`) REFERENCES `contacts` (`contact_id`)
) ENGINE=InnoDB;
Listing 10-2

Contact List Relational Database

Note

Savvy database designers will note the phone number field is denormalized too far. Can you spot the issue?3

Now that we have seen the database design, let’s consider some things that a document store designer would notice and want to change. That is, let’s look at this design a bit more critically. Note that for any contact we want to view, we have potentially up to three additional queries to retrieve all the data. This is because we have broken out the phone, email, and address into separate tables. We could issue a single join query to get all the data, but that will result in extra data (unless you use an outer join or similar tricks).

To keep things simple, let’s stick with the one query per dependent table. Even so, that gives us a total of four queries to execute to retrieve all the data for a given contact. Listing 10-3 shows the queries we would need to execute to get the data for a contact named ‘Bill Smith’.

MySQL  localhost:33060+  SQL > SELECT * FROM contact_list1.contacts WHERE first = 'Bill' AND last = 'Smith';
+------------+-------+-------+
| contact_id | first | last  |
+------------+-------+-------+
|          1 | Bill  | Smith |
+------------+-------+-------+
1 row in set (0.00 sec)
 MySQL  localhost:33060+  SQL > SELECT * FROM contact_list1.addresses WHERE contact_id = 1;
+---------+------------+--------------+-----------------+---------+----------+-------+-------+
| addr_id | contact_id | address_type | street1         | street2 | city      | state | zip   |
+---------+------------+--------------+-----------------+---------+----------+-------+-------+
|       1 |          1 | home         | 123 Main Street | NULL    | Anywhere  | VT    | 12388 |
+---------+------------+--------------+-----------------+---------+----------+-------+-------+
1 row in set (0.00 sec)
 MySQL  localhost:33060+  SQL > SELECT * FROM contact_list1.email_addresses WHERE contact_id = 1;
+----------+------------+----------------------------+
| email_id | contact_id | email_address              |
+----------+------------+----------------------------+
|        1 |          1 | [email protected] |
|        2 |          1 | [email protected]      |
+----------+------------+----------------------------+
2 rows in set (0.00 sec)
 MySQL  localhost:33060+  SQL > SELECT * FROM contact_list1.phones WHERE contact_id = 1;
+----------+------------+----------------+
| phone_id | contact_id | phone          |
+----------+------------+----------------+
|        1 |          1 | (301) 555-1212 |
+----------+------------+----------------+
1 row in set (0.00 sec)
Listing 10-3

Queries to Retrieve a Contact (Relational Database)

As you can see, this involves several trips to the database server to get the data. If your application is designed to use tabs or some other user interface mechanism to hide the phone, address, and emails until the user clicks to reveal the information, having four queries might be okay and might save you some effort. However, almost every contact list solution includes name, address, and phone number. So, we’re not saving much in the general terms of round trips to the database.

Note also it requires us to keep track of the contact_id passing to each of the three dependent queries. A good relational database designer would say, “so what?” at these observations. However, the schemaless mindset tells us to try to minimize joins and to keep all the data together. Let’s see how we can apply a schemaless mindset to migrate the database to a hybrid solution keeping the base table but incorporating JSON fields.

Note

I don’t go into how to create, update, and delete operations because these are very familiar in relational database systems.

Migrating to a Hybrid Solution

Removing joins is a good strategy for improving performance in retrieving data. It also plays to one of the standards of schemaless design; keep the data together. Both are strategies that fall under denormalizing the data. The example contact list database in the last section was normalized to include four tables: one for the contact name, and one each to contain all the addresses, phone numbers, and email addresses.

The goal wasn't to make the separate tables searchable or even accessible and presented on their own. After all, what use would anyone have to see a list of phone numbers without any connection to the owner? Rather, the normalization was meant to keep like data together and remove duplication.

For example, it is possible if you know several people who work at the same place that their work address and phone numbers will be the same. Likewise, members of the same family may have the same address and phone numbers.4 Normalization then results in a main table with three dependent tables in a one-to-many relationship.

However, if you consider that we will seldom need to query the phone number, email address, or address table separately and that the data in those tables is associated with a contact and that it only makes sense to view the data as a set, we can denormalize the data by embedding it in the contact data.

We can do this easily by simply adding three fields to the contacts table using JSON fields to embed the data and still maintain the structure. Recall, we can use a JSON document in code and thus all the field names from the original tables can be used. When you do it this way, migrating the code is easier because you will be referring to the same data names. The following shows a redesign of the database to use a hybrid solution.

CREATE DATABASE IF NOT EXISTS `contact_list2`;
CREATE TABLE `contact_list2`.`contacts` (
    `contact_id` int(11) NOT NULL AUTO_INCREMENT,
    `first` char(30) DEFAULT NULL,
    `last` char(30) DEFAULT NULL,
    `addresses` json DEFAULT NULL,
    `email_addresses` json DEFAULT NULL,
    `phones` json DEFAULT NULL,
    PRIMARY KEY (`contact_id`)
) ENGINE=InnoDB;

Here, we eliminated the three tables by adding the relations as JSON arrays. But wait, how do we format those JSON documents? Isn’t that going to be a problem? No, not really. To migrate to a hybrid solution you would carry over the field names from the embedded data using them as keys in the JSON document.

We can also improve the data for the phone numbers by adding keys to reference the area code, exchange, and phone number. Better, if we later find that we need to add a country code value, we can do so for those contacts that require them. Remember, the beauty of JSON documents is that they are mutable and you can add or leave out fields as needed. The only catch is your code must be written to expect the omissions and new fields. Thus, there is no reason to retool your data to add a country code to new contacts.

The following shows how this is done for the create operation using the same contact shown in the last section.

INSERT INTO contact_list2.contacts VALUES(
  NULL, 'Bill', 'Smith',
  '{"addresses":[{“address_type”:”home”, "street1":"123 Main Street","street2":"","city":"Anywhere","state":"VT","zip":12388}]}',
  '{"email_addresses":["[email protected]","[email protected]"]}',
  '{"phones":[{"area_code":301,"exchange":555,"number":1212}]}'
);

This results in an interesting row returned from a SELECT query. We see the result in the following.

MySQL  localhost:33060+  SQL > SELECT * FROM contact_list2.contacts WHERE first = 'Bill' AND last = 'Smith' G
*************************** 1. row ***************************
     contact_id: 1
          first: Bill
           last: Smith
      addresses: {"addresses": [{"zip": 12388, "city": "Anywhere", "state": "VT", "street1": "123 Main Street", "street2": "", "address_type": "home"}]}
email_addresses: {"email_addresses": ["[email protected]", "[email protected]"]}
         phones: {"phones": [{"number": 1212, "exchange": 555, "area_code": 301}]}
1 row in set (0.00 sec)

Here, see that we’ve used the same names as the fields in the original tables. Accessing the data is made easier as we can migrate our code from looking for columns in row objects to using the field names directly in code. For example, to display the embedded lists of addresses, phones, and emails, we can use loops. Listing 10-4 shows an example script to do this.

import mysqlx
from json import JSONDecoder
GET_BILL = """
SELECT * FROM contact_list2.contacts
WHERE last = 'Smith' AND first = 'Bill'
"""
# Connect to database
session = mysqlx.get_session("root:password@localhost:33060")
# Read the row
row = session.sql(GET_BILL).execute().fetch_one()
# Convert JSON strings to Python dictionaries
addresses = JSONDecoder().decode(row["addresses"])["addresses"]
phones = JSONDecoder().decode(row["phones"])["phones"]
email_addresses = JSONDecoder().decode(row["email_addresses"])["email_addresses"]
# Display the data
print("Contact List (Hybrid)")
print("---------------------")
print("Name: {0} {1}".format(row["first"],row["last"]))
print(" Addresses:")
for address in addresses:
    print(" ({0})".format(address["address_type"].upper()))
    print(" {0}".format(address["street1"]))
    if address["street2"]:
        print(" {0}".format(address["street2"]))
    print(" {0}, {1} {2}".format(address["city"],
                                  address["state"],
                                  address["zip"]))
print(" Phones:")
for phone in phones:
    print(" ({0}) {1}-{2}".format(phone["area_code"],
                                   phone["exchange"],
                                   phone["number"]))
print(" eMail Addresses:")
for email in email_addresses:
    print(" {0}".format(email))
print("")
Listing 10-4

Sample Read Operation for Contact List (Hybrid)

Note that the code reads well and we can see exactly which data we’re accessing in the loops. However, there is some duplication when converting the JSON strings to Python dictionaries. This is because we have a field in the table and a key in the JSON string with the same name. For example, there is an addresses field and the key in the JSON document is addresses. This might look a little odd, but it is precisely how you would access the JSON document in the fields. Some may want to rename the field or JSON key to make it a bit less ambiguous.

The following shows this code executing. Note the output does resemble how you would expect to see the data by reading the code. The use of tabs ( ) is helpful for printing strings to a console.

$ python ./hybrid_read.py
Contact List (Hybrid)
---------------------
Name: Bill Smith
Addresses:
    (HOME)
    123 Main Street
    Anywhere, VT 12388
Phones:
    (301) 555-1212
eMail Addresses:

This solution is better and does solve the issue of removing joins and keeping the data together, but what if we need to store a surname, suffix, title, or if there are contacts with more than two names? Likewise, what if you found you needed to find all contacts that live or work in a certain area code? The data is there, but because the phone number is a single string, it’s harder to search for the data (but not impossible).

We can solve this problem (and similar issues) by simply altering the table to break out the data into separate fields. That will work and that is what most developers will do. However, what do you do with any existing data? Do you go back and reformat using special tools you create yourself? You don’t have any choice and if there is a lot of data, the conversion can be painful and time consuming.

Sadly, hybrid solutions with relational parts (table, fields) are still fixed and thus hybrid solutions are not immune to changes. What we need is to achieve mutability—to be able to change the structure whenever we need to do so without having to retool. If you’re thinking, “there’s got to be a better way,” you are correct—there is. Let’s see how to overcome these issues by converting the database to a pure document store.

Converting to a Document Store

Perhaps the best attribute of document store solutions is mutability. That, along with the “data as code” concept makes working with document stores so much easier than relational databases. Although we saw some improvements in a hybrid solution for the contacts database, we are not quite to the point where mutability is possible.

To be specific, we still have fixed fields in the hybrid solution. If these fields were the complete set (for all time), we might be satisfied with a hybrid solution. But if you work in an international setting, you will find that storing first and last names are far too casual and, in some cases, insufficient.

For example, Geraldo Jose Miguel Gomez. What do you do with such a name? Split the name arbitrarily placing the parts in the two fields? What if the individual uses Miguel as his first name? Now, your database will list his name as “Miguel”, “Geraldo Jose Gomez”, which is not correct. Furthermore, if you do split the name in such a manner, any queries on first name or last name for that matter are subject to incorrect results or at least additional parsing after the query to sort out these anomalies.

If we used a document store, we can add whatever fields we need. We just need to keep our code and data synchronized. That is, when we add new fields, we must also add code in our CRUD operations to compensate. For example, adding a nickname field to the document is easy, but the code that reads the data and displays it must allow for working with the nickname. Best of all, we can add the change and don’t have to rework any existing data.

This, then, is the goal: to make your data schemaless and integrated tightly with the code. Once you adopt this mindset, you will find the stigma of normalization easy to cast off. Although that doesn’t mean all schemaless solutions will outperform their relational counterparts (unlikely in fact), it does mean you can make the data work for you instead of against you. Developers especially will appreciate the freedom.

As we learned in Chapter 9 and earlier in the book, we can create a collection easily in code. Recall, this includes connecting to the server, getting a schema object instance, and creating the collection. We can then create, read, update, or delete documents in the collection. Figure 10-2 shows a snapshot of using the MySQL Shell to create the schema and collection for the contact list. We also add the row we’ve been using as an example.
../images/432285_1_En_10_Chapter/432285_1_En_10_Fig2_HTML.jpg
Figure 10-2

Creating a document store

Here, we see we have migrated our rigid relational database model to a mutable JSON document with embedded data that keeps all the data together for each contact.

Now, let’s look at the code to perform a read operation on the document store. Listing 10-5 shows the code to read the document from the collection and print it to the console.

import mysqlx
# Connect to server
session = mysqlx.get_session("root:password@localhost:33060")
# Get the schema
schema = session.get_schema("contact_list3")
# Get the collection
contacts = schema.get_collection("contacts")
# Read the row
row = contacts.find("first = '{0}' and last = '{1}'".format('Bill',
                                                        'Smith')).execute()
contact = row.fetch_one()
addresses = contact["addresses"]
phones = contact["phones"]
email_addresses = contact["email_addresses"]
# Display the data
print("Contact List (DocStore)")
print("-----------------------")
suffix = ""
if "suffix" in contact.keys():
    suffix = ", {0}".format(contact["suffix"])
print("Name: {0} {1}{2}".format(contact["first"],contact["last"],suffix))
if "title" in contact.keys():
    print("Title: {0}".format(contact["title"]))
print(" Addresses:")
for address in addresses:
    print(" ({0})".format(address["address_type"].upper()))
    print(" {0}".format(address["street1"]))
    if "street2" in address.keys():
        print(" {0}".format(address["street2"]))
    print(" {0}, {1} {2}".format(address["city"],
                                  address["state"],
                                  address["zip"]))
print(" Phones:")
for phone in phones:
    print(" ({0}) {1}-{2}".format(phone["area_code"],
                                   phone["exchange"],
                                   phone["number"]))
print(" eMail Addresses:")
for email in email_addresses:
    print(" {0}".format(email))
print("")
Listing 10-5

Sample Read Operation for Contact List (Document Store)

Note that the code is very similar to the hybrid solution. In fact, the print sections are the same. The difference is seen early on when retrieving the data. In this case, we can retrieve the document and then store the addresses, phones, and email addresses as dictionaries, which makes the code even easier to read. Very nice!

The following shows the code executing. As you can see, the output is the same as the hybrid solution.

$ python ./docstore_read.py
Contact List (DocStore)
-----------------------
Name: Bill Smith, Jr
Title: Salesman
Addresses:
    (HOME)
    123 Main Street
    Anywhere, VT 12388
Phones:
    (301) 555-1212
eMail Addresses:

Now that we’ve discussed what a schemaless mindset is, let’s review some tips and tricks for working with the MySQL Document Store.

Document Store Tips and Tricks

The following contain several best practices for planning, developing, and managing applications using MySQL Document Store. Some may seem intuitive whereas others may simply remind you to do those things we all know we should do but sometimes short cut for brevity. They are presented in a bulleted list and are intended to be a resource you can use to refer to periodically at the start of a migration or development effort.
  • Minimize joins: Joins can be expensive. Reducing how many places you need to join data can help speed up your queries. Removing joins may result in some level of denormalization but can result in faster access to the data.

  • Plan for mutability: Schemaless designs are focused on mutability. Build your applications with the ability to modify the document as needed (and within reason).

  • Remove many-to-many relationships: Use embedded arrays and lists to store relationships among documents. This can be as simple as embedding the data in the document or embedding an array of document ids in the document. In the first case, the data is available as soon as you read the document and in the second, it takes only one additional step to retrieve the data. In cases of seldom read (used) relationships, having the data linked with an array of ids can be more efficient (less data to read on the first pass).

  • Avoid over denormalizing : It is possible to take denormalization too far. If you denormalize your data by embedding things in your document at the expense of duplication, you may at some point discover you need to change the duplicated data. If this happens, you’ve crossed the line and now your data update nightmares commence. Thus, whenever you denormalize always consider how (or if) the data will be updated. If it can be updated in isolation (say only for one or more documents) and those changes do not need to exist in other documents, your denormalization should be fine. However, if you think even for a moment that you will need to update all occurrences of the embedded data, you must consider moving the data to another collection and using embedded lists to link the documents by id.

  • Know your data: This may sound obvious, but you must understand the data you are using in your design. Not just what it can (or must) contain, but also how it will be used. Often, relational database designers are only concerned about the ability to retrieve any part of data at the expense of how the data is used. Thus, in the relational database world we often find ourselves optimizing queries after the application and data are designed. In the schemaless world, we must focus on how the data will be used from the start so that we can store the pertinent data together in a single or sometimes linked document. Knowing how your data will be used in the application can make a difference to how you form your document. It also can help you determine how to write the code to retrieve the data before you start to write the code.

  • Avoid large documents: Storing all the data in a single document is indeed one of the goals of schemaless design, but this too must be used with some judgment behind it. If your document ends up being very large, you could encounter performance problems trying to retrieve more than a single document (say for a list or to perform an operation over a set of documents). Thus, you should consider what parts of the document are used and when. You may find that you can split your document into several smaller documents (each in their own collection). This way, you can optimize retrieval for most of operations retrieving the less frequently used data only when you need it.

  • Use JSON columns for embedding data in tables : If you want to improve your existing relational database by reducing the number of joins, you can use a JSON field to embed the data. For example, use a JSON field to collapse many-to-many join tables by storing an array of pointers (keys) to the dependent table. One obvious candidate is the text of BLOB fields that have encoded data.

Summary

The MySQL Document Store and the latest incarnation of the server, MySQL 8, represent a huge leap forward in functionality, reliability, and availability. Best of all, MySQL 8 doesn’t force you into a new paradigm. The NoSQL option via the X DevAPI perfectly complements the advanced clustering and availability of NDB Cluster. But unlike NDB Cluster, you can use your existing MySQL servers.5

What this means is you don’t have to learn and completely retool your infrastructure and applications to use the newest features. What we saw in this book and in practice in Chapter 9 is that you can choose to use MySQL 8 as a traditional relational database store, migrate your applications to a hybrid of relational data with one or more JSON fields, or completely rethink your data by migrating to a pure Document Store solution.

In fact, MySQL makes it easy to migrate your applications because the X DevAPI supports both an SQL and NoSQL interface. Thus, the first step is to migrate to the X DevAPI for all your SQL interface-based applications then you have the option of migrating those to a hybrid or pure document store solution.

This is an exciting time for MySQL users. Oracle continues to keep it's promise to not only continue developing MySQL but also pouring resources into improving and expanding the feature set. Keep a close watch on more excellent features and further refinement and updates. MySQL 8 is here and now is the time to jump on board. Look for more titles from Apress on MySQL 8!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.186.143