Google Cloud Datastore

One of the services available to App Engine developers is Google Cloud Datastore, a NoSQL document database built for automatic scaling and high performance. Its limited featureset guarantees very high scale, but understanding the caveats and best practices is vital to a successful project.

Denormalizing data

Developers with experience of relational databases (RDBMS) will often aim to reduce data redundancy (trying to have each piece of data appear only once in their database) by normalizing data, spreading it across many tables, and adding references (foreign keys) before joining it back via a query to build a complete picture. In schemaless and NoSQL databases, we tend to do the opposite. We denormalize data so that each document contains the complete picture it needs, making read times extremely fast since it only needs to go and get a single thing.

For example, consider how we might model tweets in a relational database such as MySQL or Postgres:

Denormalizing data

A tweet itself contains only its unique ID, a foreign key reference to the Users table representing the author of the tweet, and perhaps many URLs that were mentioned in TweetBody.

One nice feature of this design is that a user can change their Name or AvatarURL and it will be reflected in all of their tweets, past and future, something you wouldn't get for free in a denormalized world.

However, in order to present a tweet to the user, we must load the tweet itself, look up (via a join) the user to get their name and avatar URL, and then load the associated data from the URLs table in order to show a preview of any links. At scale, this becomes difficult because all three tables of data might well be physically separated from each other, which means lots of things need to happen in order to build up this complete picture.

Consider what a denormalized design would look like instead:

Denormalizing data

We still have the same three buckets of data, except that now our tweet contains everything it needs in order to render to the user without having to look up data from anywhere else. The hardcore relational database designers out there are realizing what this means by now, and it is no doubt making them feel uneasy.

Following this approach means that:

  • Data is repeated - AvatarURL in User is repeated as UserAvatarURL in the tweet (waste of space, right?)
  • If the user changes their AvatarURL, UserAvatarURL in the tweet will be out of date

Database design, at the end of the day, comes down to physics. We are deciding that our tweet is going to be read far more times than it is going to be written, so we'd rather take the pain upfront and take a hit in storage. There's nothing wrong with repeated data as long as there is an understanding about which set is the master set and which is duplicated for speed.

Changing data is an interesting topic in itself, but let's think about a few reasons why we might be OK with the trade-offs.

Firstly, the speed benefit to reading tweets is probably worth the unexpected behavior of changes to master data not being reflected in historical documents; it would be perfectly acceptable to decide to live with this emerged functionality for that reason.

Secondly, we might decide that it makes sense to keep a snapshot of data at a specific moment in time. For example, imagine if someone tweets asking whether people like their profile picture. If the picture changed, the tweet context would be lost. For a more serious example, consider what might happen if you were pointing to a row in an Addresses table for an order delivery and the address later changed. Suddenly, the order might look like it was shipped to a different place.

Finally, storage is becoming increasingly cheaper, so the need for normalizing data to save space is lessened. Twitter even goes as far as copying the entire tweet document for each of your followers. 100 followers on Twitter means that your tweet will be copied at least 100 times, maybe more for redundancy. This sounds like madness to relational database enthusiasts, but Twitter is making smart trade-offs based on its user experience; they'll happily spend a lot of time writing a tweet and storing it many times to ensure that when you refresh your feed, you don't have to wait very long to get updates.

Note

If you want to get a sense of the scale of this, check out the Twitter API and look at what a tweet document consists of. It's a lot of data. Then, go and look at how many followers Lady Gaga has. This has become known in some circles as "the Lady Gaga problem" and is addressed by a variety of different technologies and techniques that are out of the scope of this chapter.

Now that we have an understanding of good NoSQL design practices, let's implement the types, functions, and methods required to drive the data part of our API.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.232.189