C H A P T E R  27

images

Scaling Drupal

by Károly Négyesi

To define scaling, let's look at a little café. When it opens up, because it's little, the owner does everything: she takes the order, prepares the drink, and takes the payment in exchange for the coffee. Some time passes and the café becomes popular so the owner hires a barista and a waiter. Now the waiter takes the order and gets the money. The barista gets a slip with the order, prepares the drink, and gives it to the customer. What you should notice here is that when one person did everything, the exchange of the money and the drink happened at the same time. But now the customer hands over the money first and only eventually gets a drink. Is there a risk involved with handing over money before getting the coffee? If a fire broke out suddenly, the customer would have paid his money but received nothing in exchange. However, no one is really bothered by this possibility; they would rather take this extremely small risk in order to get their coffee a lot faster. By separating the “taking of money” and the “serving coffee” actions, the shop can serve a lot more customers a lot faster. It can hire any number of baristas and any number of cashiers to better accommodate the traffic. This is scaling: to accommodate the traffic in such a way that more customers do not slow down the process.

But note that adding baristas does not make the time between paying and receiving any shorter. The shop owner makes sure there is always a free barista, but you still need to wait for him to prepare your drink. If, however, the shop employs crafty red robots that make coffee much faster, this will help shrink this waiting time. This is performance: the time that a web application takes between receiving a request and finishing serving it.

Performance is important because people tend to abandon slow web sites; scaling enables a web site to perform when there is a lot of traffic but scaling in itself won't make a web site fast. It is performance that makes it fast. However, if you have enough visitors, even the fastest web site will end up having to make some visitors wait before it begins serving their requests.

Now that I have defined scaling and performance, I'll spend the next part of the chapter telling you why you should care about them. Then I'll discuss some of the available scaling options for Drupal 7. That discussion will focus mainly on databases, because databases are integral to scaling Drupal. Drupal isn't always as scalable as one wants it to be, but the changes I'll discuss can make Drupal scale much more effectively.

Do You Need to Care About Scaling?

When you start out building a web site, handling more traffic is not usually your focus. Getting any traffic at all is usually what concerns most people. After all, only people with big, successful web sites need to worry about traffic. No web site started with millions of users. However, in time, you might just achieve that number or more. If you are aggressively marketing your site, that's all the more reason to plan for growth. Even if you just get a fixed percentage of all Internet traffic, you might be in for a surprise; new devices, even new kinds of devices, allow more and more people to spend more and more time browsing. What are you going to do when success strikes? Will your site be able to adapt?

If your site is not able to adapt to an increase in visitors, you are in for a lot of suffering: what should be success is instead frustrated users, both the new arrivals and the long-time fans, waiting for pages to load and perhaps receiving no response from the site at all. To make this worse, very often the web site is coded in such a way that adding additional hardware doesn't help much; in that case, the site needs to be rebuilt, often from the ground up. This can't happen overnight, of course, and it typically happens at a time when the business is busy with other things: growing. Simply put, more traffic very likely means more potential business, so at the same time the company is trying to cope with the necessary growth to deal with new demand, it suddenly needs to deal with a significant refactoring of the web site. While that happens, the web site might crash daily. The company or organization struggles to keep the lights on, in effect, instead of realizing the success that seems to be tantalizingly close. This is a scenario to avoid.

But there is another pitfall: becoming obsessed with scalability in the beginning of the life cycle of the site to the point where this takes away precious resources from developing functionality. Drupal's modularity has always provided a solution to these issues, and Drupal 7 rises to new heights in this regard.

Cache

Caching means to temporarily store some processed data. It can be structured data or a string of HTML-formatted text. While serving cached data is faster than retrieving and processing it from multiple database tables, the data is not really editable, so processing it into another format is not possible. Because of this, the raw data needs to stay in the database. So now you have more than one copy of the data, and the original data and the cache can become out of sync. In this case, the cache data is “stale.” Sometimes that's fine; if you produce a few articles a day, it probably doesn't matter if fresh content is not visible to anonymous users for a few minutes after the original publication date. This is another important lesson in scalability: the practical trumps the theoretical. Scalability is always a matter of making compromises; it's just a question of which compromises are acceptable for a given web site.

images Tip Take careful note of that phrase "for a given web site." There is no silver bullet, no single solution to all scaling challenges. Scalability is always web-site–specific, although some practices apply to many similar sites.

Drupal can utilize caching to store the whole page, the HTML as it is, for anonymous visitors. Enabling this is very simple at admin/config/development/performance and it works even on the simplest shared hosting. Note that it only works for anonymous users but more often than not, a significant portion of the traffic is anonymous. By default, submitting new content deletes this cache, but a minimum lifetime can be set up. As mentioned, it's very site-specific.

images Tip Another option that works with shared hosts and is even faster than the built-in page caching is the Boost module (drupal.org/project/boost). This module can serve pages to your anonymous visitors, completely bypassing PHP.

Let's look at how developers can utilize caching to store the results of a slow query. Suppose you have a very slow function called very_slow_find(). This is how to utilize caching:

$cache = cache_get('very_slow'),
if ($cache) {
  $very_slow_result = $cache->data;
}
else {
  // Run the very slow query.
  $very_slow_result = very_slow_find();
  cache_set('very_slow', $very_slow_result);
}

First you try to retrieve the cache. If yes, the stored data is used; if there is a cache miss, you run your query and store its results. This way the very slow query rarely needs to happen, but see the previous caveats about stale caching. In this example, very_slow is the cache ID (cid) or cache key and $very_slow_result is the data you store.

Cache is used very widely and it's stored in various bins. While it's certainly possible to store everything in one big pile, there are two advantages of separating them into what Drupal calls bins: the contents of a bin can be trashed separately from the other bins as necessary to avoid stale data, and different storage mechanisms can be set per bin to store data. (I will describe some examples of available storage mechanisms later in the chapter). While using cache_get and cache_set and the other functions that the Drupal caching API provides, you don't need to worry about which storage mechanism is in place. That's why pluggable subsystems are so useful: you can pick a different storage mechanism and you don't need to change any code.

Drupal, by default, uses the database to store cached data—not because it's the best way, but because Drupal knows it's there. Fortunately, there are alternatives. First, you'll see an example that has more to do with helping development and shows how to configure pluggable subsystems; then you'll see some performance and scalability solutions.

Disabling Caching During Development

For development purposes, I recommend the simplest cache implementation: nothing. There is a cache implementation that is the equivalent of a black hole: cache writes and clears don't do anything, reads always fail. Drupal uses this fake cache during installation because there is no information yet available about where it could store its data. This fake cache is quite useful while developing, too. One caveat is that multistep (and consequently AJAX) forms require a working cache, and so will not work when all caches are black-holed in this way. To short-circuit Drupal's caching using its own fake cache, add the three lines below to settings.php, which will be in the /sites folder for your site; typically it will be (relative to the root directory of your Drupal install) at /sites/default/settings.php.

$conf['cache_backends'][] = 'includes/cache-install.inc';
$conf['cache_class_cache_form'] = 'DrupalDatabaseCache';
$conf['cache_default_class'] = 'DrupalFakeCache';

By doing this, complex data structures like the theme registry will be rebuilt on every page load (adding classes and menu items still require an explicit cache flush on the admin/config/development/performance page). This slows down the site significantly but simplifies the life of the developer because changes to code are reflected immediately in Drupal's behavior.

The $conf array in settings.php is the central place to specify Drupal configuration. While some configuration options have a UI and store their state in the database, there are too many options to provide a UI for each, and most of these UI-less options are not necessary for the typical user. Another reason to use $conf instead of a UI is that some configuration is necessary before the database is available. Cache is an example of both: it's not something a user needs to set from the UI—most will never need to—and it can be used before the database is loaded. Most people are fine with the default cache implementation but also the cache can be used before the database is available. However, because the database is not available yet, settings for the cache is a bit more complicated than usual; you need to specify the file (in $conf['cache_backends']) not just a class to be used (as in $conf['cache_default_class']). For other settings, most of the time just specifying the class is enough because Drupal can read the location of the file containing the class from the database.

memcached

The solutions from now on will not work on shared hosts. You need to be able to control your environment to be able to perform and scale well.

memcached is a separate program that stores cached data in memory and allows access to it across the network. Being a separate program isn't so unusual; the database Drupal uses (like MySQL) is also such an application. The distinguishing feature of memcached is that it stores data solely in memory, which makes it very, very fast. Using this program instead of a database for caching can help Drupal performance quite a lot. And not just Drupal—this solution is very mature and used practically on every large web site.

Note that memcached is not just a performance solution but also one that scales very well and very easily: you just need to start as many instances of it on as many servers as necessary and configure Drupal to use them. There is nothing to set up in memcached because the separate memcached instances don't need to know about each other. It's Drupal that talks to every one of them. This is very different from, say, MySQL, where master-slave configuration needs to be configured explicitly.

There are three parts to the memcached puzzle: the application itself, a PHP extension, and a Drupal module. memcached is available at memcached.org. The installation and configuration are detailed on the web site. As mentioned, you need to add a PHP extension to allow PHP to communicate with memcached. There are two, confusingly named “memcache” and “memcached.” Even experts have different opinions on which one is better. I cautiously recommend the newer “memcached” one. PHP extensions can be installed with

pecl install memcached

You can install them in operating system-specific ways, too, such as the following on Debian or Ubuntu:

apt-get install php5-memcached

Third and finally, you make Drupal use memcached by installing the Memcache module, drupal.org/project/memcache. The project page has extensive documentation that I won't repeat here.

Varnish

Another important part of the scalability toolset is Varnish. Varnish is an external program for storing and serving full pages. Normal page caching requires a request to reach your web server, which in turn bootstraps Drupal, loads the page, and then Drupal sends the request. Boost module provides a faster solution because now the request only needs to reach your web server but Drupal is not started. Varnish is even faster because it handles the request itself. It's a really, really fast, and massively scalable solution to serve anonymous pages. Its motto is “Varnish makes web sites fly” and it lives up to that. The application is at www.varnish-cache.org/ and the Drupal integration lives at drupal.org/project/varnish.

On Databases

So far you've seen how some pluggable subsystems are configured and you've briefly reviewed various solutions to serve anonymous pages quickly. By only employing the solutions listed so far, you can make your site perform well (thanks to memcached) and scale well for anonymous visitors (thanks to Varnish). However, social web sites are all the rage today and they require serving logged in users. It gets harder; while memcached does buy you some performance, there are many problems to overcome. You need to step back and get an overview of how web sites operate, store, and retrieve data; the problems encountered; and a fairly new solution that solves many somewhat unrelated problems.

On a high level, most web sites do the same basic actions: collect data (either a user/administrator entering it via a form in the browser or aggregating it from another web site), store that in a database, and later show the data to users. The operations to show and modify data are commonly referred to as Create, Read, Update, and Delete (CRUD for short). A typical web site would use some sort of SQL database to execute these operations on.

As you will soon see, SQL and especially the currently widespread SQL implementations (including MySQL/MariaDB/Drizzle, PostgreSQL, Oracle, and Microsoft's SQL Server) are not always the best fit for many problems that a web site faces. But if SQL isn't optimal, why is it so widespread? Similarly, if it is widespread and working, is going another route really a good idea? In short, people have always used these SQL databases, so what's the fuss?

Good questions. Let's look at just how long that "always" is. Most databases that are in use today have their roots in the late seventies. True, a given database program such as MySQL or PostgreSQL might have no code remains from say, UNIREG (the ancestor of MySQL) and Postgres or even INGRES (the ancestors of PostgreSQL). Like the folklore of the axe used by Abraham Lincoln, proudly kept as a piece of history by a family that honored it with use, the handle was replaced five times and the head three times over the centuries. The mindset when SQL databases were designed and the resulting limitations, alas, do not go away as easily as the code is replaced. Abraham Lincoln's axe is still an axe.

In the years since these databases were constructed, factors like CPU speed, available memory, and available disk space have grown a millionfold. Note that disk speed did not keep up. And although the tasks have grown as well, few tasks have grown a millionfold. And finally, operating systems have grown more sophisticated.

All this means that some new, previously unthinkable design decisions are necessary and now make a lot of sense. Because disk space is abundant and disk speed is now the throttling factor, you can focus on making databases faster rather than smaller. Sacrificing disk space to benefit performance looks like a terrific investment when there is such an abundance of cheap storage. One can presume the whole database fits in memory; if not, the operating system will handle the problem. These design decisions affect more than databases: Varnish, for example, employs these modern programming paradigms and that's one of the reasons it's so fast.

MEMORIES OF MEMORY

Aside from design decisions made before disk space and memory became cheap and plentiful, the fundamentals of SQL also mandate a critical look. SQL is based on tables that have columns. For example, you might have a profile table that contains a column for first name and a column for last name. This means that every single profile will have a first name and a last name—exactly one of each. Even staying within the Western culture, it's easy to find examples where such a rigid structure crumples. For example, what happens when Hillary Diane Rodham Clinton wants to register? Or people who take their online nickname as a legal name and consequently have only one name? For better storage you would also need a profile names table that has a profile ID and a name column, making it possible to store as many names as you want. You will see shortly the drawback of this approach. In Drupal 7, you can easily make a “name” multiple-value text field, but despite Drupal hiding the ugliness from us, it doesn't mean it's not there.

While names are known to be problematic like that, it's not too hard to find a problem with almost every kind of data. For example, if you are describing cars, then you have a table called car and a column called horsepower that stores a number, a column called transmission steps that again stores a number, and a column called color that stores a string. After all, a car has an engine with a fixed horsepower, right? (Well, not exactly; there are cars with both a gas and an electric motor, which of course have different horsepower ratings and remember, you can only enter a single number.) As for transmission steps, continuously variable transmission can't be stored as nicely as a number. As for color, the list is endless: there are widely successful two-colored small cars as well as thermochromic paints that make your car a chameleon. This problem could be resolved by storing more complex data structures than a single string or number.

The point of all this is that you can't store all the data in a single table with SQL. Why is this such a big deal? Let's look at how the database finds data to answer this question.

Indexes

How do you find a recipe in a cookbook? You look in the index, of course. While alphabetic ordering of the recipes help, will you remember the name of the excellent salad with chorizo and avocado? Even if you do, cookbooks are often ordered by some other topic like seasons or occasion, so even the exact recipe name won't be of much help. So let's build an index on ingredients that contains the page numbers where the ingredient appears, like so:

avocado
 40, 60, 233
chorizo
 50, 60, 155

This is somewhat helpful but to find anything you would need to go over every recipe listed because there is absolute no information of what's on those pages. Let's create an alphabetical index per ingredient and then per name, like so:

avocado
 Guacamole
   40
 Tortilla Soup
   233
 Warm Chorizo Salad Cesar style
   60
chorizo
 Black Bean Chorizo Burritos
   50
 Scrambled Eggs Mexican style
   155
 Warm Chorizo Salad Cesar style
   60

While this index is useful, it is only useful if you are searching for ingredients or ingredients and names. It is also useful if you want to list recipes containing an ingredient, ordered by recipe name. However, if you want to search for a name, it becomes completely useless. For example, if you want to find the Scrambled Eggs Mexican style, you need to go over to the avocado recipes first, then the bacon recipes, then the chorizo—not much better than just paging through the whole book itself.

Remember that you started browsing your cookbook looking for a recipe containing both avocado and chorizo. How would you create an index that makes this search simple and fast? If you start your index with recipe names, then you need to go through every single recipe to find them, so that's not a good idea. If you do something like your first index, you can easily find all the recipes containing avocado and also all the recipes containing chorizo. Say you had 1000 avocado and 1000 chorizo recipes but only one that has both. You would need to compare 2000 numbers to find the single one. To make this operation fast you actually need an index like the following:

avocado
 bacon
   30, 37, 48
 chorizo
   60
bacon
 avocado
   30, 37, 48
chorizo
   70

That's plain horrible. It's not just that the storage requirements would be quite high but more importantly, in real life where data changes, maintaining something like this would become untenable very quickly. Say the average recipe had a mere eight ingredients; you can make 8*7 = 56 pairs so every change would require updating 56 index entries! (Even if you don't store both avocado-bacon and bacon-avocado, merely halving this does not solve anything and in fact raises several problems when querying, as an ingredient-name index could not be used backwards.) This is the very same way SQL stores and indexes data and this is the very problem with scaling with SQL: you can't use indexes on most queries spanning multiple tables—and you are forced to use multiple tables due to the rigidity of SQL. Let's see a Drupal example!

In Drupal, nodes have comments on them. Nodes are stored in one table (Node) and because there can be multiple comments on a single node, comments are stored in a separate table (Comment). If you want to show the ten ‘page' nodes most recently commented on then you are facing a similar problem: you can have a list of comments ordered by creation date and the nodes belonging to them if you have an index on creation date, and you can have a list of ‘page' nodes if you have an index on node type, but if, say, the last 200,000 comments are all on ‘story' nodes and you have 50,000 pages, then you can either go over 200,000 comments first to find that none of them are pages or you need to go over all 50,000 pages to find the latest comment on each. That's certainly not fast.

This example can easily be fixed by storing the node type in the comment table because node type does not change. But if you wanted to do something with the node change information and comment together then you would be forced to store node change information in the comment table and a node edit would trigger an update of hundreds of lines in the comment table. Such a practice is called denormalization and it is used widely. However, it should raise red flags: what sort of system are you working with that can only be made to perform by throwing out its very fundamental theory (normalization)? I will show one very good solution to this problem later but let's continue with listing the problems with SQL.

NULL in SQL

NULL is a wonderful construct that defies conventional logic; its usage is contradictory and it's not even consistent across databases. NULL is used to signal that some data is missing or invalid and is never equal to anything else but then again it's neither less or greater than either. Whatever you do with it, the operation will result in a NULL. You need a special IS NULL operator to detect NULLs, like so:

mysql> SELECT 0 > NULL, 0 = NULL, 0 < NULL, 0 IS NULL, NULL IS NULL;
+----------+----------+----------+-----------+--------------+
| 0 > NULL | 0 = NULL | 0 < NULL | 0 IS NULL | NULL IS NULL |
+----------+----------+----------+-----------+--------------+
|     NULL |     NULL |     NULL |         0 |            1 |
+----------+----------+----------+-----------+--------------+

This is called a three-valued logic. Instead of a statement being true or false, it can be NULL. This leads to many weird problems with NULL columns. Drupal (not by any conscious decision, mind you) doesn't have many NULL columns which just by sheer luck avoids most of this madness. Alas, this is not consistent and sometimes you are forced to enter the NULL pit.

There is even more to NULL than just the common-sense–defying three-valued logic, such as:

CREATE TABLE test1 (a int, b int);
CREATE TABLE test2 (a int, b int);
INSERT INTO test1 (a, b) VALUES (1, 0);
INSERT INTO test2 (a, b) VALUES (NULL, 1);
SELECT test1.a test1_a, test1.b test1_b, test2.a test2_a, test2.b test2_b
FROM test1
LEFT JOIN test2 ON test1.a=test2.a
WHERE test2.a IS NULL;
+---------+---------+---------+---------+
| test1_a | test1_b | test2_a | test2_b |
+---------+---------+---------+---------+
|       1 |       0 |    NULL |    NULL |
+---------+---------+---------+---------+

There is, of course, no such (NULL, NULL) row in test2. But, LEFT JOIN uses this to display the result for so called anti-joins. These are used to SELECT rows from the first table where no matches are found in the second table.

Now, if 1 = NULL is true, it would have found the single row you inserted into test2. It didn't do that. 1 = NULL, remember, is neither true nor false; it's just NULL. But just from looking at the results this is absolutely not evident; after all, the results contain a test2_a being NULL and didn't the JOIN prescribe test1.a=test2.a? Yes, it did but in this particular case the JOIN part does not need to be true. The result has missing data that is indicated by a NULL and it has nothing to do with the JOIN. If you find this confusing, it is. Not only do you need to live with a three-valued logic but you need to make exemptions for LEFT JOINs.

There's one more database architecture detail to cover before I can show what the answers to these problems. The current SQL implementations are focused on transactions (the term comes from financial transactions but it can mean any unit of work the database executes). You will quickly see that the effects of this focus fly against the expectations of the user of web sites— as the expectations of a user of a web site are different from the expectations of a user of a bank.

A CAP Between ACID and BASE

The typical example of a transaction is sending money to someone else. This is a very complex process, but at the end of the day you expect that your balance is decreased by the amount you sent and the receiver's balance is increased by that amount (minus fees). You also expect that once the bank has said “you sent money,” it is actually sent, no matter how long the actual sending takes (it's completely ridiculous how long a so long called SWIFT transfer can take). Another expectation is that if the money disappears from your account, it will appear in the account of the other party, no matter what happens to the computer system of the bank. You certainly don't expect your money to disappear without a trace if a computer crashes. A set of properties on the database is necessary to make transactions work according to these expectations—these are commonly abbreviated as ACID:

  • Atomic: Either the complete transaction succeeds or none of it.
  • Consistency: The database is in a consistent state between transactions. For example, if a record refers to another record and such a reference is invalid by the end of a transaction, then the whole transaction must be rolled back.
  • Isolation: Transactions do not see data changed by other transactions before those are finished.
  • Durability: Once the database system notified the user about the success of a transaction the data is never lost.

On the other hand, most web applications want a whole different set of properties, aptly named BASE:

  • Basically Available: Users have the silly expectation that when they point the browser to a web page, some kind of information appears. You expect this to happen even if some parts of the system you are reaching are down. This is not as trivial as it sounds; there are many systems where if one server goes down, the whole system goes down.
  • Scalable: Adding more servers makes it possible to serve more clients. Adding more servers instead of trying to create a single huge monster server is much more future-proof and usually cheaper, too. Once again, this is a user expectation: information not just appears but it appears fast. (Note that this user expectation also requires performance not just scalability.)
  • Eventually Consistent: It's enough if data eventually becomes available in all the places to which it is copied (you can have many copies, as described in the previous bullet point). The expectation is that the information appearing fast is somewhat current. When you post an ad to Craigslist, it does not immediately appear in lists and searches, which is not really a problem. It would be a problem if it took days for an ad to appear; no-one would use the site.

On this last point, if you have a consistent system as described in the ACID property list, it is also “eventually consistent” because it follows a stronger requirement: data immediately become available for all copies. (I'll get back to this one in detail just a little bit later.)

It would be awesome if there were a database system that could be both ACID and BASE, but alas that's not actually possible. Remember that BASE was basically about having a system consisting of a big number of servers. If you mandate that all writes are strongly consistent, then every write needs to go out to every server and you need to wait until all of them finish and communicate back this fact. Now, if you have servers scattered in datacenters around the globe, then network problems can be expected. Let's say a transatlantic link goes down—this is called a network partition. If you want consistency, then the system must go down together with the transatlantic link—even if all other servers are still up—because writes in Europe can't get to USA and vice versa so the parts will have inconsistent data. You need to give up something: either the Availability from BASE (system goes down to achieve consistency) or Consistency from ACID (you stay up but the system is inconsistent). In other words, out of Consistency, Availability, and Partition Tolerance (CAP), you can pick two but not three. So far, I have only defined consistency for databases, so it's time to be more generic about it: at any point in time it doesn't matter which server responds to a request, all of them will provide the same answer. Availability, as described above, means that even when some servers go down, the whole system doesn't. Finally, partition tolerance means that some communication between two servers can be lost and yet the system will work. Note how "soft" these two criteria are, and yet they pose a hard problem when you want them together with consistency.

I have just shown that the three parts of CAP don't work together and I can show that any two can live together easily. If you made sure parts of the network never fail, then providing Consistency and Availability is not a problem. However unlikely this sounds, Google's BigTable, used by more than 60 Google projects, is actually such a system. If you throw out Availability then the other two requirements are easily met: just don't switch on the servers, the data will stay consistent even if parts of the network fail. Finally, if you don't care about Consistency then you can simply not replicate writes from one server to the next—this system will certainly not care about network partitions (because it doesn't use the network at all) and if you reach a working server, it will answer something. While the latter two examples are extreme and don't describe a useful system, the point is, once again, that while Consistency, Availability, and Partition Tolerance can't all live together, any two of them can.

Once again, remember the tale of the coffee shop: in order to be able to employ more baristas (to scale) you need to accept that for a short time the customer is without money and without coffee (eventually and not immediately consistent). And people accept that even with a web application. If your comment takes a short time (even a few minutes) to appear to everyone in the world, that's fine. Of course, you expect it to appear to the poster of the comment immediately or at least as soon as the data is sent to the server and a reply has arrived. But for others, it takes some time to load a web page over the Internet anyway and so whether the server shows that new comment to those who have started loading the page just before or just after the Send button has been pressed matters little. It's more important that the web page is always responsive (available) no matter how many are browsing at the same time (scalability).

So while the SQL databases are mainly ACID compliant, databases that are BASE-focused fit a lot better with web applications. On top of that, the following databases take the vastly changed hardware and operating system possibilities into account to become an even better fit for these purposes. One of the first databases built along these principles is CouchDB which was first released in 2005. Cassandra was released to the public in 2008, and MongoDB was first released in 2009.

It was also in 2009 that these new, non-relational database systems got the somewhat catchy “NoSQL” moniker. This is, of course, just a buzzword, as it would be possible to drive a non-ACID compliant database with SQL (in fact MySQL was using SQL for more than a decade without being ACID compliant) but the feature set of these database is so different from what SQL is geared to work with that it would be pointless to try. For example, none of these databases offer the ability to JOIN tables.

As MongoDB is the best fit for many Drupal tasks, I will discuss it now in detail.

MongoDB

Getting started with MongoDB is super easy: just point a browser to try.mongodb.org. It provides a tutorial and lets you play with the database without downloading anything. If you want to use it on your own computer, you can download from mongodb.org/downloads and get up and running in no time; there is no complicated configuration file to write. The Drupal integration project is at drupal.org/project/mongodb.

MongoDB fits Drupal 7 surprisingly well despite the fact that much of Drupal 7 development predated MongoDB. It's a database designed for the Web so it's no wonder it matches so well the world's best software for making web sites—Drupal.

Basically, MongoDB stores JSON encoded documents (the actual difference is minor). A document is the rough equivalent of an SQL record. Any number of documents constitutes a collection, which is the rough equivalent of an SQL table. Finally, databases contain collections, much like MySQL databases contain tables.

While a MySQL table can only have fixed records, a single MongoDB collection can store any kind of documents, like so:

{ title: 'first document', length: 255 },
{ name: 'John Doe', weight: 20 }

And so on. Anything goes. There is no CREATE TABLE command because one document can wildly differ from the next.

Remember the problems with storing names in SQL? Not a problem here!

 db.people.insert({ name: ['Juan', 'Carlos', 'Alfonso', 'Víctor', 'María', 'de', 'Borbón',
'y', 'Borbón-Dos', 'Sicilias'], title: 'King of Spain'})

If you wanted to get a list of the documents of people named Carlos, you could run db.people.find({name: 'Carlos'}). It will find the king of Spain, regardless of where Carlos is in the list of names.

Also, properties can be a lot more than just numbers or strings.

   db.test.insert({
     'title': 'This is an example',
     'body': 'This can be a very long string. The whole document is limited to a number of
megabytes.',
     'votes': 56,.
     'options':
       {
          'sticky': true,
          'promoted': false,
       },
     'comments': [
       {
          'title': 'first comment',
          'author': 'joe',
          'published': true,

       },
       {
          'title': 'first comment',
          'author': 'harry',
          'homepage': 'http://example.com',
          'published': false,
       }
      ]
   });

So a document has properties (title, body, etc. in the previous example) and the properties can have all sorts of values, such as strings (title), numbers (votes), Booleans (options.sticky), arrays (comments) and more objects (also called sub-documents, an example is options). There is no limitation of how complex the document can be. There is, however, a limit on document size (it was 4MB for quite some time; it's currently 16MB but plans call for 32MB); for most web pages this limit doesn't pose a real problem. Sticking with this example, just how many comments will any post have? There's space for several thousands. If that's not enough, the actual body (which is never queried) can be stored separately.

So compared to an SQL table, the major advantages are no need to specify schema ahead of the time, no fixed schema actually, and the values can be complex structures.

Here are a few more interesting find commands against the document shown previously:

db.test.find({title: /^This/});
db.test.find({'options.sticky': true});
db.test.find({comments.author': 'joe'});

As you can see the, the find command uses the same JSON documents as insert. The first find command shown uses a regular expression to find posts where the title starts with "This". The second shows how easy it is to search inside objects. The third shows that arrays of objects are equally easy. It's also possible to make all three of these queries indexed.

If you look at a Drupal 7 entity with fields, it's actually an object containing arrays. You can store and index those as a whole into MongoDB. And that's what matters most: in SQL, while you could store entities that have the same structure (for example, nodes of the same type) in a single table, if you need to query across these (for example, to show all the recent content favorited by a user, regardless whether they are articles or photos), then you can't index that query. Remember, to make such a query fast, you need to denormalize; in other words, keep copies of the data you need to query together in a single table. This is hard to maintain and slow to upgrade because every piece of data has so many copies. In MongoDB, you can easily do this cross-content–type query because all nodes can sit in a single table and you can just create an index on ‘favorite' and ‘created' and you're all set.

In summary, SQL is not able to store complex structures. While MongoDB is not the only solution to this problem, it is much simpler to implement and maintain than SQL's approach of using multiple tables and denormalization. This problem existed in Drupal 6 with CCK but became central in Drupal 7 with entities and field API.

To use MongoDB as the default field storage engine, add

$conf['field_storage_default'] = 'mongodb_field_storage';

to settings.php (in sites/default or the relevant sites directory). If you are creating fields from code, then setting the 'storage' key of the field array also works. The UI doesn't offer this choice so if you are solely using the UI, you can't pick the storage per field.

There is even more to MongoDB than being a very nice field storage engine, although that's probably the best thing about it. But it solves other problems, too. Once again, it's not the only solution but it's a very nice one.

First, let's look at some more features relating to writes. Let's increment the votes in our test document, like so:

o = db.test.findOne({nid: 12345678});
o.votes++;
db.test.save(o);

While this code works, it's ugly and prone to a race condition. Race conditions are the bane of developers as they are extremely hard to reproduce but can produce mysterious failures. Let's look at an example. Here is what can happen in the listed order when two users try to vote:

User A runs o = db.test.findOne({nid: 12345678}); -- votes are at 56.

User A runs o.votes++

User B runs o = db.test.findOne({nid: 12345678}); -- votes are at 56.

User A runs db.test.save(o); setting votes to 57.

User B runs o.votes++

User B runs db.test.save(o); setting votes to 57.

In MongoDB, you can do instead:

   db.test.update({nid: 12345678}, {$inc : { votes : 1 }});
  • User A runs this command, votes are incremented by 1 to 57
  • User B runs this command, votes are incremented by 1 to 58

This is an atomic operation. There is no possibility for errors caused by a race condition. Note that update also operates with JSON documents; there are special update operators like $inc shown but it's still the same syntax.

You can also specify a multiple document update

db.test.update({nid: {'$in': [123, 456]}, {$inc : { votes : 1 }}, {multiple:1});

and while single documents will be incremented atomically (without a race condition0, a multiple update is not atomic. When I listed transaction properties previously, I mentioned that in the context of a transaction, atomicity means that either all documents update or none. This is not the case here, as there are no transactions in MongoDB: it's possible that the update of one document fails but has no consequence on the other documents, which update successfully. MongoDB also lacks another capability for transactions, namely isolation: while one document already is updated for every client, the others can still have their old values.

Atomic per-document increments make MongoDB ideal to store various statistics. For example, you might want to store and display the number of times a node is viewed. First, add a numeric field to the node called ‘views.' Next, you will see a little script that takes a node ID as an argument and increments the value of the views field by one. Note that it emits the same headers Drupal does to stop browsers and proxies from caching it and finally, prints a 1x1 transparent GIF.

<?php
if (!empty($_GET['nid']) && $_GET['nid'] == (int) $_GET['nid']) {
 define('DRUPAL_ROOT', getcwd());
 require_once DRUPAL_ROOT . '/includes/bootstrap.inc';
 drupal_bootstrap(DRUPAL_BOOTSTRAP_CONFIGURATION);

 require_once DRUPAL_ROOT . '/sites/all/modules/mongodb/mongodb.module';
 $find = array('_id' => (int) $_GET['nid']);
 $update = array('$inc' => array('views.value' => 1));
 mongodb_collection('fields_current', 'node')->update($find, $update);
}

header('Content-type: image/gif'), header('Content-length: 43'),
header("Last-Modified: " . gmdate("D, d M Y H:i:s") . " GMT");
header("Expires: Sun, 19 Nov 1978 05:00:00 GMT");
header("Cache-Control: must-revalidate");
printf('%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c
', 71, 73, 70, 56, 57, 97, 1, 0, 1, 0, 128, 0, 0, 0, 0, 0, 0, 0, 0, 33, 249, 4, 1, 0, 0, 0, 0,
44, 0, 0, 0, 0, 1, 0, 1, 0, 0, 2, 2, 68, 1, 0, 59);

Now all you need to do is to save this as stats.php in your Drupal root, add an <img src="<?php print base_path() . 'stats.php?nid=' . $node->nid; ?>"/> in your node.tpl.php. By counting node views this way, even if the page is served from cache or Varnish, the view will be counted.

Is this going to kill the site's performance? That depends. It does cause a PHP hit every time someone loads a page, which may or may not be acceptable. For most sites, it's OK. If the site works with that, MongoDB won't cause a problem. Given that MongoDB works with the database in memory and occasionally writes back to the disk, writes are a lot faster as they need to happen only into memory. Also, it works hard to update values in place so that indexes don't need to be updated needlessly.

New in 2011 is the ability to not lose any writes even when using a single server. One of the biggest complaints about MongoDB was that because it was only writing back occasionally, data might get lost if the server crashed. Previously this was mitigated by making the application wait for the write to be replicated to another server, hoping that two servers won't crash at the same time. But now, if mongod is started with the –dur option, writes don't get lost.

Watchdog, Session, and Queue

Drupal has other areas that write a lot: watchdog and session. The session subsystem keeps the user logged in; therefore it needs to write on every single page load. The fast writes of MongoDB make this a non-issue. Once you have mongod running and the Drupal module installed, just add

$conf['session_inc'] = DRUPAL_ROOT
.  '/sites/all/modules/mongodb/mongodb_session/mongodb_session.inc';

to settings.php (in sites/default or the relevant sites directory) and MongoDB takes over the sessions, speeding up the site again.

Watchdog is problematic because if for some reason there are a lot of error messages (like a massive worm outbreak that tries to retrieve nonexistent URLs), a SQL table might grow into such sizes that the only viable option is to trash it completely (TRUNCATE). Deleting older rows might not be able to catch with the torrent of writes, plus writes are a bit slow. The traditional solution is to use syslog, which is just a text file so it's not the simplest to query it. With MongoDB, you can specify that a collection should only keep the last N documents and then automatically start over, overriding the oldest messages, so there never will be more than the specified number of messages, making it easy and convenient to query. Also, the Drupal implementation of watchdog puts different messages into different collections so there won't be more than the specified number of messages recorded for each message. For example, “comment created” messages won't be crowded out by php error messages. To use this facility, enable mongodb_watchdog and disable dblog modules.

Finally, Drupal 7 has a message queue, which can also be implemented with MongoDB. There are many queues but if you have already deployed MongoDB for field storage, watchdog, and session, then just the write speed of MongoDB is handy enough here to use it instead of SQL. Just add

$conf['queue_default_class'] = 'MongoDBQueue';

to settings.php and you're done. You have seen how to utilize MongoDB as a field storage engine, to store session data, to log watchdog messages, and as a queue mechanism. It's also possible to use it as a cache but for that purpose memcache is simply better (because of the trivial scalability of it).

Null Values in MongoDB

So far you have seen how MongoDB solves some of the SQL problems. Let's see if it ameliorates SQL's previously-discussed weirdness with NULLs (and while you do that, you'll learn more about MongoDB and see more examples on MongoDB queries).

The way MongoDB handles NULL is a little bit saner than SQL but it surely has its own NULL quirks. The following find is completely valid and shows that NULL values need no special operator:

db.test.find({something:null}) ;

This will find the documents where something has a NULL value. Very easy. NULL is a type in itself and comparing different types are always false because MongoDB is strictly typed and never casts values for you. So comparing a number to NULL is always false but the same is true to comparing a number to a string. This is not really a problem—just something you need to be aware of. An example will be shown soon. At least MongoDB does not employ three-valued logic; comparisons can only be true or false, never NULL.

However, there is a caveat to NULLs: nonexistent values are treated as NULLs.

> db.test.drop()
> db.test.insert({a:1});
> db.test.insert({something:null});
> db.test.insert({something:1});   
> db.test.find({something:null});
{ "a" : 1 }
{ "something" : null }#

To actually find what you wanted, do this:

> db.test.find({something:{$in: [null], $exists:true}});
{ "something" : null }

The new operators are $in and $exists. And here is the example for comparing NULL to 1:

> db.test.find({something: {$gte: null }});
{ "a" : 1 }
{ "something" : null }

Once again, the first document doesn't contain a something property so when comparing, it matches. The second document contains a something: NULL pair and the operation is “greater than equal” ($gte) so it matches again. You have a document where something is 1 but that won't match—NULL is not zero.

Summary

As you've seen, thinking about scaling in the early stages of a site is not always a high priority. However, it always pays to think about it early before your headaches really start. This chapter went over why you should care about scaling early and what techniques are available in Drupal 7 to sort out scaling.

The main focus was on databases, because they are absolutely integral to scaling in Drupal. Caching and the like are of course useful, too, so you looked into that as well. All in all, this chapter's changes will take your Drupal site some way towards scaling effectively.

images Tip Stay up to speed with occasional updates, discussion, and resources at dgd7.org/scale.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.124.83