Chapter 2. Implementation Tips

Tip #14: Use the correct types

Storing data using the correct types will make your life easier. Data type affects how data can be queried, the order in which MongoDB will sort it, and how many bytes of storage it takes up.

Numbers

Any field you’ll be using as a number should be saved as a number. This means if you wish to increment the value or sort it in numeric order. However, what kind of number? Well, often it doesn’t matter—sometimes it does.

Sorting compares all numeric types equally: if you had a 32-bit integer, a 64-bit integer, and a double with values 2, 1, and 1.5, they would end up sorted in the correct order. However, certain operations demand certain types: bit operations (AND and OR) only work on integer fields (not doubles).

The database will automatically turn 32-bit integers into 64-bit integers if they are going to overflow (due to an $inc, say), so you don’t have to worry about that.

Dates

Similarly to numbers, exact dates should be saved using the date type. However, dates such as birthdays are not exact; who knows their birth time down to the millisecond? For dates such as these, it often works just as well to use ISO-format dates: a string of the form yyyy-mm-dd. This will sort birthdays correctly and match them more flexibly than if you used dates, which force you to match birthdays to the millisecond.

Strings

All strings in MongoDB must be UTF-8 encoded, so strings in other encodings must be either converted to UTF-8 or saved as binary data.

ObjectIds

Always save ObjectIds as ObjectIds, not as strings. This is important for several reasons. First, queryability: strings do not match ObjectIds and ObjectIds do not match strings. Second, ObjectIds are useful: most drivers have methods that can automatically extract the date a document was created from its ObjectId. Finally, the string representation of an ObjectId is more than twice the size, on disk, as an ObjectId.

Tip #15: Override _id when you have your own simple, unique id

If your data doesn’t have a naturally occurring unique field (often the case), go ahead and use the default ObjectId for _ids. However, if your data does have a unique field and you don’t need the properties of an ObjectId, then go ahead and override the default _id—use your own unique value. This saves a bit of space and is particularly useful if you were going to index your unique id, as this will save you an entire index in space and resources (a very significant savings).

There are a couple reasons not to use your own _id that you should consider: first, you must be very confident that it is unique or be willing to handle duplicate key exceptions. Second, you should keep in mind the tree structure of an index (see Tip #22: Use indexes to do more with less memory) and how random or non-random your insertion order will be. ObjectIds have an excellent insertion order as far as the index tree is concerned: they always are increasing, meaning they are always being inserted at the right edge of the B-tree. This, in turn, means that MongoDB only has to keep the right edge of the B-tree in memory.

Conversely, a random value in the _id field means that _ids will be inserted all over the tree. Then the machine must move a page of the index into memory, update a tiny piece of it, then probably ignore it until it slides out of memory again. This is less efficient.

Tip #16: Avoid using a document for _id

You should almost never use a document as your _id value, although it may be unavoidable in certain situations (such as the output of a MapReduce). The problem with using a document as _id is that indexing a document is very different than indexing the fields within a document. So, if you aren’t planning to query for the whole subdocument every time, you may end up with multiple indexes on _id, _id.foo, _id.bar, etc., anyway.

You also cannot change _id without overwriting the entire document, so it’s impractical to use it if fields of the subdocument might change.

Tip #17: Do not use database references

Note

This tip is specifically about the special database reference subdocument type, not references (as discussed in the previous chapter) in general.

Database references are normal subdocuments of the form {$id : identifier, $ref : collectionName} (they can, optionally, also have a $db field for the database name). They feel a bit relational: you’re sort of referencing a document in another collection. However, you’re not really referencing another collection, this is just a normal subdocument. It does absolutely nothing magical. MongoDB cannot dereference database references on the fly; they are not a way of doing joins in MongoDB. They are just subdocuments holding an _id and collection name. This means that, in order to dereference them, you must query the database a second time.

If you are referencing a document but already know the collection, you might as well save the space and store just the _id, not the _id and the collection name. A database reference is a waste of space unless you do not know what collection the referenced document will be in.

The only time I’ve heard of a database reference being used to good effect was for a system that allowed users to comment on anything in the system. They had a comments collection, and stored comments in that with references to nearly every other collection and database in the system.

Tip #18: Don’t use GridFS for small binary data

GridFS requires two queries: one to fetch a file’s metadata and one to fetch its contents (Figure 2-1). Thus, if you use GridFS to store small files, you are doubling the number of queries that your application has to do. GridFS is basically a way of breaking up large binary objects for storage in the database.

GridFS breaks up large pieces of data and stores them in chunks.
Figure 2-1. GridFS breaks up large pieces of data and stores them in chunks.

GridFS is for storing big data—larger than will fit in a single document. As a rule of thumb, anything that is too big to load all at once on the client is probably not something you want to load all at once on the server. Therefore, anything you’re going to stream to a client is a good candidate for GridFS. Things that will be loaded all at once on the client, such as images, sounds, or even small video clips, should generally just be embedded in your main document.

Further reading:

Tip #19: Handle “seamless” failover

Often people have heard that MongoDB handles failover seamlessly and are surprised when they start getting exceptions. MongoDB tries to recover from failures without intervention, but handling certain errors is impossible for it to do automatically.

Suppose that you send a request to the server and you get back a network error. Now your driver has a couple options.

If the driver can’t reconnect to the database, then it obviously can’t automatically retry sending the request to that server. However, suppose you have another server that the driver knows about, could it automatically send the request to that one? Well, it depends on what the request was. If you were going to send a write to the primary, probably there is no other primary yet. If you were going to do a read, it could be something like a long-running MapReduce to a slave that’s now down, and the driver shouldn’t send that to some other random server (the primary?). So, it can’t auto-retry to a different server.

If the error was a temporary network blip and the driver reconnects to the server immediately, it still shouldn’t try to send the request again. What if the driver sent the original message and then encountered a network error, or errored out on the database response? Then the request might already be being processed by the database, so you wouldn’t want to send it a second time.

This is a tricky problem that is often application-dependent, so the drivers punt on the issue. You must catch whatever exception is thrown on network errors (you should be able to find info about how to do this in your driver’s documentation). Handle the exception and then figure out on a request-by-request basis: do you want to resend the message? Do you need to check state on the database first? Can you just give up, or do you need to keep retrying?

Tip #20: Handle replica set failure and failover

Your application should be able to handle all of the exciting failure scenarios that could occur with a replica set.

Suppose your application throws a “not master” error. There are a couple possible causes for this error: your set might be failing over to a new primary and you have to handle the time during the primary’s election gracefully. The time it takes for an election varies: it’s usually a few seconds, but if you’re unlucky it could be 30 seconds or more. If you’re on the wrong side of a network partition, you might not be able to see a master for hours.

Not being able to see a master at all is an important case to handle: can your application drop into read-only mode if this happens? Your application should be able to handle being read-only for short periods (during a master election) and long periods (when a majority is down or partitioned).

Regardless of whether there’s a master, you should be able to continue sending reads to whichever members of the set you can reach.

Members may briefly go through an unreadable “recovering” phase during elections: members in this state will throw errors about not being masters or secondaries if your driver tries to read from them, and may be in this state so fleetingly that these errors slip in between the pings drivers send to the database.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.125.171