Storing data using the correct types will make your life easier. Data type affects how data can be queried, the order in which MongoDB will sort it, and how many bytes of storage it takes up.
Any field you’ll be using as a number should be saved as a number. This means if you wish to increment the value or sort it in numeric order. However, what kind of number? Well, often it doesn’t matter—sometimes it does.
Sorting compares all numeric types equally: if you had a 32-bit integer, a 64-bit integer, and a double with values 2, 1, and 1.5, they would end up sorted in the correct order. However, certain operations demand certain types: bit operations (AND and OR) only work on integer fields (not doubles).
The database will automatically turn 32-bit integers into
64-bit integers if they are going to overflow (due to an $inc
, say), so you don’t have to worry
about that.
Similarly to numbers, exact dates should be saved using the
date type. However, dates such as birthdays are not exact; who knows
their birth time down to the millisecond? For dates such as these,
it often works just as well to use ISO-format dates: a string of the
form
yyyy
-mm
-dd
.
This will sort birthdays correctly and match them more flexibly than
if you used dates, which force you to match birthdays to the
millisecond.
All strings in MongoDB must be UTF-8 encoded, so strings in other encodings must be either converted to UTF-8 or saved as binary data.
ObjectId
sAlways save ObjectId
s as
ObjectId
s, not as strings. This is important
for several reasons. First, queryability: strings do not match
ObjectId
s and
ObjectId
s do not match strings. Second,
ObjectId
s are useful: most drivers have
methods that can automatically extract the date a document was
created from its ObjectId
. Finally, the
string representation of an ObjectId
is more
than twice the size, on disk, as an
ObjectId
.
If your data doesn’t have a naturally occurring unique field (often
the case), go ahead and use the default ObjectId
for _id
s. However, if your data does
have a unique field and you don’t need the properties of an
ObjectId
, then go ahead and override the default
_id
—use your own unique value. This
saves a bit of space and is particularly useful if you were going to index
your unique id, as this will save you an entire index in space and
resources (a very significant savings).
There are a couple reasons not to use your own
_id
that you should consider: first, you must be very
confident that it is unique or be willing to handle duplicate key
exceptions. Second, you should keep in mind the tree structure of an index
(see Tip #22: Use indexes to do more with less memory) and how random or non-random your
insertion order will be. ObjectId
s have an
excellent insertion order as far as the index tree is concerned: they
always are increasing, meaning they are always being inserted at the right
edge of the B-tree. This, in turn, means that MongoDB only has to keep the
right edge of the B-tree in memory.
Conversely, a random value in the _id
field means
that _id
s will be inserted all over the tree. Then the
machine must move a page of the index into memory, update a tiny piece of
it, then probably ignore it until it slides out of memory again. This is
less efficient.
You should almost never use a document as your
_id
value, although it may be unavoidable in certain
situations (such as the output of a MapReduce). The problem with using a
document as _id
is that indexing a
document is very different than indexing the fields
within a document. So, if you aren’t planning to query for the
whole subdocument every time, you may end up with multiple indexes on
_id
, _id.foo
,
_id.bar
, etc., anyway.
You also cannot change _id
without overwriting
the entire document, so it’s impractical to use it if fields of the
subdocument might change.
This tip is specifically about the special database reference subdocument type, not references (as discussed in the previous chapter) in general.
Database references are normal subdocuments of the form {$id :
(they can,
optionally, also have a identifier
, $ref :
collectionName
}
field for the database name). They
feel a bit relational: you’re sort of referencing a document in another
collection. However, you’re not really referencing another collection,
this is just a normal subdocument. It does absolutely
nothing magical. MongoDB cannot dereference database references on the
fly; they are not a way of doing joins in MongoDB. They are just
subdocuments holding an $db
_id
and collection name. This
means that, in order to dereference them, you must query the database a
second time.
If you are referencing a document but already know the collection,
you might as well save the space and store just the _id
, not the _id
and the collection name. A database
reference is a waste of space unless you do not know what collection the
referenced document will be in.
The only time I’ve heard of a database reference being used to good effect was for a system that allowed users to comment on anything in the system. They had a comments collection, and stored comments in that with references to nearly every other collection and database in the system.
GridFS requires two queries: one to fetch a file’s metadata and one to fetch its contents (Figure 2-1). Thus, if you use GridFS to store small files, you are doubling the number of queries that your application has to do. GridFS is basically a way of breaking up large binary objects for storage in the database.
GridFS is for storing big data—larger than will fit in a single document. As a rule of thumb, anything that is too big to load all at once on the client is probably not something you want to load all at once on the server. Therefore, anything you’re going to stream to a client is a good candidate for GridFS. Things that will be loaded all at once on the client, such as images, sounds, or even small video clips, should generally just be embedded in your main document.
Further reading:
Often people have heard that MongoDB handles failover seamlessly and are surprised when they start getting exceptions. MongoDB tries to recover from failures without intervention, but handling certain errors is impossible for it to do automatically.
Suppose that you send a request to the server and you get back a network error. Now your driver has a couple options.
If the driver can’t reconnect to the database, then it obviously can’t automatically retry sending the request to that server. However, suppose you have another server that the driver knows about, could it automatically send the request to that one? Well, it depends on what the request was. If you were going to send a write to the primary, probably there is no other primary yet. If you were going to do a read, it could be something like a long-running MapReduce to a slave that’s now down, and the driver shouldn’t send that to some other random server (the primary?). So, it can’t auto-retry to a different server.
If the error was a temporary network blip and the driver reconnects to the server immediately, it still shouldn’t try to send the request again. What if the driver sent the original message and then encountered a network error, or errored out on the database response? Then the request might already be being processed by the database, so you wouldn’t want to send it a second time.
This is a tricky problem that is often application-dependent, so the drivers punt on the issue. You must catch whatever exception is thrown on network errors (you should be able to find info about how to do this in your driver’s documentation). Handle the exception and then figure out on a request-by-request basis: do you want to resend the message? Do you need to check state on the database first? Can you just give up, or do you need to keep retrying?
Your application should be able to handle all of the exciting failure scenarios that could occur with a replica set.
Suppose your application throws a “not master” error. There are a couple possible causes for this error: your set might be failing over to a new primary and you have to handle the time during the primary’s election gracefully. The time it takes for an election varies: it’s usually a few seconds, but if you’re unlucky it could be 30 seconds or more. If you’re on the wrong side of a network partition, you might not be able to see a master for hours.
Not being able to see a master at all is an important case to handle: can your application drop into read-only mode if this happens? Your application should be able to handle being read-only for short periods (during a master election) and long periods (when a majority is down or partitioned).
Regardless of whether there’s a master, you should be able to continue sending reads to whichever members of the set you can reach.
Members may briefly go through an unreadable “recovering” phase during elections: members in this state will throw errors about not being masters or secondaries if your driver tries to read from them, and may be in this state so fleetingly that these errors slip in between the pings drivers send to the database.
18.222.125.171