CHAPTER 5: GridFS

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 5

GridFS

We live in a world of high-definition video, 12MP cameras, and storage media that can hold 50GB of data on a disc the size of a CD-ROM. In that context, the 16MB limit for the maximum size of a MongoDB document might seem laughably inadequate. Indeed, you might wonder why MongoDB, which has been designed as a database for today’s high-tech age, has such a seemingly strange limitation. The short answer is performance.

If data were stored in the document itself, it would obviously get very large, which in turn would make the data harder to work with. For example, pulling back the whole document would require loading the files in the document, as well. You could work around this issue, but you would still need to pull back the entire file whenever you accessed it, even if you only wanted a small section of it. You can’t ask for a chunk of data in the middle of a document—it’s an all-or-nothing proposition. Fortunately, MongoDB features a unique and somewhat elegant solution to this problem. MongoDB enables you to store large files quite easily, yet it also allows you to access parts of the file without retrieving the entire thing—all while maintaining high performance. It achieves this by leveraging a specification known as GridFS.

Note One interesting thing about GridFS is that it isn’t actually a software feature. For example, there isn’t any special server-side code in MongoDB that manages GridFS. Instead, GridFS is a simple specification used by all of the supported drivers on MongoDB. The key benefit of such a specification is that files stored by one driver can be accessed by any other driver that follows the same convention.

This approach adheres closely to the MongoDB principle of keeping things simple. Because GridFS uses standard MongoDB features, it’s easy to implement and work with the specification from the driver’s point of view. It also means you can poke around by hand if you really want to, as to MongoDB files in the GridFS specification are just normal collections containing documents.

Filling in Some Background

Chapter 1 touched on the fact that we have been taught to use databases for even simple storage for many years. For example, the book one of us bought to help improve his PHP more than 15 years ago introduced MySQL in Chapter 3. Considering the complexity of SQL and databases in the real world (not to mention in theory), you might wonder why a book intended for beginners would practically start off with SQL. After all, it was a PHP book and not a MySQL book.

One thing most people don’t appreciate until they try it is that reading and writing data directly to disk is hard. Some people don’t agree with us on this point—after all, opening and reading files in Python might seem trivial. And it is: in simpler scenarios, working with files is rather painless when using PHP. If all you want to do is read in lines and process them, you’re unlikely to have any trouble.

On the other hand, things become a lot harder if you want to search a file or store complicated or structured data. Even if you can work out how to do this and create a solution, your solution is unlikely to be faster or more efficient than relying on a database instead. Today’s applications depend on finding and storing data quickly—and databases make this possible for those of us who can’t or don’t want to write such a system ourselves.

One area that is glossed over by many books is the storing of files. Most books that teach you to use a database to store your data also teach you to read and write to the filesystem instead when you need to store files. In some ways, this isn’t usually a problem, because it’s much easier to read and write simple files than to process what’s in them. There are some issues, however. First, the developer must have permission to write those files in the first place, and that requires giving the web server permission to write to the local filesystem. This might not seem likely to pose a problem, but it gives system administrators nightmares—getting files onto a server is the first stage in being able to compromise it.

Databases can store binary files; typically, it’s just not elegant for them to do so. MySQL has a special column type called BLOB. PostgreSQL requires special procedures to be followed to store such files—and the data isn’t stored in the table itself. In other words, it’s messy. These solutions are obviously bolt-ons. Thus, it’s not surprising that people choose to write data to the disk instead. But that approach also has issues. Apart from the problems with security, it adds another directory that needs to be backed up, and you must also ensure that this information is replicated to all the appropriate servers. There are filesystems that provide the ability to write to disk and have that content fully replicated (including GFS); but these solutions are complex and add overhead; moreover, these features typically make your solution harder to maintain.

MongoDB, on the other hand, enforces a maximum document size of 16MB. This is more than enough for storing rich documents, and it might have sufficed a few years ago for storing many other types of files as well. However, this limit is wholly inadequate for today’s environment.

Working with GridFS

Next, we’ll take a brief look at how GridFS is implemented. As the MongoDB website points out, you do not need to understand or be aware of the underlying implementation of GridFS to use it. In fact, you can simply let the driver handle the heavy lifting for you. For the most part, the drivers that support GridFS implement file handling in a language-specific way. For example, the MongoDB driver for Python works in a manner that is wholly consistent with Python, as you’ll see shortly. If the ins-and-outs of GridFS don’t interest you, then just skip ahead to the next section. We promise you won’t miss anything that enables you to use MongoDB effectively!

GridFS consists of two parts. More specifically, it consists of two collections. One collection holds the filename and related information such as size (called metadata), while the other collection holds the file data itself, usually in 256K chunks. The specification calls for these to be named files and chunks, respectively. By default, the files and chunks collections are created in the fs namespace, but this can be changed. The ability to change the default namespace is useful if you want to store different types of files. For example, you might want to keep image and movie files separate.

Getting Started with the Command-Line Tools

Now that we have some of the background out of the way, let’s look at how to get started with GridFS by exploring the command-line tools available to leverage it. First, we will need a file to play with. To keep things simple, let’s use the dictionary file. On Ubuntu, you can find this at /usr/share/dict/words. However, there are various levels of symbolic links, so you might want to run this command first:

root@core2:/usr/share/dict# cat words > /tmp/dictionary

Note In Ubuntu, you might need to use apt-get install wbritish to get the dictionary file installed.

This command copies all the contents of the file to a nice and simple path that you can use easily. Of course, you can use any file that you wish for this example; it doesn’t need to be any particular size or type.

Rather than describe all the options you can use with mongofiles, let’s jump right in and start playing with some of the tool’s features. This book assumes that you’re running mongofiles on the same machine as MongoDB. If you’re not, then you’ll need to use the –h option to specify the host that MongoDB is running on. You’ll learn about the other options available in the mongofiles command after putting it through its paces.

First, let’s list all the files in the database. We’re not expecting any files to be in there yet, but let’s make sure. The list command lists the files in the database so far:

$ mongofiles list
connected to: 127.0.0.1
$

OK, so that probably wasn’t very exciting. Keep in mind that mongofiles is a proof-of-concept tool; it’s probably not a tool you will use much with your own applications. However, mongofiles is great for learning and testing. Once you create a file, you can use the tool to explore the files and chunks that are created.

Let’s kick things up a notch and the put command to add the dictionary file created previously (remember: you can use any file that you like for this example):

$ mongofiles put /tmp/dictionary
connected to: 127.0.0.1
added file: { _id: ObjectId('51cb61b26487b3d8ce7af440'), filename: "/tmp/dictionary", chunkSize:
262144, uploadDate: new Date(1372283314621), md5: "40c0825855792bd20e8a2d515fe9c3e3", length:
4953699 }}}
done!
$

This example returns some useful information; however, let’s double-check the information it shows by confirming that the file is there. Do so by rerunning the list command:

$  mongofiles list
connected to: 127.0.0.1
/tmp/dictionary       4953699
$

This example shows the dictionary file, along with its size. The information clearly comes from the files collection, but we’re getting ahead of ourselves. Let’s take a moment to step back and examine the output returned from the put command in this example.

Using the _id Key

As you know, each document in MongoDB includes a unique identifier stored in the _id key. Like MySQL’s auto_increment field, the _id key is not of much direct interest, apart from the fact that it allows you to pick out a specific file.

Working with Filenames

The output from the put command also shows a Filename key, which itself needs a little explanation. Generally, you will want to keep this field unique to help prevent major confusion; however, that’s not entirely necessary. In fact, if you run the put command again, you’ll end up with two documents that look identical. In this case, the files and metadata are identical, apart from the _id key. You might be surprised by this and wonder why MongoDB doesn’t update the file that exists rather than create a new one. The reason is that there could be many cases where you would have filenames that are identical. For example, if you built a system to store student assignments, then chances are pretty good that at least some of the filenames would be the same. MongoDB cannot assume that identical filenames (even those with identical sizes) are in fact the same file. Thus, there are many cases where it would be a mistake for MongoDB to update the file. Of course, you can use the _id key to update a specific file; and you’ll learn more about this topic in the upcoming Python-based experiments.

Determining a File’s Length

The put command also returns a file’s length, which is both useful information and critical to how GridFS works. While it is nice to know how big a file is for reference, the file’s size also plays a big part when you write your own applications. For example, when sending a file over the Web (through HTTP, for example), you need to specify how big the file is. Not all servers do this; for example, when downloading files from certain sites, you may have noticed that your browser can tell you the speed you’re downloading the file at, but not how long it will take to finish downloading the file. This is because the server did not provide size information.

Knowing the size of your file is important in one other respect. Earlier, we mentioned that a file is broken up into chunks—that is, the file is split into smaller pieces. By default, the chunk size is 256K, but that can be changed to another value if you wish. To work out how many chunks a file takes up, you need to know two things. First you must know how big each chunk is; and second, you must know the file size, so that you can tell how many chunks there are.

You might think that this shouldn’t be important. After all, if you have a 1MB file and the chunk size is 256K, then you know that you must start with chunk number four if you want to access data starting at the 800K mark. Yet you still need to know how big the overall file is for the following reason: if you don’t know the size, you cannot work out how many valid chunks there are. In the previous example, there’s nothing to stop you asking for data that starts at 1.26MB (that is, the sixth chunk). In this case, that chunk doesn’t exist, but there is no way to know that without a reference to the file size. Of course, the driver handles all of this for you, so there’s no need to worry too much about it; however, knowing how GridFS works “behind the scenes” will certainly help when it comes to debugging your applications.

Working with Chunk Sizes

The put command also returns the chunk size because, although there is a default chunk size, this default can be changed on a file-by-file basis. This allows flexible sizing. If your website streams video, you might want to have many chunks so that you can easily skip to any part of a given video with ease. If you had one big file, you would have to return the whole file, and then find the starting point for the specified section in it. With GridFS, you can pull back data at the chunk level. If you’re using the default size, then you can start retrieving data from any 256K chunk. Of course, you can also specify the bit of data you actually want (for example, you might want only five minutes in the middle of a sixty-minute movie). This is a very efficient system, and 256K is a pretty good chunk size for most purposes. If you decide to change it, you should have a good reason for doing so. As always, don’t forget to benchmark and test the performance of your custom chunk size; it’s not uncommon for theoretically better systems to fail to live up to expectations.

Note MongoDB has a 16MB restriction on document size. Because GridFS is simply a different way of storing files in the standard MongoDB framework, this restriction also exists in GridFS. That is, you can’t create chunks larger than 16MB. This shouldn’t pose a problem, because the whole point of GridFS is to alleviate the need for huge document sizes. If you’re worried that you’re storing huge files, and this will give you too many chunk documents, you needn’t worry—there are MongoDB systems in production with significantly more than a billion documents!

Tracking the Upload Date

The uploadDate key does exactly what its name suggests: it stores the date the file was created in MongoDB. This is a good time to mention that the files collection is just a normal MongoDB collection, containing normal documents. This means that you can add any additional key and value pairs that you need, in the same way you would for any other collection.

For example, consider the case of a real-world application that needs to store text content that you extract from various files. You might need to do this so you could perform some additional indexing and searching. To accomplish this, you might add a file_text key and store the text in there. The elegance of the GridFS system means that you can do anything with this system you can do with any other MongoDB documents. Elegance and power are two of the defining characteristics of working in MongoDB.

Hashing Your Files

MongoDB ships with the MD5 hashing algorithm. You may have come across the algorithm previously when downloading software over the Internet. The theory behind MD5 is that each file has a unique signature. Changing a single bit anywhere in that file will drastically (and noticeably) change the signature. This signature is used for two reasons: security and integrity. For security, if you know what the MD5 hash is supposed to be and you trust the source (perhaps a friend gave it to you), then you can be assured that the file has not been altered if the hash (often called the checksum) is correct. This also ensures that the file integrity has been maintained and that no data has been lost or damaged. The MD5 hash of a particular file acts like a fingerprint for a file. The hash can be also used to identify files that have different filenames but have the same contents.

Warning The MD5 algorithm is no longer considered secure, and it has been demonstrated that it is possible to create two different files that have the same MD5 checksum, even though their contents are different. In cryptographic terms, this is called a collision. Such collisions are bad because they mean it is possible for an attacker to alter a file in such a way that it cannot be detected. This caveat remains somewhat theoretical because a great deal of effort and time would be required to create such collisions intentionally; and even then, the files could be so different as to be obviously not the same file. For this reason, MD5 is still the preferred method of determining file integrity because it is so widely supported. However, if you want to use hashing for its security benefits, you are much better off using one of the SHA family specifications—ideally SHA-256 or SHA-512. Even these hashing families have some theoretical vulnerabilities; however, no one has yet demonstrated a practical case of creating intentional collisions for the SHA family of hashes. MongoDB uses MD5 to ensure file integrity, which is fine for most purposes. However, if you want to hash important data (such as user passwords), you should probably consider using the SHA family of hashes instead.

Looking Under MongoDB’s Hood

At this point, you have some data in a MongoDB database. Now let’s take a closer look at that data under the covers. To do this, you’ll again use some command-line tools to connect to the database and query it. For example, try running the find() command against the file created earlier:

$ mongo test
MongoDB shell version: 2.5.1-pre
connecting to: test
 
> db.fs.files.find()
{ "_id" : ObjectId("51cb61b26487b3d8ce7af440"), "filename" : "/tmp/dictionary",
"chunkSize" : 262144, "uploadDate" : ISODate("2013-06-26T21:48:34.621Z"), "md5" :
"40c0825855792bd20e8a2d515fe9c3e3", "length" : 4953699 }
>

The output should look familiar—after all, it’s the same data that you saw earlier in this chapter. Now you can see that the information printed by mongofiles was taken from the file’s entry in the fs.files collection.

Next, let’s take a look at the chunks collection (we have to add a filter; otherwise, it will show us all of the raw binary data as well):

$ mongo test
MongoDB shell version: 2.5.1-pre
connecting to: test
> db.fs.chunks.find({},{"data":0});
{ "_id" : ObjectId("51cb61b29b2daad9857ca205"),  "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 4 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca206"),  "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 5 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca207"),  "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 6 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca208"),  "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 7 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca209"),  "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 8 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca20a"),  "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 9 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca20b"), "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 10 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca20c"), "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 11 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca20d"), "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 12 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca20e"), "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 13 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca20f"), "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 14 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca210"), "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 15 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca211"), "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 16 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca212"), "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 17 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca201"), "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 0 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca202"), "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 1 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca203"), "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 2 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca204"), "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 3 }
{ "_id" : ObjectId("51cb61b29b2daad9857ca213"), "files_id" : ObjectId("51cb61b26487b3d8ce7af440"), "n" : 18 }>

You might wonder why the output here has so many entries. As noted previously, GridFS is just a specification. That is, it uses what MongoDB already provides. While we were testing the commands for the book, the dictionary file was added a couple of times. Later, this file was deleted when we emptied the fs.files collection. You can see for yourself what happened next! The fact that some documents were removed from a collection has no bearing on what happens in another collection. Remember: MongoDB doesn’t treat these documents or collections in any special way. If the file had been deleted properly through a driver or the mongofiles tool, that tool would also have cleaned up the chunks collection.

Warning Accessing documents and collections directly is a powerful feature, but you need to be careful. This feature also makes it much easier to shoot yourself in both feet at the same time. Make sure you know what you’re doing and that you perform a great deal of testing if you decide to edit these documents and collections manually. Also, keep in mind that the GridFS support in MongoDB’s drivers won’t know anything about any customizations that you’ve made.

Using the search Command

Next, let’s take a closer look at MongoDB’s search command.Thus far, there is only a single file in the database, which greatly limits the types of searches you might conduct! So let’s add something else. The following snippet copies the dictionary to another file, and then imports that file:

$ cp /tmp/dictionary /tmp/hello_world
$ mongofiles put /tmp/hello_world
connected to: 127.0.0.1
added file: { _id: ObjectId('51cb63d167961ebc919edbd5'), filename: "/tmp/hello_world", chunkSize:
262144, uploadDate: new Date(1372283858021), md5: "40c0825855792bd20e8a2d515fe9c3e3", length:
4953699 }done!
root@core2:∼# mongofiles list
connected to: 127.0.0.1
/tmp/dictionary    4953699
/tmp/hello_world    4953699
$

The first line copies the file, and the second line imports it into MongoDB. As in the earlier example, the put command prints out the new document that MongoDB has created. Next, you might run the mongofiles command list to check that the files were correctly stored. If you do so, you can see that there are now two files in the collection; unsurprisingly, both files have the same size.

The search command works exactly as you would expect. All you need to do is tell mongofiles what you are looking for, and it will try to find it for you, as in this example:

$  mongofiles search hello
connected to: 127.0.0.1
/tmp/hello_world    4953699
$  mongofiles search dict
connected to: 127.0.0.1
/tmp/dictionary    4953699
$

Again, nothing too exciting happens here. However, there is an important takeaway that’s worth noting. MongoDB can be as simple or as complex as you need it to be. The mongofiles tool is only for reference use, and it includes very basic debugging. The good news: MongoDB makes it easy to perform simple searches against your files. The even better news: MongoDB also has your back if you want to write some insanely complicated searches.

Deleting

The mongofiles command delete doesn’t require much explanation, but it does deserve a big warning. This command deletes files based on the filename. Thus, if you have more than one file with the same name, this command will delete all of them. The following snippet shows how to use the delete command:

$ mongofiles delete /tmp/hello_world
connected to: 127.0.0.1
$ mongofiles list
connected to: 127.0.0.1
/tmp/dictionary       4953699
$

Note Many people have commented in connection with this issue that deleting multiple files with the same name is not a problem because no application would have duplicate names. This is simply not true; and in many cases, it doesn’t even make sense to enforce unique names. For example, if your app lets users upload photos to their profiles, there’s a good chance that half the files you receive will be called photo.jpg or me.png.

Of course, if you are unlikely to use mongofiles to manage your live data—and in truth no one ever expected it to be used that way—then you just need to be careful when deleting data in general.

Retrieving Files from MongoDB

So far, you haven’t actually pulled any files out from MongoDB. The most important feature of any database is that it lets you find and retrieve data once it’s been put in. The following snippet retrieves a file from MongoDB using the mongofiles command get:

$mongofiles get /tmp/dictionary
connected to: 127.0.0.1
done write to: /tmp/dictionary
$

This example includes an intentional mistake. Because it specifies the full name and path of the file you want to retrieve (as required), mongofiles writes the data to a file with the same name and path. Effectively, this overwrites the original dictionary file! This isn’t exactly a great loss, because it is being overwritten by the same file—and the dictionary file was only a temporary copy in the first place. Nevertheless, this behavior could give you a rather nasty shock if you accidentally erase two weeks of work. Trust us, you won’t figure out where all your work went until sometime after the event! As when using the delete command, you need to be careful when using the get command.

Summing Up mongofiles

The mongofiles utility is a useful tool for quickly looking at what’s in your database. If you’ve written some software, and you suspect something might be amiss with it, then you can use mongofiles to double-check what’s going on.

It’s an extremely simple implementation, so it doesn’t require any fancy logic that could complicate accomplishing the task at hand. Whether you would use mongofiles in a production environment is a matter of personal taste. It’s not exactly a Swiss army knife; however, it does provide a useful set of commands that you’ll be grateful to have if your application begins misbehaving. In short, you should be familiar with this tool because someday it might be exactly the tool you require to solve an otherwise nettlesome problem.

Exploiting the Power of Python

At this point, you have a solid idea of how GridFS works. Next, you will learn how to access GridFS from Python. Chapter 2 covered how to install PyMongo; if you have any trouble with the examples, please refer back to Chapter 2 and make sure everything is installed correctly.

If you’ve been following along with the previous examples in this chapter, you should now have one file in GridFS. You’ll also recall that the file is a dictionary file, so it contains a list of words. In this section, you will learn how to write a simple Python script that prints out all the words in the dictionary file. Sure, it would be simpler and more efficient to simply cat the original file—but where would the fun be in that?

Begin by firing up Python:

Python 2.6.6 (r266:84292, Oct 12 2012, 14:23:48)
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.>>>

The standard driver for Python is called PyMongo, and it was written by Mike Dirolf. Because the PyMongo driver is supported directly by MongoDB, Inc., the company that publishes MongoDB, you can rest assured that it will be regularly updated and maintained. So, let’s go ahead and import the library. You should see something like the following:

>>> from pymongo import Connection
>>> import gridfs
>>>

If PyMongo isn’t installed correctly, you will get an error similar to this:

>>> import gridfs
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named gridfs
>>>

If you see the latter message, chances are something was missed during installation. In that case, pop back to Chapter 2 and follow the instructions to install PyMongo again.

Connecting to the Database

Before you can retrieve information from a database, you must first establish a connection to it. When you were using the mongofiles utility earlier in this chapter, you probably noticed the reference to 127.0.0.1. This value is also known as the localhost, and it represents your computer’s loopback address. This value is simply a shortcut for telling a computer to talk to itself. The reason mongofiles mentioned this IP address is that it was actually connecting to MongoDB through the network. The default is to connect to the local machine on the default MongoDB port. Because you haven’t changed the default settings, mongofiles can find and connect to your database without any trouble.

When using MongoDB with Python, however, you need to connect to the database and then set up GridFS. Fortunately, this is easy to do:

>>> db = Connection().test
>>> fs = gridfs.GridFS(db)
>>>

The first line opens the connection and selects the database. By default, mongofiles uses the test database; hence, you’ll find your dictionary file in test. The second line sets up GridFS and prepares it for use.

Accessing the Words

In its original implementation, the PyMongo driver used a file-like interface to leverage GridFS. This is somewhat different from what you saw in this chapter’s earlier examples with mongofiles, which were more FTP-like in nature. In the original implementation of PyMongo, you could read and write data just as you do for a normal file.

This made PyMongo very much like Python to use, and it allowed for easy integration with existing scripts. However, this behavior was changed in version 1.6 of the driver, and this functionality is no longer supported. While very Python-like, the behavior had some problems that made the tool less effective overall.

Generally speaking, the PyMongo driver attempts to make GridFS files look and feel like ordinary files on the filesystem. On the one hand, this is nice because it means there’s no learning curve, and the driver is usable with any method that requires a file. On the other hand, this approach is somewhat limiting and doesn’t give a good feel for how powerful GridFS is. Important changes were made to how PyMongo works in version 1.6, particularly in how get and put work.

Note This revised version of PyMongo isn’t too dissimilar from previous versions of the tool, and many people who used the previous API have found it easy to adapt to the revised version. That said, Mike’s changes haven’t gone down well with everybody. For example, some people found the file-based keying in the old API to be extremely useful and easy to use. The revised version of PyMongo supports the ability to create filenames, so the missing behavior can be replicated in the revised version; however, doing so does require a bit more code.

Putting Files into MongoDB

Getting files into GridFS through PyMongo is straightforward and intentionally similar to the way you do so using command-line tools. MongoDB is all about throughput, and the changes to the API in the revised version of PyMongo reflect this. Not only do you get better performance, but the changes also bring the Python driver in line with the other GridFS implementations.

Let’s put the dictionary into GridFS (again):

>>> with open("/tmp/dictionary") as dictionary:
...   uid = fs.put(dictionary)
...
>>> uid
ObjectId('51cb65be2f50332093f67b98') >>>

In this example, you use the put method to insert the file. It’s important that you capture the result from this method because it contains the document _id for your file. PyMongo takes a different approach than mongofiles, which assumes the filename is effectively the key (even though you can have duplicates). Instead, PyMongo references files based on their _id. If you don’t capture this information, then you won’t be able to reliably find the file again. Actually, that’s not strictly true—you could search for a file quite easily—but if you want to link this file to a particular user account, then you need this _id.

Two useful arguments that can be used in conjunction with the put command are filename and content_type. As you might expect, these arguments let you set the filename and the content type of the file, respectively. This is useful for loading files directly from disk. However, it is even handier when you’re handling files that have been received over the Internet or generated in memory because, in those cases, you can use file-like semantics, but without actually having to create a real file on the disk.

Retrieving Files from GridFS

At long last, you’re now ready to return your data! At this point, you have your unique _id, so finding the file is easy. The get method retrieves a file from GridFS:

>>> new_dictionary = fs.get(uid)

That’s it! The preceding snippet returns a file-like object; thus, you can print all the words in the dictionary using the following snippet:

>>> for word in new_dictionary:
...   print word

Now watch in awe as a list of words quickly scrolls up the screen! Okay, so this isn’t exactly rocket science. However, the fact that it isn’t rocket science or in any way difficult is part of the beauty of GridFS—it does work as advertised, and it does so in an intuitive and easily understood way!

Deleting Files

Deleting a file is also easy. All you have to do is call fs.delete() and pass the _id of the file, as in the following example:

>>> fs.delete(uid)
>>> new_dictionary = fs.get(uid)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/site-packages/pymongo-2.5.2-py2.6-linux-x86_64.egg/gridfs/__init__.py",
line 140, in get
    return GridOut(self.__collection, file_id)
  File "/usr/lib/python2.6/site-packages/pymongo-2.5.2-py2.6-linux-x86_64.egg/gridfs/grid_file.py",
line 392, in __init__
    (files, file_id))
gridfs.errors.NoFile: no file in gridfs collection Collection(Database(Connection('localhost',
27017), u'test'), u'fs.files') with _id ObjectId('51cb65be2f50332093f67b98') >>>

These results could look a bit scary, but they are just PyMongo’s way of saying that it couldn’t find the file. This isn’t surprising, because you just deleted it!

Summary

In this chapter, you undertook a fast-paced tour of GridFS. You learned what GridFS is, how it fits together with MongoDB, and how to use its basic syntax. This chapter didn’t explore GridFS in great depth, but in the next chapter, you’ll learn how to integrate GridFS with a real application using PHP. For now, it’s enough to understand how GridFS can save you time and hassle when storing files and other large pieces of data.

In the next chapter, you’ll start putting what you’ve learned to real use—specifically, you’ll learn how to build a fully functional address book!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 5: GridFS

Create new playlist

Sign In

Sign Up

Table of Contents for
CHAPTER 5: GridFS