Throttling Node.js

With our RDF parsing module in place and well tested, let’s turn our attention to getting all those thousands of records into the database.

But one word of caution before we proceed: the code in this section attempts to demonstrate performance-related problems and their solutions. The speed of your hardware and the settings of your operating system may make it easier or harder to see these effects.

Alright—to crawl the cache directory, we’ll use a module called file, which is available through npm.[25] Install and save it as usual, and we’ll begin:

 
$ ​npm install --save file

The file module has a convenient method called walk() that traverses a directory tree and calls a callback for each file it finds.

Naive File Parsing at Scale

Let’s use this method to find all the RDF files and send them through our RDF parser. Open a text editor and enter this:

databases/list-books.js
 
'use strict'​;
 
const
 
 
file = require(​'file'​),
 
rdfParser = require(​'./lib/rdf-parser.js'​);
 
 
console.log(​'beginning directory walk'​);
 
 
file.walk(__dirname + ​'/cache'​, ​function​(err, dirPath, dirs, files){
 
files.forEach(​function​(path){
 
rdfParser(path, ​function​(err, doc) {
 
if​ (err) {
 
throw​ err;
 
} ​else​ {
 
console.log(doc);
 
}
 
});
 
});
 
});

Save the file as list-books.js in your databases project. This short program walks down the cache directory and passes each RDF file it finds into the RDF parser. The parser’s callback just echoes the JSON out to the console if there wasn’t an error.

Run the program, and let’s see what it produces:

 
$ ​node --harmony list-books.js
 
beginning directory walk
 
 
./list-books.js:12
 
throw err;
 
^
 
Error: EMFILE, open './cache/epub/12292/pg12292.rdf'

Failure! The problem here is masked by the innocuous-looking Error: EMFILE. This kind of error occurs when you’ve exhausted the number of file descriptors available on the system (typically just over 10,000).

There are a couple of ways to solve this problem. One way is to increase the maximum number of file descriptors in the operating system. However, that’s only a temporary fix, until you run into even bigger file sets.

Another way would be to modify the rdf-parser module to retry when it receives an EMFILE error. A third-party module called graceful-fs does this, for example.[26]

But you can’t always count on your dependency modules to be graceful with file handling, so we’ll use another approach: work queuing. Rather than immediately sending each file path into the RDF parser, we’ll queue them as work to be done and let the queue throttle the number of concurrently running tasks.

Queuing to Limit Work in Progress

For this we’ll use the async module.[27] Go ahead and install it now.

 
$ ​npm install --save async

Async offers low-overhead mechanisms for managing asynchronous code. For example, it has methods for executing a sequence of asynchronous tasks sequentially or in parallel, with a callback to invoke when they’re all done.

We need the ability to run a whole bunch of tasks, but limit the number that are running at any time. For this we need async.queue().

Open your editor and enter this:

databases/list-books-queued.js
 
'use strict'​;
 
const
 
 
async = require(​'async'​),
 
file = require(​'file'​),
 
rdfParser = require(​'./lib/rdf-parser.js'​),
 
 
work = async.queue(​function​(path, done) {
 
rdfParser(path, ​function​(err, doc) {
 
console.log(doc);
 
done();
 
});
 
}, 1000);
 
 
console.log(​'beginning directory walk'​);
 
file.walk(__dirname + ​'/cache'​, ​function​(err, dirPath, dirs, files){
 
files.forEach(​function​(path){
 
work.push(path);
 
});
 
});

Save this file as list-books-queued.js. Notice that we create a work object by passing async.queue() a worker function and a concurrency limit of 1,000. This object has a push() method that we use to add more paths for processing.

The worker function we used to create the queue takes two arguments: path and done. The path argument is the path to an RDF file discovered by walking the directory tree. done is a callback that our worker function has to call to signal to the work queue that it’s free to dequeue the next path.

In Node.js, it’s common for the last argument to a callback to be a done or next function. Technically, you can name this argument whatever you want (it’s your function, after all). But naming it done or next signals that this is a callback function that takes no arguments and should be called exactly once when you’re finished doing whatever it is that you’re doing. By contrast, callback functions named callback often take one or more arguments, starting with an err argument.

Now let’s give this program a try and see if it works better. There may be a long pause between when you kick off the program and when you start seeing anything happen—this is to be expected.

 
$ ​node --harmony list-books-queued.js
 
{ _id: 0, title: '', authors: [], subjects: [] }
 
{ _id: 10,
 
title: 'The Bible, Old and New Testaments, King James Version',
 
authors: [],
 
subjects: [ 'Religion' ] }
 
{ _id: 1,
 
title: 'United States Declaration of Independence',
 
authors: [ 'United States' ],
 
subjects: [ 'United States -- History -- Revolution, 1775-1783 -- Sources' ] }
 
{ _id: 1000,
 
title: 'La Divina Commedia di Dante',
 
authors: [ 'Dante Alighieri' ],
 
subjects: [ 'Poetry' ] }
 
...

Great! This program will run for quite a while if you let it. Instead, go ahead and kill it with Ctrl-C.

Now on to the last step: pumping these records into the database.

Putting It All Together

With a working parser and a throttled queue, we’re ready to add the last piece: putting the records into the database.

Open your editor to the list-books-queued.js file one more time. First, add an extra require to the top of the file to pull in the request module like we did for the dbcli.js program. Then, find the callback we pass into rdfParser(). Replace what’s there with this:

databases/import-books.js
 
rdfParser(path, ​function​(err, doc) {
 
request({
 
method: ​'PUT'​,
 
url: ​'http://localhost:5984/books/'​ + doc._id,
 
json: doc
 
}, ​function​(err, res, body) {
 
if​ (err) {
 
throw​ Error(err);
 
}
 
console.log(res.statusCode, body);
 
done();
 
});
 
});

Save this new file as import-books.js. Instead of dumping the parsed object (doc) directly to the console, now we PUT it into the database using request(). Then we take the response we get from the database and send that to the console.

Time to find out if it works. Kick it off on the command line:

 
$ ​node --harmony import-books.js
 
beginning directory walk
 
 
./import-books.js:18
 
throw Error(err);
 
^
 
Error: Error: write ECONNRESET
 
at Error (<anonymous>)
 
at Request._callback (./import-books.js:18:17)
 
at self.callback (./node_modules/request/index.js:148:22)
 
at Request.EventEmitter.emit (events.js:100:17)
 
...

Well, the error isn’t EMFILE, so we didn’t run out of file descriptors. The error ECONNRESET means that the TCP connection to the database died abruptly. This shouldn’t be very surprising; we tried to open upwards of a thousand connections to it at the same time.

The easiest thing we can do is dial back the concurrency for the work queue. Rather than allowing 1,000 concurrent jobs, change it to only 10. When you’re done, clear out the books database and try running the import again:

 
$ ​./dbcli.js DELETE books
 
200 { ok: true }
 
$ ​./dbcli.js PUT books
 
201 { ok: true }
 
$ ​node --harmony import-books.js
 
beginning directory walk
 
201 { ok: true, id: '0', rev: '1-453265faaa77a714d46bd72f62326186' }
 
201 { ok: true, id: '1', rev: '1-52a8081284aa74919298d724e5b73589' }
 
201 { ok: true,
 
id: '10002',
 
rev: '1-cf704abd67ec797f318dc6c949c7beed' }
 
201 { ok: true,
 
id: '10000',
 
rev: '1-4ccce019854dd45ba6a7c5c064c4460b' }
 
...

Recall that 201 is the HTTP status code for Created. The 201s above mean the database is successfully creating our documents.

This program will take quite a while to complete. You can check on its progress using the dbcli program we made earlier from a second terminal. Run ./dbcli.js GET books and look at the doc_count property in the output to see how many records have been imported.

Dealing with Update Conflicts

But one more thing before we move on to querying. There’s something strange about that first entry—let’s take a look at book 0:

 
$ ​./dbcli.js GET books/0
 
200 { _id: '0',
 
_rev: '1-453265faaa77a714d46bd72f62326186',
 
title: '',
 
authors: [],
 
subjects: [] }

The book with id 0 has no title, no authors, and no subjects. It turns out that this record is just a template; the RDF file that produced this document is pretty much empty.

We don’t need this document, so let’s delete it. Try removing it with DELETE:

 
$ ​./dbcli.js DELETE books/0
 
409 { error: 'conflict', reason: 'Document update conflict.' }

CouchDB responded with a 409 Conflict status code. This is actually a good thing—CouchDB guards against conflicts and race conditions by demanding a revision ID for updates to an existing document. Since our request didn’t include a rev parameter at all, it didn’t match the document we were attempting to delete.

So, to DELETE or PUT a document that exists, you need to provide a revision:

 
$ ​./dbcli.js DELETE books/0?rev=1-453265faaa77a714d46bd72f62326186
 
200 { ok: true, id: '0', rev: '2-1ab4213771c4fbf4c48e11c246e8a6fc' }

Different databases and RESTful services deal with conflicts differently. Relying on a rev parameter is a characteristic specific to CouchDB.

With that extra document out of the way, it’s time to start querying the data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.17.40