Use case – distributing work

Some time ago I wanted to write an application to index content from Usenet groups. Usenet groups were the precursor to today's forums. Each group encompasses a topic, people can create new threads (called articles), and anyone can reply to those articles.

As opposed to forums, though, you don't access Usenet using an internet browser. Instead, you have to connect through a client that supports the NNTP protocol. Also, when accessing a Usenet server, you have a limited amount of concurrent connections to the server, let's say, 40 connections.

In summary, the purpose of the application is to use all the possible connections to retrieve and parse as many articles as possible, categorize them, and store the processed information in a database. A naïve approach to this challenge would be to have one coroutine for each possible connection—40 in the preceding example—having each of them fetch the article, parse it, catalog its contents, and then put it in the database. Each coroutine can move forward to the next article as they finish one.

A representation of this approach could be something like the following:

The problem with this approach is that we aren't maximizing the usage of the available connections, which means that we are indexing at a lower speed than is possible. Suppose that, on average, fetching an article takes 30 milliseconds, parsing its content takes 20, cataloging it takes another 30, and storing it in the database another 20. Here is a simple breakdown for reference:

Only 30 percent of the time it takes to index an article is spent on retrieving it, and the other 70 percent is used for processing. Nevertheless, since we are using one coroutine to do all the steps of the indexation, we aren't using the connection for 70 milliseconds per each article that is processed.

A better approach is to have 40 coroutines dedicated only to retrieving articles, and then having a different group of coroutines do the processing. Say, for example, 80 coroutines. By setting the correct configuration between both groups, it would be possible to index content more than three times faster—assuming that we have enough hardware to handle the load—because we could maximize the connection usage so that all the connections are always being used during the whole execution of the indexation. We could retrieve data non-stop,  and then maximize the use of hardware to process as much as possible concurrently.

To perform such an implementation, we only need to put a channel between the group of 40 article retrievers and the group of 80 article processors. The idea is that the channel will work as a pipeline in which all the retrievers will put the raw articles as they fetch them, and the processors will take and process them whenever they are available.

The following is a representation of this design:

In this example, the bridge between the coroutines that fetch and the ones that process is the channel. The coroutines fetching data only need to care about fetching and putting the raw response in the channel—so they will maximize the usage of each connection—whereas the processing coroutines will process data as they receive it through the channel.

Because of how channels work, we don't need to worry about distributing the load—as coroutines become available to process, they will receive data from the channel.

Now, after this change, we will be retrieving around three articles from the indexer every 90 milliseconds—that's more than three times faster. The processing may not be this fast; it would depend on whether there's enough hardware to have other coroutines processing data as soon as it gets retrieved.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.239.44