Reading large amounts of data

Some applications require large amounts of data that have to be read by the database. Usually, two major cases can be distinguished:

  • Reading large amounts of data and shipping them to the client
  • Reading large amounts of data so that the server can process them

In this section, both the cases will be covered because they have slightly different requirements.

Making use of cursors

Cursors are a fundamental concept provided by all reasonable relational databases. However, cursors are often ignored and not used the way they should be, which leads to problems all along the way.

Why would anybody want to use a cursor, and what is it all about anyway? Consider the following example:

SELECT * FROM very_large_table;

For PostgreSQL, this is no problem at all. It will calculate the result and send it to the client server. However, the client may instantly crash because it will try to store all of the data in the memory. If the table is really large, the client may just not cope with all of the data coming from the server. The client will consume a lot of memory or even die. The biggest problem is that whether your client will survive or not will depend on the amount of data in the table—a sure recipe for disaster, which is also difficult to identify.

In general, your applications should be designed in a way such that they will always work, regardless of the amount of data and regardless of the result sets coming back from the server. A cursor is here to fix this very problem. The idea behind a cursor is to allow the client to keep only as much data around as will be needed at a specific time.

To make use of cursors, the DECLARE command has to be used:

test=# h DECLARE
Command:     DECLARE
Description: define a cursor
Syntax:
DECLARE name [ BINARY ] [ INSENSITIVE ] [ [ NO ] SCROLL ]
    CURSOR [ { WITH | WITHOUT } HOLD ] FOR query

In our example, the goal is to read 10 rows without having to keep all 10 rows in the memory at once. Demo data can be created easily:

test=# CREATE TABLE t_test AS 
   SELECT * FROM generate_series(1, 10) AS x;
SELECT 10

In the next step, the data should be read by the application. The most important thing to do here is to wrap the operation inside a transaction (in this book, transactions will also be covered in detail in one of the later chapters). The cursor can then be easily defined:

test=# BEGIN;
BEGIN
test=# DECLARE mycur CURSOR FOR SELECT * FROM t_test;
DECLARE CURSOR

Note that creating the cursor does not do anything at this point. It merely means that PostgreSQL is now prepared to send a lot of data to you. The real work is done as soon as FETCH is called:

test=# FETCH NEXT FROM mycur;
 x 
---
 1
(1 row)

FETCH NEXT will yield one row from the result set. It is not possible to call FETCH NEXT for every row in the result. The trouble with FETCH NEXT is that it is really really slow. If you fetch 1 million rows one by one, it means a lot of network overhead. Therefore, it is highly recommended to fetch data in larger buckets. In general, buckets of several hundred to several thousand rows make perfect sense, as memory consumption and network overhead would be in perfect balance. However, don't take this as a fixed guideline. Things depend on your type of application, your data, your network, and so on. It can make sense to experiment.

Here is how more than one row can be fetched at a time:

test=# FETCH 3 FROM mycur;
 x 
---
 2
 3
 4
(3 rows)

It is now possible to loop through the entire result as needed.

To finish the operation, it is enough to commit the transaction. All cursors will be closed automatically:

test=# COMMIT;
COMMIT

It is important to note that after a COMMIT statement, the cursor is not there anymore:

test=# FETCH NEXT FROM mycur;
ERROR:  cursor "mycur" does not exist

By default, a cursor is only visible inside a transaction, and therefore, very long cursor-related operations can also cause issues such as replication conflicts and so on. If you really want a cursor to survive a transaction, you have to create a so-called WITH HOLD cursor. In this case, however, you have to take care of cleaning up the cursor again by calling CLOSE:

test=# h CLOSE
Command:     CLOSE
Description: close a cursor
Syntax:
CLOSE { name | ALL }

WITH HOLD cursors should be handled with care. The most important observation is that a WITH HOLD cursor has to materialize its result, which can have a lot of impact on the amount of storage required on the database server.

Synchronized scanning

Sometimes, reading is not about shipping data to the client but all about processing data server side. This is exactly when synchronized seq scans come into play.

Let's imagine the following scenario: a 10 TB table contains some old reporting data. It is stored on a server containing a couple of traditional mechanical disks. Let's consider the following query:

SELECT error_type, count(*) FROM big_table GROUP BY 1;

Assuming that error_type has just a couple of different values, PostgreSQL will turn to a seq scan and process the query. The I/O system might deliver data at the rate of 500 MB per second. All disks will be spinning nicely.

Let's assume now that the query shown here has already processed 2 TB of data. Then somebody comes along and fires up a second, similar query. Even if the data has already been read by the previous job, it has definitely fallen out of the cache by now. Therefore, it has to be read once again. But things are even worse; those spinning disks will start jumping between the first and the second query. The more the queries added, the more random the disk behavior. After about 10 queries, the total throughput may drop to approximately 60 MB per second, which means 6 MB per database connection. Clearly, this is not a desirable performance level and is unacceptable.

Synchronized seq scans come to the rescue. Let's assume the first query has already processed 2 TB of the table at the full speed of 500 MB per second. When the second (and similar) query kicks off on the same table, PostgreSQL will make it start reading at around 2 TB. The second query will continue from there. The advantage of synchronizing these scans is that one can benefit from the cache entries generated by the other one. Reading from the disk once can, under good conditions, feed both (or even more) of the scans with data. Thus, performance is not 500 MB per second divided by two minus additional random access per scan, but ideally 500 MB per second multiplied by the number of scans. Running scans in parallel can improve speed dramatically by drastically reducing the I/O bottleneck and turning it slowly into a CPU bottleneck, which is a lot easier to scale.

The important thing is that synchronized sequential scans are on by default, and in general, it is a good idea to ensure that this setting is really set to on:

test=# SHOW synchronize_seqscans ;
 synchronize_seqscans 
----------------------
 on
(1 row)

You can enjoy their full potential on the default configuration.

In many cases, it makes more sense to execute things in parallel than in sequence because many analytical jobs may share the same I/O.

Note

Keep in mind that this works only for sequential scans.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.27.251