Chapter 10. Replication and External Data

PostgreSQL has a number of options for sharing data with external servers or data sources. The first option is PostgreSQL’s own built-in replication, which allows you to have a readied copy of your server on another PostgreSQL server. The second option, unveiled in 9.1, is the Foreign Data Wrapper, which allows you to query and copy data from many kinds of external data resources utilizing the SQL/Management of External Datasource (MED) standard. The third option is to use third-party add-ons, many of which are freely available and time-tested.

Replication Overview

You can probably enumerate countless reasons for the need to replicate, but they all boil down to two: availability and scalability. If your main server goes down you want another to immediately assume its role. For small databases, you could just make sure you have another physical server ready and restore the database onto it, but for large databases (say, in the terabytes), the restore itself could take many hours. To avoid the downtime, you’ll need to replicate. The other main reason is for scalability. You set up a database to handle your collection of fancy elephant beetles. After a few years of unbridled breeding, you now have millions of fancy elephant beetles. People all over the world now come to your site to check out the beetles. You’re overwhelmed by the traffic. Replication comes to your aid; you set up a read-only slave server to replicate with your main server. People who just want to learn about your beetles will pull data from the slave. As your audience grows, you can add on more and more slave servers.

Replication Lingo

Before we get too carried away with replication, we better lay down some common terminology used in replication.

Master

The master server is the database server that is the source of the data being replicated and where all updates happen. As of now you can have only one master when using the built-in replication features of PostgreSQL. Plans are in place to support multi-master replication scenarios, packaged with future releases of PostgreSQL.

Slave

A slave is a server where data is copied to. More aesthetically pleasing terms such as subscriber or agent have been bandied about, but slave is still the most apropos. PostgreSQL built-in replication currently only supports read-only slaves.

Write-ahead Log (WAL)

WAL is the log that keeps track of all transactions. It’s often referred to as the transaction log in other databases. To set up replication, PostgreSQL simply makes the logs available for slaves to pull down. Once slaves have the logs, they just need to execute the transactions therein.

Synchronous

A transaction on the master will not be considered complete until all slaves have updated, guaranteeing zero data loss.

Asynchronous

A transaction on the master will commit even if slaves haven’t been updated. This is useful in the case of distant servers where you don’t want transactions to wait because of network latency, but the downside is that your dataset on the slave may lag behind, and the slave may miss some transactions in the event of transmission failure.

Streaming

Streaming replication model was introduced in 9.0. Unlike prior versions, it does not require direct file access between master and slaves. Instead, it relies on PostgreSQL connection protocol to transmit the WALs.

Cascading Replication

Introduced in 9.2, slaves can receive logs from nearby slaves instead of directly from the master. This allows a slave to also behave like a master for replication purposes but still only allow read only queries.

PostgreSQL Built-in Replication Advancements

When you set up replication, the additional servers can be on the same physical hardware running on a different port or one on the cloud halfway around the globe. Prior to 9.0, PostgreSQL only offered asynchronous warm slaves. A warm slave will retrieve WAL and keep itself in sync but will not be available for query. It acted only as a standby. Version 9.0 introduced asynchronous hot slaves and also streaming replication where users can execute read-only queries against the slave and replication can happen without direct file access between the servers (using database connections for shipping logs instead). Finally, with 9.1, synchronous replication became a reality. In 9.2, Cascading Streaming Replication was introduced. The main benefit of Cascading Streaming Replication is to reduce latency. It’s much faster for a slave to receive updates from a nearby slave than from a master far far away. Built-in replication relies on WAL shipping to perform the replication. The disadvantage is that your slaves need to have the same version of PostgreSQL and OS installed to ensure faithful execution of the received logs.

Third-Party Replication Options

In addition to the built-in replication, common third party options abound. Slony and Bucardo are two of the most popular open source ones. Although PostgreSQL is improving replication with each new release, Slony, Bucardo, and other third-party replication options still offer more flexibility. Slony and Bucardo will allow you to replicate individual databases or even tables instead of the entire server. As such, they don’t require that all masters and slaves be of the same PostgreSQL version and OS. Both also support multi-master scenarios. However, both rely on additional triggers to initiate the replication and often don’t support DDL commands such as creating new tables, installing extensions, and so on. This makes them more invasive than merely shipping logs. Postgres-XC, still in beta, is starting to gain an audience. Postgres-XC is not an add-on to PostgreSQL; rather. it’s a completely separate fork focused on providing a write-scalable, multi-master symmetric cluster very similar in purpose to Oracle RAC. To this end, the raison d’etre of Postgres-XC is not replication, but distributed query processing. It is designed with scability in mind rather than high availability.

We urge you to consult a comparison matrix of popular third-party options here: http://wiki.postgresql.org/wiki/Replication%2C_Clustering%2C_and_Connection_Pooling.

Setting Up Replication

Let’s go over the steps to set up replication. We’ll take advantage of streaming introduced in 9.0 so that master and slaves only need to be connected at the PostgreSQL connection level instead of at the directory level to sustain replication. We will also use features introduced in 9.1 that allow you to easily setup authentication accounts specifically for replication.

Configuring the Master

The basic steps for setting up the master server are as follows:

  1. Create a replication account.

    CREATE ROLE pgrepuser REPLICATION LOGIN PASSWORD 'woohoo'
  2. Alter the following configuration settings in postgresql.conf.

    wal_level = hot_standby
    archive_mode = on
    max_wal_senders = 10
  3. Use the archive_command to indicate where the WAL will be saved. With streaming, you’re free to choose any directory. More details on this setting can be found at PostgreSQL PGStandby.

    On Linux/Unix your archive_command line should look something like:

    archive_command = 'cp %p ../archive/%f'

    On Windows:

    archive_command = 'copy %p ..\archive\%f'
  4. In the pg_hba.conf, you want a rule to allow the slaves to act as replication agents. As an example, the following rule will allow a PostgreSQL account named pgrepuser that is on my private network with IP in range 192.168.0.1 to 192.168.0.254 to replicate using a md5 password.

    host replication pgrepuser 192.168.0.0/24 md5
  5. Shut down the postgreSQL service and copy all the files in the data folder EXCEPT for the pg_xlog and pg_log folders to the slaves. You should make sure that pg_xlog and pg_log folders are both present on the slaves, but devoid of any files.

    If you have a large database cluster and can’t afford a shut down for a long period of time while you’re copying, you can use the pg_basebackup utility which is located in the bin folder of your PostgreSQL install. This will create a copy of the data cluster files in the specified directory and allow you to do a base backup while the postgres server service is running and people are using the system.

Configuring the Slaves

To minimize headaches, slaves should have the same configuration as the master, especially if you’ll be using them for failover. In addition to those configurations, in order for it to be a slave, it needs to be able to play back the WAL transactions of the master. So, you need at least the following settings in postgresql.conf of a slave:

  1. Create a new instance of PostgreSQL with the same version (preferably even micro-versions) as your master server and also same OS at the same patch level. Keeping servers identical is not a requirement and you’re more than welcome to TOFTT and see how far you can deviate.

  2. Shut down the postgreSQL service.

  3. Overwrite the data folder files with those you copied from the master.

  4. Set the following configuration settings on the postgresql.conf.

    hot_standby = on
  5. You don’t need to run the slaves on the same port as the master, so you can optionally change the port either via postgresql.conf or via some other OS specific startup script that sets PGPORT before startup. Any startup script will override the setting you have in postgresql.conf.

  6. Create a new file in the data folder called recovery.conf that contains the following lines:

    standby_mode = 'on'
    primary_conninfo = 'host=192.168.0.1 port=5432 user=pgrepuser password=woohoo'
    trigger_file = 'failover.now'

    Host name, IP, and port should be those of the master.

  7. Add to the recovery.conf file the following line, which varies, depending on the OS:

    On Linux/Unix:

    restore_command = 'cp %p ../archive/%f'

    On Windows:

    restore_command = 'copy %p ..\archive\%f'

    This command is only needed if the slave can’t play the WALs fast enough, so it needs a location to cache them.

Initiate the Replication Process

  1. Start up the slave server first. You’ll get an error in logs that it can’t connect to the master. Ignore.

  2. Start up the master server.

You should now be able to connect to both servers. Any changes you make on the master, even structural changes like installing extensions or creating tables, should trickle down to the slaves. You should also be able to query the slaves, but not much else.

When and if the time comes to liberate a chosen slave, create a blank file called failover.now in the data folder of the slave. What happens next is that Postgres will complete the playing back of WAL, rename the recover.conf to recover.done. At that point, your slave will be unshackled from the master and continue life on its own with all the data from the last WAL. Once the slave has tasted freedom, there’s no going back. In order to make it a slave again, you’ll need to go through the whole process from the beginning.

Warning

Unlogged tables don’t participate in replication.

Foreign Data Wrappers (FDW)

Foreign Data Wrappers are mechanisms of querying external datasources. PostgreSQL 9.1 introduced this SQL/MED standards compliant feature. At the center of the concept is what is called a foreign table. In this section, we’ll demonstrate how to register foreign servers, foreign users, and foreign tables, and finally, how to query foreign tables. You can find a catalog of foreign data wrappers for PostgreSQL at PGXN FDW and PGXN Foreign Data Wrapper. You can also find examples of usage in PostgreSQL Wiki FDW. At this time, it’s rare to find FDWs packaged with PostgreSQL except for fdw_file. For wrapping anything else, you’ll need to compile your own or get them from someone who already did the work. In PostgreSQL 9.3, you can expect a FDW that will at least wrap other PostgreSQL databases. Also, you’re limited to SELECT queries against the FDW, but this will hopefully change in the future so that you can use them to update foreign data as well.

Querying Simple Flat File Data Sources

We’ll gain an introduction to FDW using the file_fdw wrapper. To install, use the command:

CREATE EXTENSION file_fdw;

Although file_fdw can only read from files on your local server, you still need to define a server for it. You register a FDW server with the following command.

CREATE SERVER my_server FOREIGN DATA WRAPPER file_fdw;

Next, you have to register the tables. You can place foreign tables in any schema you want. We usually create a separate schema to house foreign data. For this example, we’ll use our staging schema.

Example 10-1. Make a Foreign Table from Delimited file

CREATE FOREIGN TABLE staging.devs (developer VARCHAR(150), company VARCHAR(150))
SERVER my_server
OPTIONS (format 'csv', header 'true', filename '/postgresql_book/ch10/devs.psv', delimiter '|', null ''),

When all the set up is finished, we can finally query our pipe delimited file directly:

SELECT * FROM staging.devs WHERE developer LIKE 'T%';

Once we no longer need our foreign table, we can drop it with the basic SQL command:

DROP FOREIGN TABLE staging.devs;

Querying More Complex Data Sources

The database world does not appear to be getting more homogeneous. We’re witnessing exotic databases sprouting up left and right. Some are fads that go away. Some aspire to dethrone the relational databases altogether. Some could hardly be considered databases. The introduction of foreign data wrappers is in part a response to the growing diversity. Resistance is futile. FDW assimilates.

In this next example, we’ll demonstrate how to use the www_fdw foreign data wrapper to query web services. We borrowed the example from www_fdw Examples.

Note

The www_fdw foreign data wrapper is not generally packaged with PostgreSQL installs. If you are on Linux/Unix, it’s an easy compile if you have the postgresql-dev installed. We did the work of compiling for Windows—you can download our binaries here: Windows-32 9.1 FDWs.

The first step to perform after you have copied the binaries and extension files is to install the extension in your database:

CREATE EXTENSION www_fdw;

We then create our Twitter foreign data server:

CREATE SERVER www_fdw_server_twitter
FOREIGN DATA WRAPPER www_fdw 
OPTIONS (uri 'http://search.twitter.com/search.json'),

The default format supported by the www_fdw is JSON, so we didn’t need to include it in the OPTIONS modifier. The other supported format is XML. For details on additional parameters that you can set, refer to the www_fdw documentation. Each FDW is different and comes with its own API settings.

Next, we define at least one user for our FDW. All users that connect to our server should be able to access the Twitter server, so we create one for the entire public group.

CREATE USER MAPPING FOR public SERVER www_fdw_server_twitter;

Now we create our foreign table:

Example 10-2. Make a Foreign Table from Twitter

CREATE FOREIGN TABLE www_fdw_twitter (
/* parameters used in request */
q text,	page text, rpp text, result_type text,
/* fields in response */
created_at text, from_user text, from_user_id text,	from_user_id_str text
 , geo text, id text, id_str text
 , is_language_code text, profile_image_url text
 , source text, text text, to_user text, to_user_id text)
SERVER www_fdw_server_twitter;

The user mapping doesn’t imply rights. We still need to grants rights before being able to query the foreign table.

GRANT SELECT ON TABLE www_fdw_twitter TO public;

Now comes the fun part. Here, we ask for page two of any tweets that have something to do with postgresql, mysql, and nosql:

SELECT DISTINCT left(text,75) As part_txt 
 FROM www_fdw_twitter WHERE q='postgresql AND mysql AND nosql' and
page='2';

Voilà! We have our response:

                                  part_txt
-----------------------------------------------------------------------------
 MySQL Is Done. NoSQL Is Done. It's the Postgres Age http://t.co/4DfqG75d
 RT @mjasay: .@451Research: <0.002% of paid MySQL deployments being repla
 @alanzeino: I know MySQL... but anyone with a brain is using PostgreSQL
 Hstore FTW! RT @mjasay: .@451Research: <0.002% of MySQL deployments bein
 @al3xandru: MySQL Is Done. NoSQL Is Done. It's the Postgres Age http://t
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.138.178