Chapter 12. Change Data Capture

Raghotham Murthy

Change data capture (CDC) is a solution to a specific problem. You have your most valuable data in your production databases. You want to analyze that data, but you don’t want to add any more load to those production databases. Instead, you want to rely on a data warehouse or data lake. Once you decide that you want to analyze data from your production database in a different system, you need a reliable way to replicate that data from the production database to your data warehouse.

It turns out, at scale, this is a hard problem to solve. You can’t just decide to copy data over from the production database to the warehouse—that would add a lot more load on the production database, especially if you want high fidelity. And if you fetched only the changed records, you would miss deletes.

Thankfully, all modern production databases write out a write-ahead log (WAL), or change log, as part of their normal transaction processing. This log captures every single change to each row/cell in each table in the database and can be used in database replication to create replicas of the production database. In CDC, a tool reads this write-ahead log and applies the changes to the data warehouse. This technique is a lot more robust than batch exports of the tables and has a low footprint on the production database.

However, you have to treat CDC as an end-to-end data pipeline in order to correctly replicate data initially as well as on an ongoing basis. You need to consider several aspects of the CDC connector life cycle. Here are a few examples:

Scale
The CDC pipeline has to be robust enough for high data volume. For example, in PostgreSQL, delays in reading the WAL can cause the database’s disk space to run out!
Replication lag
This refers to the duration between the time that a transaction is committed in the primary database and the time it becomes available in the data warehouse. You have to build checks to make sure that your pipeline is minimizing lag time before transformations are run.
Schema changes
Over time, database schemas evolve because tables or columns are added or removed or types are updated. It is important to propagate the schema changes to the data warehouse. Sometimes a schema change might require a historical sync.
Masking
You have to mask sensitive columns for compliance purposes.
Historical syncs
Before applying the CDC changes, an initial historical sync of the tables is needed. It can take a while and can overload the source. It’s better to do historical syncs from a replica database to speed them up and reduce load on the primary database. Sometimes partial interruptions might occur in the WAL, so you need partial historical syncs instead of full historical syncs to recover fast.

There are typically no strong reasons to build your own CDC connectors. Use existing tools instead!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.143.181