Now that you are familiar with Spanner, it’s time to look more closely at its underlying concepts and features. As with any database, it is beneficial to do a deep dive before starting to work with it. This chapter explores the concepts that are important when working with Spanner, starting with instances. The chapter covers pointers for databases, tables, and keys. Along the way, it also covers the way operations are handled in Spanner.
Data modeling concepts
Design considerations to keep in mind when modeling with Spanner
Data replication
Transactions and understanding concurrent transactions, with an example
Operations in the distributed environment
The next section begins with instances.
Instance
The resources are allocated during instance creation and are controlled by two main configurations. These configurations decide the compute and the way data is replicated, stored, and computed.
As you can see, you get equal resources (both compute and storage) across all zones. Within each zone, data of all the databases of the instance will be distributed across these nodes, which are responsible for processing their part of the data.
Note
Although nine nodes might seem like overkill, this is in line with the Cloud Spanner guarantee of rich querying, strong consistency, high availability, and performance.
Later sections will discuss the way data distribution happens across the nodes. The next section looks at the data modeling concepts found in Spanner.
Data Modeling
Cloud Spanner behaves like an RDBMS in many ways but with a few differences. This section explains Spanner’s data modeling concepts, covering the differences wherever applicable. It begins with databases and tables.
Databases and Tables
Databases are created to hold related data together in the form of tables.
Tables store data in a structured format using rows and columns.
Schemas are predefined with a number of columns, column names, and data types and size (if applicable).
Spanner Supported Data Types
Category | Data Type |
---|---|
Number | INT64, FLOAT64 |
Text | STRING |
Date Time | TIMESTAMP, DATE |
Boolean | BOOL |
Bytes | Bytes |
Complex | ARRAY |
Primary Keys
Primary key creation in RDBMS is not mandatory but in Spanner it’s mandatory.
The primary key of RDBMS contains one or more columns. The Cloud Spanner primary key can have zero columns as well. In this case, the table will contain only one row.
The primary key of RDBMS can never be NULL; however, Spanner’s primary key column can be NULLABLE. Since it’s a unique identifier, only one row can have the NULL value.
Artists Table Schema
Column Name | Data Type | |
---|---|---|
Artist ID | INT64 | Primary Key |
FirstName | STRING [10] | |
LastName | STRING [20] | |
EMailId | STRING [50] |
Artists Data
Artist ID | First Name | Last Name | EMailId |
---|---|---|---|
1 | Mike | Jonas | |
2 | Steffi | Graph | |
3 | Stephan | Chart | |
4 | Leslie | John | |
5 | Kate | Charlie |
Being distributed, Spanner partitions the data automatically and distributes across its allotted node. The partitioning is done by Spanner along the primary key dimension. So, it’s important to choose the primary key carefully.
Choose a Key to Avoid Hotspots
This scenario illustrates hot spotting, as the load weighs down a single node only. This effectively kills parallelization in Spanner , therefore affecting performance. In order to handle such scenarios, you should avoid keys that are sequentially ordered or monotonically increasing, like the ArtistId column. Instead, you should choose a key that evenly distributes data and enables even distribution of workload (reads and writes) and does not create a hotspot.
However, if it is unavoidable and you have to use a sequentially ordered key, you should see if you can hash it. Hashing has the effect of distributing across different servers quite evenly. In the previous example, if you hash Artist IDs, even distribution can be achieved as the hash values of the ArtistId columns will contain random alphanumeric data, which is highly unlikely to be monotonically increasing or decreasing. As a result, the representation will be free of hot spotting.
Note
There are various other techniques defined on the product site that you should take into consideration for efficient selection of the key.
The next section looks at the way related tables should be created for efficient querying .
Interleaving
Spanner enables you to choose the physical representation of your table data based on your querying pattern. In order to handle relationships where tables are related and are frequently queried together, Spanner enables you to define hierarchical (parent-child) relationship between the tables, which enables efficient retrieval as the data of the hierarchically related tables will be co-located and stored together on the disk.
Note
The parent/child relationship can be specified between the tables in up to seven levels of nesting.
The relationship is defined by creating the child table (table containing related information) as an interleaved table of the parent (main) table.
Albums Table Schema
Column Name | Data Type | |
---|---|---|
AlbumId | INT64 | Primary Key |
ArtistId | INT64 | |
Album Title | STRING [50] |
Albums Data
Album ID | Artist ID | Album Title |
---|---|---|
1 | 1 | Junk yard |
2 | 1 | Go On |
3 | 2 | Girls like you |
4 | 2 | Color Green |
5 | 2 | Purple |
Albums Table Schema
Column Name | Data Type | |
---|---|---|
ArtistId | INT64 | Primary Key (Column 1) |
AlbumId | INT64 | Primary Key (Column 2) |
Album Title | STRING [50] |
As you can see, the data is interleaved. Every row from the Artists table is followed by all its rows from the child Albums table, which is determined by the ArtistId column value. This entire representation looks like a big flat table.
In effect, this process of inserting the child rows between the parent rows along the primary key dimension is called interleaving. The child tables are called interleaved tables and the parent tables are called the root table or top-level tables. As can be seen, this will enable fast querying, as data that’s queried together is stored together.
So while designing your schema, it is important to identify the querying pattern to identify the tables that need to be co-located. That way, when you’re actually creating the tables, proper interleaving can be defined.
You are now familiar with two important design decisions—choosing the proper primary key to avoid hot spotting and identifying the querying pattern to interleave the related tables. The next step is to understand another interesting concept, splits .
Splits
As you know by now, Cloud Spanner uses the primary key dimension to partition the data. Splits determine the point within the dimension where the partitions should happen. In effect, splits are used by Cloud Spanner to logically group the rows so that they are independent of each other and can be moved around independently without impacting the other group of data and performance.
Although you don’t control when or how to split, you can control the way you define the root table and the row tree. In effect, it’s important to choose the correct root table while designing the database to enable proper scaling.
These are groups of rows that can be moved around without affecting any of the other rows. Splits complement the data distribution in Spanner. Having looked at important concepts of choosing a proper primary key, interleaving, and splits, it’s time to look at the indexes .
Secondary Indexes
As you know by now, data is stored in sorted order by the primary key, so any query involving the primary key field will always result in faster querying. However, querying on other non-key fields will lead to full table scan, leading to slower query performances. In order to cater to this scenario, like other databases, Spanner enables you to create secondary indexes on the non-key fields.
ArtistsByLastName(Ascending)
Last Name | Artist ID |
---|---|
Charlie | 5 |
Chart | 3 |
Graph | 2 |
John | 4 |
Jonas | 1 |
So, if you query Fetch Artist ID's with LastName values in the Range specified, this index will be able to return results faster without any table scan as all the data is available within the index itself.
Storing Clause
Now you can rephrase the query to Fetch Artists Id and FirstName where the LastName value is in the range specified. Since in this case FirstName is not there in the index, this query will require a JOIN with the Artists table to fetch the FirstName, leading to a negative impact on query performance.
ArtistsByLastName (Ascending) with FirstName Field Within the STORING Clause
Last Name | Artist ID | First Name |
---|---|---|
Charlie | 5 | Kate |
Chart | 3 | Stephan |
Graph | 2 | Steffi |
John | 4 | Leslie |
Jonas | 1 | Mike |
Interleaved Indexes
AlbumsByNameForEachArtist
Artist ID | Album Name | Album ID |
---|---|---|
1 | Go On | 2 |
1 | Junk Yard | 1 |
2 | Color Green | 4 |
2 | Girls Like You | 3 |
2 | Purple | 5 |
As with interleaved tables, this enables co-location of related rows near each other, in effect improving performance, as the data to be referred together is always placed together.
So, indexes can be really powerful as they make the querying faster by saving data in the format you need. However, the implications of secondary indexes are that you end up paying a storage penalty, as some data is going to be stored multiple times, like with interleaved indexes and the STORING clause. This has a further impact on the write performance.
With all the concepts in place, it’s time to look at the way data will be distributed and replicated, with an example.
Replication
Replicate the data based on the instance configuration chosen
Distribute the data based on the nodes allocated
Split the data along the primary key dimension
TableA Schema
Column Name | Data Type | |
---|---|---|
ID | INT64 | Primary Key |
Value | STRING |
TableA Split Details
Split | Key Range |
---|---|
0 | <100 |
1 | >=100 and <500 |
2 | >=500 and <1000 |
3 | >=1000 and <2000 |
4 | >=2000 and <2500 |
5 | >=2500 and <4000 |
6 | >=4000 and <7000 |
7 | >=7000 |
Note
This split distribution is transparent to the user and is handled automatically by Cloud Spanner. The section is for understanding purposes only.
Data is stored in the form of splits.
A node hosts one or many splits. This will be the case when the number of splits is more than the number of nodes allotted to the instance.
Replication is at the split level.
The split copies (or replicas), which are replicated across independent failure zones, are determined by the instance configuration. In this example, you have three replicas for each split.
Replica Types
Read/Write | Read Only | Witness |
---|---|---|
Replica holds full copy of the split | Replica holds full copy of the split | Do not maintain a full copy of data |
Data is always up-to-date | Data can be stale | Not applicable |
Serves both read and write traffic | Server read traffic only These nodes enable us to scale our read capacity | Serves neither read nor writes |
Can elect the leader and vote for whether the write is to be committed or not | Cannot participate in the leadership election nor can vote for committing writes; this implies increasing read replicas doesn’t increase the quorum size required for writes nor its location contributes to write latency | Can elect the leader and vote for whether the write is to be committed or not. These replicas make it easier to achieve quorums for writes without the storage and compute resources |
Can be the leader as well | Cannot be a leader | Cannot be a leader |
Spanner uses synchronous , Paxos (consensus protocol) based replication schemes where, whenever a write is issued, every voting replica votes before the writes can be committed. The major concern Cloud Spanner addresses is database replication on a global scale and provisioning of data consistency, even when multiple users conduct transactions that require the system to connect to datacenters across the globe. Cloud Spanner eliminates the bottleneck by using Paxos, which relies on an algorithm to create consensus among globally distributed nodes to conduct the commit. Later in the chapter, you will delve deeper into the way this works while carrying out the read/write operations in Spanner. Prior to that, it’s important that you understand the way operations happen.
Transactions
In Spanner, all operations—whether they are read, write, or read and write both—happen within a transaction. Spanner supports two modes of transaction: read/write and read only. These transaction modes differ in the way they handle access to the data. Let’s begin with the read/write transaction mode.
Read/Write
Savings Account Table Schema
Column Name | Data Type |
---|---|
Account ID | STRING |
Amount | INT64 |
Last Operation | STRING |
Last Modified Date | DATE |
As you can see, the read acquires a shared lock. Within the database at any point in time, you can have multiple concurrent transactions running. This lock enables the other transactions to continue reading. In effect, it does not block the other reads. When commit is initiated, it implies that the transaction is ready with the data to be applied back to the database. At this point it tries to acquire an exclusive lock on the data point. This will block all the new read requests for the data and will wait for the existing locks to be cleared. Once all the locks are cleared, it places the exclusive lock and applies all the writes. Once this is done, the lock is released. The Amount field will show the updated data.
Note that the updates in Cloud Spanner is not in-place updates, instead Spanner uses the MVCC (Multi Version Concurrent Control) method to modify the data. This implies that the writes do not overwrite the data. Instead, an immutable copy is created, which is timestamped with the write’s transaction commit time. In effect, Cloud Spanner maintains multiple immutable versions of data. The next section looks at the read only transaction mode.
Read Only
Accounts Data
Account ID | Amount | Last Operation | Last Modified Date |
---|---|---|---|
1 | $1000 | Credit | 23rd August 2019 |
2 | $90 | Debit | 18th August 2019 |
As seen, you can have reads only with no updates. In general, you can use the reads to process and display the output to the end users. When you have multiple reads happening together, it’s best to use the read only transaction mode. In TxnR all reads will return the result as $1000, as the Begin transaction has captured a snapshot of the data, and all reads refer to that snapshot.
There might be cases where you need to do single read calls or need to read data in parallel. Cloud Spanner enables you to do this with single reads options, which are reads outside the context of a transaction. The single reads are the simplest and fastest operations in Spanner.
By default, the reads within a read only transaction and single reads pick the most recent timestamp; however, Cloud Spanner enables you to control the freshness of the data. At any point in time, Cloud Spanner can perform reads up to one hour in the past.
Note
As you saw earlier, write leads to a creation of an immutable copy. In effect, at any given time, you will have multiple copies maintained. Cloud Spanner continuously garbage collects the old copy in the background to reclaim the storage space. This process is known as “version GC”. By default, version GC reclaims the versions after they are one hour old. Because of this, Cloud Spanner cannot perform reads at read timestamps more than one hour in the past.
In effect, Cloud Spanner enables you to do two types of reads: strong reads and stale reads. Strong reads are the default reads that get the freshest data, wherein reads get to see the effect of all transactions that were committed prior to when the read started. Stale reads enable you to read past data and are further categorized into two types: exact staleness or bounded staleness. Exact staleness reads data at a user-specified timestamp, e.g., 2019-09-16T15:01:23. In bounded staleness, instead of specifying a timestamp, the user specifies a bound, such as 7s. Whenever the transaction is run based on the bound specified, Spanner chooses the newest timestamp within the bound. Bounded staleness is slower than the exact staleness reads; however, they return the freshest data. The read type is controlled by specifying the timestamp bound on the reads.
The read types are useful when you use read replicas to scale out reads. The read replicas can be a bit outdated. If the application is willing to tolerate staleness, then it can get faster responses, as they will be able to read from any nearest read replica and will not have to communicate with the leader replica, which may be geographically distant.
Now that you are familiar with the transaction modes available, let’s next look at the way Spanner handles concurrent transactions across its distributed environment.
Handling Multiple Transactions
Cloud Spanner provides the strictest concurrency control, which is external consistency. This implies that multiple transactions can run side-by-side on different servers (which are possibly spread out across different datacenters in different time zones) without causing any inconsistencies. Proper timestamping is mandatory for achieving external consistency. For this, Spanner uses TrueTime, a Google developed technology. This is a highly available, distributed clock enabling applications running on Google Servers to generate monotonically increasing timestamps without any need for global communication.
Salary Table Schema
Column Name | Data Type |
---|---|
Employee ID | INT64 |
Month | STRING |
Salary | INT64 |
Processed On | DATE |
Advance Taken Table Schema
Column Name | Data Type |
---|---|
Employee ID | INT64 |
Month | STRING |
Advance Taken | INT64 |
Is Adjusted | BOOLEAN |
Processed On | DATE |
Reimbursement Table Schema
Column Name | Data Type |
---|---|
Employee ID | INT64 |
Month | STRING |
Reimbursement Amount | INT64 |
Is Processed | BOOLEAN |
Processed On | DATE |
Read/Write Transactions
Sequential Occurrence
As TrueTime is used to ensure proper timestamping, TrueTime guarantees that because Tx2 starts to commit after Tx1 finishes, all readers of the database will observe that Tx1 occurred before Tx2. This guarantees no update is ever lost.
Simultaneous Occurrence
In this process, the Tx2 commit fails and will be retried later. This is done to avoid deadlock. In addition, this also guarantees that no updates are ever lost if multiple concurrent transactions start, as at any point in time only one will be updating the data and the other will retry after some time. Under the hood, the age of each transaction is used to resolve the conflicting lock issue. The younger transaction will wait, but an older transaction will wound (abort) a younger transaction. By giving priority to the older transaction, Spanner ensures that eventually every transaction gets a chance to update.
Data Invalidated
Since Tx2 started earlier than the Tx1 commit, it will read the Salary value, which is already invalidated due to changes made by Tx1. In this case as well, Tx2 will be terminated so that the update of Tx1 is not overwritten. In effect, prior to making the update, the transaction waits for the read locks to be released and also validates that all the read locks it acquired within the transaction are still valid. This implies no change is made to it after it has read the data.
Different Fields Updates
Transaction Tx1 is working on the Salary field, whereas Tx2 works on the Processed On field. It is quite likely that both transactions will attempt to update the same row of data simultaneously. However, unlike the previous scenario, where transaction Tx2 was aborted to avoid the deadlock situation, here, both transactions will be executed successfully. This is due to the fact that Cloud Spanner locks at the cell level (a row and a column). The row is the same but the columns that both transactions are working on are different. So, there will be no conflict, as the cells that will be updated are different.
Blind Writes
Let’s now look at the scenario of concurrent read/write transactions, where you have only write operations that are not followed by any reads. Such operations are called blind writes. Inserting the current month’s salary details in the Salary table is an example. Write locks are usually exclusive, where only one writer updates the data. Although this prevents data corruption by preventing the race condition on the data, it does impact the throughput.
With blind writes in Spanner, the write locks are allowed to be shared. This is possible due to usage of TrueTime. TrueTime guarantees a unique timestamp to each transaction, so multiple blind writes can run in parallel, thereby updating the same data without worrying about data corruption or loss. This feature not only avoids race conditions but also brings in significant throughput for blind writes .
Read/Write with Read Only
Salary Data at t0
Employee ID | Month | Salary | Processed On |
---|---|---|---|
1 | Aug 2019 | $1000 | 23rd Aug 2019 |
Salary Data at t1
Employee ID | Month | Salary | Processed On |
---|---|---|---|
1 | Aug 2019 | $500 | 23rd Aug 2019 |
Salary Data at t2
Employee ID | Month | Salary | Processed On |
---|---|---|---|
1 | Aug 2019 | $10000 | 23rd Aug 2019 |
In Txr1, all reads will show values, as on timestamp t0, which is $1000 only, as it started prior to the Tx1 commit and the snapshot taken was of timestamp t0. All reads within the transaction return the value from the snapshot, so even though the second read is after the Tx1 commit timestamp, the changes will not be reflected in this transaction read.
The change of the Tx1 commit will be visible in the Txr2 read, as the snapshot for this transaction is after the Tx1 commit. Every read in this transaction will show $500. As with Txr1, reads in Txr2 will not show the changes made in transaction Tx2.
This guarantees safety against non-repeatable reads and dirty reads. Non-repeatable reads are reads within a transaction that see different values due to interleaved read/write transactions. Dirty reads are reads that see a change that has not yet been committed.
Having looked at the way operations happen in Spanner, it’s time to go back and look at the way Spanner carries out the read/write operations across its distributed dataset.
Distributed Transactions Explained
TableA Schema
Column Name | Data Type | |
---|---|---|
ID | INT64 | Primary Key |
Value | STRING |
TableA Split Details
Split | Key Range |
---|---|
0 | <100 |
1 | >=100 and <500 |
2 | >=500 and <1000 |
3 | >=1000 and <2000 |
4 | >=2000 and <2500 |
5 | >=2500 and <4000 |
6 | >=4000 and <7000 |
7 | >=7000 |
Let’s begin with writes.
Writes
Involving a single split. An update statement is issued for row ID 7.
Involving multiple splits. An update value of row IDs 410, 500, 1300, and 2300.
Step1: Identifies the leader replica for the split. In this example, the split is 0 and the leader replica resides on node 1.
Step2: The leader replica tries to acquire an exclusive lock on its local copy of the data. In addition, it forwards requests for agreement on committing to voting replicas.
Step 3: If the lock is acquired and a majority of votes are received, it tries to assign a timestamp for commit and wait for its certainty. In parallel, it sends the writes to the other replicas hosting the data copy. Replicas store the writes to stable storage and communicate back to the leader. Once the majority of replicas have responded and the commit wait is over, the leader is sure the commit will begin.
Step 4: Communicates the timestamp back to the client and leader. The other replicas start applying the changes.
Step 5: After completion, the locks are released.
Step 1: Identify the leader replica for the splits. The splits are 1, 2, 3, and 4. The leader replicas reside across nodes 1, 2, and 3.
Step 2: Since multiple splits are involved, one split leader picks up the role of the coordinator and the remaining leaders become the participants. For this example, assume that split 1 becomes the coordinator. As coordinator, its job is to ensure the atomicity of the transaction—the changes are applied on all the splits or no splits at all.
Step 3: The first phase of the two phase commit.
Like the single split reads, all the respective split leaders (irrespective of the role) try to acquire an exclusive lock on the local data copy. In this example, the following splits tries to acquire a write lock on the row IDs Split 1 => Row Id 410, Split 2 => Row Id 500, Split 3 => Row Id 1300, and Split 4 =>Row Id 2300.
Each split leader records its set of locks by replicating them to at least a majority of its split replicas. Once they are able to hold the locks, they notify the coordinator.
If all the split leaders are able to successfully acquire the lock, the coordinator moves to the next step.
Like in a single split read, the coordinator then tries to assign a timestamp to the commit and wait for its certainty. In parallel, it sends the writes to all the replicas. Replicas store the writes to its stable storage and communicate back to the leader. Once the majority has responded and the commit wait is over, the second phase of the commit begins.
Step 4: The second phase of the two phase commit.
The coordinator communicates the outcome to all the participant’s leaders. The participant’s leaders communicate the same to all its replicas.
The coordinator and all its participants apply the changes to their respective data. The coordinator leader communicates to the client that the transaction has been committed.
Step 5: After the completion, the locks are released
Irrespective of whether the write involves a single node or is multi-node, a write is marked to be committed only when write-quorum is achieved or, in other words, the majority of voting replicas (replicas who are eligible to vote for a write commit) agree to commit the write. Writes always happen on the leader replica and the non-witness replicas of the replica set. The next section covers read operations.
Reads
As you already know by now, reads can be part of read/write transactions as well as read only transactions. Reads that are part of the read/write transactions will always be served by the leader replica, as it requires locking. Reads that are part of read only transactions (or single reads) can be served from any replica. The need to communicate with the leader replica depends on the read type method: strong or stale.
- 1.
Splits are identified. In this case, the splits are 0, 1, and 2. Being a strong read, the current TrueTime timestamp is picked and the request is sent to any replica of the splits 0, 1, and 2, along with the read timestamp.
- 2.This step depends on the replica type. In this case, you have only two possibilities—the replica is a leader or is not a leader.
- a)
If the replica turns out to be the leader, reads can proceed directly, as data is always up to date on the leader.
- b)If it’s not a leader replica, there are two new possibilities. The data is updated or is not updated.
- i.
If the data is updated, reads are served as is.
- ii.
If it’s not up to date, it interacts with the leader to get the latest update. Once the updates are applied, the data state becomes the latest and the reads are served.
- i.
- a)
- 3.
Results received from the replicas are combined and returned to the client.
If you do stale reads, then you can proceed without Step 2.b.(ii).
Summary
This chapter covered the way data is stored and looked at important modeling concepts, including a brief understanding of the way transactions are handled in Spanner. The next chapter starts on developing with Spanner.