Chapter 3ConceptsIt’s time to dive deeper into CockroachDB! This chapter explores some of CockroachDB’s building blocks, including data types, indexes, and geo-partitioning.
First, let’s explore the top-level objects that make up CockroachDB.
Database Objects
At the top level of the CockroachDB object hierarchy , there are databases. Databases contain schemas that, in turn, contain schema objects like tables and views.
defaultdb – The defaultdb database that clients will connect to if they don’t provide a database name in their connection configuration. This database is empty and does not provide any contextual information to the schema objects contained within it. For that reason, it’s best to avoid using this database and prefer to create and reference named databases explicitly. If you don’t need this database, you can delete it.
postgres – The postgres database exists to provide compatibility with Postgres clients. If you don’t need this database, you can delete it.
system – The system database provides core information to the CockroachDB cluster, including information relating to jobs, ranges, replication, geo-locations, users, and more.
In every database, there are schemas. Schemas are CockroachDB’s second-level object. They allow you to separate objects like tables and views into logical areas. Say, for example, you have two business areas, each needing their own tables; having a schema per business area not only keeps those tables logically separated but also resolves naming collisions. Having two tables named “employee,” for example, would not be possible within the same schema. With two schemas, this is not an issue.
crdb_internal – The crdb_internal schema contains information relating to the internal objects used by a database. This includes things like metrics, jobs, alerts, leases, and audit entries for CREATE operations.
information_schema – The information_schema schema contains information on user-defined objects such as columns, indexes, tables, and views.
pg_catalog – The pg_catalog schema provides compatibility with Postgres clients that are expecting a schema with this name to exist. It contains information relating to databases and the objects they contain.
pg_extension – The pg_extension schema provides extension information. By default, this includes information from the Spatial Features extension, which provides spatial data types.
public – In the same way that the defaultdb database is used if no database name is provided, the public schema is the default schema that is used if no user-defined schema is provided. All user-defined objects go into this schema unless the user provides another.
Underneath schemas are CockroachDB’s third and final-level objects. These include indexes, sequences, tables, views, and temporary objects (objects like temporary tables that are not persisted).
Data Types
CockroachDB boasts all of the data types you’ll need to build a rich database. In this section, I’ll show you how and where to use these data types .
UUID
The UUID data type stores a 128-bit UUID value. Values stored in this column can be any UUID version1 but will all be formatted using RFC 4122 standards. Let’s create a table with a UUID column to learn more about it.
CockroachDB will only generate a default value if you don’t provide one, so it’s still possible to provide values yourself.
ARRAY
The ARRAY data type stores a flat (or one-dimensional) collection of another data type. It’s indexable using inverted indexes designed to work with tokenizable data, such as the values of an array or the key-value pairs in a JSON object. Let’s create a table with an ARRAY column to learn more about it.
There are many operations you can perform against ARRAY columns . We’ll cover just the most common.
BIT
BOOL
BYTES
Every resulting row from the preceding insert will have an identical password column value.
DATE
Every resulting row from the preceding insert will have an identical date_of_birth column value.
ENUM
DECIMAL
The DECIMAL, DEC, or NUMERIC data type stores exact, fixed-point2 numbers of variable size and is created either with or without a precision.
Positive infinity Bitcoin. One can dream.
FLOAT
INET
Notice that the 10.0.1.0/24 address keeps its mask because a /24 mask for this address includes addresses from 10.0.1.0 to 10.0.1.255. On the other hand, the 10.0.1.1/32 address has a mask of /32, which means “this IP address only,” meaning we have a specific IP address, and the mask is superfluous. The same holds for the IPv6 addresses, where a /128 means “this IP address only.”
INTERVAL
JSONB
If you plan on performing complex queries against a JSON column , you’ll want to create an inverted index for the column. I’ll cover these in a subsequent chapter on performance.
To access specific JSON fields as regular columns, create indexes or even create PRIMARY KEYs from JSON fields within JSON documents; this is also possible with the JSON data type .
SERIAL
The SERIAL data type is not strictly a data type but rather an INT, INT2, INT4, or INT8 with a default value applied, resulting in an auto-incrementing number. SERIAL was introduced to CockroachDB to provide Postgres compatibility. As such, you should consider using the UUID data type with a default value of gen_random_uuid() instead of the SERIAL data type, as this provides 128 bits of randomness vs. SERIAL’s maximum of 64 bits.
As we can see from the values generated, regardless of the integer size provided by the data type definition, the default mode for generating these values (unique_rowid()) will always result in a 64-bit integer.
STRING
Note that VARCHAR(n) is, in fact, an alias of STRING(n), but for Postgres compatibility, it is still represented as VARCHAR(n) here, as Postgres does not have a STRING data type .
Inserting data into this table will reveal that CockroachDB faithfully represents Unicode characters:
As in Postgres, you’ll receive an error from CockroachDB if you try to insert data that will not fit into a fixed-length or limited variable-length column.
TIME/TIMETZ
The TIME and TIMEZ data types store time (minus date) values in UTC and zoned representations, respectively. They can store time values to various levels of precision, ranging from second precision down to microsecond precision.
TIMESTAMP/TIMESTAMPTZ
The TIMESTAMP and TIMESTAMPTZ data types store timestamps in UTC and display the values in UTC and zoned representations, respectively.
As you can see, the TIMESTAMP data types allow for an optional precision to be provided. This precision accepts a value between zero (representing second precision) and six (representing microsecond precision). The default precision for TIMESTAMP data types is 6, so for most cases, you can omit the precision entirely.
As you can see, users in the UTC time zone see the data as it was inserted into the database, while users in the EDT time zone see the data in their local time as a -05:00 offset from UTC.
GEOMETRY
GEOMETRYCOLLECTION – A collection of other spatial objects (including other GEOMETRYCOLLECTION objects)
LINESTRING – A collection of points making up an arbitrary line, useful for roads and rivers
MULTILINESTRING – A collection of LINESTRING objects
MULTIPOINT – A collection of POINT objects
MULTIPOLYGON – A collection of POLYGON objects
POINT – Two points that represent coordinates (e.g., latitude and longitude)
POLYGON – A collection of points that represent an area
Their query returns just the two properties in Knightsbridge, leaving the equally lavish Chelsea property for another fabulously wealthy user.
Functions
CockroachDB provides built-in functions to make certain operations easier, and there are functions for working with most of the data types introduced in this chapter.
Built-in functions5 either provide functionality for working with data types or expose core system functionality such as working with stream ingestion and getting system information.
There are many more functions to try; for a complete list of available functions, visit www.cockroachlabs.com/docs/stable/functions-and-operators.html .
Geo-partitioned Data
Geo-partitioning is one of CockroachDB’s most powerful Enterprise features , and as of v21.1.10, working with geo-partitioned data has become much more straightforward.
To get started with geo-partitioned data, you first need to decide on a partitioning strategy. For example, will all of the rows within a table need to remain in one location, or will specific rows within that table need to be separated by location? If you need all of the rows within a table to remain in one location, the REGION BY TABLE table locality is what you need. On the other hand, if you need to pin rows to different places, the REGION BY ROW table locality is a good choice.
Let’s create some database tables to demonstrate geo-partitioning for both locality strategies.
This command creates an insecure, empty cluster with nine nodes. The --demo-locality argument allows us to specify the region and availability zones for the nodes to be distributed across: nine availability zones, one for each of our nine nodes. The syntax of this argument’s value looks tricky but is actually very simple; for each of the requested localities, we simply provide a colon-separated collection of region and availability zone pairs.
It’s now time to create some partitioned tables.
REGION BY ROW
The REGION BY ROW table locality ensures that individual rows within a table are pinned to specific locations, depending on values set for a specific field. In human speak, a location can be a continent, a country, a state, or even a specific machine. In infrastructure speak, a location can be a region, an availability zone, or a node.
We don’t need to manually set a survival goal in this example because for now, we’ll stick with the default configuration , which allows CockroachDB to survive zonal failures.
country – The country that is associated with the row.
crdb_region – The region that describes where CockroachDB will pin data. This column is populated automatically on-insert with a value derived from the country column. Using a case statement, we’ll ensure that when the country column is “JP” for Japan, “KR” for South Korea, or “TH” for Thailand, the crdb_region column receives the value “asia-northwest1”, etc.
From the preceding output , you’ll see that data in the person table is shared across five nodes, as per the default replication factor in CockroachDB. A replication factor of five provides additional robustness if a region becomes unavailable.
The preceding ALTER statement will be blocked by CockroachDB, as any request to update the number of replicas in a table or database requires special permissions. However, as seen in the preceding statement, we can ask CockroachDB to permit us to make this change by overriding the multiregion zone configuration flag.
REGION BY TABLE
The REGION BY TABLE table locality ensures that all of the rows within a table remain in a particular location. In human speak, a location can be a continent, a country, a state, or even a specific machine. In infrastructure speak, a location can be a region, an availability zone, or a node.
The REGION BY TABLE table locality is the default configuration for tables in CockroachDB. As a result, access to regional tables will be faster when accessed from the database’s primary region.
Use this option if you wish to pin all data within a table to one region. In the following example, we’ll create a regional table.
In the previous step, we reduced the number of replicas for the american_scientists table from five to three. Doing so will impact the resilience of the cluster, given that CockroachDB will now replicate data across fewer nodes. In later chapters, I share some cluster size and replication number recommendations for production environments.