Most applications depend on a database to store data. Imagine an application that keeps track of a warehouse’s inventory. The more inventory moves through the warehouse, the more requests the application serves and the more queries the database has to process. Sooner or later, the database becomes too busy and latency increases to a level that limits the warehouse’s productivity. At this point, you have to scale the database to help the business. You can do this in two ways:
Vertically—You can use faster hardware for your database machine; for example, you can add memory or replace the CPU with a more powerful model.
Horizontally—You can add a second database machine. Both machines then form a database cluster.
Scaling a database vertically is the easier option, but it gets expensive. High-end hardware is more expensive than commodity hardware. Besides that, at some point, you will not find more powerful machines on the market anymore.
Scaling a traditional relational database horizontally is difficult because transactional guarantees such as atomicity, consistency, isolation, and durability—also known as ACID—require communication among all nodes of the database during a two-phase commit. To see what we’re talking about, here’s how a simplified two-phase commit with two nodes works, as illustrated in figure 12.1:
A query is sent to the database cluster that wants to change data (INSERT
, UPDATE
, DELETE
).
The database transaction coordinator sends a commit request to the two nodes.
Node 1 checks whether the query could be executed. The decision is sent back to the coordinator. If the nodes decide yes, the query can be executed, it must fulfill this promise. There is no way back.
Node 2 checks whether the query could be executed. The decision is sent back to the coordinator.
The coordinator receives all decisions. If all nodes decide that the query could be executed, the coordinator instructs the nodes to finally commit.
Nodes 1 and 2 finally change the data. At this point, the nodes must fulfill the request. This step must not fail.
The problem is, the more nodes you add, the slower your database becomes, because more nodes must coordinate transactions between each other. The way to tackle this has been to use databases that don’t adhere to these guarantees. They’re called NoSQL databases.
There are four types of NoSQL databases—document, graph, columnar, and key-value store—each with its own uses and applications. Amazon provides a NoSQL database service called DynamoDB, a key-value store. DynamoDB is a fully managed, proprietary, closed source key-value store with document support. In other words, DynamoDB persists objects identified by a unique key, which you might know from the concept of a hash table. Fully managed means that you only use the service and AWS operates it for you. DynamoDB is highly available and highly durable. You can scale from one item to billions and from one request per second to tens of thousands of requests per second. AWS also offers other types of NoSQL database systems like Keyspaces, Neptune, DocumentDB, and MemoryDB for Redis—more about these options later.
To use DynamoDB, your application needs to be built for this particular NoSQL database. You cannot point your legacy application to DynamoDB instead of a MySQL database, for example. Therefore, this chapter focuses on how to write an application for storing and retrieving data from DynamoDB. At the end of the chapter, we will discuss what is needed to administer DynamoDB.
Some typical use cases for DynamoDB follow:
When building systems that need to deal with a massive amount of requests or spiky workloads, the ability to scale horizontally is a game changer. We have used DynamoDB to track client-side errors from a web application, for example.
When building small applications with a simple data structure, the pay-per-request pricing model and the simplicity of a fully managed service are good reasons to go with DynamoDB. For example, we used DynamoDB to track the progress of batch jobs.
While following this chapter, you will implement a simple to-do application called nodetodo, the equivalent of a “Hello World” example for databases. You will learn how to write, fetch, and query data. Figure 12.2 shows nodetodo in action.
DynamoDB is a key-value store that organizes your data into tables. For example, you can have a table to store your users and another table to store tasks. The items contained in the table are identified by a unique key. An item could be a user or a task; think of an item as a row in a relational database.
To minimize the overhead of a programming language, you’ll use Node.js/JavaScript to create a small to-do application, nodetodo, which you can use via the terminal on your local machine. nodetodo uses DynamoDB as a database and comes with the following features:
To implement an intuitive CLI, nodetodo uses docopt, a command-line interface description language, to describe the CLI interface. The supported commands follow:
In the following sections, you’ll implement these commands to learn about DynamoDB hands-on. This listing shows the full CLI description of all the commands, including parameters.
nodetodo Usage: nodetodo user-add <uid> <email> <phone> nodetodo user-rm <uid> nodetodo user-ls [--limit=<limit>] [--next=<id>] ① nodetodo user <uid> nodetodo task-add <uid> <description> ➥ [<category>] [--dueat=<yyyymmdd>] ② nodetodo task-rm <uid> <tid> nodetodo task-ls <uid> [<category>] ➥ [--overdue|--due|--withoutdue|--futuredue] ③ nodetodo task-la <category> ➥ [--overdue|--due|--withoutdue|--futuredue] nodetodo task-done <uid> <tid> nodetodo -h | --help ④ nodetodo --version ⑤ Options: -h --help Show this screen. --version Show version.
① The named parameters limit and next are optional.
② The category parameter is optional.
④ help prints information about how to use nodetodo.
DynamoDB isn’t comparable to a traditional relational database in which you create, read, update, or delete data with SQL. You’ll use the AWS SDK to send requests to the REST API. You must integrate DynamoDB into your application; you can’t take an existing application that uses an SQL database and run it on DynamoDB. To use DynamoDB, you need to write code!
Each DynamoDB table has a name and organizes a collection of items. An item is a collection of attributes, and an attribute is a name-value pair. The attribute value can be scalar (number, string, binary, Boolean), multivalued (number set, string set, binary set), or a JSON document (object, array). Items in a table aren’t required to have the same attributes; there is no enforced schema. Figure 12.3 demonstrates these terms.
DynamoDB doesn’t need a static schema like a relational database does, but you must define the attributes that are used as the primary key in your table. In other words, you must define the table’s primary key schema. Next, you will create a table for the users of the nodetodo application as well as a table that will store all the tasks.
To be able to assign a task to a user, the nodetodo application needs to store some information about users. Therefore, we came up with the following data structure for storing information about users:
{ "uid": "emma", ① "email": "[email protected]", ② "phone": "0123456789" ③ }
③ The phone number belonging to the user
How to create a DynamoDB table based on this information? First, you need to think about the table’s name. We suggest that you prefix all your tables with the name of your application. In this case, the table name would be todo-user
.
Next, a DynamoDB table requires the definition of a primary key. A primary key consists of one or two attributes. A primary key is unique within a table and identifies an item. You need the primary key to retrieve, update, or delete an item. Note that DynamoDB does not care about any other attributes of an item because it does not require a fixed schema, as relational databases do.
For the todo-user
table, we recommend you use the uid
as the primary key because the attribute is unique. Also, the only command that queries data is user
, which fetches a user by its uid
. When using a single attribute as primary key, DynamoDB calls this the partition key of the table.
You can create a table by using the Management Console, CloudFormation, SDKs, or the CLI. Within this chapter, we use the AWS CLI. The aws
dynamodb
create-table
command has the following four mandatory options:
attribute-definitions
—Name and type of attributes used as the primary key. Multiple definitions can be given using the syntax AttributeName=attr1
, AttributeType=S
, separated by a space character. Valid types are S
(string), N
(number), and B
(binary).
key-schema
—Name of attributes that are part of the primary key (can’t be changed). Contains a single entry using the syntax AttributeName=attr1, KeyType=HASH
for a partition key, or two entries separated by spaces for a partition key and sort key. Valid types are HASH
and RANGE
.
provisioned-throughput
—Performance settings for this table defined as ReadCapacityUnits=5,WriteCapacityUnits=5
(you’ll learn about this in section 12.11).
Execute the following command to create the todo-user
table with the uid
attribute as the partition key:
$ aws dynamodb create-table --table-name todo-user ① ➥ --attribute-definitions AttributeName=uid,AttributeType=S ② ➥ --key-schema AttributeName=uid,KeyType=HASH ③ ➥ --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5 ④
① Prefixing tables with the name of your application will prevent name clashes in the future.
② Items must at least have one attribute uid of type string.
③ The partition key (type HASH) uses the uid attribute.
④ You’ll learn about this in section 12.11.
Creating a table takes some time. Wait until the status changes to ACTIVE
. You can check the status of a table as follows.
$ aws dynamodb describe-table --table-name todo-user ① { "Table": { "AttributeDefinitions": [ ② { "AttributeName": "uid", "AttributeType": "S" } ], "TableName": "todo-user", "KeySchema": [ ③ { "AttributeName": "uid", "KeyType": "HASH" } ], "TableStatus": "ACTIVE", ④ "CreationDateTime": "2022-01-24T16:00:29.105000+01:00", "ProvisionedThroughput": { "NumberOfDecreasesToday": 0, "ReadCapacityUnits": 5, "WriteCapacityUnits": 5 }, "TableSizeBytes": 0, "ItemCount": 0, "TableArn": "arn:aws:dynamodb:us-east-1:111111111111:table/todo-user", "TableId": "0697ea25-5901-421c-af29-8288a024392a" } }
① The CLI command to check the table status
② Attributes defined for that table
③ Attributes used as the primary key
So far, we created the table todo-user
to store information about the users of nodetodo. Next, we need a table to store the tasks. A task belongs to a user and contains a description of the task. Therefore, we came up with the following data structure:
① The task is assigned to the user with this ID.
② The creation time (number of milliseconds elapsed since January 1, 1970 00:00:00 UTC) is used as ID for the task.
How does nodetodo query the task table? By using the task-ls
command, which we need to implement. This command lists all the tasks belonging to a user. Therefore, choosing the tid
as the primary key is not sufficient. We recommend a combination of uid
and tid
instead. DynamoDB calls the components of a primary key with two attributes partition key and sort key. Neither the partition key nor the sort key need to be unique, but the combination of both parts must be unique.
Note This solution has one limitation: users can add only one task per timestamp. Because tasks are uniquely identified by uid
and tid
(the primary key) there can’t be two tasks for the same user at the same time. Our timestamp comes with millisecond resolution, so it should be fine.
Using a partition key and a sort key uses two of your table’s attributes. For the partition key, an unordered hash index is maintained; the sort key is kept in a sorted index for each partition key. The combination of the partition key and the sort key uniquely identifies an item if they are used as the primary key. The following data set shows the combination of unsorted partition keys and sorted sort keys:
["john", 1] => { ① "uid": "john", "tid": 1, "description": "prepare customer presentation" } ["john", 2] => { ② "uid": "john", "tid": 2, "description": "plan holidays" } ["emma", 1] => { ③ "uid": "emma", "tid": 1, "description": "prepare lunch" } ["emma", 2] => { "uid": "emma", "tid": 2, "description": "buy nice flowers for mum" } ["emma", 3] => { "uid": "emma", "tid": 3, "description": "prepare talk for conference" }
① The primary key consists of the uid (john) used as the partition key and tid (1) used as the sort key.
② The sort keys are sorted within a partition key.
③ There is no order in the partition keys.
Execute the following command to create the todo-task
table with a primary key consisting of the partition key uid
and sort key tid
like this:
$ aws dynamodb create-table --table-name todo-task ➥ --attribute-definitions AttributeName=uid,AttributeType=S ① ➥ AttributeName=tid,AttributeType=N ➥ --key-schema AttributeName=uid,KeyType=HASH ➥ AttributeName=tid,KeyType=RANGE ② ➥ --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5
① At least two attributes are needed for a partition key and sort key.
② The tid attribute is the sort key.
Wait until the table status changes to ACTIVE
when you run aws
dynamodb
describe-table
--table-name
todo-task
.
The todo-user
and todo-task
tables are ready. Time to add some data.
Now you have two tables up and running to store users and their tasks. To use them, you need to add data. You’ll access DynamoDB via the Node.js SDK, so it’s time to set up the SDK and some boilerplate code before you implement adding users and tasks.
To get started with Node.js and docopt, you need some magic lines to load all the dependencies and do some configuration work. Listing 12.3 shows how this can be done.
Docopt is responsible for reading all the arguments passed to the process. It returns a JavaScript object, where the arguments are mapped to the parameters in the CLI description.
const fs = require('fs'); ① const docopt = require('docopt'); ② const moment = require('moment'); ③ const AWS = require('aws-sdk'); ④ const db = new AWS.DynamoDB({ region: 'us-east-1' }); const cli = fs.readFileSync('./cli.txt', ➥ {encoding: 'utf8'}); ⑤ const input = docopt.docopt(cli, { ⑥ version: '1.0', argv: process.argv.splice(2) });
① Loads the fs module to access the filesystem
② Loads the docopt module to read input arguments
③ Loads the moment module to simplify temporal types in JavaScript
⑤ Reads the CLI description from the file cli.txt
⑥ Parses the arguments, and saves them to an input variable
Next, you will add an item to a table. The next listing explains the putItem
method of the AWS SDK for Node.js.
const params = { Item: { ① attr1: {S: 'val1'}, ② attr2: {N: '2'} ③ }, TableName: 'app-entity' ④ }; db.putItem(params, (err) => { ⑤ if (err) { ⑥ console.error('error', err); } else { console.log('success'); } });
① All item attribute name-value pairs
② Strings are indicated by an S.
③ Numbers (floats and integers) are indicated by an N.
④ Adds item to the app-entity table
⑤ Invokes the putItem operation on DynamoDB
The first step is to add an item to the todo-user
table.
The following listing shows the code executed by the user-add
command.
if (input['user-add'] === true) { const params = { Item: { ① uid: {S: input['<uid>']}, ② email: {S: input['<email>']}, ③ phone: {S: input['<phone>']} ④ }, TableName: 'todo-user', ⑤ ConditionExpression: 'attribute_not_exists(uid)' ⑥ }; db.putItem(params, (err) => { ⑦ if (err) { console.error('error', err); } else { console.log('user added'); } }); }
① Item contains all attributes. Keys are also attributes, and that’s why you do not need to tell DynamoDB which attributes are keys if you add data.
② The uid attribute is of type string and contains the uid parameter value.
③ The email attribute is of type string and contains the email parameter value.
④ The phone attribute is of type string and contains the phone parameter value.
⑥ If putItem is called twice on the same key, data is replaced. ConditionExpression allows the putItem only if the key isn’t yet present.
⑦ Invokes the putItem operation on DynamoDB
When you make a call to the AWS API, you always do the following:
Create a JavaScript object (map) filled with the needed parameters (the params variable).
Check whether the response contains an error, and if not, process the returned data.
Therefore, you only need to change the content of params
if you want to add a task instead of a user. Execute the following commands to add two users:
node index.js user-add john [email protected] +11111111 node index.js user-add emma [email protected] +22222222
John and Emma want to add tasks to their to-do lists to better organize their everyday life. Adding a task is similar to adding a user. The next listing shows the code used to create a task.
if (input['task-add'] === true) { const tid = Date.now(); ① const params = { Item: { uid: {S: input['<uid>']}, tid: {N: tid.toString()}, ② description: {S: input['<description>']}, created: {N: moment(tid).format('YYYYMMDD')} ③ }, TableName: 'todo-task', ④ ConditionExpression: 'attribute_not_exists(uid) ➥ and attribute_not_exists(tid)' ⑤ }; if (input['--dueat'] !== null) { ⑥ params.Item.due = {N: input['--dueat']}; } if (input['<category>'] !== null) { ⑦ params.Item.category = {S: input['<category>']}; } db.putItem(params, (err) => { ⑧ if (err) { console.error('error', err); } else { console.log('task added with tid ' + tid); } }); }
① Creates the task ID (tid) based on the current timestamp
② The tid attribute is of type number and contains the tid value.
③ The created attribute is of type number (format 20150525).
⑤ Ensures that an existing item is not overridden
⑥ If the optional named parameter dueat is set, adds this value to the item
⑦ If the optional named parameter category is set, adds this value to the item
⑧ Invokes the putItem operation on DynamoDB
Now we’ll add some tasks. Use the following command to remind Emma about buying some milk and a task asking John to put out the garbage:
node index.js task-add emma "buy milk" "shopping" node index.js task-add emma "put out the garbage" "housekeeping" --dueat "20220224"
Now that you are able to add users and tasks to nodetodo, wouldn’t it be nice if you could retrieve all this data?
So far, you have learned how to insert users and tasks into two different DynamoDB tables. Next, you will learn how to query the data, for example, to get a list of all tasks assigned to Emma.
DynamoDB is a key-value store. The key is usually the only way to retrieve data from such a store. When designing a data model for DynamoDB, you must be aware of that limitation when you create tables (as you did in section 12.2). If you can use only the key to look up data, you’ll sooner or later experience difficulties. Luckily, DynamoDB provides two other ways to look up items: a secondary index lookup and the scan operation. You’ll start by retrieving data by the items’ primary key and continue with more sophisticated methods of data retrieval.
Let’s start with a simple example. You want to find out the contact details about Emma, which are stored in the todo-user
table.
The simplest form of data retrieval is looking up a single item by its primary key, for example, a user by its ID. The getItem
SDK operation to get a single item from DynamoDB can be used like this.
const params = { Key: { attr1: {S: 'val1'} ① }, TableName: 'app-entity' }; db.getItem(params, (err, data) => { ② if (err) { console.error('error', err); } else { if (data.Item) { ③ console.log('item', data.Item); } else { console.error('no item found'); } } });
① Specifies the attributes of the primary key
② Invokes the getItem operation on DynamoDB
③ Checks whether an item was found
The user
<uid>
command retrieves a user by the user’s ID. The code to fetch a user is shown in listing 12.8.
const mapUserItem = (item) => { ① return { uid: item.uid.S, email: item.email.S, phone: item.phone.S }; }; if (input['user'] === true) { const params = { Key: { uid: {S: input['<uid>']} ② }, TableName: 'todo-user' ③ }; db.getItem(params, (err, data) => { ④ if (err) { console.error('error', err); } else { if (data.Item) { ⑤ console.log('user', mapUserItem(data.Item)); } else { console.error('user not found'); } } }); }
① Helper function to transform DynamoDB result
② Looks up a user by primary key
④ Invokes the getItem operation on DynamoDB
⑤ Checks whether data was found for the primary key
For example, use the following command to fetch information about Emma:
You can also use the getItem
operation to retrieve data by partition key and sort key, for example, to look up a specific task. The only change is that the Key
has two entries instead of one. getItem
returns one item or no items; if you want to get multiple items, you need to query DynamoDB.
Emma wants to get check her to-do list for things she needs to do. Therefore, you will query the todo-task
table to retrieve all tasks assigned to Emma next.
If you want to retrieve a collection of items rather than a single item, such as all tasks for a user, you must query DynamoDB. Retrieving multiple items by primary key works only if your table has a partition key and sort key. Otherwise, the partition key will identify only a single item. Listing 12.9 shows how the query
SDK operation can be used to get a collection of items from DynamoDB.
const params = { KeyConditionExpression: 'attr1 = :attr1val ➥ AND attr2 = :attr2val', ① ExpressionAttributeValues: { ':attr1val': {S: 'val1'}, ② ':attr2val': {N: '2'} ③ }, TableName: 'app-entity' }; db.query(params, (err, data) => { ④ if (err) { console.error('error', err); } else { console.log('items', data.Items); } });
① The condition the key must match. Use AND if you’re querying both a partition and sort key. Only the = operator is allowed for partition keys. Allowed operators for sort keys are =, >, <, >=, ⇐, BETWEEN x AND y, and begins_with. Sort key operators are blazing fast because the data is already sorted.
② Dynamic values are referenced in the expression.
③ Always specify the correct type (S, N, B).
④ Invokes the query operation on DynamoDB
The query
operation also lets you specify an optional FilterExpression
, to include only items that match the filter and key condition. This is helpful to reduce the result set, for example, to show only tasks of a specific category. The syntax of FilterExpression
works like KeyConditionExpression
, but no index is used for filters. Filters are applied to all matches that KeyConditionExpression
returns.
To list all tasks for a certain user, you must query DynamoDB. The primary key of a task is the combination of the uid
and the tid
. To get all tasks for a user, KeyConditionExpression
requires only the partition key.
The next listing shows two helper functions used to implement the task-ls
command.
const getValue = (attribute, type) => { ① if (attribute === undefined) { return null; } return attribute[type]; }; const mapTaskItem = (item) => { ② return { tid: item.tid.N, description: item.description.S, created: item.created.N, due: getValue(item.due, 'N'), category: getValue(item.category, 'S'), completed: getValue(item.completed, 'N') }; };
① Helper function to access optional attributes
② Helper function to transform the DynamoDB result
The real magic happens in listing 12.11, which shows the implementation of the task-ls
command.
if (input['task-ls'] === true) { const yyyymmdd = moment().format('YYYYMMDD'); const params = { KeyConditionExpression: 'uid = :uid', ① ExpressionAttributeValues: { ':uid': {S: input['<uid>']} ② }, TableName: 'todo-task', Limit: input['--limit'] }; if (input['--next'] !== null) { params.KeyConditionExpression += ' AND tid > :next'; params.ExpressionAttributeValues[':next'] = {N: input['--next']}; } if (input['--overdue'] === true) { params.FilterExpression = 'due < :yyyymmdd'; ③ params.ExpressionAttributeValues ➥ [':yyyymmdd'] = {N: yyyymmdd}; ④ } else if (input['--due'] === true) { params.FilterExpression = 'due = :yyyymmdd'; params.ExpressionAttributeValues[':yyyymmdd'] = {N: yyyymmdd}; } else if (input['--withoutdue'] === true) { params.FilterExpression = ➥ 'attribute_not_exists(due)'; ⑤ } else if (input['--futuredue'] === true) { params.FilterExpression = 'due > :yyyymmdd'; params.ExpressionAttributeValues[':yyyymmdd'] = {N: yyyymmdd}; } else if (input['--dueafter'] !== null) { params.FilterExpression = 'due > :yyyymmdd'; params.ExpressionAttributeValues[':yyyymmdd'] = {N: input['--dueafter']}; } else if (input['--duebefore'] !== null) { params.FilterExpression = 'due < :yyyymmdd'; params.ExpressionAttributeValues[':yyyymmdd'] = {N: input['--duebefore']}; } if (input['<category>'] !== null) { if (params.FilterExpression === undefined) { params.FilterExpression = ''; } else { params.FilterExpression += ' AND '; ⑥ } params.FilterExpression += 'category = :category'; params.ExpressionAttributeValues[':category'] = S: input['<category>']}; } db.query(params, (err, data) => { ⑦ if (err) { console.error('error', err); } else { console.log('tasks', data.Items.map(mapTaskItem)); if (data.LastEvaluatedKey !== undefined) { console.log('more tasks available with --next=' + data.LastEvaluatedKey.tid.N); } } }); }
① Primary key query. The task table uses a partition and sort key. Only the partition key is defined in the query, so all tasks belonging to a user are returned.
② Query attributes must be passed this way.
③ Filtering uses no index; it’s applied over all elements returned from the primary key query.
④ Filter attributes must be passed this way.
⑤ attribute_not_exists(due) is true when the attribute is missing (opposite of attribute_exists).
⑥ Multiple filters can be combined with logical operators.
⑦ Invokes the query operation on DynamoDB
As an example, you could use the following command to get a list of Emma’s shopping tasks. The query will fetch all tasks by uid
first and filter items based on the category attribute next:.
Two problems arise with the query approach:
Depending on the result size from the primary key query, filtering may be slow. Filters work without an index: every item must be inspected. Imagine you have stock prices in DynamoDB, with a partition key and sort key: the partition key is a ticker like AAPL, and the sort key is a timestamp. You can make a query to retrieve all stock prices of Apple (AAPL) between two timestamps (20100101 and 20150101). But if you want to return prices only on Mondays, you need to filter over all prices to return only 20% (one out of five trading days each week) of them. That’s wasting a lot of resources!
You can only query the primary key. Returning a list of all tasks that belong to a certain category for all users isn’t possible, because you can’t query the category
attribute.
You can solve these problems with secondary indexes. Let’s look at how they work.
In this section, you will create a secondary index that allows you to query the tasks belonging to a certain category.
A global secondary index is a projection of your original table that is automatically maintained by DynamoDB. Items in an index don’t have a primary key, just a key. This key is not necessarily unique within the index. Imagine a table of users where each user has a country attribute. You then create a global secondary index where the country is the new partition key. As you can see, many users can live in the same country, so that key is not unique in the index.
You can query a global secondary index like you would query the table. You can imagine a global secondary index as a read-only DynamoDB table that is automatically maintained by DynamoDB: whenever you change the parent table, all indexes are asynchronously (eventually consistent!) updated as well.
In our example, we will create a global secondary index for the table todo-tasks
which uses the category
as the partition key and the tid
as the sort key. Doing so allows us to query tasks by category. Figure 12.4 shows how a global secondary index works.
A global secondary index comes at a price: the index requires storage (the same cost as for the original table). You must provision additional write-capacity units for the index as well, because a write to your table will cause a write to the global secondary index as well.
A huge benefit of DynamoDB is that you can provision capacity based on your workload. If one of your global secondary indexes gets tons of read traffic, you can increase the read capacity of that index. You can fine-tune your database performance by provisioning sufficient capacity for your tables and indexes. You’ll learn more about that in section 13.9.
Back to nodetodo. John needs the shopping list including Emma’s and his tasks for his trip to town.
To implement the retrieval of tasks by category, you’ll add a secondary index to the todo-task
table. This will allow you to make queries by category. A partition key and sort key are used: the partition key is the category
attribute, and the sort key is the tid
attribute. The index also needs a name: category-index
. You can find the following CLI command in the README.md file in nodetodo’s code folder and simply copy and paste:
$ aws dynamodb update-table --table-name todo-task ① ➥ --attribute-definitions AttributeName=uid,AttributeType=S ➥ AttributeName=tid,AttributeType=N ➥ AttributeName=category,AttributeType=S ② ➥ --global-secondary-index-updates '[{ ➥ "Create": { ③ ➥ "IndexName": "category-index", ➥ "KeySchema": [{"AttributeName": "category", ➥ "KeyType": "HASH"}, ④ ➥ {"AttributeName": "tid", "KeyType": "RANGE"}], ➥ "Projection": {"ProjectionType": "ALL"}, ⑤ ➥ "ProvisionedThroughput": {"ReadCapacityUnits": 5, ➥ "WriteCapacityUnits": 5} ➥ }}]'
① Adds a global secondary index by updating the table that has already been created previously
② Adds a category attribute, because the attribute will be used in the index
③ Creates a new secondary index
④ The category attribute is the partition key, and the tid attribute is the sort key.
⑤ All attributes are projected into the index.
Creating a global secondary index takes about five minutes. You can use the CLI to find out if the index is ready like this:
The next listing shows the code to query the global secondary index to fetch tasks by category with the help of the task-la
command.
if (input['task-la'] === true) { const yyyymmdd = moment().format('YYYYMMDD'); const params = { KeyConditionExpression: 'category = :category', ① ExpressionAttributeValues: { ':category': {S: input['<category>']} }, TableName: 'todo-task', IndexName: 'category-index', ② Limit: input['--limit'] }; if (input['--next'] !== null) { params.KeyConditionExpression += ' AND tid > :next'; params.ExpressionAttributeValues[':next'] = {N: input['--next']}; } if (input['--overdue'] === true) { params.FilterExpression = 'due < :yyyymmdd'; ③ params.ExpressionAttributeValues[':yyyymmdd'] = {N: yyyymmdd}; } [...] db.query(params, (err, data) => { if (err) { console.error('error', err); } else { console.log('tasks', data.Items.map(mapTaskItem)); if (data.LastEvaluatedKey !== undefined) { console.log('more tasks available with --next=' + data.LastEvaluatedKey.tid.N); } } }); }
① A query against an index works the same as a query against a table ...
② ... but you must specify the index you want to use.
③ Filtering works the same as with tables.
When following our example, use the following command to fetch John and Emma’s shopping tasks:
But you’ll still have situations where a query doesn’t work. For example, you can’t retrieve all users from todo-user
with a query. Therefore, let’s look at how to scan through all items of a table.
John wants to know who else is using nodetodo and asks for a list of all users. You will learn how to list all items of a table without using an index next.
Sometime you can’t work with keys because you don’t know them up front; instead, you need to go through all the items in a table. That’s not very efficient, but in rare situations, like daily batch jobs or rare requests, it’s fine. DynamoDB provides the scan
operation to scan all items in a table as shown next.
const params = { TableName: 'app-entity', Limit: 50 ① }; db.scan(params, (err, data) => { ② if (err) { console.error('error', err); } else { console.log('items', data.Items); if (data.LastEvaluatedKey !== undefined) { ③ console.log('more items available'); } } });
① Specifies the maximum number of items to return
② Invokes the scan operation on DynamoDB
③ Checks whether there are more items that can be scanned
The following listing shows how to scan through all items of the todo-user
table. A paging mechanism is used to prevent too many items from being returned at a time.
if (input['user-ls'] === true) { const params = { TableName: 'todo-user', Limit: input['--limit'] ① }; if (input['--next'] !== null) { params.ExclusiveStartKey = { ② uid: {S: input['--next']} }; } db.scan(params, (err, data) => { ③ if (err) { console.error('error', err); } else { console.log('users', data.Items.map(mapUserItem)); if (data.LastEvaluatedKey !== undefined) { ④ console.log('page with --next=' + data.LastEvaluatedKey.uid.S); } } }); }
① The maximum number of items returned
② The named parameter next contains the last evaluated key.
③ Invokes the scan operation on DynamoDB
④ Checks whether the last item has been reached
Use the following command to fetch all users:
The scan
operation reads all items in the table. This example didn’t filter any data, but you can use FilterExpression
as well. Note that you shouldn’t use the scan
operation too often—it’s flexible but not efficient.
By default, reading data from DynamoDB is eventually consistent. That means it’s possible that if you create an item (version 1), update that item (version 2), and then read that item, you may get back version 1; if you wait and fetch the item again, you’ll see version 2. Figure 12.5 shows this process. This behavior occurs because the item is persisted on multiple machines in the background. Depending on which machine answers your request, the machine may not have the latest version of the item.
You can prevent eventually consistent reads by adding "ConsistentRead":
true
to the DynamoDB request to get strongly consistent reads. Strongly consistent reads are supported by getItem
, query
, and scan
operations. But a strongly consistent read is more expensive—it takes longer and consumes more read capacity—than an eventually consistent read. Reads from a global secondary index are always eventually consistent because the synchronization between table and index happens asynchronously.
Typically, NoSQL databases do not support transactions with atomicity, consistency, isolation, and durability (ACID) guarantees. However, DynamoDB comes with the ability to bundle multiple read or write requests into a transaction. The relevant API methods are called TransactWriteItems
and TransactGetItems
—you can either group write or read requests, but not a mix of both of them. Whenever possible, you should avoid using transactions because they are more expensive from a cost and latency perspective. Check out http://mng.bz/E0ol if you are interested in the details about DynamoDB transactions.
John stumbled across a fancy to-do application on the web. Therefore, he decides to remove his user from nodetodo.
Like the getItem
operation, the deleteItem
operation requires that you specify the primary key you want to delete. Depending on whether your table uses a partition key or a partition key and sort key, you must specify one or two attributes. The next listing shows the implementation of the user-rm
command.
if (input['user-rm'] === true) { const params = { Key: { uid: {S: input['<uid>']} ① }, TableName: 'todo-user' ② }; db.deleteItem(params, (err) => { ③ if (err) { console.error('error', err); } else { console.log('user removed'); } }); }
① Identifies an item by partition key
③ Invokes the deleteItem operation on DynamoDB
Use the following command to delete John from the DynamoDB table:
But John wants to delete not only his user but also his tasks. Luckily, removing a task works similarly. The only change is that the item is identified by a partition key and sort key and the table name has to be changed. The code used for the task-rm
command is shown here.
if (input['task-rm'] === true) { const params = { Key: { uid: {S: input['<uid>']}, tid: {N: input['<tid>']} ① }, TableName: 'todo-task' ② }; db.deleteItem(params, (err) => { if (err) { console.error('error', err); } else { console.log('task removed'); } }); }
① Identifies an item by partition key and sort key
You’re now able to create, read, and delete items in DynamoDB. The only operation you’re missing is updating an item.
Emma is still a big fan of nodetodo and uses the application constantly. She just bought some milk and wants to mark the task as done.
You can update an item with the updateItem
operation. You must identify the item you want to update by its primary key; you can also provide an UpdateExpression
to specify the updates you want to perform. Use one or a combination of the following update actions:
Use SET
to override or create a new attribute. Examples: SET
attr1
=
:attr1val
, SET
attr1
=
attr2
+
:attr2val
, SET
attr1
=
:attr1val,
attr2
=
:attr2val
.
Use REMOVE
to remove an attribute. Examples: REMOVE
attr1
, REMOVE
attr1
, attr2
.
To implement this feature, you need to update the task item, as shown next.
if (input['task-done'] === true) { const yyyymmdd = moment().format('YYYYMMDD'); const params = { Key: { uid: {S: input['<uid>']}, ① tid: {N: input['<tid>']} }, UpdateExpression: 'SET completed = :yyyymmdd', ② ExpressionAttributeValues: { ':yyyymmdd': {N: yyyymmdd} ③ }, TableName: 'todo-task' }; db.updateItem(params, (err) => { ④ if (err) { console.error('error', err); } else { console.log('task completed'); } }); }
① Identifies the item by a partition and sort key
② Defines which attributes should be updated
③ Attribute values must be passed this way.
④ Invokes the updateItem operation on DynamoDB
For example, the following command closes Emma’s task to buy milk. Please note: the <tid>
will differ when following the example yourself. Use node
index.js
task-ls
emma
to get the task’s ID:
We’d like to recap an important aspect of DynamoDB, the primary key. A primary key is unique within a table and identifies an item. You can use a single attribute as the primary key. DynamoDB calls this a partition key. You need an item’s partition key to look up that item. Also, when updating or deleting an item, DynamoDB requires the partition key. You can also use two attributes as the primary key. In this case, one of the attributes is the partition key, and the other is called the sort key.
A partition key uses a single attribute of an item to create a hash-based index. If you want to look up an item based on its partition key, you need to know the exact partition key. For example, a user table could use the user’s email address as a partition key. The user could then be retrieved if you know the partition key—the email address, in this case.
When you use both a partition key and a sort key, you’re using two attributes of an item to create a more powerful index. To look up an item, you need to know its exact partition key, but you don’t need to know the sort key. You can even have multiple items with the same partition key: they will be sorted according to their sort key.
The partition key can be queried only using exact matches (=). The sort key can be queried using =
, >
, <
, >=
, <=
, and BETWEEN
x
AND
y
operators. For example, you can query the sort key of a partition key from a certain starting point. You cannot query only the sort key—you must always specify the partition key. A message table could use a partition key and sort key as its primary key; the partition key could be the user’s email, and the sort key could be a timestamp. You could then look up all of user’s messages that are newer or older than a specific timestamp, and the items would be sorted according to the timestamp.
As the developer of nodetodo, we want to get some insight into the way our users use the application. Therefore, we are looking for a flexible approach to query the data stored on DynamoDB on the fly.
Because SQL is such a widely used language, hardly any database system can avoid offering an SQL interface as well, even if, in the case of NoSQL databases, often only a fraction of the language is supported. In the following examples, you will learn how to use PartiQL, which is designed to provide unified query access to all kinds of data and data stored. DynamoDB supports PartiQL via the Management Console, NoSQL Workbench, AWS CLI, and DynamoDB APIs. Be aware that DynamoDB supports only a small subset of the PartiQL language. The following query lists all tasks stored in table todo-task
:
① The command execute-statement supports PartiQL statements as well.
② A simple SELECT statement to fetch all attributes of all items from table todo-task. The escaped " is required because the table name includes a hyphen.
PartiQL allows you to query an index as well. The following statement fetches all tasks with category equals shopping
from index category-index
:
$ aws dynamodb execute-statement --statement ➥ "SELECT * FROM ""todo-task"."category-index"" ➥ WHERE category = 'shopping'"
Please note that combining multiple queries (JOIN
) is not supported by DynamoDB. However, DynamoDB supports modifying data with the help of PartiQL. The following command updates Emma’s phone number:
aws dynamodb execute-statement --statement ➥ "Update "todo-user" SET phone='+33333333' WHERE uid='emma'"
Be warned, an UPDATE
or DELETE
statement must include a WHERE
clause that identifies a single item by its partition key or partition and sort keys. Therefore, it is not possible to update or delete more than one item per query.
Want to learn more about PartiQL for DynamoDB? Check out the official documentation: http://mng.bz/N5g2.
In our opinion, using PartiQL is confusing because it pretends to provide a flexible SQL language but is in fact very limited. We prefer using the DynamoDB APIs and the SDK, which is much more descriptive.
Imagine a team of developers is working on a new app using DynamoDB. During development, each developer needs an isolated database so as not to corrupt the other team members’ data. They also want to write unit tests to make sure their app is working. To address their needs, you could create a unique set of DynamoDB tables with a CloudFormation stack for each developer. Or you could use a local DynamoDB for offline development. AWS provides a local implementation of DynamoDB, which is available for download at http://mng.bz/71qm.
Don’t run DynamoDB Local in production! It’s only made for development purposes and provides the same functionality as DynamoDB, but it uses a different implementation: only the API is the same.
That’s it! You’ve implemented all of nodetodo’s features.
DynamoDB doesn’t require administration like a traditional relational database, because it’s a managed service and AWS takes care of that; instead, you only have a few things to do.
With DynamoDB, you don’t need to worry about installation, updates, machines, storage, or backups. Here’s why:
DynamoDB isn’t software you can download. Instead, it’s a NoSQL database as a service. Therefore, you can’t install DynamoDB like you would MySQL or MongoDB. This also means you don’t have to update your database; the software is maintained by AWS.
DynamoDB runs on a fleet of machines operated by AWS. AWS takes care of the OS and all security-related questions. From a security perspective, it’s your job to restrict access to your data with the help of IAM.
DynamoDB replicates your data among multiple machines and across multiple data centers. There is no need for a backup from a durability point of view. However, you should configure snapshots to be able to recover from accidental data deletion.
Now you know some administrative tasks that are no longer necessary if you use DynamoDB. But you still have things to consider when using DynamoDB in production: monitoring capacity usage, provisioning read and write capacity (section 12.11), and creating backups of your tables.
As discussed at the beginning of the chapter, the difference between a typical relational database and a NoSQL database is that a NoSQL database can be scaled by adding new nodes. Horizontal scaling enables increasing the read and write capacity of a database enormously. That’s the case for DynamoDB as well. However, a DynamoDB table has two different read/write capacity modes:
On-demand mode adapts the read and write capacity automatically.
Provisioned mode requires you to configure the read and write capacity upfront. Additionally, you have the option to enable autoscaling to adapt the provisioned capacity based on the current load automatically.
At first glance, the first option sounds much better because it is totally maintenance free. Table 12.1 compares the costs between on-demand and provisioned modes. All of the following examples are based on costs for us-east-1
and assume you are reading and writing items with less than 1 KB.
That’s a significant difference. Accessing an on-demand table costs about seven times as much as using a table in provisioned capacity mode. But the example is not significant, because of the assumption that the provisioned throughput of 10 writes and 100 reads per second is running at 100% capacity 24/7. This is hardly true for any application. Most applications have large variations in read and write throughput. Therefore, in many cases, the opposite is correct: on-demand is more cost-efficient than provisioned. For example, we are operating a chatbot called marbot on AWS. When we switched from provisioned to on-demand capacity in December 2018, we reduced our monthly costs for DynamoDB by 90% (see http://mng.bz/DD69 for details).
As a rule of thumb, the more spikes in your traffic, the more likely that on-demand is the best option for you. Let’s illustrate this with another example. Assume your application is used only during business hours from 9 a.m. to 5 p.m. The baseline throughput is 100 reads and 10 write per second. However, a batch job is running between 4 p.m. and 5 p.m., which requires 1,000 reads and 100 writes per second.
With provisioned capacity mode, you need to provision for the peak load. That’s 1,000 reads and 10 writes per second. That’s why the monthly costs for a table with provisioned capacity costs is $24.75 higher than when using on-demand capacity in this example, as shown here:
// Provisioned capacity mode $0.00065 for 1 writes/sec * 100 * 24 hours * 30 days = $46.80 per month $0.00013 for 2 reads/sec * 500 * 24 hours * 30 days = $46.80 per month Read/Writes with provisioned capacity mode per month: $93.60 // On-demand Capacity Mode (7 hours * 60 min * 60 sec * 10 write) + (1 hour * 60 min * 60 sec * 100 writes) = 612,000 writes/day (7 hours * 60 min * 60 sec * 100 reads) + (1 hour * 60 min * 60 sec * 1000 reads) = 6,120,000 reads/day $0.00000125/write * 612,000 writes/day * 30 days = $22.95 $0.00000025/read * 6,120,000 reads/day * 30 days = $45.90 Read/Writes with On-demand Mode per month: $68.85
By the way, you are not only paying for accessing your data but also for storage. By default, you are paying $0.25 per GB per month. The first 25 GB are free. You only pay per use—there is no need to provision data upfront.
Does your workload require storing huge amounts of data that are accessed rarely? If so, you should have a look into Standard-Infrequent Access, which reduces the cost for storing data to $0.10 per GB per month but increases costs for reading and writing data by about 25%. Please visit the https://aws.amazon.com/dynamodb/pricing/ for more details on DynamoDB pricing.
When working with provisioned capacity mode, you need to configure the provisioned read and write capacity separately. To understand capacity units, let’s start by experimenting with the CLI, as shown here:
$ aws dynamodb get-item --table-name todo-user ➥ --key '{"uid": {"S": "emma"}}' ➥ --return-consumed-capacity TOTAL ① ➥ --query "ConsumedCapacity" { "CapacityUnits": 0.5, ② "TableName": "todo-user" } $ aws dynamodb get-item --table-name todo-user ➥ --key '{"uid": {"S": "emma"}}' ➥ --consistent-read --return-consumed-capacity TOTAL ③ ➥ --query "ConsumedCapacity" { "CapacityUnits": 1.0, ④ "TableName": "todo-user" }
① Tells DynamoDB to return the used capacity units
② getItem requires 0.5 capacity units.
④ ... needs twice as many capacity units.
More abstract rules for throughput consumption follow:
An eventually consistent read takes half the capacity of a strongly consistent read.
A strongly consistent getItem
requires one read capacity unit if the item isn’t larger than 4 KB. If the item is larger than 4 KB, you need additional read capacity units. You can calculate the required read capacity units using roundUP(itemSize
/
4)
.
A strongly consistent query
requires one read capacity unit per 4 KB of item size. This means if your query returns 10 items, and each item is 2 KB, the item size is 20 KB and you need five read units. This is in contrast to 10 getItem
operations, for which you would need 10 read capacity units.
A write operation needs one write capacity unit per 1 KB of item size. If your item is larger than 1 KB, you can calculate the required write capacity units using roundUP(itemSize)
.
If capacity units aren’t your favorite unit, you can use the AWS Pricing Calculator at https://calculator.aws to calculate your capacity needs by providing details of your read and write workload.
The provision throughput of a table or a global secondary index is defined in seconds. If you provision five read capacity units per second with ReadCapacityUnits=5
, you can make five strongly consistent getItem
requests for that table if the item size isn’t larger than 4 KB per second. If you make more requests than are provisioned, DynamoDB will first throttle your request. If you make many more requests than are provisioned, DynamoDB will reject your requests.
Increasing the provisioned throughput is possible whenever you like, but you can only decrease the throughput of a table four to 23 times a day (a day in UTC time). Therefore, you might need to overprovision the throughput of a table during some times of the day.
It is possible but not necessary to update the provisioned capacity manually. By using Application Auto Scaling, you can increase or decrease the capacity of your table or global secondary indices based on a CloudWatch metric automatically. See http://mng.bz/lR7M to learn more.
DynamoDB does not run in your VPC. It is accessible via an API. You need internet connectivity to reach the DynamoDB API. This means you can’t access DynamoDB from a private subnet by default, because a private subnet has no route to the internet via an internet gateway. Instead, a NAT gateway is used (see section 5.5 for more details). Keep in mind that an application using DynamoDB can create a lot of traffic, and your NAT gateway is limited to 10 Gbps of bandwidth. A better approach is to set up a VPC endpoint for DynamoDB and use that to access DynamoDB from private subnets without needing a NAT gateway at all. You can read more about VPC endpoints in the AWS documentation at http://mng.bz/51qz.
In chapter 10, you learned about the Relational Database Service (RDS). For a better understanding, let’s compare the two different database systems. Table 12.2 compares DynamoDB and the RDS. Keep in mind that this is like comparing apples and oranges: the only thing DynamoDB and RDS have in common is that both are called databases. RDS provides relational databases that are very flexible when it comes to ingesting or querying data. However, scaling RDS is a challenge. Also, RDS is priced per database instance hour. In contrast, DynamoDB scales horizontally with little or no effort. Also, DynamoDB offers a pay-per-request pricing model, which is interesting for low-volume workloads with traffic spikes. Use RDS if your application requires complex SQL queries or you don’t want to invest time into mastering a new technology. Otherwise, you can consider migrating your application to DynamoDB.
Besides DynamoDB, a few other NoSQL options are available on AWS as shown in table 12.3. Make sure you understand the requirements of your workload and learn about the details of a NoSQL database before deciding on an option.
|
DynamoDB is a NoSQL database service that removes all the operational burdens from you, scales well, and can be used in many ways as the storage backend of your applications.
Looking up data in DynamoDB is based on keys. To query an item, you need to know the primary key, called the partition key.
When using a combination of partition key and sort key, many items can use the same partition key as long as their sort key does not overlap. This approach also allows you to query all items with the same partition key, ordered by the sort key.
Adding a global secondary index allows you to query efficiently on an additional attribute.
The scan
operation searches through all items of a table, which is flexible but not efficient and shouldn’t be used too extensively.
Enforcing strongly consistent reads avoids running into eventual consistency problems with stale data. But reading from a global secondary index is always eventually consistent.
DynamDB comes with two capacity modes: on demand and provisioned. The on-demand capacity mode works best for spiky workloads.
18.118.1.232