AWS Knowledge Series: DynamoDB — Part 1

Sanjay Dandekar
13 min readAug 31, 2020

DynamoDB is fully managed NoSQL database and one of the key components of modern server less design on AWS. It is highly scalable and has very low latency for both read / write operations. In this article we will look at the following topics related to DynamoDB and its usage.

  • Data modelling
  • Auto deletion of “expired” / “old” records from DB
  • Triggers

DynamoDB is most suitable for data which does not change often but is read multiple times i.e. write once (or may be few times) / read multiple times.

Data Modelling in DynamoDB

If you are coming from RDBMS background and are trying to figure out how to use DynamoDB for your use case, this is where you will struggle the most. So in order to explain the approach that one has to take while modelling using DynamoDB, remember the following things first:

  • Storage is cheap — What this means is that you can store the same data in multiple ways and it will not cost much.
  • Compute is expensive — In order to get the data you want if you have to make multiple DynamoDB calls — It will be expensive. So store it in a way that you can get to it without multiple round trips.

One important aspect that you have to keep in mind is you design the DynamoDB for the access patterns you expect to implement in your application. Unlike RDBMS where one can add / modify indexes on the fly — Once can not do that with DynamoDB. So the first step while designing DynamoDB model is to look at your use cases and your datasets and list out access patterns. Based on the access pattern you will design the DynamoDB schema and various indexes. If in future your use cases change or your access patterns change, you will have to re-create the DynamoDB model and migrate the data to newly designed schema. This is such a critical aspect — It is important that you spend most of your time thinking about all your access patterns.To understand the entire process, let us start with a simple RDBMS model shown below:

Based on the above model let us say that we have following access patterns for our use cases:

  • Fetch a single customer
  • Fetch a single product
  • Fetch a single order
  • Fetch all orders of a given customer
  • Fetch all products in a given order
  • Fetch all orders for a given product

Let us see how we can design a DynamoDb table that can store all of the data that is stored in the four tables above and provide a double digit millisecond latency for read / write of data even with millions of records in the table.

One table or four tables?

So the first question that arises is — how many tables should we have in DynamoDB. The answer lies in the following statement found in almost all AWS Youtube videos / developer documentation related to DynamoDB.

“Most well-designed applications require only one table”

So the short answer is — One table. Let us see if we can achieve this for the use case described above. Remember, unlike RDBMS in DynamoDb you can not do joins across tables.

Primary Key in DynamoDB table

Like RDBMS tables, even DynamoDB has concept of “primary key”. In RDBMS, multiple columns can be part of the primary key (composite primary key). In DynamoDB, the primary key has two components:

Hash Key — This is the mandatory part of the primary key. When a records has to be inserted in the DynamoDB store, the data you provide as hash key decides which underlying store will it be stored in. All records must have a unique hash key — Two records can not have the same hash key.

Range Key — This is the optional part of the primary key. The data is stored and retrieved in the order defined by the range key. Since this is used for sorting, two records can have same range key.

When defining the DynamoDB table, it is mandatory to define its hash key column. If your DynamoDB table has both hash key and range key, if you want to access a single item from the table, both values have to be specified. So operations like get item, put item, update item, delete item will require both the hash key and range key to uniquely identify the item to perform the get / delete / update operation on it. For a table with composite key, only the query API allows usage of just the hash key for searching purpose.

With this in mind let's start designing the DynamoDB table:

Let us say we have following data in the CUSTOMER and PRODUCT tables:

CUSTOMER Table Data:

PRODUCT Table Data:

To accommodate the above data, the simplest DynamoDB structure that one can come up with is as follows:

ECOMMERCEDB — DynamoDB table with one record for customer added:

Now if we try to add the first product entry in the above table, it will fail since Hash keys have to be unique for each record. So we can not have a same hash key for Customer and Product record. The simplest solution to use in such cases is to use a “prefix” to qualify the record type. Let us use “CUSTOMER#” as prefix for Customer data and “PRODUCT#” for the product data. So with this the data for product and customer can fit into a single table as follows:

With a unique prefix, we were able to qualify the records and now in a single DynamoDB table we are able to store both customers and products.

With the above in place, we have met the following access patterns given that each customer and product has a unique hash key:

  • Fetch a single customer
  • Fetch a single product

One question that you may have is what is stored in the “price” column for customer record or what is stored in the email column for the product record. The answer is that there will be no column (or more accurately attribute) named “price” for customer row / “email” for product row. DynamoDB is schema less. Each record can have its own structure. The only requirement is that they all must have a hash key (and range key for composite primary key).

Now let’s try to add the ORDER table data into the above DynamoDB:

Let’s apply the previous trick and see what we get:

With the above, we have met one more access pattern:

  • Fetch a single order

However, if we have to find out all orders of a customer, there is no efficient way of doing it with the above design. So now is the time to introduce another concept DynamoDB called the “Global Secondary Index”. Global secondary index (GSI) is an index with a partition key and a sort key that can be different from those on the base table. GSI queries can span across of the data in the base table, across all partitions. So if we now want to find out all orders for a given customer, we can create a GSI with customer ID as hash key. This will allow us to use the Query API on the GSI where we can specify the hash key of GSI — Customer ID — to search for all order records across all partitions of the tables. Unlike the primary key, the hash key of GSI need not be “unique”. So with this information, we now have the following DynamoDB table:

With the above DynamoDB table design we have now met the following access pattern as well:

  • Fetch all orders of a given customer

Before we can address the last two remaining access pattern, we have to first import the ORDER_DETAIL data in the DynamoDB table:

ORDER_DETAIL Table Data:

The DynamoDB table with pre-fix applied:

The data of the ORDER_DETAIL is stored but there is no efficient way to find out all the products that are part of an order or all the orders for a given product. We already have one GSI — Let us see if we can use that to achieve these access patterns.

Let us try and address the first access pattern — Find all products that are part of an order. The modified DynamoDb table looks like following:

See how we applied the concept of prefix to the hash key of the GSI. Now we will be able to get all products in a given order by querying the GSI. However this does not provide us a way to know all orders for a given product. So what can we do? Remember that I said previously — Storage is cheap and it is ok to store more data if the solves the problem with accessing it without any computation / multiple round trips.

So one way of doing this is as follows:

YES — The data is duplicated but that’s the point when I said storage is cheap. With the above design it is now possible to find out all the orders for a given product. An alternate way of doing this is to introduce another GSI as follows:

The following terraform definition describes the DynamoDB table that can store data for all the 4 RDBMS tables and provide millisecond read / write latency at internet scale.

Option 1 — Single GSI, Multiple records for ORDER-PRODUCT combination

resource "aws_dynamodb_table" "ecommerce_db" {
name = "ECOMMERCEDB"
billing_mode = "PAY_PER_REQUEST"
hash_key = "ITEM_ID"
attribute {
name = "ITEM_ID"
type = "S"
}
attribute {
name = "OTHER_ID"
type = "S"
}
global_secondary_index {
name = "OtherIdIndex"
hash_key = "OTHER_ID"
projection_type = "ALL"
}
}

Option 2 — Two GSI

resource "aws_dynamodb_table" "ecommerce_db" {
name = "ECOMMERCEDB"
billing_mode = "PAY_PER_REQUEST"
hash_key = "ITEM_ID"
attribute {
name = "ITEM_ID"
type = "S"
}
attribute {
name = "OTHER_ID"
type = "S"
}
attribute {
name = "OTHER_ID2"
type = "S"
}
global_secondary_index {
name = "OtherIdIndex"
hash_key = "OTHER_ID"
projection_type = "ALL"
}
global_secondary_index {
name = "OtherId2Index"
hash_key = "OTHER_ID2"
projection_type = "ALL"
}
}

As you can see — we were able to use just one DynamoDB table and were able to achieve all the access patterns that we wanted to achieve. One can follow a similar approach to design DynamoDB table that provides same functions as an RDBMS table with the added flexibility that we can extend schema of the ORDER or CUSTOMER object whenever we want!

The following shows how all query patterns can be realized:

Auto deletion of “expired” / “old” records from DB

As mentioned earlier, DynamoDB is suitable for data that is seldom modified but is read multiple times. With such pattern it is possible that data in DynamoDB keeps growing with passage of time. For instance, in the example we have above, the Order data will keep growing with time. It is highly unlikely that a user wants to access older orders. So having this data around in DynamoDB is not necessary. So how does one get rid of older data? One way is to run a batch job that looks at “age” of the data and deletes the same. But this is not advisable for following reasons:

  • You will incur cost for executing the regular batch job
  • There will be cost associated with reading GSI related to the age of the data and also for deleting the data from DynamoDB
  • You will have to scan the entire table and as the size of the table grows — It will become unmanageable

So what is the way out? Well, DynamoDB provides a functionality where we can define an “expiry” time of a record. DynamoDB regularly checks for records that are expired and deletes them. Bonus — There no cost associated with this! So how does one do it? Well — It is simple. When you define the table — you create a “Time to live” column. For records that you want deleted from table, ensure that you specify value for this column and store the UTC time after which the given record has to be treated as “expired”.

The following terraform configuration defines the TTL attribute for the ECOMMERCEDB:

resource "aws_dynamodb_table" "ecommerce_db" {
name = "ECOMMERCEDB"
billing_mode = "PAY_PER_REQUEST"
hash_key = "ITEM_ID"
attribute {
name = "ITEM_ID"
type = "S"
}
attribute {
name = "OTHER_ID"
type = "S"
}
global_secondary_index {
name = "OtherIdIndex"
hash_key = "OTHER_ID"
projection_type = "ALL"
}
ttl {
attribute_name = "EXPIRE_TS"
enabled = true
}
}

All the records that have an attribute “EXPIRE_TS” will become eligible for automatic deletion on expiry of the record. For example the ORDER record shown below will be deleted on 30 Aug 2020, 12:00 PM UTC.

Remember, this is an attribute of the record and like any other attribute it can also be modified. So if you want to extend life of a specific record because you saw an activity on it (e.g. someone fetched the order record), you can always modify the expiry time and keep it around for a little longer than your original planned expiry.

Triggers

Triggers are a very important concepts in the RDBMS world and the good news is that they are supported in DynamoDB as well. In case of DynamoDB, target of a trigger is a Lambda function (or SNS / SQS endpoint).

Following are the steps involved in configuring triggers:

  • Enable DynamoDB Streams and specify what data do you want to access in the trigger function
  • Create a Lambda function and associate it with the table as trigger handler. You have to ensure that the role assigned to Lambda function can perform GetRecords, GetShardIterator, DescribeStream, and ListStreams Actions on your stream.

Enabling streams is very straight forward — You can use the AWS console, select DynamoDB service and open the DynamoDB table for which you want to enable streams. In the overview tab click the “Manage Streams” button to bring up the following dialog:

Select the option that is most appropriate for your use case and click “Enable” to enable streams on the DynamoDB table.

Next let’s look at the entries required in the IAM role to be assigned to the Lambda function which will be the target for the trigger.

{
"Effect": "Allow",
"Action": [
"dynamodb:DescribeStream",
"dynamodb:GetRecords",
"dynamodb:GetShardIterator",
"dynamodb:ListStreams"
],
"Resource": [
"<<ARN of the DynamoDb Stream>>"
]
}

Once the IAM role is configured, you can assign (or create a new ) lambda function as trigger handler. In the DynamoDB console, select the “Triggers” tab and click the “Create trigger” button to bring up a dialog as follows where you can configure the Lambda function:

Select the Lambda function which will handle the DynamoDb stream as data records are created, modified, deleted from the DynamoDb table. One important attribute to configure is the “batch size”. This has to be configured properly else your trigger will be called multiple times due to failure of the Lambda function. Lambda function has a maximum execution time of 15 minutes. Let us say that the processing you are implementing in the Lambda function takes a maximum of 10 seconds per record. This means that the worst case scenario is you need the Lambda function to execute for 1000 seconds (100 records — which is the default batch size x 10 seconds). This is more than the 15 minute (900 seconds) limit of the Lambda. So in the worst case scenario when DynamoDb trigger is sent 100 records, it will fail and this will result in the function being called again with the same data. The default retries of Lambda is configured as 2 — So your trigger lambda will be executed thrice and all the three times it will fail. So when deciding the batch size do keep the maximum expected execution time for a single record in mind. Allow for some buffer time as well for unexpected conditions and then come up with the batch size for your function.

One interesting aspect to note here is that when the DynamoDb deletes the “expired” records, even then the triggers are called. So if your use case so desires, you can use that opportunity to store the data being deleted in a cheap long term storage.

The following snippets show how to configure streams using terraform:

resource "aws_dynamodb_table" "ecommerce_db" {
name = "ECOMMERCEDB"
billing_mode = "PAY_PER_REQUEST"
hash_key = "ITEM_ID"
attribute {
name = "ITEM_ID"
type = "S"
}
attribute {
name = "OTHER_ID"
type = "S"
}
global_secondary_index {
name = "OtherIdIndex"
hash_key = "OTHER_ID"
projection_type = "ALL"
}
ttl {
attribute_name = "EXPIRE_TS"
enabled = true
}
stream_enabled = true
stream_view_type = "NEW_IMAGE"
}

The terraform to map Lambda function to the trigger is as follows:

resource "aws_lambda_event_source_mapping" "trigger-handler" {
event_source_arn = "<<ARN of DynamoDB Table>>"
function_name = "<<Lambda function name>>"
batch_size = 20
starting_position = "LATEST"
}

The following code shows how you can handle “INSERT” event in the Lambda function configured as trigger handler:

That’s it for this article. In the next article, I will talk about the following aspects of DynamoDB:

  • Exponential backoff when doing batch operation
  • Updating just few attributes of a record
  • Atomic update of numeric attribute of record
  • Accessing DynamoDB service from a Lambda hosted in VPC

--

--

Sanjay Dandekar

Developer, Home Cook, Amateur Photographer - In that order!.