- NoSQL serverless database
- Fully managed, highly available with replication across 3 AZs
- Scales to massive workloads, distributed
- Millions of records per seconds, trillions of rows, 100TB of storage
- Fast and consistent in performance (low latency retrieval)
- Integrates with IAM for security and administration
- Enables event driven programming
- Low cost
Traditional databases | NoSQL |
Traditional applications leverage RDBMS databases | Non-relational, distributed databases |
SQL query language | Many different query languages, SQL can be one |
Strong requirements about how the data should be modelled | All the data should be present in one row |
Ability to do joins, aggregations and computations | Do not support join, can't perform aggregations such as "SUM" |
Vertical scaling | Horizontal scaling |
- Made of tables
- Each table has a primary key
- Each table can have inf. number of rows(items)
- Each item can have attributes (columns, but they can be nested)
- Max size of an item is 400KB
- Data types supported:
- Scalar: String, Number, Binary, Boolean, Null
- Document Types: List, Map
- Set Types: String Set, Number Set, Binary Set
- Option I: partition key only (HASH), should be unique
- Partition key must be unique for each item
- Partition key must be "diverse" so data will be distributed
- Option II: partition key + sort key, combination should be unique
- Data is grouped by partition key
- Sort key = range key
Table must have a provisioned read and write capacity
Read Capacity Units (RCU): throughput for reads
Write Capacity Units (WCU): throughput for writes
Option to setup auto-scaling of throughput to meet demand
Throughput can be exceeded temporarily using "burst credits"
If no burst credits are available, we get a ProvisionedThroughputExceededException
We should use exponential back-off for retries in case get the exception above
One write capacity unit represents one write per second for an item up to 1KB in size
If the items are larger than 1KB, more WCU is consumed
- 10 objects per second each of 2KB
2 * 10 = 20 WCU
- 6 objects per second, 4.5 each
6 * 5 = 30 WCU (4.5 should be rounded to 5)
Eventually consistent reads: if we read after a while it is possible to get unexpected data, there is a possibility that the data did not replicate to every node
Strongly consistent reads: if we read data just after the write we get correct data
By default: DynamoDB uses eventually consistent reads, but GetItem, Query and Scan can provide a ConsistentRead parameter to achieve strong consistency
One read capacity unit represents one strongly consistent read per second, or two eventually consistent reads per second for an item up to 4KB in size
If the items are larger than 4KB, more RCU is consumed
- 10 strongly consistent reads per seconds of 4KB each
10 * 4KB / 4KB = 10 RCU
- 16 eventual consistent reads per second of 12KB records
(16 / 2) * (12 / 4) = 24 RCU
- 10 strong consistent reads, 6KB each
10 * 8KB / 4 = 20 RCU (we have to round up 6KB to 8 KB)
- Data is divided in partitions
- Partition keys are hashed in order to know in which partition will the data go
- To compute the number of partitions:
- Capacity:
(TOTAL RCU / 3000) + (TOTAL WCU / 1000)
- Size:
- Total number of partitions:
CEILING(MAX(Capacity, Size))
- Capacity:
- WCU and RCU are spread evenly between partitions!
- If we exceed our RCU or WCU we get ProvisionedThroughputExceededException
- Reasons:
- Hot keys
- Hot partition
- Very large items
- Solutions:
- Exponential back-off
- Distribute partition keys as much as possible
- If RCU issue, use DynamoDB Accelerator (DAX)
- PutItem - write data to DynamoDB (create data or full replace data)
- Consumes WCU
- UpdateIem - update data in DynamoDB (partial update of attributes)
- Possibility to use Atomic Counters and increase them
- Conditional Writes:
- Accept a write/update only if some conditions are met
- Helps with concurrent access
- No additional cost
- DeleteItem
- Delete an individual row
- Ability to perform conditional deletes
- DeleteTable
- Deletes the whole table
- Much quicker deletion than DeleteItem
- BatchWriteIem
- Up to 25 PutItem and/or DeleteItem in one call
- Up to 16MB of data written
- Up to 400KB or data per item
- Batching allows to save latency by reducing the number of API calls
- Operations are done in parallel by DynamoDB
- It is possible for a batch to fail, we can retry just the failed items
- Read based on the primary key
- Primary key = HASH or HASH-RANGE (partition key + sort key)
- Eventually consistent by default, but has an option to have strong consistency
- ProjectionExpression can be specified to include only certain attributes
- Up to 100 items
- Up to 16MB data
- Items are retrieved in parallel
- Can be combined with projection expressions
- PartitionKey value (must be equals ("=") operator)
- SortKey value (operators: =, <, <= >, >=, Between, Begin) - optional
- FilterExpression to further filter the data (this will happen on the client side)
- Returns up to 1MB of data or number of items specified by the Limit
- Able to do pagination
- We can query a table, secondary index or global secondary index
- Efficient way to query DynamoDB!
- Scans the entire table and then filter data
- Returns up to 1MB ot data - we can use pagination to keep reading
- Consumes a lot of RCU
- Limit impact using Limit / reduce the size of the result in order to reduce costs
- Fost faster performance we can use parallel scan
- Way more RCU!
- Can we use a combination of ProjectionExpression + FilterExpression (no charge to RCU)
- Inefficient way to read data from DynamoDB!
- Alternate range key for the table, local to the hash key
- Up to 5 LSI/table
- The sort key consist of exactly one scalar attribute
- The attribute can be String, Number, Binary
- LSI must be defined at table creation time!
- Used to speed up queries on non-key attributes
- GSI = partition key + optional sort key
- This index can be seen as a new "table"
- We can project attributes on the new "table"
- The partition key and the sort key of the original table are always projected (KEYS_ONLY)
- We can specify extra attributes to be projected (INCLUDE)
- We can use all attributes from main table (ALL)
- We must define a RCU/WCU for the index
- GSI can be created and modified after origin table creation!
- DynamoDB indexes can cause throttling
- GSI: If writes are throttled in case of a GSI, the main table is throttled as well! This can happen even if the WCU on the main table is just fine
- LSI: uses the same WCU and RCU on the main table, can not throttle the main table
- Conditional update/delete: ensures the item hasn't changed before altering it
- This feature makes DynamoDB an optimistic locking / concurrent database
- It is a seamless caching mechanism for DynamoDB
- It can be activated without doing any code change on the application which uses the database
- Writes go through DAX to DynamoDB
- Micro second latency for cache reads & queries
- Solves one big problems: Hot Key problem (too many requests for the same key)
- By default: 5 min TTL for every item in the cache
- Up to 10 nodes in the cluster
- Multi AZ (3 nodes minimum recommended)
- Secure (encryption at rest with KMS, IAM, CloudTrail)
- Changes in a DynamoDB table (create, update, delete) are pushed into a stream
- This stream can be read by Lambda or by EC2 instances
- They can react to this changelog by:
- Doing analytics
- Create derivative tables/view
- Insert data in ElasticSearch
- etc.
- Could implement cross region replication using Streams (nowadays it is a provided feature by DynamoDB)
- Stream have 24 hours of retention (we can not change this)
- We can choose what type of information ends up in stream:
- KEYS_ONLY - only the key attributes of the modified items are pushed to the stream
- NEW_IMAGE - the entire item is pushed to the stream after it was modified
- OLD_IMAGE - the entire item is pushed to the stream before it was modified
- NEW_AND_OLD_IMAGES - the entire item is pushed to the stream
- DynamoDB streams are made of shards, we don't have to provision them
- Records are not retroactively pushed to the stream after enabling it!
- We need to define an event source mapping to read from the stream
- We need to ensure the Lambda has the appropriate methods
- The Lambda is invoked synchronously
- TTL = item is automatically deleted after an expiry date/time
- TTL is provided at no extra cost / no additional WCU/RCU used
- TTL is background task operated by DynamoDB
- Helps reduce storage and manage the size of the table over time
- Helps adhere to regulatory norms
- TTL is enabled per row (we define a TTL columns and add a date there)
- Can be named however we want
- Should be a number (we should use time to epoch conversion for deletion timestamp)
- DynamoDB deletes expired items within 48 hours (it wont happen right after the item expired)
- Deleted items are deleted from the indexes as well (GSI/LSI)
- Deleted items could be recovered if DynamoDB Streams were enabled
--projection-expression: attributes to retrieve
--filter-expression: filter results
General CLI pagination options:
- Optimization:
- --page-size: full dataset is still received but each API call will request less data (helps avoid timeouts)
- Pagination:
- --max-number: max number of results returned by the CLI. Returns NextToken
- --starting-token: specify the last received NextToken to keep reading data from the next page
- Optimization:
CLI Examples:
- Scan:
aws dynamodb scan --table-name blog-posts --projection-expression "post_id, content" --region us-east-2
aws dynamodb scan --table-name blog-posts --projection-expression "post_id, content" --region us-east-2 --filter-expression "post_id = :p" --expression-attribute-values '{ \":p\" : { \"S\" : \"post-2\" }}'
aws dynamodb scan --table-name blog-posts --projection-expression "post_id, content" --region us-east-2 --page-size 10 --max-items 1
aws dynamodb scan --table-name blog-posts --projection-expression "post_id, content" --region us-east-2 --page-size 10 --max-items 1 --starting-token <token>
You can return the RCU of the table by adding the attribute
--return-consumed-capacity TOTAL
or--return-consumed-capacity INDEXES
to the query. It will return the RCU (WCU will always be 0 in this case, since you are doing a query)
- Ability to create/update/delete multiple rows in different tables at the same time
- "All or nothing" type operation
- Write modes: Standard, Transactional
- Read modes: Eventual Consistent, Strong Consistent, Transactional
- For Transactional it consumes 2x the WCU/RCU
- It is common for DynamoDB to be used to store session state of a web application
- Same functionality can be achieved with ElastiCache as well, differences:
- ElastiCache is in-memory, DynamoDB is serverless (automatic scaling)
- Both are key/value store
- Same functionality can also be achieved with EFS disk storage, differences:
- EFS must be attached to an EC2 instance, DynamoDB can be accessed via REST calls
- EFS can not be used with Lambda
- ... EBS & Instance Store disk storage:
- EBS & Instance store can only be used as local caching, they are not shared
- ... S3:
- S3 has higher latency, not a great tool for session state storage
- In case the partition key is not distributed enough, or we have limited number of partition keys, we can add a random suffix (or computed suffix) to the key to achive higher distribution
- Example: voting system if we have 2 candidates
- Concurrent writes:
- Two users try to update the same record in the same time
- The last update wins
- Conditional writes
- A write can only happen if a specific condition is met
- First write will succeed, second write will fail because the condition wont be met
- Important: data is not overridden
- Atomic writes:
- Batch writes:
- Write an update containing many items at the same time
- Max object size is 400KB
- Pattern:
- Large object is uploaded to S3
- Metadata is stored in DynamoDB
- Reader clients reads the metadata which contains the object path and reads the object from the optianed path
- Index objects in S3:
- Write into S3 bucket -> trigger notification starts a Lambda function -> Lambda updates DynamoDB table
- Table cleanup:
- Option 1: Scan + Delete => very slow, consumes RCU and WCU
- Option 2: Drop table => fast, cheap, efficient
- Copy table:
- Option 1: Use AWS DataPipeline (use EMR): takes the table, puts it into S3, puts it back into DynamoDB table (outdated)
- Option 2: create a back-up, restore it into a new table
- Option 3: Scan + Write
- Security:
- VPC Endpoints to access DynamoDB without internet
- Access fully controlled by IAM
- Encryption at rest using KMS
- Encryption in transit using SSL/TLS
- Backup/Restore:
- Point in time restore la RDS
- No performance impact
- Global Table:
- Multi region, fully replicated, high performance
- Amazon DMS:
- Used to migrate data from Mongo, Oracle, MySQL, S3 to DynamoDB
- Local DynamoDB for development