Kinesis is a managed service service which makes it easy to collect and analyze data and video streams in real time. It is considered a managed alternative for Apache Kafka.
- Kinesis Streams: low latency streaming ingest at scale
- Kinesis Analytics: real-time analytics on streams using SQL
- Kinesis Firehose: load streams into S3, Redshift, Elastic etc.
- Streams are divided in shards/partitions.
- Data is ordered in shards
- Data can be replayed
- Multiple applications can consume the same stream
- Data in Kinesis is immutable
- A stream can contain one or more shards
- Write capacity: 1MB/s or 1000 messages per shard
- Read capacity: 2MB/s per shard
- Billing is per shard
- Records are ordered per shard
- PutRecord API: requires a partition key which gets hashed.
- Partition key should be highly distributed (otherwise a shard can become overwhelmed)
- PutRecord accepts batches, cost can be reduced
- PutRecord throws ProvisionedThroughputExceeded in case of capacity is exceeded. Solution: exponential back-off, more shards, ensure that partition key is highly distributed
- It is a Java library
- Each shard should be read by only one KCL instance!
- Progress checkpoint is written to DynamoDB
- Records are read in order at the shard level
- IAM policies
- Encryption at flight using HTTPS
- Encryption at rest using KMS
- Client side encryption/decryption (should be done manually)
- VPC Endpoints available for Kinesis
- Real-time analytics on Kinesis Streams using SQL
- Pay for actual consumption, no need to pre-provision anything
- Can create streams from query results
- Used for ETL, can load data into Redshift, S3, ElasticSearch etc.
- Fully managed, automatic scaling
- Low latency (60 seconds)
- Pay for the amount of data which goes through Firehose
SQS | SNS | Kinesis |
---|---|---|
Consumer pulls data | Data is pushed to many subscribers | Consumer pulls data |
Data is deleted after consumed | Up to 10 million subscribers | Can have as many consumer as we want |
Can have as many consumer as we want | Data os not persisted | Can replay data |
No need to provision throughput | Pub/Sub | Used for big data analytics and ETL |
No ordering (except FIFO) | Up to 100k topics | Data expires after X days |
Individual message delay capability | No need to provision throughput | Must provision throughput |
Integrates with SQS for fan-out architecture | Ordering at the shard level |