Kinesis Data Streams (KDS) are used to collect and process data and with large numbers of shards. We can collect gigabytes of data per second and make it available for processing and analyzing in real-time for multiple consumers.
The classic use cases for Data Streams are:
- Collecting real-time metrics and reporting
- Real-time data analytics
- Accelerated log and data feed intake and processing
For any of the above use cases, it is important to control the assorted factors to ensure smooth functioning and lack of data loss when using multiple shards in KDS. In this post, I’m going to review some of the common challenges and "gotchas" that will help you smoothly implement KDS. If you’re looking for more of a background tutorial on what KDS is and how to use it, check out this in-depth video provided by AWS.
The optimal method of monitoring performance on a KDS at a shard level is to turn on enhanced monitoring for the KDS. This will give you a few more helpful metrics. One of the helpful shard level metrics is the Iterator Age in Milliseconds explained below.
IteratorAgeMilliseconds is the age of the last record in all GetRecords calls made against a KDS, measured over the specified time period in this case milliseconds. Age is the difference between the current time and when the last record of the GetRecords call was received by the stream. The Minimum and Maximum statistics can be used to track the progress of Kinesis consumer applications. A value of zero indicates that the records being read are completely caught up with the stream.
Common UpdateShardCount "Gotchas"
Shard counts are the key scaling dimensions for Data Streams. Shards determine the capacity of throughput in a data stream. They can be scaled up or down depending on the usage of the data stream. However, there are a few default limitations that are unclear for those with little experience in using Kinesis.
When scaling Data Streams, you cannot:
- Scale-up more than ten times per rolling 24-hour period per stream
- Scale-up to more than double your current shard count for a stream
- Scale down below half your current shard count for a stream
These hidden limitations can leave you pondering late into the night on why no data is being pushed to your data streams. Make sure you account for these limitations when designing your architecture to be elastic and available.
Intricacies with Partition Keys
Partition keys are unicode strings, with a maximum length of 256 characters that can be used to group data by shard within streams. A hash function is used to map partition keys to 128-bit integer values and to map associated data records to shards using the hash key ranges. Partition keys help you scale up beyond one shard, so even if data is written to different shards and there is need for order partition keys can help maintain order during transmission Using some information such as a UUID to trace back the producer of the information is suggested best practice for using multiple shards. Partition keys also count against the total write size and are stored in the stream, so be careful with stringent payload sizes.
Using partition keys in the most effective manner:
- If order does not matter, it is most efficient to choose a well-distributed attribute for the partition keys (such as a constant variable to indicate category and some randomness such as a timestamp or random variable.
- If order does matter, ensuring that the data is put sequentially to the same shard to ensure continuity and linearity in streaming.
Following the above advice could help avoid jumbling up the data delivery, but how do we avoid the problem of pushing too much data to a single shard and leaving the rest of the shards cold.
Avoiding Hot and Cold Shards
Choosing an Id that is an identifier such as a client Id can backfire when clients with overactive data production push to the stream. This can lead to a few shards that are over-utilized and other shards that are underutilized.
Aim for a minimum of 10x as many partition keys are required as the number of shards. As mentioned earlier, if ordering is not important some randomized partition keys can be used to eliminate hot and cold shards.
With enhanced monitoring, IteratorAgeMillis metric shows the oldest message age on each shard, making it easier to find hot shards. Enhanced monitoring provides shard level metrics that would otherwise be invisible with regular monitoring, there is a cost associated with turning on enhanced monitoring, and this can be done using the AWS CLI, console or API.
These are some of the common challenges that are faced when using KDS, and paying attention to them during development can be helpful in avoiding the common pitfalls.