At Trek10, we are heavy users of Amazon Kinesis Data Streams for building scalable, reliable, and cost-effective cloud native data architectures. In a blog post from a couple years ago, we discussed some high-level concepts of Kinesis Data Streams. That blog post describes several important concepts to keep in mind when working with Kinesis Data Streams. But to really build enterprise-scale solutions, you need to get much deeper. In this blog series, we will seek to expand upon that baseline knowledge, dive a bit deeper into how Kinesis works, and explore some more complicated Kinesis concepts. At a quick glance, some of the advanced topics we will cover over three parts include:
Enhanced Fan Out
Before we jump into any of these, it makes sense to talk a bit further about what exactly Kinesis is and how it works. It helps to consider Kinesis Data Streams as actual storage. This storage has special properties such as:
Each stream has a well-defined Time-to-Live (Known as RetentionPeriod)
Each record has a timestamp for when it was inserted
Each record has a partition key provided by the producer
The data is partitioned into logical groups called shards based on partition key
Within a shard, the records are ordered based on when they were inserted
There is a mechanism called an ‘Iterator’ which can be pointed at any timestamp and proceed to return all records in order.
Some liberty is taken with the concept of a timestamp, so there is no explicit timestamp value. The concept of a timestamp is handled behind the scenes for us by tracking sequence IDs. Every record within Kinesis has only a Sequence Id, Partition Key, and Data Blob. The Sequence ID is provided to the record by Kinesis. The partition key and data blob are provided by the application producer. Understanding that this is a storage of data and not a simple queue can help vastly in figuring out the differences between Kinesis and other services which get used in similar scenarios.
Although partition keys were discussed in the previous post, they are such a crucial part of what makes Kinesis work. Implementations that do not require your messages to be read in order can disregard the choice of partition keys as nothing more than random values. However, for implementations where the order is paramount to effective delivery, you must carefully consider the choice of key. Because of this importance, we will discuss them more with additional detail.
Each shard for a given stream is provided a hash range. The shard itself will contain all records whose hashed partition values fall within this hash range. To discover the hash ranges of your shards, you can perform the DescribeStream API call. This API call will return to you a payload including the maximum and minimum hash values for all the shards currently provisioned. When you provide a partition key value, that value is passed through a hash function. The resulting hash is then mapped to the correct shard.
This level of detail can help to understand why it is so important to heavily consider your choice in partition keys. Should you choose an uninformed choice of providing a single value for all records partition keys, then all records will always contain the identical hash value. This means there can never be a way for you to efficiently scale your workload and any choice in shard count or parallelization factor would be irrelevant. It also provides interesting possibilities for hot shards, described in the other post, to occur in unexpected ways. Let's use an example to showcase this possibility.
We will start with two shards and a simplified max hash range of [0,9]. That is, all records will be hashed to values between 0 and 9 inclusive. Should we choose a partition key with a discrete number of values, we can have an interesting problem occur. Our first shard will be given the values [0,4] and second shard the values of [5,9]. If our partition key for each record is one of the four following evenly distributed values; foo, bar, biz, and baz, then these values will have unique and consistent hash values. By definition, a well-defined hash function will evenly distribute values across its valid hash range. That means all numbers are equally likely to be the result of a given hash value regardless of the characteristics of your input value. Let’s go ahead and perform our hash function on these values and see where we end up:
At this point, the issue should start to come into focus. Three of the four values have been provided a hash value between the ranges of 0 and 4. This means that assuming all partition keys have identical velocities, the first shard will contain three times the traffic as the second shard.
The distribution of records between shards can be a crucial thing to understand for performance, but most importantly it helps to inform you about efficient scaling techniques. Join us in the next post to dive deeper still into ways in which we can scale our solution.