Wed, 05 Apr 2017
One question we often face at Trek10 as we design Serverless AWS architectures is, what is the most cost-effective and efficient AWS platform service for a new system to use for ingesting data? Of course you could always just spin up some EC2 servers and pump your data into them, but at Trek10 we push hard to design new systems as “serverless-ly” as possible, using AWS platform services such as Lambda, DynamoDB, and S3 to their fullest extent.
A variety of use-cases face this kind of a challenge: IOT is one of the most obvious (getting data from “things” into the cloud), but there are many others: branch offices pushing data to a central system, a slow or “lazy load” migration from your data center, or even an always-on integration between legacy environments and a new AWS environment.
So back to the challenge. There are multiple AWS services that are tailor-made for data ingestion, and it turns out that all of them can be the most cost-effective and well-suited in the right situation. We’ll try to break down the story for you here.
(Two brief caveats: This is not intended to be comprehensive; there are a huge number of possibilities, we just think these are the top few. Also, all pricing is for us-east-1.)
A Breakdown of Options
A real-time streaming data queuing service. Kinesis Streams producer apps push data in, and consumer apps pull the data to process it. AWS Lambda functions can be a consumer, so there is no need to run a server to process and store the data out of Kinesis Streams.
- Pros: Meant for very high volume. Very flexible
- Cons: Building producer and consumer apps is not trivial ; Additional costs for running your consumer app (or Lambdas); 1 MB max per PUT
- $10.80/mo for one “shard” (1 MB/s ingress, 2 MB/s egress), $0.014/1M PUT payload units
- PUT payload unit: every 25KB, rounded up
Firehose simplifies the consumer side of Streams… your data is automatically pushed into S3, Redshift, or Elasticsearch by the Firehose service.
- Pros: Very simple to push into S3, Redshift, or AWS Elasticsearch
- Cons: 1 MB max per object. Additional services needed to do anything more complex or disaggregate the data pushed to S3.
- $0.035 per GB ingested PLUS S3 charges (but with buffering & compression this is usually very small as a % of total.)
The grandaddy of AWS services: object storage at scale. Because there is read-after-write consistency, you can use S3 as an “in transit” part of your ingestion pipeline, not just a final resting place for your data. We described an architecture like this in a previous post.
- Pros: 5TB limit for an object; very very simple
- Cons: Additional services needed to do any processing
- Everyone knows about S3 storage costs. We’ll ignore that here, as many of these services may end up archiving the data in S3 anyway. The question here is, how does direct S3 PUTs compare to the other options?
- $0.005 per 1000 PUTs
- One PUT can be up to 5 MB (multi-part PUTs can allow you to push a total object of up to 50GB)
AWS IOT Platform
A newer AWS service for enabling IOT applications. At its core, it is an MQTT broker and rules engine which you can use to publish, process, and store data.
- Pros: Great for “constrained” (low-power, low-compute) edge devices with small data. MQTT and the AWS IOT SDK are purpose-built for this use-case.
- Cons: Need to introduce additional services processing and storage. Message increment is 512 bytes of data.
- $5/million message increments
- Total message size can be up to 128KB
Rules of Thumb
We’ve run the numbers on all these options to compare costs at various ingestion profiles, in terms of frequency and size of data. Below are a couple general conclusions to help you make sense of all of this:
- If your data producers are power/compute constrained, you’ll probably need to use AWS IOT. If your ingestion costs are too high, consider AWS Greengrass to buffer/process on the edge.
- Under about 100k PUTs/hr @ 50 KB per PUT, Streams, Firehose and S3 are all in the tens to low hundreds of dollars per month, so cost does not need to be a key design factor. Pick the service that best fits your architecture.
- For AWS IOT Service, that “don’t consider cost” cutoff is more like 100k PUTs/hr @ 512 bytes or 1000/hr @ 50KB.
- At high PUT volume with payloads in the low hundreds of KB or less, Kinesis Streams is the clear winner: 10M PUTs/hr @ 5KB is only $255/mo ; @ 50KB payload, it is only $1700/mo!
- Firehose is very competitive at most volumes but diverges from Streams in the 10’s of TBs/mo. Unless you have 10’s of TBs/mo, select Firehose if the simplicity and processing model suits your need.
- At high PUT volume and low data size (10’s of KB), S3 is very uncompetitive. But as you approach the 1 MB limit of Kinesis, S3 costs look similar. 1M PUTs/hr @ 1MB costs $3400/mo with Streams, $3600/S3. So as you approach larger objects, consider S3 if the simplicity suits your needs. And above 1MB, S3 is your only option.
I hope this is useful. Hit us up on Twitter @trek10inc if you have any questions or ideas of your own about AWS data ingestion options!
And while you’re here, check out AWS Lambda Pricing in Context - A Comparison to EC2.