Data and Analytics
Data Lakehouses Unleashed: How AWS and Apache Iceberg are Changing the Game
An exploration of how to build a data lakehouse entirely in Amazon S3.
As the AWS ecosystem matures and adds increasingly more services that work hand in hand, the number of production serverless architectures will skyrocket. As serverless architectures shift from niche dev tasks to production systems, the burden of adding production level monitoring will increase. The core of serverless on AWS is Lambda, so this post is covering the basics of what needs to be monitored for AWS Lambda.
Specifically, we will cover:
Here is a list of the default metrics AWS provides. They will be the building blocks for your basic monitors. Any monitoring tool you pick should be able to ingest Lambda’s CloudWatch metrics. In a future blog post, we will cover custom metrics, additional logging, and traceability.
Metric Name: Description
Invocations: The number of invocations, successful invokes and errors, in the given timeframe.
Errors: The number of invocations that returned a 4xx, 5xx, or no response code.
DeadLetterErrors: The number of times a Lambda function couldn’t write to its dead letter queue (seriously screwed territory).
Duration: How long the function took to run in milliseconds
Throttles: Number of requests that failed due to concurrency limit (do not count as errors, for async invokes only)
IteratorAge: How long the last message in the batch the most recently invoked Lambda was on its stream.
ConcurrentExecutions: The number of invocations running at the same time (for functions that have concurrency limits set).
UnreservedConcurrentExecutions: The number of invocations running at the same time (for functions that do not concurrency limits set).
A monitor is an alerting engine/tool that will notify your team when a threshold is crossed or an event is processed. At Trek10, we use Datadog to ingest all kinds of metrics, including the cloudwatch metrics mentioned above, and events. We then use Datadog’s built in monitoring tool, which every minute compares the metrics to a query we have defined, to send alerts to our on call team. Similar functionality can be gained using Cloudwatch Alarms or another industry leading monitoring software. A general rule of thumb when operationalizing Lambdas is to make sure each Lambda function has its own suite of monitors that were built and scoped for that Lambda function. This is the opposite of the philosophy for traditional infrastructures and the metrics associated with those such as cpu. Specifically, the monitor that monitors CPU for one host is monitoring the CPU for all hosts where we would now want a Lambda to have monitors set up for that function itself. This all begs the question: What type of metrics should you monitor? The answer to that really depends on the type of Lambda function.
First Type: scheduled Lambdas / cron job Lambdas
Description: Lambda functions that are run on a set schedule doing a set job. An example of a commonly used “cron job” Lambda is to have a nightly Lambda run that copies EBS snapshot backups from one region to another for DR. A custom example of cron Lambdas we use at Trek10 is a Lambda we nicknamed
keep-it-clean that runs every hour alerting us if a user has forgotten to document their work.
Monitoring Notes: For these cron job Lambdas, we will want to monitor for errors on the function as well as confirming there was an invocation. Since cron Lambdas have a usually specific purpose/job every time the job isn’t done correctly, there is cause for an alert. Using two monitors, we can ensure job success. The first monitor should monitor for errors and alert you if there is an error for the function (note that asynchronous functions retry twice by default so you can watch for >= 3 errors). The second monitor should monitor invocations and alert if no invocation happens in a specific time frame.
keep-it-cleanfunction that runs once every hour, we have a monitor that will alert us if the service doesn’t run after 90 minutes. The runbook would include steps for confirming the CloudWatch trigger is still active, troubleshooting why the function didn’t run, and some set steps for manually checking on what the Lambda was checking, in our case making sure the work done is documented.
keep-it-cleandoes not retry on failure and thus we alert on the first error occurrence. The runbook would include steps for debugging failures due to Lambda configurations and some set steps for manually checking on what the Lambda was checking, in our case documented downtime.
Second Type: External API Driven Lambdas API Lambda
Description: A Lambda function that serves as a piece (or the entire) application that is invoked by an external API, often via Amazon API Gateway. An example of this usage pattern would include a Lambda that validated boat ownership records whenever a particular API was invoked.
Monitoring notes: Generally, modern applications usually have some sort of error tolerance, especially if the API is publicly scrapeable and will be hit with bogus input. Instead of monitoring for any failure as we did in cron, we will usually monitor for error rate that also requires a flat number of errors. This will let us only get notified if the application is actually experiencing stability problems. For async functions, we will want to monitor for throttling as that can be an indicator that the Lambda is causing a bottleneck. Finally, it’s critical to make sure your dead letter queue is working. The key CloudWatch metrics for this function include errors, throttles, and dead letter errors.
Third Type: AWS Resource Driven
Description: Instead of being invoked by an API call as the previous usage type, these functions run based on pulling off of AWS resources, such as kinesis streams. On top of monitoring for error rate, DLQ errors, and throttles, we will also want to monitor the iterator age.
Monitoring notes: Since these Lambda functions will be pulling information off of AWS resources if the iterator age is increasing, then the Lambda function is falling behind on processing. Raising the concurrency limits in Lambda or increasing your function’s memory may help the Lambda catch up.
There is sentiment out there that to monitor Lambda functions, you must have a more experienced ops team then was necessary for the past. I have heard numerous times from developers that their management teams will not allow them to build serverless production applications because it is too expensive to maintain 24/7 operations.
Let’s talk about how things have very much stayed the same.
The setup for monitoring serverless, specifically Lambda, compared to traditional infrastructure, is strikingly similar. In fact, in some aspects, it is comparatively easier to monitor serverless architectures. For example, one does not have to have constant monitoring on traditional metrics such as CPU, RAM, disk space, etc. but rather, those metrics are handled by AWS entirely or configured by you. In other ways, the Lambda functions seem to be harder - specifically creating specific first response runbooks for when Lambda functions fail requires a monumental amount of effort from someone with a complete understanding of the environment. The runbooks for traditional alerts, such as CPU, are quite easy to come up with as those types of alerts come up across all systems and have generic solutions. Because of the ease of creating those runbooks, which are not necessary for most serverless environments, it is overlooked that the comprehensive error runbooks required by serverless environments are also required by traditional architectures. Because of this oversight, people have a knee jerk reaction that the setup is harder for serverless ops. From experience, it’s moving your operations teams to support and handle more specific tasks as opposed to the occasional CPU or RAM issue.
In the same way, there is a misconception about serverless runbooks being more complex. There is a misconception that serverless architectures require a developer on call 24/7. Since AWS manages some aspects of the infrastructure, there is less noise, and thus a higher percentage of alerts require escalation to a developer who is on call. Those on-call alerts whether it’s an unconsidered edge case that pops up or bad data gets into production, can and will happen in traditional as well as serverless architectures. It is worth noting that since fewer developers right now are as familiar with serverless, the pool of people who are ready to take on-call is typically smaller. With time and adoption, that pool size will grow, and we at Trek10 are confident once that pool hits critical mass building on-call for serverless will be considered as natural as today’s traditional systems.
Monitoring isn’t always the sexiest of work. And in serverless environments where everything feels fresh and exciting, it is still essential to make sure the less glamorous ops work gets implemented correctly. If you have any questions or suggestions about monitoring Lambda functions, please leave a comment or reach out and I will do my best to update the blog!
An exploration of how to build a data lakehouse entirely in Amazon S3.
A comprehensive guide on Filters, Filter Controls, Actions, Parameters, and other QuickSight features to leverage filter capabilities in your dashboards.