Is your data secure? Find out with our free IBM security assessment! Learn More →

Services
Focus Areas

Areas of Expertise

Interests
Engagements

Discover

Build

Support
Areas of Expertise

App Modernization

Public Sector

Serverless

IoT

DevOps

Migration

Data and Machine Learning

Enterprise Architecture

24/7 Monitoring

Team Support

Datadog

Overview

Are you taking advantage of modernizing your AWS apps to protect your cloud investments?

Overview

Our mission is to accelerate high-quality cloud adoption across the Public Sector.

Overview

Whether you are new to serverless or looking to scale, Trek10 allows you to focus on building applications, not managing servers.

Related Content

AWS Lambda

With AWS Lambda, you can run code without the need for managing servers in a cost-effective manner.

Blog

What is Serverless and Why Does it Matter?

Overview

Whether you’re looking to gain visibility into plant floor machinery or seeking to enhance process efficiency, Trek10 can help.

Related Content

Blog

Serverless Architectures: IoT

Blog

Is IoT Device Shadow Right for You?

or should you build-your-own with DynamoDB?

Overview

Shorten the development lifecycle, increase reliability, and release software faster.

Related Content

AWS CloudFormation

AWS CloudFormation helps you save time and money by configuring and managing resources for you.

Containers on AWS

Containers on AWS makes managing container registries easy, autonomous, reliable, and safe from anywhere.

Overview

At Trek10, we rapidly migrate your applications with a focus on cost-effectiveness

Related Content

Amazon WorkSpaces

Amazon WorkSpaces allows you to quickly scale according to your virtual desktop needs.

Containers on AWS

Containers on AWS makes managing container registries easy, autonomous, reliable, and safe from anywhere.

Overview

Uncover insights from your data no matter where you are in your analytics journey.

Related Content

Machine Learning Ops

MLOps constitute best practices for developing, deploying, and monitoring high precision Machine Learning models.

Amazon SageMaker

Amazon SageMaker enables developers and data scientists to easily build ML models.

Overview

Enterprise Architecture (EA) combines business and technology in a proven industry recognized framework to deliver business focused results based on your industry, environment, competition and the ever increasing capabilities of cloud technologies.

Related Content

Developer Acceleration

A series of in-person architect-led training modules designed to help your team develop the necessary skills and best practices to modernize your applications.

Overview

Maximize the uptime and security of your most critical applications.

Related Content

Amazon CloudWatch

Amazon CloudWatch makes performance monitoring simple for you and your business.

Disaster Recovery

Prevent downtime, strengthen resilience, and avoid unanticipated costs with a comprehensive Disaster Recovery Plan.

Overview

Experienced solutions architects and developers at your service, on-demand.

Related Content

Amazon CloudWatch

Amazon CloudWatch makes performance monitoring simple for you and your business.

Disaster Recovery

Prevent downtime, strengthen resilience, and avoid unanticipated costs with a comprehensive Disaster Recovery Plan.

Overview

Let Trek10 help you hit the ground running with Datadog.

Related Content

AWS Premier Partner

Interests

Amazon API Gateway

Amazon Athena

Amazon CloudFront

Amazon CloudWatch

Amazon Cognito

Amazon Connect

Amazon DynamoDB

Amazon Elastic Kubernetes Service (EKS)

Amazon ElastiCache

Amazon EventBridge

Amazon Kinesis

Amazon QuickSight

Amazon RDS

Amazon Redshift

Amazon SageMaker

Amazon WorkSpaces

AWS CloudFormation

AWS CodePipeline

AWS Config

AWS Control Tower

AWS Database Migration Service

AWS Fargate

AWS Glue

AWS Glue Databrew

AWS IoT Architecture

AWS IoT Devices

AWS IoT Greengrass

AWS IoT SiteWise

AWS Lambda

AWS MAP (Migration Acceleration Program)

AWS Serverless Application Model (SAM)

AWS WAF

AWS Well-Architected Framework

Containers on AWS

Data & Analytics on AWS

DevOps Security in AWS

Disaster Recovery

Industrial Machine Connectivity/Connected Factory

Machine Learning Ops

Serverless Analytics in AWS

Serverless Architectures in AWS

Overview

Amazon API Gateway is a fully-managed, easily configurable entry point for your web services.

Overview

Analyze and query data easily at a mass scale from a variety of platform services using Amazon Athena.

Overview

Amazon CloudFront is a content delivery network (CDN) which is a distributed system that delivers applications, websites, and content to users based on factors such as users’ geographical locations, or the origins of the content and delivery servers.

Overview

CloudWatch is an AWS service that allows for basic-to-detailed performance monitoring of your applications and AWS environment resources within a single platform.

Overview

Make it easy to add user sign-up, sign-in, and access control to your web and mobile applications with Amazon Cognito.

Overview

Amazon Connect is an affordable omni-channel cloud-based contact center that enables companies to deliver advanced level support to customers without the burden of maintaining on-premise legacy systems.

Overview

Amazon DynamoDB is the one of the fastest and most versatile, serverless key-value and document database options available in the cloud today.

Overview

The benefits of Kubernetes without the upfront infrastructure hassles.

Overview

Traditional relational databases do not scale well horizontally, and even right-sized NoSQL databases can become a bottleneck under high traffic.

Overview

AWS EventBridge makes it easy to connect applications together using data from Software-as-a-Service(SaaS), AWS services, and one’s own applications.

Overview

An AWS-managed service, Kinesis is a solution that allows users to analyze streaming data in real-time.

Overview

QuickSight is an AWS-managed business intelligence tool that allows you to quickly assess your business.

Overview

Managed Relational Database Service

Overview

An AWS cloud data warehousing solution that stands out.

Overview

Amazon SageMaker is a fully managed service that allows developers and data scientists to build, train, and deploy machine learning (ML) models much faster and efficiently for your specific use cases.

Overview

Amazon WorkSpaces is a managed, secure Desktop-as-a-Service (DaaS) that helps you cut the noise and cost of traditional VDI platforms.

Overview

CloudFormation is a free AWS service that enables taking declarative code and creating AWS resources configured exactly as declared via templates.

Overview

A continuous delivery service.

Overview

Continually assess, audit, and evaluate your AWS resources using AWS Config.

Overview

Set up and govern multi-account AWS environments with AWS Control Tower.

Overview

Migrate a wide variety of databases to or within AWS utilizing AWS Database Migration Service.

Overview

With AWS Fargate, you can deploy containers in AWS without managing any underlying host infrastructure.

Overview

AWS Glue is a fully managed, scalable, serverless data ingestion service that enables customers to extract, transform, and load (ETL) data for analytics.

Overview

AWS Glue DataBrew is an interactive data preparation tool for cleaning, normalizing, analyzing, and adjusting datasets.

Overview

We break down IoT ecosystems into five foundational components that revolve around cloud-based data insights.

Overview

In general IoT device platforms can be divided into two categories: Embedded Systems and Edge devices.

Overview

AWS IoT Greengrass is an open-source runtime for IoT devices to interact with AWS cloud services.

Overview

IoT SiteWise is an AWS service that can be used to collect, process, analyze and monitor industrial IoT data on AWS.

Overview

AWS Lambda is one of the most revolutionary serverless compute services offered in cloud computing today, allowing you to easily run code for practically any type of application or backend service.

Overview

MAP helps you accelerate cloud migration and modernization with an outcome-driven methodology.

Overview

Enable your team to build serverless applications faster with this open-source framework from AWS.

Overview

Protect against web attacks.

Overview

A Complete Guide to the AWS Well-Architected Framework.

Overview

Amazon Elastic Container Registry (ECR) makes data storage, management sharing, and deployment possible from anywhere.

Overview

AWS provides integrated end-to-end solutions for modern data management and advanced analytics.

Overview

Applying Devops Security for an AWS application.

Overview

A Disaster Recovery Plan (DRP) is a structured and detailed set of instructions geared to recover a system and networks in the event of failure or attack, with the aim of helping the organization get back to being operational as fast as possible.

Overview

In addition to the full range of AWS IoT architecture and support capabilities, we offer an Industrial IoT Proof of Value (POV) solution.

Overview

Machine learning operations (MLOps) is the umbrella term for best practices surrounding machine learning.

Overview

Using AWS serverless services as building blocks, you can now easily and rapidly build data lakes and data pipelines that process and analyze petabytes of data without needing to manage any infrastructure components.

Overview

Let AWS handle the burden of server management so you can focus your time on solutions for clients. By adopting a serverless architecture, you tremendously reduce the operational complexity of running your application, enabling you to focus on delivering new features faster without compromising security, reliability, and performance.

Discover

Cloud-Native Immersion Day

Developer Acceleration

Retail | Industry Overview

SaaS on AWS

Serverless Workshop

Overview

Trek10's Cloud-Native Immersion Days are focused, high impact training sessions that will drench your teams in knowledge of the latest tech and best-practices.

Overview

Trek10’s expert-led Developer Acceleration workshops help enterprise teams quickly and safely jump-start their serverless journey.

Overview

Leveraging the vast capabilities of the AWS ecosystem, Trek10 provides retail businesses with solutions tailored to their unique needs, enabling them to innovate at speed and scale.

Overview

Trek10 helps companies migrate and build their SaaS offering on AWS with a cloud-native approach.

Overview

Whether it’s a greenfield project or re-architecting legacy, Trek10 is your guide to adopting cloud native architectures.

Build

DevOps Transformation

Internet of Things (IoT) Applications

Security

Overview

At Trek10, we leverage the best AWS native and third party tools for code-defined infrastructure, continuous integration, and automated deployment pipelines.

Overview

Trek10 helps you deliver on the promise of IoT by guiding you through the process of connecting your devices to AWS and by designing, implementing, and fully supporting your AWS cloud infrastructure.

Overview

Trek10’s security solutions and services will secure your AWS APIs and infrastructure. Schedule a meeting today to see if you qualify for a free security scan and report.

Support

CloudOps 24/7 Monitoring & Support

CloudOps Team Support

Overview

Trek10 brings managed services to the cloud. Our team works hard to reduce noise and maximize uptime in every AWS environment we manage.

Overview

Trek10 Team Support augments your team’s skills with access to a team of experienced and focused AWS solutions architects and cloud developers that specialize in leveraging AWS to the fullest.

Overview

Everyone who moves to AWS wants to secure their environment, but knowing where to start is hard. That is where Trek10 can help.
Case Studies
About
AWS Premier Partner
Community
CloudProse Blog

Spotlight

Serverless

Cost and Pricing Analysis

Cloud Native

Developer Experience

Databases

News

IoT

Monitoring, Ops & DevOps

Containers

Security and IAM

Generative AI and Machine Learning (ML)

Search Trek10

Serverless

AWS Lambda Monitoring 101

Tips to help you ensure your team's lambda functions are monitored

James Bowyer | Aug 26 2019

As the AWS ecosystem matures and adds increasingly more services that work hand in hand, the number of production serverless architectures will skyrocket. As serverless architectures shift from niche dev tasks to production systems, the burden of adding production level monitoring will increase. The core of serverless on AWS is Lambda, so this post is covering the basics of what needs to be monitored for AWS Lambda.

Specifically, we will cover:

Default metrics AWS provides
Usage patterns for AWS Lambda and the corresponding monitoring patterns
Does Lambda require more dev support than traditional architectures?

CloudWatch Metrics

Here is a list of the default metrics AWS provides. They will be the building blocks for your basic monitors. Any monitoring tool you pick should be able to ingest Lambda’s CloudWatch metrics. In a future blog post, we will cover custom metrics, additional logging, and traceability.

Metric Name: Description

Invocations: The number of invocations, successful invokes and errors, in the given timeframe.

Errors: The number of invocations that returned a 4xx, 5xx, or no response code.

DeadLetterErrors: The number of times a Lambda function couldn’t write to its dead letter queue (seriously screwed territory).

Duration: How long the function took to run in milliseconds

Throttles: Number of requests that failed due to concurrency limit (do not count as errors, for async invokes only)

IteratorAge: How long the last message in the batch the most recently invoked Lambda was on its stream.

ConcurrentExecutions: The number of invocations running at the same time (for functions that have concurrency limits set).

UnreservedConcurrentExecutions: The number of invocations running at the same time (for functions that do not concurrency limits set).

Monitors:

A monitor is an alerting engine/tool that will notify your team when a threshold is crossed or an event is processed. At Trek10, we use Datadog to ingest all kinds of metrics, including the cloudwatch metrics mentioned above, and events. We then use Datadog’s built in monitoring tool, which every minute compares the metrics to a query we have defined, to send alerts to our on call team. Similar functionality can be gained using Cloudwatch Alarms or another industry leading monitoring software. A general rule of thumb when operationalizing Lambdas is to make sure each Lambda function has its own suite of monitors that were built and scoped for that Lambda function. This is the opposite of the philosophy for traditional infrastructures and the metrics associated with those such as cpu. Specifically, the monitor that monitors CPU for one host is monitoring the CPU for all hosts where we would now want a Lambda to have monitors set up for that function itself. This all begs the question: What type of metrics should you monitor? The answer to that really depends on the type of Lambda function.

Usage Patterns & Monitoring:

First Type: scheduled Lambdas / cron job Lambdas

Description: Lambda functions that are run on a set schedule doing a set job. An example of a commonly used “cron job” Lambda is to have a nightly Lambda run that copies EBS snapshot backups from one region to another for DR. A custom example of cron Lambdas we use at Trek10 is a Lambda we nicknamed keep-it-clean that runs every hour alerting us if a user has forgotten to document their work.

Monitoring Notes: For these cron job Lambdas, we will want to monitor for errors on the function as well as confirming there was an invocation. Since cron Lambdas have a usually specific purpose/job every time the job isn’t done correctly, there is cause for an alert. Using two monitors, we can ensure job success. The first monitor should monitor for errors and alert you if there is an error for the function (note that asynchronous functions retry twice by default so you can watch for >= 3 errors). The second monitor should monitor invocations and alert if no invocation happens in a specific time frame.

Useful Metrics:

Invocations: In our example for the keep-it-clean function that runs once every hour, we have a monitor that will alert us if the service doesn’t run after 90 minutes. The runbook would include steps for confirming the CloudWatch trigger is still active, troubleshooting why the function didn’t run, and some set steps for manually checking on what the Lambda was checking, in our case making sure the work done is documented.

Errors: In our example for the keep-it-clean does not retry on failure and thus we alert on the first error occurrence. The runbook would include steps for debugging failures due to Lambda configurations and some set steps for manually checking on what the Lambda was checking, in our case documented downtime.

Second Type: External API Driven Lambdas API Lambda

Description: A Lambda function that serves as a piece (or the entire) application that is invoked by an external API, often via Amazon API Gateway. An example of this usage pattern would include a Lambda that validated boat ownership records whenever a particular API was invoked.

Monitoring notes: Generally, modern applications usually have some sort of error tolerance, especially if the API is publicly scrapeable and will be hit with bogus input. Instead of monitoring for any failure as we did in cron, we will usually monitor for error rate that also requires a flat number of errors. This will let us only get notified if the application is actually experiencing stability problems. For async functions, we will want to monitor for throttling as that can be an indicator that the Lambda is causing a bottleneck. Finally, it’s critical to make sure your dead letter queue is working. The key CloudWatch metrics for this function include errors, throttles, and dead letter errors.

Useful metrics:

Errors: Say we only wanted to get alerted if there were more than 10 errors and only if more than 5% of all requests error out. To satisfy this, we use the equation ((# of errors – 10)/(# of invocations)) > 5%. The runbook for error rate would include confirming the data sources and dependencies the Lambda uses from aren’t experiencing issues, the Lambda configurations are correct, and to look in the logs to try and see if they share any common traits among failures.
Throttles: The runbook for throttling would include looking at raising the concurrency set, including creating an AWS support case if the account limits need to be raised.
DeadLetterErrors: This metric’s name can be confusing as what this metric is really telling us is that the lambda couldn’t write to its dead letter queue. We will want to alert on the sum of the errors. The runbook for when there is a failure to write to a dead letter queue is to look into IAM permission changes or naming changes as those are often the situation. For production workloads, where each event is critical, having a working DLQ is a must
_Dead Letter Queue Length: While this is not a metric in Lambda itself if our function writes to a dead letter queue (and API driven Lambda functions should whether the queue is SNS, SQS, or even just dumping into DyanmoDB) it is crucial to measure when items get placed in the queue. _
Bonus metric - Lambda invocations: While when monitoring scheduled Lambdas, it makes sense to measure invocations to confirm the invocations that are expected happen, this can still be a useful metric. Specifically, by using some form of machine learning-based learning-based anomaly detection on the number of invocations, your team can get alerts for several reasons. One such reason Trek10 has seen is when a client of our system was down and thus stopped making requests, this alert let us reach out to the client and alert them they were no longer making requests.

Third Type: AWS Resource Driven

Description: Instead of being invoked by an API call as the previous usage type, these functions run based on pulling off of AWS resources, such as kinesis streams. On top of monitoring for error rate, DLQ errors, and throttles, we will also want to monitor the iterator age.

Monitoring notes: Since these Lambda functions will be pulling information off of AWS resources if the iterator age is increasing, then the Lambda function is falling behind on processing. Raising the concurrency limits in Lambda or increasing your function’s memory may help the Lambda catch up.

Useful metrics:

Previously described (from the “External API Driven Lambdas API Lambda” type): Error Rate, DLQ Errors and puts, and throttles.
IteratorAge Lambda: As the iterator age increases, you know the Lambda function is falling behind on processing requests. If requests come in batches, this may be okay. But we recommend establishing a baseline for any function with a defined iterator age to at least know the thresholds that are in normal bounds for iterator age.
ConcurrentExecutions: It is essential to monitor the number of concurrent executions concerning the number of allowed concurrent executions. If the function is fanning out and hitting the max count of concurrent executions it may be worth raising the limit, although it may require some code changes to handle the additional parallel executions. Make a note of the account level limits that may need to be raised.

Lambda Support Breakdown:

There is sentiment out there that to monitor Lambda functions, you must have a more experienced ops team then was necessary for the past. I have heard numerous times from developers that their management teams will not allow them to build serverless production applications because it is too expensive to maintain 24/7 operations.

Let’s talk about how things have very much stayed the same.

Setup Work

The setup for monitoring serverless, specifically Lambda, compared to traditional infrastructure, is strikingly similar. In fact, in some aspects, it is comparatively easier to monitor serverless architectures. For example, one does not have to have constant monitoring on traditional metrics such as CPU, RAM, disk space, etc. but rather, those metrics are handled by AWS entirely or configured by you. In other ways, the Lambda functions seem to be harder - specifically creating specific first response runbooks for when Lambda functions fail requires a monumental amount of effort from someone with a complete understanding of the environment. The runbooks for traditional alerts, such as CPU, are quite easy to come up with as those types of alerts come up across all systems and have generic solutions. Because of the ease of creating those runbooks, which are not necessary for most serverless environments, it is overlooked that the comprehensive error runbooks required by serverless environments are also required by traditional architectures. Because of this oversight, people have a knee jerk reaction that the setup is harder for serverless ops. From experience, it’s moving your operations teams to support and handle more specific tasks as opposed to the occasional CPU or RAM issue.

On-Call Burdens

In the same way, there is a misconception about serverless runbooks being more complex. There is a misconception that serverless architectures require a developer on call 24/7. Since AWS manages some aspects of the infrastructure, there is less noise, and thus a higher percentage of alerts require escalation to a developer who is on call. Those on-call alerts whether it’s an unconsidered edge case that pops up or bad data gets into production, can and will happen in traditional as well as serverless architectures. It is worth noting that since fewer developers right now are as familiar with serverless, the pool of people who are ready to take on-call is typically smaller. With time and adoption, that pool size will grow, and we at Trek10 are confident once that pool hits critical mass building on-call for serverless will be considered as natural as today’s traditional systems.

Monitoring isn’t always the sexiest of work. And in serverless environments where everything feels fresh and exciting, it is still essential to make sure the less glamorous ops work gets implemented correctly. If you have any questions or suggestions about monitoring Lambda functions, please leave a comment or reach out and I will do my best to update the blog!

Author

James Bowyer

Go to Stories by James

Similar Blog

Spotlight

AWS Lambda Functions: Return Response and Continue Executing

A how-to guide using the Node.js Lambda runtime.

Joel Haubold | Dec 07 2023
5 min read

Serverless

Replacing Amazon S3 Events with Amazon S3 Data Events

How to synthesize an (almost) identical payload using Amazon EventBridge rules.

Joel Haubold | Nov 02 2023
5 min read

Cloud Native

How and When to Use Amazon EventBridge Pipes

Amazon EventBridge Pipes: Useful, but not magical.

Matt Skillman | Aug 28 2023
4 min read

Overview

Overview

Overview

Related Content

AWS Lambda

Blog

What is Serverless and Why Does it Matter?

Overview

Related Content

Blog

Serverless Architectures: IoT

Blog

Is IoT Device Shadow Right for You?

Overview

Related Content

AWS CloudFormation

Containers on AWS

Overview

Related Content

Amazon WorkSpaces

Containers on AWS

Overview

Related Content

Machine Learning Ops

Amazon SageMaker

Overview

Related Content

Developer Acceleration

Overview

Related Content

Amazon CloudWatch

Disaster Recovery

Overview

Related Content

Amazon CloudWatch

Disaster Recovery

Overview

Related Content

AWS Premier Partner

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview