Andy Warzon in aws 7 minutes to read

Custom Metrics Deep Dive

This is the third in a series of posts about monitoring your production workloads in AWS. In the first post, we did a high level overview of cloud monitoring and broke it down into six types of metrics you should be monitoring, and in the second we dove deep into CloudWatch. Today, we’ll do another deep dive, this time into custom metrics.

While custom metrics are an afterthought for many when initially operationalizing their systems, we view custom metrics as one of the things that separates the “good” operations from the “great” ones.

  • Doing it right upfront will make your system more reliable and performant and your operational analysis more efficient.
  • A few well-placed custom metrics may identify problems that are otherwise missed (or caught too late) by system metrics.
  • Custom metrics add clarity to system behaviors that are hard to tease out of system metrics.

Let’s take a look at the two key questions for implementing custom metrics: What metrics should you be generating, and how can you generate them? As with past posts, the “how” will consider both AWS-centric approaches and bring in some options from Datadog, Trek10’s favorite tool for operational insights.

Metrics? What Metrics?

First things first… what custom metrics should you be creating? Here are four areas you might want to consider:

Business Key Performance Indicators (KPIs)

This is getting to the heart of “what matters?” in your application and your business. Identify the activities on your platform that generates business value. Maybe user signups, or transactions completed, or data rows ingested. Keeping an eye on that high-level goal is a safety net– if there is some gremlin in your system that none of your system metrics are catching, the KPI will show the truth.

Tracking KPIs has the nice side-benefit of making it very easy to build a dashboard for your business leaders. Great fodder for an office monitor!

Points of concern or focus

Assuming you have thorough system metrics (Cloudwatch and VM, if applicable) and APM metrics, you can ultimately see just about any underlying behavior in the system. However it might not always be simple or obvious and setting appropriate alerts may be impossible.

To simplify matters, add a custom metric to track performance in a specific area of focus. For example, it may be very well known that when your application makes a request to an external service provider and with a certain set of parameters the service provider’s response time is problematic. Add a custom metric in your code for just that scenario and you can immediately understand the behavior and alert on it.

Background Jobs

This one is an easy win. Whenever you have background jobs run, have them submit a custom metric like JobSuccess=1 after they have completed. Almost every monitoring system has an option to alert on no data, so you simply create alerts for when the metric is zero or no data, and voila, you have a simple and highly reliable job monitoring system. Add in job duration and other metrics and you can easily build a complete background job dashboard.

Events

Don’t focus just on metrics: Consider what discrete events in your system might be valuable for alerting or overlaying on charts for correlations. Add custom code in your app or your deployment automation to drop those events into your monitoring platform.

An obvious one is deployment events: Datadog and many other monitoring platforms offer direct integration with tools like Jenkins and Github. If that is not available to you, add a script in your deployment pipeline to post a custom event. Other meaningful events might be when an ETL jobs starts and finishes or when regular patching happens.

How To Create Custom Metrics and Events

It’s important that your custom metrics be as lightweight as possible: The last thing you want is large new code bases or new infrastructure to manage. That’s why we strongly recommend either using AWS CloudWatch or a SaaS, our favorite being Datadog. At Trek10 we tend to put most things into Datadog; we find the code to be simpler and like having everything in one place. But if that’s not an option for you, CloudWatch will do the job just fine. We’ll also discuss a third interesting option: logging your custom metrics.

CloudWatch

You can use CloudWatch to log both metrics and events, though currently events cannot be overlaid on charts & dashboards. (Sidebar… CloudWatch Events is a very powerful service for many other reasons. This post has gotten us thinking about the possibilities with it.)

The approach is straightforward… just use your AWS SDK of choice and the API calls PutMetricData and PutEvents. In our favorite SDK, boto3, that would be put_metric_data and put_events. Make sure of course that your IAM user/role has the permission for cloudwatch:PutMetricData or events:PutEvents.

When putting metric data, you can include up to 150 values per metric. And note that if you have high volume metrics, you might want to look at publishing statistic sets instead of every individual data point.

Datadog

There are two main options for custom metrics in Datadog (not including the logging option we discuss below):

If your code is running on a VM with the Datadog agent installed, the simplest option will be to use dogstatsd, Datadog’s adaptation of the popular statsd. This relays metrics to the agent which then ships them off to Datadog alongside the typical system metrics. This is a very efficient and high-throughput approach, and it has the added benefit of not needing to configure additional security since the Datadog agent already has a key. If you are running in the Linux shell there is even a convenient bash one-liner to send metrics to the agent.

If you’re running without a Datadog agent (i.e. in a Lambda function), you will need to use the Datadog API to POST a metric. There is a convenient Python library as well as support for other languages, or try going sans SDK with a raw HTTP post.

If you want to dive deeper, here is a great summary from Datadog.

Logging Your Custom Metrics

Structured logging is an interesting third option for pushing custom metrics. Just drop your metrics into your logs in some structured form like METRIC|{NAME}|{VALUE} and then configure your logging platform to extract those metrics. Datadog has a slick feature to automatically parse these metrics from AWS Lambda logs if they come in the form defined here. Perhaps the biggest upside of this is the simplicity of the code. No extra packages at all… one simple line of logging code for each metric. However the configuration on the logging platform side may be more complex. The other downside to consider is cost… depending on your platform this may incur additional costs.

Most logging platforms like Datadog Logs, Splunk, and SumoLogic will support this, as will the open source option of ElasticSearch. However if you’re trying the low-cost AWS-native option of CloudWatch Logs, you won’t be able to do this.

That’s a quick summary of a few options for you to consider. We’d love to hear your thoughts on this… what platform are you using for custom metrics and what kind of events are you sending metrics for? Start a conversation at @Trek10Inc.


This is the third in a series of posts about monitoring production workloads in AWS. Related posts include.
1. All The Metrics - A Cloud Monitoring Blueprint
2. CloudWatch Deep Dive
3. Current post…