InteliBridge MCP: Unlock up to $30k to build your Model Context Protocol (MCP) Server. Join the waitlist →

Services
Focus Areas

Areas of Expertise
Engagements

Discover

Build

Support
Areas of Expertise

App Modernization

Public Sector

Serverless

IoT

DevOps

Migration

Data and Machine Learning (ML)

Enterprise Architecture

24/7 Monitoring

Team Support

Datadog

Overview

Are you taking advantage of modernizing your AWS apps to protect your cloud investments?

Overview

Our mission is to accelerate high-quality cloud adoption across the Public Sector.

Overview

Whether you are new to serverless or looking to scale, Trek10 allows you to focus on building applications, not managing servers.

Related Content

AWS Lambda

With AWS Lambda, you can run code without the need for managing servers in a cost-effective manner.

Blog

What is Serverless and Why Does it Matter?

Overview

Whether you’re looking to gain visibility into plant floor machinery or seeking to enhance process efficiency, Trek10 can help.

Related Content

Blog

Serverless Architectures: IoT

Blog

Is IoT Device Shadow Right for You?

or should you build-your-own with DynamoDB?

Overview

Shorten the development lifecycle, increase reliability, and release software faster.

Related Content

AWS CloudFormation

AWS CloudFormation helps you save time and money by configuring and managing resources for you.

Containers on AWS

Containers on AWS makes managing container registries easy, autonomous, reliable, and safe from anywhere.

Overview

At Trek10, we rapidly migrate your applications with a focus on cost-effectiveness

Related Content

Amazon WorkSpaces

Amazon WorkSpaces allows you to quickly scale according to your virtual desktop needs.

Containers on AWS

Containers on AWS makes managing container registries easy, autonomous, reliable, and safe from anywhere.

Overview

Uncover insights from your data no matter where you are in your analytics journey.

Related Content

Machine Learning Ops

MLOps constitute best practices for developing, deploying, and monitoring high precision Machine Learning models.

Amazon SageMaker

Amazon SageMaker enables developers and data scientists to easily build ML models.

Overview

Enterprise Architecture (EA) combines business and technology in a proven industry recognized framework to deliver business focused results based on your industry, environment, competition and the ever increasing capabilities of cloud technologies.

Related Content

Developer Acceleration

A series of in-person architect-led training modules designed to help your team develop the necessary skills and best practices to modernize your applications.

Overview

Maximize the uptime and security of your most critical applications.

Related Content

Amazon CloudWatch

Amazon CloudWatch makes performance monitoring simple for you and your business.

Disaster Recovery

Prevent downtime, strengthen resilience, and avoid unanticipated costs with a comprehensive Disaster Recovery Plan.

Overview

Experienced solutions architects and developers at your service, on-demand.

Related Content

Amazon CloudWatch

Amazon CloudWatch makes performance monitoring simple for you and your business.

Disaster Recovery

Prevent downtime, strengthen resilience, and avoid unanticipated costs with a comprehensive Disaster Recovery Plan.

Overview

Let Trek10 help you hit the ground running with Datadog.

Related Content

AWS Premier Partner

Discover

Cloud-Native Immersion Day

Developer Acceleration

Retail | Industry Overview

SaaS on AWS

Serverless Workshop

Overview

Trek10's Cloud-Native Immersion Days are focused, high impact training sessions that will drench your teams in knowledge of the latest tech and best-practices.

Overview

Trek10’s expert-led Developer Acceleration workshops help enterprise teams quickly and safely jump-start their serverless journey.

Overview

Leveraging the vast capabilities of the AWS ecosystem, Trek10 provides retail businesses with solutions tailored to their unique needs, enabling them to innovate at speed and scale.

Overview

Trek10 helps companies migrate and build their SaaS offering on AWS with a cloud-native approach.

Overview

Whether it’s a greenfield project or re-architecting legacy, Trek10 is your guide to adopting cloud native architectures.

Build

DevOps Transformation

Internet of Things (IoT) Applications

Security

Overview

At Trek10, we leverage the best AWS native and third party tools for code-defined infrastructure, continuous integration, and automated deployment pipelines.

Overview

Trek10 helps you deliver on the promise of IoT by guiding you through the process of connecting your devices to AWS and by designing, implementing, and fully supporting your AWS cloud infrastructure.

Overview

Trek10’s security solutions and services will secure your AWS APIs and infrastructure. Schedule a meeting today to see if you qualify for a free security scan and report.

Support

CloudOps 24/7 Monitoring & Support

CloudOps Team Support

Overview

Trek10 brings managed services to the cloud. Our team works hard to reduce noise and maximize uptime in every AWS environment we manage.

Overview

Trek10 Team Support augments your team’s skills with access to a team of experienced and focused AWS solutions architects and cloud developers that specialize in leveraging AWS to the fullest.

Overview

Everyone who moves to AWS wants to secure their environment, but knowing where to start is hard. That is where Trek10 can help.
Case Studies
About
Careers
AWS Premier Partner
Community
CloudProse Blog

Spotlight

Serverless

Cost and Pricing Analysis

Cloud Native

Developer Experience

Databases

News

IoT

Monitoring, Ops & DevOps

Containers

Security and IAM

Generative AI and Machine Learning (ML)

Search Trek10

Monitoring, Ops & DevOps

CloudWatch Metrics: The Unsung Hero of Monitoring Your AWS Environment

What happens when you cross Superman, Batman, and CloudWatch?

James Bowyer | Jun 07 2021
2 min read

Intro

One of the worst feelings in ops is when an end user alerts the product team of an outage or service disruption. Have you ever had this happen to you because someone introduced a new AWS service and the devops team didn’t know to monitor it? Or, have you ever thought to yourself as you roll out the newest shiniest service released at re:Invent: “How am I going to monitor this?” The answer to both of those questions starts with CloudWatch metrics.

CloudWatch metrics are not only the unsung hero of AWS, but they are also criminally undervalued within the operations of IT workloads everywhere. In this post we are going to highlight what CloudWatch metrics are, the value they provide, how they fit in the devops lifecycle, and a useful tool we have built around them.

What CloudWatch Metrics Are

Let's start with the building blocks of what CloudWatch metrics are (if you already know feel free to skip to the next paragraph). Basically each AWS service has a list of metrics that AWS considers relevant to how that specific AWS service is performing. AWS will only make metrics available that they think carry some value. For a classic service like EC2, AWS will let you monitor things like CPU. In contrast, for heavily managed services such as SNS, AWS will not let you monitor the CPU of the underlying resources managing the SNS topic—instead they let you monitor things like the number of messages not delivered. If you want a little hands-on experience using CloudWatch but you are still getting familiar with the cloud we would recommend checking out this lab that our friends at A Cloud Guru put together: https://acloudguru.com/hands-on-labs/using-cloudwatch-for-resource-monitoring .

Here is where the magic kicks in: the more managed services you use, the less work you will have to do setting up the application; thus, the less work you have to do to set up traditional monitoring for them. Instead of adopting an ever-expanding list of third party services or creating countless homegrown monitoring solutions with AWS, you are guaranteed CloudWatch metrics. Below is a chart with examples demonstrating how you would monitor something homegrown compared to monitoring its AWS managed service counterpart:

Functionality	AWS Managed Service	Something you can monitor in AWS	How to monitor a homegrown solution
Sending emails	AWS Simple Email Service (SES)	All available metrics are listed here, including bounce and compliance rate	To monitor bounce rate you would either need to use a third party mailing service or manually track email bounces in your application
Storing objects	AWS Simple Storage Service (S3)	All available metrics are listed here, including size of bucket and # of items in the bucket	To monitor something such as # of items in the bucket you would need to either keep a small database up to date, continuously pull the number of stored items, or use a third party service.
Routing Requests	AWS Application Load Balancer (ALB)	All available metrics are listed here, including 5xx errors and 4xx errors	On a home grown Load Balancer you would need to use custom plugins (such as the Datadog integration for HAProxy) and a custom solution to store the metrics, or to process the logs to create your own metrics.

When monitoring application-specific functionality, we recommend using Datadog’s API, Agent, and logs to gain full visibility, but that’s a story for a different blog post.

Value They Provide

I know I got ahead of myself and started showing you how awesome CloudWatch Metrics are and the value they provide in the table above. It can’t be understated that for most AWS services you get a truly solid fundamental indicator of service performance. To see a list of all the AWS services that publish CloudWatch metrics and links to those metrics check out this AWS documentation.

The best way to see this value is to pick an AWS service and walk through its associated CloudWatch metrics. Say, for example, you wanted to use AWS Web Application Firewall. Since it’s a managed service, AWS handles executing the WAF rules we have defined, but we may want reporting on what the WAF is doing. Namely, if our application isn’t letting traffic in, then we may want to share info with business stakeholders about how many and what percentage of requests aren’t making it through. Or, we may want to let our internal security team know which rules are blocking the most traffic and have the ability to give them insights into which types of attacks are currently popular. Once the rules have been defined in WAF, we can head over to documentation about WAF CloudWatch Metrics. We can see in the documentation that AWS provides us with metrics for the number of requests not compared to any rule, the number of requests blocked by a specific rule, the number of requests counted (i.e. flagged as suspicious) by a certain rule, and the number of requests allowed. Each of those metrics can be broken down across the various “CloudWatch dimensions”, i.e. region, rule group, rule, and web ACL was used to iterate on the metric. Said another way, we can look at blocked requests per region, and see which rule and Web ACL that request counted towards. Looking at the screenshot below, our security team can conclude that the reputation list is what is blocking a majority of the malicious requests to the application and thus they should invest more time keeping that IP reputation list up to date.

The security team could draw this assumption by seeing that 17 requests were “allowed” or deemed to be safe traffic, while 11 requests were flagged as potentially malicious and blocked thanks to the IPReputationlist rule and no requests were flagged from the xss or sql injection rule. CloudWatch metrics are giving us insight in near real time as to which WAF rules are catching malicious traffic and we can make actionable recommendations off of these metrics.

We aren’t saying CloudWatch metrics are the only thing that one needs to monitor for their application in AWS, but they are by far the most important thing to monitor. In fact many environments we manage use only application up/down monitoring, log monitoring, and CloudWatch metric monitoring with a large percentage of the monitors being CloudWatch monitors.

CloudWatch as Part of Your DevOps Lifecycle

It is common to hear things like “Cloud adoption is a journey,” “crawl, walk, run” and other sayings that reinforce the idea that building applications in AWS is a living process; monitoring these applications happens to be very much of a living process as well. A critical part of adopting DevOps is lifecycle management, but the "dev" side has traditionally had the most well-defined lifecycle. Operations used to have definition inside on premise data centers, but in the cloud has been largely up for debate. As DevOps has continued to evolve, the “ops” side has seen all its well-defined barriers torn down and turned more and more into the Wild West every day. We are currently at a time when best practices are still not defined and likely some major building blocks still haven’t been released (think of when AWS was just a handful of services). The three biggest questions we are asked would be: (1) Is sticking to AWS services when possible a principle for both building and monitoring or just building? (2) Should monitors & dashboards follow the same model of Infrastructure as Code? and (3) How do I know I am monitoring the right CloudWatch metrics?

While each of those questions deserves its own blog post, I will cover how we use CloudWatch metrics in the lens of those questions while explaining some of the pros and cons.

Is sticking to AWS services when possible a principle for both building and monitoring or just building?
- At Trek10 we consider ourselves superfans of AWS, especially when it comes to CloudWatch metrics; however, in the same breath, we are quick to recognize and appreciate when a third party vendor brings value to the table that AWS doesn’t match. Specifically, we have seen value in the SaaS product Datadog for years and have used it actively in our managed service practice. As it stands today, Datadog’s feature richness puts it a cut above the rest in many different areas including their powerful monitoring engine and observability platform. The monitoring can alert off any combination of metric/tag, provide anomaly alerting, and use real time data and logic to enrich the payload that is sent with the alert the monitor fires off. The dashboards are feature rich in many ways (think “super legit”). Finally, the ability to track an incident from logs to metrics to dashboards is seamless. Thus, we have defined it as a best practice internally to handle all of our monitoring via Datadog, including ingesting and alerting off of CloudWatch metrics.
Should monitors & dashboards be Infrastructure as Code?
- At Trek10 we fully recognize the power of IaaC and try to always incorporate building with automation over any alternative manual approach. We deploy a suite of monitors and make some types of edits in a codified manner, but we also give our agents the ability to manage the monitors via manual (console) updates. We do this for a variety of reasons including: to allow for less technical stakeholders, to be able to make changes to monitoring, and most importantly to give a visual representation of monitoring changes. We have internal tooling that runs regularly to shore up the changes between the previously stored codified monitor and the new monitor and store both as “Monitors, Users, Log configurations, Notebooks, and Dashboard as Code”. The older monitor is stored for historical purposes, and the new changed monitor is also stored for backup / redeployment purposes. Oh! Guess what? Datadog and CloudFormation play fetch together.
How do I know I am monitoring the right CloudWatch Metrics?
- As we saw earlier the list of CloudWatch metrics is extensive. Another shortcut way to know which metrics and dimensions are available for a given metric is simply Googling “<AWS SERVICE NAME> CloudWatch metrics.” The problem is just because AWS has defined a CloudWatch metric for a service doesn’t mean every application needs to monitor that service. For example S3 lets you know how many objects are stored in a bucket and the size of the objects stored in that bucket. While this information could be handy, there are many applications where monitoring the # of items or size of buckets doesn’t provide any actionable insight. On top of that, many applications hosted in AWS are constantly being updated to use the latest, greatest services. As new AWS services are introduced to an account, new CloudWatch metrics will start showing up therein. Without insight into what new metrics are available for a given account it is impossible to properly address them. In the past we would have recommended scheduled reviews of available metrics, processes for adopting a new service that include reviewing CloudWatch Metrics, and developer training and awareness. Today, we no longer recommend these methods. Gone are the days of a new developer introducing a service and subsequently forgetting to monitor it, technical staff following complicated metric availability reviews every time a new service is adopted, or operations teams conducting any kind of manual review on a scheduled basis. Here are the days of knowing what is in your account AND what is monitored so you can rest easy. We have developed a tool called Trek10 Coverage Advisor that is available via the Datadog Marketplace to anyone reading this article. Trek10 Coverage Advisor will alert you when a CloudWatch metric shows up in your account whose corresponding service doesn’t have any related monitors enabled. See the below section for more info.

To recap: internally, we use and have seen the most value of a combination of AWS CloudWatch and Datadog. Specifically, we use Datadog’s monitoring engine as a way to alert on CloudWatch metrics because of all the benefits it provides while using pieces of AWS/Cloud best practices of using IaaC to provide our team an easy and efficient way to keep monitors up to date. In the next section of this article, we will outline how to keep up to date with CloudWatch metrics as your application progresses through its lifecycle.

A High Power Tool We Have Built

Revisiting the question we asked at the beginning of the blog: have you ever been surprised by an outage, think the customer calling to let you know it is down, because you weren’t monitoring the right metric, specifically the right CloudWatch metric? Yeah, we have too. After stubbing our toe a couple times, our CloudOps Team team decided to invest in creating a solution that will relieve this pain point and we are excited to now announce this tool is available to you via the Datadog marketplace! “Trek10 Coverage Advisor'' first looks at the CloudWatch metrics being reported to your Datadog account and then it checks to see if there are any monitors in the Datadog account that correspond to that AWS service in order to deduce a list of services that do not have corresponding monitors. Trek10 Coverage Advisor compares this list of services without monitors against Trek10’s internal database of metrics that are the best indicators of AWS service health and should therefore have corresponding monitors. We created this internal Database by having AWS certified engineers read over CloudWatch documentation, use their own first hand experience, and reach out to SMEs to highlight the CloudWatch metrics that are valuable to monitor. The internal database also includes a short description of why monitoring that metric is important. If we see a metric being reported that is valuable and whose service doesn’t have monitors, we will automatically push an event to your Datadog account to let you know. Additionally, we provide you with a dashboard that shows off some pretty cool information. The dashboard highlights a list of recent recommendations which each include a metric to monitor and its description. Every recommendation is indicative of a CloudWatch metric that AWS made available, that metric exists in Trek10’s internal database of valuable CloudWatch metrics, and the observation that your current Datadog account doesn’t have a monitor of that service.

The dashboard also includes a button for you to “Generate a Recommendation Report.” If you click that button you will be prompted to download a report. This report includes a comprehensive list of recommendations as seen below:

We use a flavor of this tool internally with our CloudOps practice to make sure we are monitoring our clients’ CloudWatch metrics and we are really excited that this tool is now available to the public. So instead of worrying about which CloudWatch metrics to monitor: Put the pager down. Grab a beer, and enjoy the soft summer breeze of CloudWatch in the air.

P.S. go on over and check out the marketplace for other cool integrations like RapDev’s Microsoft 365 integration. If you don’t see the integration you’re looking for, please contact us and we would love to discuss how to get that monitored for you.

Be on the look out for the following blog posts coming out later this year: