Is your data secure? Find out with our free IBM security assessment! Learn More →

Services
Focus Areas

Areas of Expertise
Engagements

Discover

Build

Support
Areas of Expertise

App Modernization

Public Sector

Serverless

IoT

DevOps

Migration

Data and Machine Learning (ML)

Enterprise Architecture

24/7 Monitoring

Team Support

Datadog

Overview

Are you taking advantage of modernizing your AWS apps to protect your cloud investments?

Overview

Our mission is to accelerate high-quality cloud adoption across the Public Sector.

Overview

Whether you are new to serverless or looking to scale, Trek10 allows you to focus on building applications, not managing servers.

Related Content

AWS Lambda

With AWS Lambda, you can run code without the need for managing servers in a cost-effective manner.

Blog

What is Serverless and Why Does it Matter?

Overview

Whether you’re looking to gain visibility into plant floor machinery or seeking to enhance process efficiency, Trek10 can help.

Related Content

Blog

Serverless Architectures: IoT

Blog

Is IoT Device Shadow Right for You?

or should you build-your-own with DynamoDB?

Overview

Shorten the development lifecycle, increase reliability, and release software faster.

Related Content

AWS CloudFormation

AWS CloudFormation helps you save time and money by configuring and managing resources for you.

Containers on AWS

Containers on AWS makes managing container registries easy, autonomous, reliable, and safe from anywhere.

Overview

At Trek10, we rapidly migrate your applications with a focus on cost-effectiveness

Related Content

Amazon WorkSpaces

Amazon WorkSpaces allows you to quickly scale according to your virtual desktop needs.

Containers on AWS

Containers on AWS makes managing container registries easy, autonomous, reliable, and safe from anywhere.

Overview

Uncover insights from your data no matter where you are in your analytics journey.

Related Content

Machine Learning Ops

MLOps constitute best practices for developing, deploying, and monitoring high precision Machine Learning models.

Amazon SageMaker

Amazon SageMaker enables developers and data scientists to easily build ML models.

Overview

Enterprise Architecture (EA) combines business and technology in a proven industry recognized framework to deliver business focused results based on your industry, environment, competition and the ever increasing capabilities of cloud technologies.

Related Content

Developer Acceleration

A series of in-person architect-led training modules designed to help your team develop the necessary skills and best practices to modernize your applications.

Overview

Maximize the uptime and security of your most critical applications.

Related Content

Amazon CloudWatch

Amazon CloudWatch makes performance monitoring simple for you and your business.

Disaster Recovery

Prevent downtime, strengthen resilience, and avoid unanticipated costs with a comprehensive Disaster Recovery Plan.

Overview

Experienced solutions architects and developers at your service, on-demand.

Related Content

Amazon CloudWatch

Amazon CloudWatch makes performance monitoring simple for you and your business.

Disaster Recovery

Prevent downtime, strengthen resilience, and avoid unanticipated costs with a comprehensive Disaster Recovery Plan.

Overview

Let Trek10 help you hit the ground running with Datadog.

Related Content

AWS Premier Partner

Discover

Cloud-Native Immersion Day

Developer Acceleration

Retail | Industry Overview

SaaS on AWS

Serverless Workshop

Overview

Trek10's Cloud-Native Immersion Days are focused, high impact training sessions that will drench your teams in knowledge of the latest tech and best-practices.

Overview

Trek10’s expert-led Developer Acceleration workshops help enterprise teams quickly and safely jump-start their serverless journey.

Overview

Leveraging the vast capabilities of the AWS ecosystem, Trek10 provides retail businesses with solutions tailored to their unique needs, enabling them to innovate at speed and scale.

Overview

Trek10 helps companies migrate and build their SaaS offering on AWS with a cloud-native approach.

Overview

Whether it’s a greenfield project or re-architecting legacy, Trek10 is your guide to adopting cloud native architectures.

Build

DevOps Transformation

Internet of Things (IoT) Applications

Security

Overview

At Trek10, we leverage the best AWS native and third party tools for code-defined infrastructure, continuous integration, and automated deployment pipelines.

Overview

Trek10 helps you deliver on the promise of IoT by guiding you through the process of connecting your devices to AWS and by designing, implementing, and fully supporting your AWS cloud infrastructure.

Overview

Trek10’s security solutions and services will secure your AWS APIs and infrastructure. Schedule a meeting today to see if you qualify for a free security scan and report.

Support

CloudOps 24/7 Monitoring & Support

CloudOps Team Support

Overview

Trek10 brings managed services to the cloud. Our team works hard to reduce noise and maximize uptime in every AWS environment we manage.

Overview

Trek10 Team Support augments your team’s skills with access to a team of experienced and focused AWS solutions architects and cloud developers that specialize in leveraging AWS to the fullest.

Overview

Everyone who moves to AWS wants to secure their environment, but knowing where to start is hard. That is where Trek10 can help.
Case Studies
About
Careers
AWS Premier Partner
Community
CloudProse Blog

Spotlight

Serverless

Cost and Pricing Analysis

Cloud Native

Developer Experience

Databases

News

IoT

Monitoring, Ops & DevOps

Containers

Security and IAM

Generative AI and Machine Learning (ML)

Search Trek10

Security and IAM

Cloud Emergency Preparedness Kit Pt. 1

The 3 topics cover what needs to be done before, during, and after an incident. Learn important factors that will contribute to great incident response.

Jeremiah Owen | Dec 03 2021
3 min read

If you are like me, every time you first get on a plane you are rehearsing in your head all the possible things that could go wrong and how you would respond. After a few flights the thoughts of preparedness get less and less and ultimately you forget those thoughts.... At least until you hit that extraordinary rough patch of turbulence.

This very thing often happens when a service is built. At first the anticipated issues may be imagined and maybe even written out along with the mitigation techniques, but as time goes on there becomes less attention to the possibilities of what might go wrong and what to do when things do go wrong. This can happen for many reasons, one of which is the assumption that new shiny features and products make way more sense to the bottom line of the business than being prepared for an incident... But does it really?

Cost of an Incident

To calculate the cost of an incident let’s look at a few of the factors. There will be an impact to revenue, the hours the team is spending to fix the incident including all the aftermath, and the lost time due to delays of releasing new features.

For this example I will use a retail site which has a $1 million annual revenue (AR) and 15% year-over-year (YoY) growth that is dependent on the new products being released each day. There are 5 engineers at this company responding to the incident and they are each paid $50/hr. The employee overhead calculation of 1.25 is on the low-end, but will work for our calculation. Finally the incident lasted 24 hours and there was no “aftermath” for the engineers to clean up. Let's look at the cost breakdown:

$1M revenue means that revenue per day ≈ $2,739.73

For engineer time to fix the issue it is ≈ $7,500

The lost growth due to new features not being released is ≈ $410.96

For a total loss of $10,650.69 for a single outage.

That single outage equates almost to 26 times more than a single day of growth from new features (in our case it was products). Now that, you’re feeling that turbulence a little bit let me jump on the loudspeaker and say,

“This is your pilot speaking, over the next few blog posts we will be covering what you can do to prepare for an incident”

Incident Management

Incident Management is a term largely associated with IT Service Management (ITSM). ITSM is about the people, processes and technology that are implemented to run a service. I want to cover 3 general topics within incident management to help better prepare (or hopefully give you assurance that you already are prepared) for an incident to strike. The 3 topics cover what needs to be done before, during, and after an incident. Inside each of these I will break down some important factors that will contribute to a great incident response plan!

Before the Incident Strikes

Have you ever thought about who was the person who thought to make each seat detachable and inflatable in an airplane? Airplanes have many built-in safety measures for incidents and so should our services!

Architecting for Resilience

Proper architecture seems obvious, but it is important to highlight. Architectural decisions can dramatically impact the resilience of a service. There are AWS whitepapers which serve as a valuable resource when having to make these architectural decisions. Here are a couple of links to get you started on valuable AWS Whitepapers: Reliability and Operational Excellence. Another valuable resource is to have an AWS Partner conduct a Well Architected Review of an application or service which will help identify gaps and oversights that may have been made as a service is developed and evolved.

The desired Recovery Point Objective/Recovery Time Objective (RPO/RTO) is a very important metric that should be established as early as possible. RTO is how long an application is down after a disaster occurs. RPO is how old the backup data is when restoring to recover from a disaster. The goal is to establish an RTO and RPO that balances minimizing business impact with the cost of building and running a solution that achieves that RTO/PRO. These metrics will help guide the right decisions with the right level of resilience in the service’s architecture.

Defining an Incident

One important part of incident management is understanding what an incident is. An incident is any time that a service is impaired, degraded, or interrupted. This could be something as simple as a 30 second blip that users mostly wouldn’t even notice all the way to a complete outage that lasts for days. For a small website that gets 100 visitors a day a 30s blip is probably not a big deal, but for a stock trading site where there are thousands of transactions per second, that could be extremely costly. This is why it is important to identify for each SERVICE, not each company, how to rate the severity of an incident. Define what constitutes a “Critical” severity incident versus a “Medium” or “High” severity incident. This is very important because the response should be very different based on the severity of the incident. Once this severity definition is established it needs to be documented, distributed and reviewed regularly. It will be imperative that the entire company knows how and when to prioritize an incident if it were to occur.

Monitoring for an Outage

All of those gauges in the airplane cockpit are there to help navigate and alert for danger that could impact the flight. Similarly, we need to have dashboards, metrics, monitors and alerts that will go off when an incident is about to or has happened. This part is really something that will grow as a service is built. This article has some great information on using AWS CloudWatch Metrics and some tips on how to start creating some of these metrics you may need. Many times teams either start with too few or too many metrics and monitors and then as time progresses they are able to find the metrics that matter. As we improve and grow our infrastructure there will likely be more monitors created. This amongst other reasons is why we should review and improve the metrics we are gathering regularly.

Dashboards are our concept of that cockpit view, where we can see a full landscape of metrics and monitors. This can be extraordinarily helpful in an active incident, because it will be easier to identify all the broken pieces and hopefully identify a root cause faster. Alerts are extremely important because without them an incident may go undetected until a customer complains. Over-alerting can lead to alarm fatigue, and can be just as bad as not having alerts at all! Alerting too late will lead to longer incident recovery times. Finding the balance where the service is not alerting too frequently or too late is a constant adjustment.

Documentation

At the start of every flight the attendants give an instructional speech that many of us have memorized. Similarly with a service, having documentation in place that is up-to-date is what separates those who can recover quickly from those who struggle to resolve their incidents quickly. I want to highlight a couple types of documentation specifically relating to incident management: runbooks and playbooks. Runbooks are meant to be step-by-step instructions for doing a specific task, while playbooks are more of general guidance for dealing with a certain situation. Runbooks are important for incidents because when an action has been determined to fix the incident, having a runbook for that exact action is priceless. The less “guessing” or “trying” we have to do in an incident the more predictable the action will be. Playbooks are important because they are there to help bring structure and process into an incident. These playbooks help the incident team come to a root cause and resolution much faster. For example, a runbook would be “Increasing Disk Space on Server”, while a playbook would be “Diagnosing Excessive Disk Space Consumption”. I will stress this again, documentation needs to be continually updated and reviewed in order for it to be useful. This article has some great advice on writing runbooks.

Tools Needed for an Incident

Being prepared for the incident is important and having the right tools makes it even easier during an active incident. The tools that are the most valuable in an incident usually are scripts, or an automated or manual action that kicks off a series of helpful tasks. A couple of examples that may fit in with all services are:

Tools to turn off or redirect traffic from the degraded service
Tools to help implement an out-of-cycle patch
Automated tools to kick off procedural elements to managing an outage like conference bridges and email templates
Auto-generated context to the outage alert that will enrich the event with data to help find a root cause faster

You Are All Clear for Takeoff

Preparation work for an incident is extremely important and can really help speed up the time to recovery. Once you have prepared for everything discussed in this article, run a mock incident, or game day, where you test the incident management processes. This is helpful in ironing out the issues within incident management before a real incident occurs. It is so important to stay mindful of incidents and always be ready to take action if the time arises.

“This is your pilot speaking, you're now cleared for takeoff. I'll see you on the next topic as we figure out what to do during an incident!”

Look for part 2 in this series soon where I’ll discuss what needs to be done during an incident to help prepare your team to handle any incident with ease!