If you are like me, every time you first get on a plane you are rehearsing in your head all the possible things that could go wrong and how you would respond. After a few flights the thoughts of preparedness get less and less and ultimately you forget those thoughts.... At least until you hit that extraordinary rough patch of turbulence.
This very thing often happens when a service is built. At first the anticipated issues may be imagined and maybe even written out along with the mitigation techniques, but as time goes on there becomes less attention to the possibilities of what might go wrong and what to do when things do go wrong. This can happen for many reasons, one of which is the assumption that new shiny features and products make way more sense to the bottom line of the business than being prepared for an incident... But does it really?
Cost of an Incident
To calculate the cost of an incident let’s look at a few of the factors. There will be an impact to revenue, the hours the team is spending to fix the incident including all the aftermath, and the lost time due to delays of releasing new features.
For this example I will use a retail site which has a $1 million annual revenue (AR) and 15% year-over-year (YoY) growth that is dependent on the new products being released each day. There are 5 engineers at this company responding to the incident and they are each paid $50/hr. The employee overhead calculation of 1.25 is on the low-end, but will work for our calculation. Finally the incident lasted 24 hours and there was no “aftermath” for the engineers to clean up. Let's look at the cost breakdown:
$1M revenue means that revenue per day ≈ $2,739.73
For engineer time to fix the issue it is ≈ $7,500
The lost growth due to new features not being released is ≈ $410.96
For a total loss of $10,650.69 for a single outage.
That single outage equates almost to 26 times more than a single day of growth from new features (in our case it was products). Now that, you’re feeling that turbulence a little bit let me jump on the loudspeaker and say,
“This is your pilot speaking, over the next few blog posts we will be covering what you can do to prepare for an incident”
Incident Management is a term largely associated with IT Service Management (ITSM). ITSM is about the people, processes and technology that are implemented to run a service. I want to cover 3 general topics within incident management to help better prepare (or hopefully give you assurance that you already are prepared) for an incident to strike. The 3 topics cover what needs to be done before, during, and after an incident. Inside each of these I will break down some important factors that will contribute to a great incident response plan!
Before the Incident Strikes
Have you ever thought about who was the person who thought to make each seat detachable and inflatable in an airplane? Airplanes have many built-in safety measures for incidents and so should our services!
Architecting for Resilience
Proper architecture seems obvious, but it is important to highlight. Architectural decisions can dramatically impact the resilience of a service. There are AWS whitepapers which serve as a valuable resource when having to make these architectural decisions. Here are a couple of links to get you started on valuable AWS Whitepapers: Reliability and Operational Excellence. Another valuable resource is to have an AWS Partner conduct a Well Architected Review of an application or service which will help identify gaps and oversights that may have been made as a service is developed and evolved.
The desired Recovery Point Objective/Recovery Time Objective (RPO/RTO) is a very important metric that should be established as early as possible. RTO is how long an application is down after a disaster occurs. RPO is how old the backup data is when restoring to recover from a disaster. The goal is to establish an RTO and RPO that balances minimizing business impact with the cost of building and running a solution that achieves that RTO/PRO. These metrics will help guide the right decisions with the right level of resilience in the service’s architecture.
Defining an Incident
One important part of incident management is understanding what an incident is. An incident is any time that a service is impaired, degraded, or interrupted. This could be something as simple as a 30 second blip that users mostly wouldn’t even notice all the way to a complete outage that lasts for days. For a small website that gets 100 visitors a day a 30s blip is probably not a big deal, but for a stock trading site where there are thousands of transactions per second, that could be extremely costly. This is why it is important to identify for each SERVICE, not each company, how to rate the severity of an incident. Define what constitutes a “Critical” severity incident versus a “Medium” or “High” severity incident. This is very important because the response should be very different based on the severity of the incident. Once this severity definition is established it needs to be documented, distributed and reviewed regularly. It will be imperative that the entire company knows how and when to prioritize an incident if it were to occur.
Monitoring for an Outage
All of those gauges in the airplane cockpit are there to help navigate and alert for danger that could impact the flight. Similarly, we need to have dashboards, metrics, monitors and alerts that will go off when an incident is about to or has happened. This part is really something that will grow as a service is built. This article has some great information on using AWS CloudWatch Metrics and some tips on how to start creating some of these metrics you may need. Many times teams either start with too few or too many metrics and monitors and then as time progresses they are able to find the metrics that matter. As we improve and grow our infrastructure there will likely be more monitors created. This amongst other reasons is why we should review and improve the metrics we are gathering regularly.
Dashboards are our concept of that cockpit view, where we can see a full landscape of metrics and monitors. This can be extraordinarily helpful in an active incident, because it will be easier to identify all the broken pieces and hopefully identify a root cause faster. Alerts are extremely important because without them an incident may go undetected until a customer complains. Over-alerting can lead to alarm fatigue, and can be just as bad as not having alerts at all! Alerting too late will lead to longer incident recovery times. Finding the balance where the service is not alerting too frequently or too late is a constant adjustment.
At the start of every flight the attendants give an instructional speech that many of us have memorized. Similarly with a service, having documentation in place that is up-to-date is what separates those who can recover quickly from those who struggle to resolve their incidents quickly. I want to highlight a couple types of documentation specifically relating to incident management: runbooks and playbooks. Runbooks are meant to be step-by-step instructions for doing a specific task, while playbooks are more of general guidance for dealing with a certain situation. Runbooks are important for incidents because when an action has been determined to fix the incident, having a runbook for that exact action is priceless. The less “guessing” or “trying” we have to do in an incident the more predictable the action will be. Playbooks are important because they are there to help bring structure and process into an incident. These playbooks help the incident team come to a root cause and resolution much faster. For example, a runbook would be “Increasing Disk Space on Server”, while a playbook would be “Diagnosing Excessive Disk Space Consumption”. I will stress this again, documentation needs to be continually updated and reviewed in order for it to be useful. This article has some great advice on writing runbooks.
Tools Needed for an Incident
Being prepared for the incident is important and having the right tools makes it even easier during an active incident. The tools that are the most valuable in an incident usually are scripts, or an automated or manual action that kicks off a series of helpful tasks. A couple of examples that may fit in with all services are:
Tools to turn off or redirect traffic from the degraded service
Tools to help implement an out-of-cycle patch
Automated tools to kick off procedural elements to managing an outage like conference bridges and email templates
Auto-generated context to the outage alert that will enrich the event with data to help find a root cause faster
You Are All Clear for Takeoff
Preparation work for an incident is extremely important and can really help speed up the time to recovery. Once you have prepared for everything discussed in this article, run a mock incident, or game day, where you test the incident management processes. This is helpful in ironing out the issues within incident management before a real incident occurs. It is so important to stay mindful of incidents and always be ready to take action if the time arises.
“This is your pilot speaking, you're now cleared for takeoff. I'll see you on the next topic as we figure out what to do during an incident!”
Look for part 2 in this series soon where I’ll discuss what needs to be done during an incident to help prepare your team to handle any incident with ease!