Serverless Best Practices: Operations - Think FaaS Podcast
And we’re back again, I’m Jared Short at Trek10, and this is ‘Think FaaS’, where we learn about the world of serverless computing in less time than it takes to run an AWS Lambda function. So put five minutes on the clock - it’s time to ‘Think FaaS’.
We are continuing our Serverless Best practices again this week, and we are focusing on operations and operating serverless systems. This is a huge area, with evolving best practices. I’m sure you’ve also heard some of the buzzwords in this area like structured logging, tracing, observability, anomaly detection, etc. There’s a lot to get into, let’s go!
This is probably the most obvious and most common. Apps have errors and you have to watch for them… not exactly groundbreaking. What is a bit different is figuring out the right threshold for alerting. For a low-volume system it may make sense to alert on every error, but for any sufficiently high-volume system you need to do some work to weed out the noise that comes from typical transient errors. Typically some rate of errors under 0.1% is always going to occur. With asynchronous Lambda invocations (for example from an S3 object event), leverage dead letters queues and such and you can usually ignore all transient errors knowing that AWS will usually retry and get success.
You may also want to consider tools that improve your visibility beyond the built-in basics of CloudWatch metrics and Lambda logs in CloudWatch Logs. Error tracking services like Sentry or Rollbar work just as well as in traditional architectures in helping to track errors. When it comes to tracing, though, you’ll need to look at a new generation of tools: AWS X-Ray and IOPipe are two of the more popular options.
Watch the Dials Where Dials Exist
While scaling with AWS platform services is mostly transparent, that’s not 100% the case. There are a few dials in the system; it is important to know where they are and how to monitor them to optimize scalability and costs. Some are obvious and easily visible like DynamoDB Provisioned Throughput (which also has auto-scaling now, by the way) or Kinesis shards, others are slightly more hidden like Lambda Concurrency Limits, and still others like S3 pre-partitioning are completely hidden and can only be monitored by observing symptoms like S3 error rate or PUT latency. Carefully review each part of the system to identify all of the relevant dials.
We had a whole episode on this in this series from Forrest, so consider that security is never a solved problem… the risks just shift. With no long-running VM and often no network to manage, Serverless greatly reduces the attack surface area of many traditional threats. This doesn’t mean security is solved, though: it just allows you to shift your focus to other threat areas. Focus on IAM, tightening down those policies, for Web Applications you still need to follow all your normal OWASP Top 10’s, and finally your application dependencies are a key factor. A small but emerging ecosystem of tools is focusing on analyzing your project’s dependencies to validate that they are both coded securely and that they are not compromised by attackers. Snyk and PureSec are two interesting ones to keep an eye on.
This is sort of the flipside of a great benefit of Serverless: cost is truly usage-based… but cost is truly usage-based. If you get unwanted or unexpected traffic, costs could spike quickly. So it is important to monitor costs on a daily basis so you can quickly detect any cost spikes and block the offending traffic or optimize your application to minimize costs.
All of the AWS platform services are by default running in multiple AWS Availability Zones (AZs, which are one or more data centers with independent power and network within a given AWS Region), so in theory 2-3 AWS data centers would need to go down simultaneously to cause an outage, which is a very uncommon (i.e. much less often than once a year) scenario. But the reality has been that these services in fact have cross-AZ dependencies and have region-wide outages. In the past 15 months there have been multiple outages to services like DynamoDB, S3, and Lambda. So this is a real thing that you need to plan for.
The first step is determining the extent to which you can build for multi-region failover or possibly even multi-region active-active. Build your operational response plan for these outages. While your ops team may not be able to fix AWS’s issue, it still has a key role to play: Identify as early as possible that there is a problem, trace root cause to the AWS services, look for confirmation from AWS that the problem is on their side (usually, AWS Support initially and then with some lag the AWS Status Page), and then effectively communicate to end users, initiate failover plans as appropriate, and monitor status on the AWS side.
Whew, we made it! A timely reminder that only one episode of Think FaaS remains until Trek10, myself and my co-host Forrest will be at ServerlessConf San Francisco. We’ll be putting on a couple sessions of “Think FaaS Live” with a bunch of bright folks throwing out nuggets of gold in rapid fire fashion. Will you be there? Let us know on twitter @trek10inc, @shortjared or @forrestbrazeal. Hope to see you there and on the next episode of Think Faas!