Spotlight
Amazon Q: GenAI a Feature or a System?
Identifying where challenges and advantages exist in the quest for immediate value in Generative AI.
Mon, 17 Jun 2019
Hi, it’s Forrest Brazeal with Trek10, and you’re listening to Think FaaS, where we learn about the world of serverless computing in less time than it takes to run a Lambda function. So put fifteen minutes on the clock — it’s time to Think FaaS.
There are lots of ways to assemble an architecture using serverless building blocks, just like you can make just about anything out of a pile of Lego bricks. The problem is that if you want to make a sturdy house, probably not too many combinations of those bricks will work. One of the big challenges that I see in serverless is sifting through all the possible architectures to settle on a good solution.
Let’s suppose that you want to build an API that returns responses synchronously to your users. But that API call also needs to launch an asynchronous process that hits some third-party service, and then notify your user later via a callback. This is such a common pattern that I’ve encountered some form of it literally three different times in the past two weeks. And yet there are so many ways you could do this. Will your API be REST or GraphQL? Will you use some sort of queue to buffer the third-party calls? If so, how will you keep from overwhelming your downstream system? Will you set timers on your queue messages, and how will you keep track of how long those timers should be? Do you need DynamoDB involved somewhere?
There are probably five ways to build this reasonably well and fifty ways to build it kind of wrong. And you may not know which path you’ve chosen until you’re relatively far along.
So today, we’re going to talk about detecting serverless smells. What are the warning signs you can look for to develop an intuition about architectural or development choices that are not going to work well long-term? As usual we’ll scope our discussion around AWS, but hopefully this will generalize to other clouds as well.
I’m going to try to avoid using the word “antipattern”, because this space simply hasn’t been around enough yet to codify whether a lot of things are absolutely bad ideas, particularly given that so many people are on a journey of serverless adoption that requires extensive refactoring. So the smells we identify below may in fact be valid stopping points as you progress, but I want to call out why they may have problems and what direction we should be moving towards.
In order to identify smells, the first thing we need is a sense of what a good serverless architecture looks like. What are the design goals you are striving for?
For me, the motivating impulse is always to build things that “just work”. Back to the Lego analogy, I want to see the bricks click together. I don’t want to have to reshape them with a dremel and then slather them with glue. Here’s an example of what I mean. Let’s suppose one of the requirements for the API I mentioned before is that the user contract should be defined in a GraphQL schema. I could find a GraphQL library and build out a bunch of boilerplate logic in a Lambda function to connect queries to resolvers. Or I could probably cut down a lot of that work by using AWS’s managed GraphQL service, AppSync. So far, so obvious. AppSync is a Lego brick here. Now let’s suppose one of my requirements is to provide throttling and rate limiting of user requests to this service. Turns out AppSync doesn’t natively provide that behavior today. The service that does have a lot of features around API key usage plans is API Gateway. But if I use API Gateway, I’m back to square one with implementing my own GraphQL server behind the scenes. That seems like a serverless smell.
So what are my options? Well, I could put an API Gateway proxy in front of an AppSync endpoint. Is that a good idea? Maybe! It would give me the usage control features of API Gateway combined with the power of AppSync, but it also might add some extra latency to my API. There’s also the question of how to handle the auth handoff between API Gateway and AppSync. If I use a standard HTTP proxy, I’ll have to maintain and rotate AppSync API keys, maybe with a separate Lambda function that runs on a schedule. However, I also have the option in API Gateway to treat my AppSync backend as an AWS service integration. That means I can authorize my AppSync calls using IAM, no key rotation needed.
Can you visualize what I’m describing here? We’ve connected two services with very powerful features, AppSync and API Gateway, without requiring any intermediate Lambda functions. We’re getting things like schema introspection, database integrations, and API key throttling out of the box. We’re getting invalid requests handled at no cost to us via API Gateway. Another advantage is that though AppSync doesn’t support custom domains and certificates, API Gateway does. So we can put together an active-active multi-region service with health checks and failover just by adding a bit of Route53 config. The downsides of this architecture — probably a slight performance penalty, a few milliseconds per requests, due to API Gateway in front of AppSync, and of course you have the cost of two managed services. But those are the tradeoffs you want to be making with serverless, in exchange for leveraging a lot of features you didn’t have to build yourself.
The smells happen when you start hacking things together on your own. We try to avoid putting a “server in a function” around here, lifting and shifting an entire app from a VM into Lambda. I get that there are times when it makes sense as an intermediate step or when working with an existing codebase. But it’s suboptimal because it makes your functions build and run slower, you lose granular control over permissions, and you’re likely shipping code that you shouldn’t have to be writing anymore, like a GraphQL server.
The related smell here, something I think is a bit more insidious and doesn’t get talked about as much, is using Lambda as an orchestration server. You have to start thinking of the cloud services as the places you should externalize your application’s control flow. That means scaling out concurrent Lambda invocations instead of writing a bunch of threading code. It means using Step Functions and the Amazon States Language instead of a bunch of if statements. It means handling errors via dead letter queues. So you keep your business logic small, simple, discrete, and represent the rest as a service graph. I know that’s difficult for programmers to accept, and there are lots of varying opinions here. I’m telling you my opinion based on my experience: a black-box Lambda full of orchestration logic can be harder to reason about, harder to debug, and costs you more in responsibility. In the serverless world, it’s a smell.
Another smell, I think, is code that is not event-driven. Just because Lambda functions can run up to fifteen minutes now does not mean you should be long-polling until timeout. I’ve seen these batch Lambda functions that run fourteen minutes and fifty seconds, then reinvoke themselves in a loop using Step Functions, burning hours of compute until some job is complete. I get why people do this, if they’re comfortable with the Lambda programming model, but in my opinion, this is a case where that external orchestration is not providing clarity. These are long-running jobs. Use Batch or Fargate. Use Lambda to respond to events or process small, predictably-sized chunks of work.
I think it’s worth taking an extra minute here to break down a discussion that seems to be pretty hot in the serverless community right now, which is about how much code you should try to cram into VTL templates versus Lambda functions. If you don’t know what VTL is, it’s a templating language that ships with both API Gateway and AppSync to some extent. It’s really designed to let you tweak the request or response body of events that you pass through to a back-end service. It’s a Turing-complete language, though, so you’ll see people using it to do some fairly complex logic, sometimes in lieu of a Lambda function at all. The problem is that it’s hard to write and debug, way worse than Python or Javascript or whatever you’re running on Lambda. (Though I will point out that in a backwards way, writing VTL enforces what I would consider to be better hygiene for testing a serverless app, which is running end-to-end tests as opposed to unit tests. That’s a whole separate discussion which I recently broke down in a blog post over at dev.to.) So the question is, at what point does the jankiness of the templating code outweigh the value of not having to package and ship a Lambda function? The answer to that question is going to be team-dependent. My personal rule of thumb, I guess, is that you should stop writing VTL when it’s no longer obvious to others what the VTL is doing. I know some people try to avoid putting in any control statements or whatever, but I think those can make sense in the right context. But these deeply-nested loops are too hard to read.
The final smell I want to talk about today is more high-level, and it’s around ease of refactoring. The serverless space changes rapidly, and so do the best practices. The two or three missing features of AppSync that required me to plug in API Gateway? I don’t know the AppSync team’s roadmap, but odds are those gaps will be filled at some point in some way. And the cool thing about the design we envisioned is that, once a better option is available, it should be pretty easy for me to pull out API Gateway and plug my DNS records right into the awesome new AppSync custom domains feature. That might not be true if I had built my own custom proxy solution.
Look, the reality is that you’re not going to get all these architecture decisions right. I sure don’t. There’s simply too much to keep track of and we’re all working with incomplete mental models of what the cloud can do at a given moment. But what you can do is optimize for ease of correction. Don’t entwine your functions and services with each other to where you’ve created a distributed monolith. Look up your service dependencies at runtime rather than making hardcoded assumptions, maybe using something like the new Cloud Map service that AWS has released for service discovery. Don’t force your clients to make assumptions about what kind of back end they’re talking to. You’re gonna have plenty of problems no matter what you build. Let them be problems you can fix without tearing everything down.
So to sum up a lot of these smells, here’s the first thing I tend to look for when designing out a serverless app: am I overusing functions? Do I have unnecessary Lambda in my design? How can I leverage service integrations to give myself features for free, because I am lazy and entitled in the best possible way? That gets you back to the Lego brick ideal. And the cool thing about Legos: unless you’re doing something really wrong, they don’t smell like anything at all.
And that’ll do it for today. If you have a question or topic you’d like us to address in a future episode, you can always reach out to Trek10 on Twitter @Trek10inc, or hit me up @forrestbrazeal, and we’ll see you on the next episode of Think FaaS.
Identifying where challenges and advantages exist in the quest for immediate value in Generative AI.