Monitoring, Ops & DevOps

10 Lessons Learned from Start-ups on their AWS Journey

Lessons distilled from our work with clients so that you don't have to learn them the hard way.
Nessan Harpur Trek10
Nessan Harpur | Apr 01 2020

In my work on a day to day basis, I've learned a thing or two that I think is valuable to pass on, in the hopes that someone out there learns from the collective past. All these lessons are distilled from start-ups who are on their journey in the AWS cloud, with industries tackled including Fintech, SaaS, IoT, and E-Commerce.

Meeting Security Requirements with a Multi-Account Strategy

Problem: A fintech start-up was working with major financial institutions. Unfortunately, the financial institutions had extensive security questionnaires, tightly controlled access patterns, and intensive user management practices. This start-up had 1 AWS account with multiple users. The financial institutions demand a more granular security structure due to their own regulatory requirements, including the use of MFA, VPN, strict logging requirements, and clearly defined access patterns.

Solution: In order to meet the demands of the financial institutions, we recommended a multi-account structure. The multi-account structure operated with users limited to one IAM-specific account, these users used a cross-account role function to work in other accounts. The working accounts were production, dev, staging, and sandbox. There was also a logging account for Cloudtrail and Cloudwatch logs. The ‘shared services’ account contained CI/CD. The master-payer account was where the root user lived and where billing was managed. We also implemented forced MFA and VPN access for each user.

Business impact: The start-up was able to meet the requirements of the financial regulators by showing them this more granular permissions structure. This set-up reduced the blast radius of security incidents and availability issues and met the requirements of the financial institutions’ 87-page security questionnaire.

Reducing Spend by Optimizing Compute

Problem: An IoT start-up had a monolithic app that was powering their business. Leadership wanted to optimize their current platform while also preparing the platform for a migration 12 months down the road. The team was focused on building the new app, and they lost touch with their current computing needs. Several instances expired as ‘reserved instances’ and moved into an on-demand billing model, drastically increasing the cost structure.

Solution: We set out to optimize for cost and dive into their backup strategy. First, we tackled the EC2 issue. We used AWS Compute Optimizer to determine the provisioning level for their instances. We then reviewed the EC2 dashboard and looked at the cost analysis in cost explorer. We reduced instance sizes where possible and then implemented a savings plan. We reviewed both EC2 and Compute savings plans. Since this start-up was building a new platform, we needed to be able to transfer the savings plan to the new platform. For that reason, we chose the compute savings plan. We enabled the savings to be transferred by setting up AWS organizations and linked each of the start-ups AWS accounts.

Business Impact: We carried out an analysis of the usage of compute, the cost of compute, and the potential cost-savings measures. We implemented resizing based on compute usage and implemented a compute savings plan to reduce the customer’s overall AWS bill by 37.5% per month.

Building Awareness through Cost Visualization & Analysis

Problem: A SaaS start-up had no insights into their costs. The billing was managed by the CEO, who was inexperienced with AWS. The technical lead had left the company, and no one was managing the AWS account. We were tasked with providing insightful information to the non-technical stakeholder.

Solution: First, we opened up cost explorer and configured cost reports across services, accounts, AWS marketplace, instances, utilization, and coverage. We provided dashboards to the CEO, which helped them understand their cost structure in the cloud. We also implemented AWS organizations to consolidate billing across multiple accounts. Consolidated Billing reduced the bill by 1% by combining storage tiers across multiple accounts and also simplified the billing process. We then presented the CEO with a ‘group by service’ graph, which helped to understand their cost structure on a more granular level.

Business impact: The AWS bill for the start-up was simplified by implementing consolidated billing with AWS organizations across multiple accounts. This also reduced the monthly bill by 1% by combining storage tiers across the accounts. The CEO was provided with a dashboard for their AWS accounts and could determine expenditure on a per service level.

Leveraging AMIs and Managed Services to Reduce Operational Burden

Problem: An e-commerce start-up decided to move their databases from on-prem to the cloud. They updated their SQL servers and managed patching in-house. Their data center reached the need to upgrade their hardware. Instead of renewing their data center, they decided to move to the cloud.

Solution: We led a design session with the start-up to understand their needs for the cloud. We determined that they could leverage managed services and use amazon machine images (AMIs). We built an amazon machine image (AMI), configured it with their desired monitoring solution to build out a golden image. For the database we implemented the use of AWS RDS, a managed database service. This service removed the time spent on patching and updating SQL server instances. RDS also provides metrics that can be easily analyzed in AWS Cloudwatch.

Business Impact: This start-up was initially nervous about cloud adoption. We completed a design session to provide them with a roadmap for their cloud journey. We implemented AMIs and RDs, which reduced their time to deployment by 80%, relieved their operational burden, and automatically fed key metrics into their monitoring solution.

Reducing Onboarding Friction and Simplifying User Management with AWS SSO

Problem: A fintech start-up was having user management issues with active directory. The technical leadership committed to develop and deploy on AWS. They wanted their users to be able to access their AWS account. They did not want to create a second identity pool for these users. The company was growing and experienced issues with onboarding new employees.

Solution: We analyzed different methods of solving their user management problems. We decided to use Active Directory as the identity source for SSO. This allowed the customer to sign in with their current identities. AWS SSO vends the credentials to the user’s browser so that there would not be a second identity pool. The tech team at this start-up could use role permissions via SSO without creating users. The roles provide credentials for between 1-12 hours. After the time period is up, the user gets kicked and has to generate new credentials.

Business Impact: The start-up’s users could log into AWS easily. They did not need to create a second identity pool, which simplified their user management. They had a single source of truth for their employees' identities, their employees accessed AWS with a user-friendly interface and we reduced the on-boarding and offboarding time by ~5 hours per new employee

Optimizing Data Transfer for Reduced Latency and Cost-Savings

Problem: An IoT start-up was running their app on AWS. The app was built on multiple EC2s inside a VPC. There were databases in a private subnet. Snapshots and backups were sent to S3 over the internet, causing latency issues, data transfer costs, and a growing S3 expense that the customer could not explain.

Solution: We recommended a VPC endpoint in S3 to solve some of these problems. We created an endpoint which allowed the data to pass to S3 within the AWS network. This reduced the latency and removed the cost of data transfer over the internet. The second issue was the growing S3 cost. The daily cost of S3 was increasing by several dollars, which amounts to 10’s of thousands of dollars over the course of a year. As it happened, the start-up had improperly configured their S3 lifecycle policy with a prefix that was supposed to be deleted. We removed the prefix and set the lifecycle policy up correctly.

Business Impact: We created an S3 endpoint in the VPC, which reduced latency, removed the cost of data transfer, and improved this start-up’s security. The second item was more severe than the first in terms of impact on cost. By correctly configuring the S3 lifecycle policy, we reduced the client’s AWS bill by ~$40k per year.

Improving Visibility and Reducing Time to Scale with AWS Native Tooling

Problem: An e-commerce company was using multiple monitoring solutions based outside of AWS. The tools were installed at runtime, which resulted in slower deploys and slower scaling. The DevOps leader had very little insight into the metrics of their application, limited ability to predict cost, and was overwhelmed by managing multiple monitoring dashboards.

Solution: We first created amazon machine images (AMI) of the e-commerce application. We then implemented a Golden AMI pipeline that added necessary packages and agents ahead of runtime. The issue with the Golden AMI pipeline was that periodic updates were required. We discovered that the monitoring tools were not feeding into a single dashboard.

We developed a proof of concept using AWS native tooling of Cloudwatch and X-Ray traces. We used metrics from here and other services to feed into a single destination. This central location provided the DevOps lead with all the monitoring data required. The alert definitions were previously spread across tools that were hard to manage. The metrics provided insight into the application’s performance, allowed the DevOps lead to predict cost, and acted as a single source of truth for the application.

Business Impact: We reduced the time to deploy and scale the application to seconds instead of minutes. The DevOps leader had a dashboard that was a single source of truth, and could more effectively manage alerting and predict costs within their environment. They were also able to reduce the number of third-party tooling by using AWS native tooling.

Reducing Downtime and Accelerating Updates with Blue-Green Deployments

Problem: An e-commerce company was launching a new app with significant traffic. The technical team was concerned about the app crashing due to high traffic, each minute of downtime cost between $6,000-12,000. They were also concerned that on-going maintenance would be complicated.

Solution: We implemented Blue-Green deployments to manage the traffic issues better. This technique reduced downtime by running two identical production environments called Blue and Green. Only one of the environments was alive at any given time. The technical team developed their new software version and tested it in Blue. Once ready, they switched from Green to Blue; this then made the Green deployment idle. This technique simplified rollbacks. The environments were copied to one another to reduce the risk of configuration drift. The process was like this:

  • Create a clone (Blue) of the existing (Green) environment
  • Deploy the new application version
  • Scale Blue to match the instance number of Green
  • Run a smoke test against Blue to ensure the app is healthy
  • Swap the environment URLs
  • Monitor success via monitoring tools or CloudWatch
  • After 1 hour, delete the old (Green) environment after validating traffic drops to 0 requests.

Business Impact: We reduced the application downtime by 30-60 minutes per month, which translated to $180-360k in downtime savings per month. We implemented a process where the environment URLs could be switched quickly, simplified rollbacks, and reduced the risk of configuration drift.

Using Infrastructure as Code with CI/CD to Meet Launch Deadlines

Problem: A fintech start-up prepared for launch. An off-shore team developed their app. The off-shore team was not allowed to have access to the production environment for security and compliance purposes. So, the client needed the off-shore team to deploy the infrastructure without doing manual provisioning in the console.

Solution: We worked with the start-up to implement an Infrastructure as Code (IAC) methodology. We defined the infrastructure in CloudFormation templates. We then built a CI/CD pipeline which allows the off-shore team to deploy to the production account. This works via a ‘commit’ to source control, and the CI/CD then deploys to the production account.

Business Impact: The start-up met their launch deadline. The off-shore team contributed and were in compliance. The off-shore team continued to build and deploy infrastructure via CI/CD pipelines. This implementation increased speed to deployment, satisfied compliance needs, and reduced operational overhead by 30% using off-shore resources.

Improving Security Posture by Auditing S3 Objects

Problem: A SaaS start-up processed 1 million video files from consumers every year. The video files were encoded and served back to the customer. The encoded files lived in S3 buckets. The buckets had consistent policies, but the objects had variable permissions. This created a security risk. The consumers accessed the video via S3 URLs, which were publicly accessible.

Solution: We used internal tooling to do an S3 object audit and uncovered the publicly accessible objects. The consumers needed access to the files served by S3. We implemented a pre-signed URL that provided the content to the customers but ensured a higher level of security for the files.

Business Impact: We added in provisions to the data access patterns, which improved the security posture of this SaaS start-up. We reduced their exposure and kept their customers happy receiving the same service. The SaaS start-up continued growing and serving >1 million video files per year in a safe and secure environment.

I hope the above case studies were useful. If you’d like to chat with me about any of the stories or learn more, feel free to email me on or book some time on my calendar.