InteliBridge MCP: Unlock up to $30k to build your Model Context Protocol (MCP) Server. Join the waitlist →

Services
Focus Areas

Areas of Expertise
Engagements

Discover

Build

Support
Areas of Expertise

App Modernization

Public Sector

Serverless

IoT

DevOps

Migration

Data and Machine Learning (ML)

Enterprise Architecture

24/7 Monitoring

Team Support

Datadog

Overview

Are you taking advantage of modernizing your AWS apps to protect your cloud investments?

Overview

Our mission is to accelerate high-quality cloud adoption across the Public Sector.

Overview

Whether you are new to serverless or looking to scale, Trek10 allows you to focus on building applications, not managing servers.

Related Content

AWS Lambda

With AWS Lambda, you can run code without the need for managing servers in a cost-effective manner.

Blog

What is Serverless and Why Does it Matter?

Overview

Whether you’re looking to gain visibility into plant floor machinery or seeking to enhance process efficiency, Trek10 can help.

Related Content

Blog

Serverless Architectures: IoT

Blog

Is IoT Device Shadow Right for You?

or should you build-your-own with DynamoDB?

Overview

Shorten the development lifecycle, increase reliability, and release software faster.

Related Content

AWS CloudFormation

AWS CloudFormation helps you save time and money by configuring and managing resources for you.

Containers on AWS

Containers on AWS makes managing container registries easy, autonomous, reliable, and safe from anywhere.

Overview

At Trek10, we rapidly migrate your applications with a focus on cost-effectiveness

Related Content

Amazon WorkSpaces

Amazon WorkSpaces allows you to quickly scale according to your virtual desktop needs.

Containers on AWS

Containers on AWS makes managing container registries easy, autonomous, reliable, and safe from anywhere.

Overview

Uncover insights from your data no matter where you are in your analytics journey.

Related Content

Machine Learning Ops

MLOps constitute best practices for developing, deploying, and monitoring high precision Machine Learning models.

Amazon SageMaker

Amazon SageMaker enables developers and data scientists to easily build ML models.

Overview

Enterprise Architecture (EA) combines business and technology in a proven industry recognized framework to deliver business focused results based on your industry, environment, competition and the ever increasing capabilities of cloud technologies.

Related Content

Developer Acceleration

A series of in-person architect-led training modules designed to help your team develop the necessary skills and best practices to modernize your applications.

Overview

Maximize the uptime and security of your most critical applications.

Related Content

Amazon CloudWatch

Amazon CloudWatch makes performance monitoring simple for you and your business.

Disaster Recovery

Prevent downtime, strengthen resilience, and avoid unanticipated costs with a comprehensive Disaster Recovery Plan.

Overview

Experienced solutions architects and developers at your service, on-demand.

Related Content

Amazon CloudWatch

Amazon CloudWatch makes performance monitoring simple for you and your business.

Disaster Recovery

Prevent downtime, strengthen resilience, and avoid unanticipated costs with a comprehensive Disaster Recovery Plan.

Overview

Let Trek10 help you hit the ground running with Datadog.

Related Content

AWS Premier Partner

Discover

Cloud-Native Immersion Day

Developer Acceleration

Retail | Industry Overview

SaaS on AWS

Serverless Workshop

Overview

Trek10's Cloud-Native Immersion Days are focused, high impact training sessions that will drench your teams in knowledge of the latest tech and best-practices.

Overview

Trek10’s expert-led Developer Acceleration workshops help enterprise teams quickly and safely jump-start their serverless journey.

Overview

Leveraging the vast capabilities of the AWS ecosystem, Trek10 provides retail businesses with solutions tailored to their unique needs, enabling them to innovate at speed and scale.

Overview

Trek10 helps companies migrate and build their SaaS offering on AWS with a cloud-native approach.

Overview

Whether it’s a greenfield project or re-architecting legacy, Trek10 is your guide to adopting cloud native architectures.

Build

DevOps Transformation

Internet of Things (IoT) Applications

Security

Overview

At Trek10, we leverage the best AWS native and third party tools for code-defined infrastructure, continuous integration, and automated deployment pipelines.

Overview

Trek10 helps you deliver on the promise of IoT by guiding you through the process of connecting your devices to AWS and by designing, implementing, and fully supporting your AWS cloud infrastructure.

Overview

Trek10’s security solutions and services will secure your AWS APIs and infrastructure. Schedule a meeting today to see if you qualify for a free security scan and report.

Support

CloudOps 24/7 Monitoring & Support

CloudOps Team Support

Overview

Trek10 brings managed services to the cloud. Our team works hard to reduce noise and maximize uptime in every AWS environment we manage.

Overview

Trek10 Team Support augments your team’s skills with access to a team of experienced and focused AWS solutions architects and cloud developers that specialize in leveraging AWS to the fullest.

Overview

Everyone who moves to AWS wants to secure their environment, but knowing where to start is hard. That is where Trek10 can help.
Case Studies
About
Careers
AWS Premier Partner
Community
CloudProse Blog

Spotlight

Serverless

Cost and Pricing Analysis

Cloud Native

Developer Experience

Databases

News

IoT

Monitoring, Ops & DevOps

Containers

Security and IAM

Generative AI and Machine Learning (ML)

Search Trek10

Blog - Preparing real world kubernetes 2 hero photo

Cloud Native

Real-World Kubernetes Deployments Part 2 - Cloud Native CI/CD

Experimenting with a Docker container to research the self-healing capabilities of Kubernetes to help manage issues encountered during the pod lifecycle.

Ryan Wendel | Jan 11 2022
30 min read

Hey, everybody! Welcome back to the second post in my Kubernetes deployment series.

The links to all three posts in this series are.

Real-World Kubernetes Deployments Part 1 - Cloud Native CI/CD

Real-World Kubernetes Deployments Part 2 - Cloud Native CI/CD

Real-World Kubernetes Deployments Part 3 - Cloud Native CI/CD

In my previous post we investigated various Kubernetes manifest directives that help manage containers throughout the pod lifecycle. In that post we looked at how Kubernetes manages a rolling update and verifies pod health status in order to self-heal deployments when issues are encountered.

Where this previous post had explored the pod deployment process, we did not actually purposefully generate any issues in terms of containers crashing. Kubernetes will self-heal deployments when containers produce errors and/or crash. As such, the second post in this trilogy will feature a container image I created that does the following:

Allows you to customize what is returned in the response to a GET request made to the apex route.
Pause for a specific amount of time before starting a web server.
Crash with a certain frequency on launch.
Crash when a certain URL path is accessed.
Output health status when a certain URL path is accessed.
Cause the health status to periodically show a problem.
Resolve and make a request to a Kubernetes service hostname.

All of these features should be pretty self explanatory aside from the latter. I’ve been researching cross-availability-zone traffic within Kubernetes and wanted a way to determine how the Kubernetes proxy handles requests to services. I’ll work on exploring cross-az traffic at a later date in another post.

The container files used for this image can be found at the following Github repo.

https://github.com/trek10inc/probeserver

With all of that said, let’s look at the features built into this container and how we’ll go about using them!

Customize apex route response content

For the purposes of this blog post series, configuring what the application server includes in its response was meant to provide a way to distinguish between different commits being deployed to a cluster. To configure the response of the apex route simply requires setting an environment variable.

Delaying application server startup

Configuring the container to pause before starting an application server is meant to simulate the behavior of a production container taking a few seconds before being able to accept requests. This feature is also configured by setting an environment variable.

Curious about Kubernetes? Our Experts are here to help!

Crash on launch

Configuring the container to crash with a certain frequency is meant to simulate a buggy container that made its way into a production environment. Definitely not unheard in today’s fast-paced agile development environments. Again, this feature is configured by setting an environment variable.

Crash on URL path

I wanted to create a way to crash the container at any point after the application server has started. The thought being that at some point I might want to get into some Chaos Testing for future POCs. Accessing the ‘/crash’ endpoint of the application server will achieve this result.

Health status output

Kubernetes liveness and readiness probes require some form of health check. For the sake of simulating web applications, I configured the container’s application server to return a 200 status code and a brief JSON object when a request is made to the ‘/healthz’ endpoint to simulate an application’s (healthy) health check.

Alter health status output

I wanted a way to simulate a bad health check status within each pod. As such, I worked in a mechanism similar to what was used to crash the container outright into the health check endpoint to produce 500 status codes randomly in responses from the container’s application server. This feature is also configured by setting an environment variable.

Resolve and request a Kubernetes service

Again, this feature is meant for researching Kubernetes proxy behavior and its effects on cross-availability zone traffic routing. It won’t be used in this post. Regardless, this feature takes in a hostname, resolves it, makes an HTTP (note: not secure HTTPS) request, and then outputs some information about the container servicing the request, the IP address of the resolved hostname, and the response provided by the remote service. A good way to test this is to run it against “checkip.dyndns.org”.

This ends up looking like the following when run.

$ curl http://127.0.0.1/resolve?service=checkip.dyndns.org
----------------------------------------------------
Hostname = 1bc526ba48ce
Ip Address = 127.0.0.1
Service hostname = checkip.dyndns.org
Service IP = 158.101.44.242
----------------------------------------------------
<html><head><title>Current IP Check</title></head><body>Current IP Address: 174.51.70.14</body></html>

Putting all of this together, in conjunction with what we learned from the previous post, in the form of a Kubernetes manifest looks like the following. Please note that altering the health status being returned by the application server was not included in this manifest. To keep things simple(ish), we’ll work through that later.

kind: Deployment
apiVersion: apps/v1
metadata:
  name: prober-foo
  namespace: default
  labels:
    app: prober-foo
    deployment: foo
    env: prod
spec:
  progressDeadlineSeconds: 60
  replicas: 3
  selector:
    matchLabels:
      app: prober-foo
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 30%
      maxSurge: 1
  template:
    metadata:
      labels:
        app: prober-foo
        deployment: foo
        env: prod
        version: 1.0.1
    spec:
      containers:
      - image: "public.ecr.aws/i4a3l2a7/probeserver:latest"
        name: prober-foo
        env:
          - name: START_WAIT_SECS
            value: '15'
          - name: CRASH_FACTOR
            value: '40'
          - name: CONTENT
            value: '{ "team": "foo", "version": "1.0.1" }'
        livenessProbe:
          httpGet:
            path: /healthz
            port: 80
          initialDelaySeconds: 20
          successThreshold: 1
          failureThreshold: 1
        readinessProbe:
          httpGet:
            path: /healthz
            port: 80
          initialDelaySeconds: 5
          successThreshold: 1
          failureThreshold: 1

You’ll notice that I’ve placed my container image in AWS’ public container registry. It currently lives at “public.ecr.aws/i4a3l2a7/probeserver:latest”. Be advised this may not be the case at some point in the future.

You may also notice that I’ve set the following three environment variables in the deployment manifest.

CONTENT - The apex route will return '{ "team": "foo", "version": "1.0.1" }' in its response body.
START_WAIT_SECS - The application server will wait 15 seconds before starting.
CRASH_FACTOR - The container will crash 40% of the time on launch.

Additionally, we’ll accompany this deployment manifest with a service so we can access the application being launched from listening ports on each of the worker nodes. We’ll use the following to do so.

apiVersion: v1
kind: Service
metadata:
  name: foo-nodeport-svc
  labels:
    app: prober-foo
    deployment: foo
    env: prod
    version: 1.0.1
spec:
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: 80
      nodePort: 30080
  selector:
    deployment: foo
    env: prod
  type: NodePort

After applying the manifest to a Kubernetes cluster and periodically capturing the deployment pod statuses you will initially see output like the following.

$ kubectl get pods
NAME                          READY   STATUS              RESTARTS   AGE
prober-foo-67b66bdbf7-4bxjb   0/1     ContainerCreating   0          1s
prober-foo-67b66bdbf7-7dnzd   0/1     ContainerCreating   0          1s
prober-foo-67b66bdbf7-hw78w   0/1     ContainerCreating   0          1s

We know the application server won’t be started immediately (due to our START_WAIT_SECONDS environment variable) so we expect to see the pods running but not yet in the ready state. Something like the following should be seen when checking pod statuses.

NAME                          READY   STATUS    RESTARTS   AGE
prober-foo-67b66bdbf7-4bxjb   0/1     Running   0          4s
prober-foo-67b66bdbf7-7dnzd   0/1     Running   0          4s
prober-foo-67b66bdbf7-hw78w   0/1     Running   0          4s

And once again, looking very closely at the “RESTARTS” column you will see some of the pods restarting. This is where the CRASH_FACTOR environment variable is causing the container to fail and Kubernetes is restarting it.

NAME                          READY   STATUS    RESTARTS     AGE
prober-foo-67b66bdbf7-4bxjb   0/1     Running   0            19s
prober-foo-67b66bdbf7-7dnzd   0/1     Running   0            19s
prober-foo-67b66bdbf7-hw78w   0/1     Running   1 (1s ago)   19s

Checking pod status again, you may see some of the pods in a crash backoff loop. This is where Kubernetes is applying an exponential back-off delay when restarting failed pods. As we know the container won’t fail to launch 100% of the time, it should eventually achieve a ready state.

NAME                          READY   STATUS             RESTARTS      AGE
prober-foo-67b66bdbf7-4bxjb   0/1     CrashLoopBackOff   2 (4s ago)    70s
prober-foo-67b66bdbf7-7dnzd   1/1     Running            1 (53s ago)   70s
prober-foo-67b66bdbf7-hw78w   0/1     Running            2 (36s ago)   70s

And after a few short minutes, you should see all of the pods successfully launched and in the “ready” state.

NAME                          READY   STATUS    RESTARTS       AGE
prober-foo-67b66bdbf7-4bxjb   1/1     Running   3 (33s ago)    1m44s
prober-foo-67b66bdbf7-7dnzd   1/1     Running   1 (2m8s ago)   1m44s
prober-foo-67b66bdbf7-hw78w   1/1     Running   3 (38s ago)    1m44s

Accessing the application’s apex route via its service port on each worker node yields the following.

$ curl http://192.168.0.231:30080
{ "team": "foo", "version": "1.0.1" }

$ curl http://192.168.0.232:30080
{ "team": "foo", "version": "1.0.1" }

Accessing the applications “/healthz” route also provides us what we expected.

$ curl http://192.168.0.231:30080/healthz
{ "status": "ok" }

$ curl http://192.168.0.232:30080/healthz
{ "status": "ok" }

Now that we’ve verified the deployment was successful and operating correctly, let’s crash one of the pods and watch it restart.

$ curl http://192.168.0.232:30080/crash
curl: (52) Empty reply from server

$ kubectl get pods
NAME                          READY   STATUS     RESTARTS       AGE
prober-foo-67b66bdbf7-4bxjb   1/1     Running    3 (33s ago)    5m30s
prober-foo-67b66bdbf7-7dnzd   1/1     Running    1 (2m8s ago)   5m30s
prober-foo-67b66bdbf7-hw78w   0/1     Completed  4 (1s ago)     5m30s

Checking events shows us that the readiness probe failed for the “prober-foo-67b66bdbf7-hw78” pod prompting Kubernetes to pull the container image in order to restart it.

$ kubectl get events --sort-by=.metadata.creationTimestamp | grep prober-foo | grep -e Unhealthy -e Pulled
16s         Warning   Unhealthy                pod/prober-foo-67b66bdbf7-hw78w    Readiness probe failed: Get "http://10.44.0.3:80/healthz": dial tcp 10.44.0.3:80: connect: connection refused
22s         Normal    Pulled                   pod/prober-foo-67b66bdbf7-hw78w    Successfully pulled image "public.ecr.aws/i4a3l2a7/probeserver:latest" in 781.431535ms

Waiting about a minute shows us that the crashed pod has recovered. Note the incremented value in the RESTARTS column.

NAME                          READY   STATUS    RESTARTS       AGE
prober-foo-67b66bdbf7-4bxjb   1/1     Running   3 (33s ago)    6m27s
prober-foo-67b66bdbf7-7dnzd   1/1     Running   1 (2m8s ago)   6m27s
prober-foo-67b66bdbf7-hw78w   1/1     Running   4 (58s ago)    6m27s

Great success! Everything went according to plan. We witnessed the following behavior exhibited by the test container during this short exercise.

The application server took a few seconds to start up.
A few of the containers crashed on launch.
The customized content returned in the response for the apex route is what we expected.
The hard-coded content returned in the response for the health-check route is what we expected.
We were able to arbitrarily crash a pod and witness Kubernetes recover it.

So as promised, we’ll also look at altering the health status being returned by the application server. We’ll be looking to verify that Kubernetes will self-heal pods failing liveness probes by restarting them.

The following deployment manifest will be utilized for this exercise. The following changes were introduced into the original deployment manifest.

The CRASH_FACTOR environment variable was removed and replaced with HEALTH_STATUS_FACTOR. The health status will return a 500 status code 40% of the time.
The CONTENT environment variable was updated to reflect that this version of the app is now set to 1.0.2.
The liveness probe had the “periodSeconds” directive added to it to force the health check to execute every 3 seconds.
The liveness probe had its “failureThreshold” directive set to 2 to reduce the number of container restarts caused by the alteration of health statuses introduced by the HEALTH_STATUS_FACTOR environment variable.

kind: Deployment
apiVersion: apps/v1
metadata:
  name: prober-foo
  namespace: default
  labels:
    app: prober-foo
    deployment: foo
    env: prod
spec:
  progressDeadlineSeconds: 60
  replicas: 3
  selector:
    matchLabels:
      app: prober-foo
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 30%
      maxSurge: 1
  template:
    metadata:
      labels:
        app: prober-foo
        deployment: foo
        env: prod
        version: 1.0.2
    spec:
      containers:
      - image: "public.ecr.aws/i4a3l2a7/probeserver:latest"
        name: prober-foo
        env:
          - name: START_WAIT_SECS
            value: '15'
          - name: HEALTH_STATUS_FACTOR
            value: '40'
          - name: CONTENT
            value: '{ "team": "foo", "version": "1.0.2" }'
        livenessProbe:
          httpGet:
            path: /healthz
            port: 80
          initialDelaySeconds: 20
          successThreshold: 1
          failureThreshold: 2
          periodSeconds: 3
        readinessProbe:
          httpGet:
            path: /healthz
            port: 80
          initialDelaySeconds: 5
          successThreshold: 1
          failureThreshold: 1

Applying this updated manifest will force a rolling update for the deployment. Once the update has been completed we will look at pod status again to see if we witness any restarts.

Sure enough, we’re seeing all of the pods periodically restarting.

$ kubectl get pods
NAME                          READY   STATUS    RESTARTS      AGE
prober-foo-764cf9f454-5pqcx   1/1     Running   1 (88s ago)   2m43s
prober-foo-764cf9f454-8fsnc   0/1     Running   3 (3s ago)    3m3s
prober-foo-764cf9f454-8gjl4   0/1     Running   2 (10s ago)   2m13s

Making requests to the application’s health status endpoint reflects that the application’s “/healthz” endpoint is not always returning healthy status codes.

$ while true; do curl http://192.168.0.231:30080/healthz; sleep 1; done
{ "status": "ok" }
{ "status": "ok" }
{ "status": "error" }
{ "status": "ok" }
{ "status": "ok" }
{ "status": "error" }
{ "status": "ok" }
{ "status": "ok" }
{ "status": "error" }

Searching through events shows us what was happening under the hood.

$ kubectl get events --sort-by=.metadata.creationTimestamp | grep prober-foo | grep -e Unhealthy -e Killing
2m24s       Warning   Unhealthy                pod/prober-foo-764cf9f454-5pqcx    Liveness probe failed: HTTP probe failed with statuscode: 500
3m3s        Warning   Unhealthy                pod/prober-foo-764cf9f454-8fsnc    Liveness probe failed: HTTP probe failed with statuscode: 500
3m7s        Normal    Killing                  pod/prober-foo-764cf9f454-8fsnc    Container prober-foo failed liveness probe, will be restarted
3m3s        Normal    Killing                  pod/prober-foo-764cf9f454-5pqcx    Container prober-foo failed liveness probe, will be restarted

Kubernetes saw liveness probes fail and then restarted the unhealthy pods. Exactly what we were looking to see happen!

That about wraps up the intended purpose of this blog post. I was successfully able to create a container that allows me to experiment with the Kubernetes pod lifecycle and understand how Kubernetes manages it. I’m pretty happy with the outcome and hope that you’ll find value in it for your POCs.

So once again, thanks for hanging out with me for a bit! Stay tuned for the next installment of this series where we’ll work creating an AWS cloud-native CI/CD pipeline to facilitate deployments to an EKS cluster.