Cloud Native

Real-World Kubernetes Deployments Part 2 - Cloud Native CI/CD

Experimenting with a Docker container to research the self-healing capabilities of Kubernetes to help manage issues encountered during the pod lifecycle.
Ryan Wendel Featured Team Member
Ryan Wendel | Jan 11 2022
30 min read

Hey, everybody! Welcome back to the second post in my Kubernetes deployment series.

The links to all three posts in this series are.

Real-World Kubernetes Deployments Part 1 - Cloud Native CI/CD

Real-World Kubernetes Deployments Part 2 - Cloud Native CI/CD

Real-World Kubernetes Deployments Part 3 - Cloud Native CI/CD

In my previous post we investigated various Kubernetes manifest directives that help manage containers throughout the pod lifecycle. In that post we looked at how Kubernetes manages a rolling update and verifies pod health status in order to self-heal deployments when issues are encountered.

Where this previous post had explored the pod deployment process, we did not actually purposefully generate any issues in terms of containers crashing. Kubernetes will self-heal deployments when containers produce errors and/or crash. As such, the second post in this trilogy will feature a container image I created that does the following:

  • Allows you to customize what is returned in the response to a GET request made to the apex route.
  • Pause for a specific amount of time before starting a web server.
  • Crash with a certain frequency on launch.
  • Crash when a certain URL path is accessed.
  • Output health status when a certain URL path is accessed.
  • Cause the health status to periodically show a problem.
  • Resolve and make a request to a Kubernetes service hostname.

All of these features should be pretty self explanatory aside from the latter. I’ve been researching cross-availability-zone traffic within Kubernetes and wanted a way to determine how the Kubernetes proxy handles requests to services. I’ll work on exploring cross-az traffic at a later date in another post.

The container files used for this image can be found at the following Github repo.

https://github.com/trek10inc/probeserver

With all of that said, let’s look at the features built into this container and how we’ll go about using them!

Customize apex route response content

For the purposes of this blog post series, configuring what the application server includes in its response was meant to provide a way to distinguish between different commits being deployed to a cluster. To configure the response of the apex route simply requires setting an environment variable.

Delaying application server startup

Configuring the container to pause before starting an application server is meant to simulate the behavior of a production container taking a few seconds before being able to accept requests. This feature is also configured by setting an environment variable.

Curious about Kubernetes? Our Experts are here to help!

Contact Us

Crash on launch

Configuring the container to crash with a certain frequency is meant to simulate a buggy container that made its way into a production environment. Definitely not unheard in today’s fast-paced agile development environments. Again, this feature is configured by setting an environment variable.

Crash on URL path

I wanted to create a way to crash the container at any point after the application server has started. The thought being that at some point I might want to get into some Chaos Testing for future POCs. Accessing the ‘/crash’ endpoint of the application server will achieve this result.

Health status output

Kubernetes liveness and readiness probes require some form of health check. For the sake of simulating web applications, I configured the container’s application server to return a 200 status code and a brief JSON object when a request is made to the ‘/healthz’ endpoint to simulate an application’s (healthy) health check.

Alter health status output

I wanted a way to simulate a bad health check status within each pod. As such, I worked in a mechanism similar to what was used to crash the container outright into the health check endpoint to produce 500 status codes randomly in responses from the container’s application server. This feature is also configured by setting an environment variable.

Resolve and request a Kubernetes service

Again, this feature is meant for researching Kubernetes proxy behavior and its effects on cross-availability zone traffic routing. It won’t be used in this post. Regardless, this feature takes in a hostname, resolves it, makes an HTTP (note: not secure HTTPS) request, and then outputs some information about the container servicing the request, the IP address of the resolved hostname, and the response provided by the remote service. A good way to test this is to run it against “checkip.dyndns.org”.

This ends up looking like the following when run.

$ curl http://127.0.0.1/resolve?service=checkip.dyndns.org
----------------------------------------------------
Hostname = 1bc526ba48ce
Ip Address = 127.0.0.1
Service hostname = checkip.dyndns.org
Service IP = 158.101.44.242
----------------------------------------------------
<html><head><title>Current IP Check</title></head><body>Current IP Address: 174.51.70.14</body></html>

Putting all of this together, in conjunction with what we learned from the previous post, in the form of a Kubernetes manifest looks like the following. Please note that altering the health status being returned by the application server was not included in this manifest. To keep things simple(ish), we’ll work through that later.

kind: Deployment
apiVersion: apps/v1
metadata:
  name: prober-foo
  namespace: default
  labels:
    app: prober-foo
    deployment: foo
    env: prod
spec:
  progressDeadlineSeconds: 60
  replicas: 3
  selector:
    matchLabels:
      app: prober-foo
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 30%
      maxSurge: 1
  template:
    metadata:
      labels:
        app: prober-foo
        deployment: foo
        env: prod
        version: 1.0.1
    spec:
      containers:
      - image: "public.ecr.aws/i4a3l2a7/probeserver:latest"
        name: prober-foo
        env:
          - name: START_WAIT_SECS
            value: '15'
          - name: CRASH_FACTOR
            value: '40'
          - name: CONTENT
            value: '{ "team": "foo", "version": "1.0.1" }'
        livenessProbe:
          httpGet:
            path: /healthz
            port: 80
          initialDelaySeconds: 20
          successThreshold: 1
          failureThreshold: 1
        readinessProbe:
          httpGet:
            path: /healthz
            port: 80
          initialDelaySeconds: 5
          successThreshold: 1
          failureThreshold: 1

You’ll notice that I’ve placed my container image in AWS’ public container registry. It currently lives at “public.ecr.aws/i4a3l2a7/probeserver:latest”. Be advised this may not be the case at some point in the future.

You may also notice that I’ve set the following three environment variables in the deployment manifest.

  • CONTENT - The apex route will return '{ "team": "foo", "version": "1.0.1" }' in its response body.
  • START_WAIT_SECS - The application server will wait 15 seconds before starting.
  • CRASH_FACTOR - The container will crash 40% of the time on launch.

Additionally, we’ll accompany this deployment manifest with a service so we can access the application being launched from listening ports on each of the worker nodes. We’ll use the following to do so.

apiVersion: v1
kind: Service
metadata:
  name: foo-nodeport-svc
  labels:
    app: prober-foo
    deployment: foo
    env: prod
    version: 1.0.1
spec:
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: 80
      nodePort: 30080
  selector:
    deployment: foo
    env: prod
  type: NodePort

After applying the manifest to a Kubernetes cluster and periodically capturing the deployment pod statuses you will initially see output like the following.

$ kubectl get pods
NAME                          READY   STATUS              RESTARTS   AGE
prober-foo-67b66bdbf7-4bxjb   0/1     ContainerCreating   0          1s
prober-foo-67b66bdbf7-7dnzd   0/1     ContainerCreating   0          1s
prober-foo-67b66bdbf7-hw78w   0/1     ContainerCreating   0          1s

We know the application server won’t be started immediately (due to our START_WAIT_SECONDS environment variable) so we expect to see the pods running but not yet in the ready state. Something like the following should be seen when checking pod statuses.

NAME                          READY   STATUS    RESTARTS   AGE
prober-foo-67b66bdbf7-4bxjb   0/1     Running   0          4s
prober-foo-67b66bdbf7-7dnzd   0/1     Running   0          4s
prober-foo-67b66bdbf7-hw78w   0/1     Running   0          4s

And once again, looking very closely at the “RESTARTS” column you will see some of the pods restarting. This is where the CRASH_FACTOR environment variable is causing the container to fail and Kubernetes is restarting it.

NAME                          READY   STATUS    RESTARTS     AGE
prober-foo-67b66bdbf7-4bxjb   0/1     Running   0            19s
prober-foo-67b66bdbf7-7dnzd   0/1     Running   0            19s
prober-foo-67b66bdbf7-hw78w   0/1     Running   1 (1s ago)   19s

Checking pod status again, you may see some of the pods in a crash backoff loop. This is where Kubernetes is applying an exponential back-off delay when restarting failed pods. As we know the container won’t fail to launch 100% of the time, it should eventually achieve a ready state.

NAME                          READY   STATUS             RESTARTS      AGE
prober-foo-67b66bdbf7-4bxjb   0/1     CrashLoopBackOff   2 (4s ago)    70s
prober-foo-67b66bdbf7-7dnzd   1/1     Running            1 (53s ago)   70s
prober-foo-67b66bdbf7-hw78w   0/1     Running            2 (36s ago)   70s

And after a few short minutes, you should see all of the pods successfully launched and in the “ready” state.

NAME                          READY   STATUS    RESTARTS       AGE
prober-foo-67b66bdbf7-4bxjb   1/1     Running   3 (33s ago)    1m44s
prober-foo-67b66bdbf7-7dnzd   1/1     Running   1 (2m8s ago)   1m44s
prober-foo-67b66bdbf7-hw78w   1/1     Running   3 (38s ago)    1m44s

Accessing the application’s apex route via its service port on each worker node yields the following.

$ curl http://192.168.0.231:30080
{ "team": "foo", "version": "1.0.1" }

$ curl http://192.168.0.232:30080
{ "team": "foo", "version": "1.0.1" }

Accessing the applications “/healthz” route also provides us what we expected.

$ curl http://192.168.0.231:30080/healthz
{ "status": "ok" }

$ curl http://192.168.0.232:30080/healthz
{ "status": "ok" }

Now that we’ve verified the deployment was successful and operating correctly, let’s crash one of the pods and watch it restart.

$ curl http://192.168.0.232:30080/crash
curl: (52) Empty reply from server

$ kubectl get pods
NAME                          READY   STATUS     RESTARTS       AGE
prober-foo-67b66bdbf7-4bxjb   1/1     Running    3 (33s ago)    5m30s
prober-foo-67b66bdbf7-7dnzd   1/1     Running    1 (2m8s ago)   5m30s
prober-foo-67b66bdbf7-hw78w   0/1     Completed  4 (1s ago)     5m30s

Checking events shows us that the readiness probe failed for the “prober-foo-67b66bdbf7-hw78” pod prompting Kubernetes to pull the container image in order to restart it.

$ kubectl get events --sort-by=.metadata.creationTimestamp | grep prober-foo | grep -e Unhealthy -e Pulled
16s         Warning   Unhealthy                pod/prober-foo-67b66bdbf7-hw78w    Readiness probe failed: Get "http://10.44.0.3:80/healthz": dial tcp 10.44.0.3:80: connect: connection refused
22s         Normal    Pulled                   pod/prober-foo-67b66bdbf7-hw78w    Successfully pulled image "public.ecr.aws/i4a3l2a7/probeserver:latest" in 781.431535ms

Waiting about a minute shows us that the crashed pod has recovered. Note the incremented value in the RESTARTS column.

NAME                          READY   STATUS    RESTARTS       AGE
prober-foo-67b66bdbf7-4bxjb   1/1     Running   3 (33s ago)    6m27s
prober-foo-67b66bdbf7-7dnzd   1/1     Running   1 (2m8s ago)   6m27s
prober-foo-67b66bdbf7-hw78w   1/1     Running   4 (58s ago)    6m27s

Great success! Everything went according to plan. We witnessed the following behavior exhibited by the test container during this short exercise.

  • The application server took a few seconds to start up.
  • A few of the containers crashed on launch.
  • The customized content returned in the response for the apex route is what we expected.
  • The hard-coded content returned in the response for the health-check route is what we expected.
  • We were able to arbitrarily crash a pod and witness Kubernetes recover it.

So as promised, we’ll also look at altering the health status being returned by the application server. We’ll be looking to verify that Kubernetes will self-heal pods failing liveness probes by restarting them.

The following deployment manifest will be utilized for this exercise. The following changes were introduced into the original deployment manifest.

  • The CRASH_FACTOR environment variable was removed and replaced with HEALTH_STATUS_FACTOR. The health status will return a 500 status code 40% of the time.
  • The CONTENT environment variable was updated to reflect that this version of the app is now set to 1.0.2.
  • The liveness probe had the “periodSeconds” directive added to it to force the health check to execute every 3 seconds.
  • The liveness probe had its “failureThreshold” directive set to 2 to reduce the number of container restarts caused by the alteration of health statuses introduced by the HEALTH_STATUS_FACTOR environment variable.
kind: Deployment
apiVersion: apps/v1
metadata:
  name: prober-foo
  namespace: default
  labels:
    app: prober-foo
    deployment: foo
    env: prod
spec:
  progressDeadlineSeconds: 60
  replicas: 3
  selector:
    matchLabels:
      app: prober-foo
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 30%
      maxSurge: 1
  template:
    metadata:
      labels:
        app: prober-foo
        deployment: foo
        env: prod
        version: 1.0.2
    spec:
      containers:
      - image: "public.ecr.aws/i4a3l2a7/probeserver:latest"
        name: prober-foo
        env:
          - name: START_WAIT_SECS
            value: '15'
          - name: HEALTH_STATUS_FACTOR
            value: '40'
          - name: CONTENT
            value: '{ "team": "foo", "version": "1.0.2" }'
        livenessProbe:
          httpGet:
            path: /healthz
            port: 80
          initialDelaySeconds: 20
          successThreshold: 1
          failureThreshold: 2
          periodSeconds: 3
        readinessProbe:
          httpGet:
            path: /healthz
            port: 80
          initialDelaySeconds: 5
          successThreshold: 1
          failureThreshold: 1

Applying this updated manifest will force a rolling update for the deployment. Once the update has been completed we will look at pod status again to see if we witness any restarts.

Sure enough, we’re seeing all of the pods periodically restarting.

$ kubectl get pods
NAME                          READY   STATUS    RESTARTS      AGE
prober-foo-764cf9f454-5pqcx   1/1     Running   1 (88s ago)   2m43s
prober-foo-764cf9f454-8fsnc   0/1     Running   3 (3s ago)    3m3s
prober-foo-764cf9f454-8gjl4   0/1     Running   2 (10s ago)   2m13s

Making requests to the application’s health status endpoint reflects that the application’s “/healthz” endpoint is not always returning healthy status codes.

$ while true; do curl http://192.168.0.231:30080/healthz; sleep 1; done
{ "status": "ok" }
{ "status": "ok" }
{ "status": "error" }
{ "status": "ok" }
{ "status": "ok" }
{ "status": "error" }
{ "status": "ok" }
{ "status": "ok" }
{ "status": "error" }

Searching through events shows us what was happening under the hood.

$ kubectl get events --sort-by=.metadata.creationTimestamp | grep prober-foo | grep -e Unhealthy -e Killing
2m24s       Warning   Unhealthy                pod/prober-foo-764cf9f454-5pqcx    Liveness probe failed: HTTP probe failed with statuscode: 500
3m3s        Warning   Unhealthy                pod/prober-foo-764cf9f454-8fsnc    Liveness probe failed: HTTP probe failed with statuscode: 500
3m7s        Normal    Killing                  pod/prober-foo-764cf9f454-8fsnc    Container prober-foo failed liveness probe, will be restarted
3m3s        Normal    Killing                  pod/prober-foo-764cf9f454-5pqcx    Container prober-foo failed liveness probe, will be restarted

Kubernetes saw liveness probes fail and then restarted the unhealthy pods. Exactly what we were looking to see happen!

That about wraps up the intended purpose of this blog post. I was successfully able to create a container that allows me to experiment with the Kubernetes pod lifecycle and understand how Kubernetes manages it. I’m pretty happy with the outcome and hope that you’ll find value in it for your POCs.

So once again, thanks for hanging out with me for a bit! Stay tuned for the next installment of this series where we’ll work creating an AWS cloud-native CI/CD pipeline to facilitate deployments to an EKS cluster.

Author
Ryan Wendel Featured Team Member
Ryan Wendel