Trek10 AWS Cloud Studio
Have you considered the cloud as a solution to cost-effective video production?
I've been working on a few Kubernetes engagements recently, in particular application deployment processes for Amazon Elastic Kubernetes Service (EKS) clusters. The question came up, how could we build a simple, cloud native deployment mechanism to deploy applications on EKS clusters. I have built a pretty involved proof of concept of one approach to solve this challenge and want to share it with others who may be interested.
The entire proof-of-concept solution involves the following tasks:
As working through these three tasks would make for an extremely long read, I’ll only be focusing on the first in this blog post. I’ll go into the remaining two (maybe broken up into three) tasks in a series of follow-up blog posts over the coming month or so.
The links to all three posts in this series are.
Real-World Kubernetes Deployments Part 1 - Cloud Native CI/CD
Real-World Kubernetes Deployments Part 2 - Cloud Native CI/CD
Real-World Kubernetes Deployments Part 3 - Cloud Native CI/CD
The Kubernetes eco-system is a remarkably powerful tool-set and, as one could easily expect, provides a robust mechanism for rolling out container updates and other supporting objects required by a given application deployment.
With this said, a good place to start this journey would be to understand how Kubernetes probes workload containers throughout the pod lifecycle. In the context of a deployment, this entails understanding the "livenessProbe", "readinessProbe", and "startupProbe" manifest directives. Granted, the former isn't necessary for deployment purposes but it's good to understand so we'll go over it anyways.
Some great references for these probes come straight from the Kubernetes docs.
I'll quickly summarize the highlights to help you quickly work through this post.
The startup probe type was created to deal with legacy applications that might require an additional startup time on their first initialization. You will, more than likely, opt for a readiness over a startup probe and, as such, we'll not focus on the startup probe in this post.
Probes feature handlers that allow you to customize how you want to determine the status of your container. The following three handlers are afforded by Kubernetes:
We'll only be working with the HTTP handler for this blog post as my proof-of-concept dealt with containers running web applications.
Let's take a look at examples of a liveness and readiness probe as they might be used in the wild to test a container's HTTP health check endpoint.
livenessProbe: httpGet: path: /healthz port: 80 initialDelaySeconds: 20 successThreshold: 1 failureThreshold: 1 readinessProbe: httpGet: path: /healthz port: 80 initialDelaySeconds: 5 successThreshold: 1 failureThreshold: 1
Right away we see some interesting, extra directives. Of particular interest are the initial delay and success/failure thresholds. These are important to set in order to help prevent false positives and to codify what success and failure levels we'll accept from our probes (respectively).
Looking back at our example probes, we see that when our pod/container starts we:
Now that we have a basic understanding of how to help Kubernetes understand the health of our container workloads, we'll want to look at how we go about deploying new versions of containers; which has us first looking at the "Deployment" Kubernetes object.
Straight from the docs:
“A Deployment provides declarative updates for Pods and ReplicaSets. You describe a desired state in a Deployment, and the Deployment Controller changes the actual state to the desired state at a controlled rate.”
Keeping our focus concentrated on the more important aspects of a deployment we'll focus on the following manifest directives.
We'll start with the "strategy" directive as this controls how we'll instruct Kubernetes to provision and terminate the containers that comprise our deployments.
The following two types of strategies are at our disposal:
Seeing that this post is discussing “real-world Kubernetes deployments”, we're only going to focus on the latter. Application uptime is paramount in today's online business environment and terminating an entire deployment's pods/containers prior to replacing them is not something that would likely be viewed as acceptable.
The "RollingUpdate" update type utilizes two of its own directives to control how a deployment's update process is enacted.
In short, maxSurge is how many new pods we allow Kubernetes to create over the original number specified in the deployment manifest while maxUnavailable is the amount of pods we are willing to forgo when serving requests from the deployment during the update process.
And lastly, we need to touch on the "progressDeadlineSeconds" directive. Again, taking info provided to us in the Kubernetes docs:
It is important to note that this directive is given in terms of deployment "progress", not the overall length of time a deployment spans from start to finish. More specifically, progress is defined as any time the deployment creates or deletes a pod/container. When progress is made, the timeout clock is reset to zero.
In short, this directive allows you to tell Kubernetes to give up and abandon a rollout with a progress deadline different from the default 10 minutes.
Putting all of these directives together, we end up with a deployment manifest that looks like the following.
apiVersion: apps/v1 kind: Deployment metadata: name: example-foo namespace: default labels: app: example-foo deployment: foo spec: progressDeadlineSeconds: 60 replicas: 5 selector: matchLabels: app: example-foo strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 20% maxSurge: 1 template: metadata: labels: app: example-foo deployment: foo spec: containers: - name: example-foo image: public.ecr.aws/nginx/nginx:latest command: [ "/bin/sh", "-c" ] args: - sleep 15; echo "1.0.0" > /usr/share/nginx/html/index.html; echo healthy > /usr/share/nginx/html/healthz; nginx -g "daemon off;"; livenessProbe: httpGet: path: /healthz port: 80 initialDelaySeconds: 20 successThreshold: 1 failureThreshold: 1 readinessProbe: httpGet: path: /healthz port: 80 initialDelaySeconds: 5 successThreshold: 1 failureThreshold: 1
This manifest provides us with a deployment that will utilize every aforementioned Kubernetes pod lifecycle directive. Some of the more important aspects of this deployment include:
In short, this deployment will simulate the creation of an application container fleet where each container requires a few seconds of boot time before being able to accept requests. It will enable us to examine how Kubernetes conducts and controls rollouts given the values we specify for the pod lifecycle directives.
In order to begin using this deployment to examine rollout behavior, we’ll want to provision a NodePort service so we can access the web services being provided by each container. The following service manifest should do the trick.
apiVersion: v1 kind: Service metadata: name: foo-nodeport-svc labels: deployment: foo spec: ports: - name: http port: 80 protocol: TCP targetPort: 80 nodePort: 30080 selector: deployment: foo type: NodePort
We’ll write both of these manifests to separate files and then apply both to a cluster using kubectl. Once applied, we’ll examine the information associated with the deployment, its pods, and the service. You should see something like the following:
$ kubectl apply -f deployments.yaml deployment.apps/example-foo created $ kubectl apply -f services.yaml service/foo-nodeport-svc created $ kubectl get deployment.apps/example-foo NAME READY UP-TO-DATE AVAILABLE AGE example-foo 3/5 5 3 24s $ kubectl get pods NAME READY STATUS RESTARTS AGE example-foo-6574bff886-tdd8h 1/1 Running 0 31s example-foo-6574bff886-25gpw 0/1 Running 0 31s example-foo-6574bff886-4m44j 1/1 Running 0 31s example-foo-6574bff886-fb7s4 0/1 Running 0 31s example-foo-6574bff886-txxqp 1/1 Running 0 31s $ kubectl get service/foo-nodeport-svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE foo-nodeport-svc NodePort 10.105.128.167 <none> 80:30080/TCP 39s
You will quickly notice that not all pods are in the “ready” state. This is the result of having added the sleep statement to the webserver pod. After about 45 seconds you should see all 5 pods servicing requests.
The kubernetes cluster I utilize for testing purposes has two worker nodes with the following IP addresses. You will see these hosts throughout the remainder of this post.
Making requests to the exposed services will show that the webservers in our pods have all successfully started and are servicing requests made to /index.html and /healthz.
$ curl -S http://192.168.0.231:30080/healthz healthy $ curl -S http://192.168.0.232:30080/healthz healthy $ curl -S http://192.168.0.231:30080/index.html 1.0.0 $ curl -S http://192.168.0.232:30080/index.html 1.0.0
Now that we’ve performed our initial deployment, we’ll alter the deployment in such a way as to force Kubernetes to replace all of the pods in the deployment. We’ll accomplish this by incrementing the version number displayed in the response provided by index.html. As in, we’ll change the line in the deployment manifest that writes the version number to /usr/share/nginx/html/index.html:
command: [ "/bin/sh", "-c" ] args: - sleep 15; echo "1.0.0" > /usr/share/nginx/html/index.html; echo healthy > /usr/share/nginx/html/healthz; nginx -g "daemon off;";
To look like:
command: [ "/bin/sh", "-c" ] args: - sleep 15; echo "1.0.1" > /usr/share/nginx/html/index.html; echo healthy > /usr/share/nginx/html/healthz; nginx -g "daemon off;";
Applying the updated deployment file and examining the output of the following commands over time will illustrate how Kubernetes is managing the rollout of our deployment.
kubectl get deployment.apps/example-foo
kubectl get pods
Right after applying the updated manifest we see the following output.
$ kubectl get deployment.apps/example-foo NAME READY UP-TO-DATE AVAILABLE AGE example-foo 5/5 1 5 111s $ kubectl get pods NAME READY STATUS RESTARTS AGE example-foo-59bc65584c-v27q2 0/1 ContainerCreating 0 1s example-foo-6574bff886-25gpw 1/1 Running 0 111s example-foo-6574bff886-4m44j 1/1 Running 0 111s example-foo-6574bff886-fb7s4 1/1 Running 0 111s example-foo-6574bff886-tdd8h 1/1 Running 0 111s example-foo-6574bff886-txxqp 1/1 Running 0 111s $ curl http://192.168.0.231:30080 1.0.0 $ curl http://192.168.0.232:30080 1.0.0
As expected, only a single container was initially provisioned as our maxSurge value was set to “1”. Additionally, the version being displayed in the response for index.html has not changed yet.
Waiting a little more than twenty seconds we then see the following.
$ kubectl get deployment.apps/example-foo NAME READY UP-TO-DATE AVAILABLE AGE example-foo 5/5 2 5 2m14s $ kubectl get pods NAME READY STATUS RESTARTS AGE example-foo-59bc65584c-84xxc 0/1 ContainerCreating 0 0s example-foo-59bc65584c-v27q2 1/1 Running 0 24s example-foo-6574bff886-25gpw 1/1 Running 0 2m14s example-foo-6574bff886-4m44j 1/1 Running 0 2m14s example-foo-6574bff886-fb7s4 1/1 Running 0 2m14s example-foo-6574bff886-tdd8h 1/1 Terminating 0 2m14s example-foo-6574bff886-txxqp 1/1 Running 0 2m14s $ curl http://192.168.0.231:30080 1.0.0 $ curl http://192.168.0.232:30080 1.0.0
At this point we see a second new pod being provisioned and a single old pod being terminated. Just what we’d expect by setting maxUnavailable to “20%” with five pods in a deployment. Again, we have yet to see the new version pop up in a response.
A little over a minute in and we see that a total of four new pods have been provisioned, the new version number is being reflected in some responses, and Kubernetes is continuing to terminate old pods.
$ kubectl get deployment.apps/example-foo NAME READY UP-TO-DATE AVAILABLE AGE example-foo 5/5 4 5 3m1s $ kubectl get pods NAME READY STATUS RESTARTS AGE example-foo-59bc65584c-84xxc 1/1 Running 0 48s example-foo-59bc65584c-9fdp2 1/1 Running 0 23s example-foo-59bc65584c-v27q2 1/1 Running 0 72s example-foo-59bc65584c-v59mm 0/1 ContainerCreating 0 2s example-foo-6574bff886-25gpw 1/1 Running 0 3m2s example-foo-6574bff886-4m44j 1/1 Terminating 0 3m2s example-foo-6574bff886-fb7s4 1/1 Terminating 0 3m2s example-foo-6574bff886-txxqp 1/1 Running 0 3m2s $ curl http://192.168.0.231:30080 1.0.1 $ curl http://192.168.0.232:30080 1.0.0
Not once during the deployment have we seen a request fail. Granted, we’ve seen differences in the versions (something characteristic of a rolling update) but, so far, we have not experienced any downtime during this deployment.
After a total of about 2.5 minutes we see the deployment completing successfully.
$ kubectl get deployment.apps/example-foo NAME READY UP-TO-DATE AVAILABLE AGE example-foo 5/5 5 5 4m15s $ kubectl get pods NAME READY STATUS RESTARTS AGE example-foo-59bc65584c-84xxc 1/1 Running 0 2m2s example-foo-59bc65584c-9fdp2 1/1 Running 0 97s example-foo-59bc65584c-gt6q9 1/1 Running 0 53s example-foo-59bc65584c-v27q2 1/1 Running 0 2m26s example-foo-59bc65584c-v59mm 1/1 Running 0 76s $ curl http://192.168.0.231:30080 1.0.1 $ curl http://192.168.0.232:30080 1.0.1
The final result being that all of the updated containers/pods have been provisioned while every older pod has been terminated and all requests to index.html are returning the new version number. At this point we can confidently state that our deployment ran successfully encountering zero downtime!
Some final thoughts to consider are that setting pod lifecycle directives to optimal values won’t always be easy and will, undoubtedly, involve some tweaking, experimentation, and metrics-driven debugging when substantial code-base changes are made and new deployments are rolled out. Rare will be the occasion where you’ll get to “set it and forget it”.
Additionally, responsibility for setting these values needs to be hashed out between teams involved in the deployment process. A collaboration between developers and Kubernetes admins will, most likely, produce an optimal result. Removing any diffusions of responsibility will go a long way in ensuring a smooth user experience during deployments.
Thanks for hanging out with me for a bit. Stay tuned for the next installment of this series where I’ll work through a similar deployment process with a container I put together that can produce some of the hiccups you may encounter during a deployment.
Have you considered the cloud as a solution to cost-effective video production?
Using a little-known module named requests_futures can dramatically speed up the consumption of remote APIs, even faster than preemptive multitasking.