Kubernetes and Docker setup for SPDK

Overview

Circuit Blvd is building storage systems that can scale at data centers. We rely on Docker and Kubernetes (k8s) to manage the end-to-end software pipeline from developments to deployments. In this technote, we present an example of how one can leverage Docker and Kubernetes to containerize and orchestrate applications and storage system software.

A mini cluster

We set up a mini Kubernetes cluster with two physical machines and one virtual machine. The first physical machine runs as a Kubernetes master node, orchestrating multiple applications and SPDK Docker pod images. Pod is another name for container per Kubernetes jargon. It also hosts a private Docker registry to store and serve Docker images. The second physical machine is a worker node and performs two major tasks. The first task is to run SPDK vhost app container and expose vhost blk targets to application containers. The second task is to run video server application containers inside a Qemu virtual machine. The virtual machine is actually another Kubernetes worker node. Thus the master node can deploy containers to this virtual machine. Following diagram summarizes the mini cluster setup.

Fig 1. Mini cluster for 24 instances of video servers

One of the Kubernetes features allows specifying the number of vhost blk targets and number of video service pods in our example, along with other parameters. We set NUMINST as 24 in our example and thus sets up 24 units of SPDK malloc drives as vhost targets.

Installation and setup of Kubernetes

We run this setup on Ubuntu environments but other Linux environments should work also with minor modifications. After we install docker.io, Kubernetes can be installed with commands below.

master:~$ curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
master:~$ echo "deb http://apt.kubernetes.io/ kubernetes-xenial main" | \
 sudo tee /etc/apt/sources.list.d/kubernetes.list 
master:~$ sudo apt update   
master:~$ sudo apt install -y kubelet kubeadm kubernetes-cni

Disable swap and reboot. Now we need to set up the Kubernetes master. You need to replace master's_ip_address with your own. Save a line starting with "kubeadm join" for worker nodes later.

master:~$ sudo kubeadm init --pod-network-cidr=10.244.0.0/16 \            --apiserver-advertise-address=master's_ip_address 
master:~$ mkdir -p $HOME/.kube
master:~$ sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config 
master:~$ sudo chown $(id -u):$(id -g) $HOME/.kube/config    
master:~$ kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr  -d '\n')"

Let's have the worker physical machine join the Kubernetes mini cluster. In addition, do not forget to make the virtual machine join the mini cluster when it is run for the first time. Install Kubernetes on worker1 and worker2 just like the master node. Run the following commands on both worker nodes. Once it is configured, Kubernetes will be up and running automatically across machine reboots. Here both token and sha256_hash are taken from master's init command output.

worker1:~$  sudo kubeadm join master's_ip_address:6443 --token token --discovery-token-ca-cert-hash sha256_hash

After labeling master/worker1/worker2 nodes respectively, 'kubectl get nodes' should display information as shown below. Note that master, worker1, worker2 under NAME column are Linux host names.

master:~$ kubectl label nodes master type1=master
master:~$ kubectl label nodes worker1 type1=worker1
master:~$ kubectl label nodes worker2 type1=worker2
master:~$ kubectl get nodes   
NAME                                  STATUS   ROLES    AGE     VERSION master                                Ready    master   3d6h    v1.14.1 worker1                               Ready    <none>   3d5h    v1.14.1 worker2                               Ready    <none>   2d23h   v1.14.1

Build and run SPDK and video server containers

SPDK vhost container needs root privilege but video server containers do not need it. The privileged attribute under securityContext controls whether pods run in privileged mode. The video server containers run on mounted volumes whose access is allowed by typical Linux access control mechanisms. We used chown to make the file systems available to these containers.

Once the source code is ready, we need to build, tag, and push the executable container image to our registry. Download codes from our public github. Repeat the commands below for app-vs-p2 as well.

master:~/github/cb-k8s/cb-vhost-p0 $ docker build -t cb-vhost-p0 .
master:~/github/cb-k8s/cb-vhost-p0 $ docker tag cb-vhost-p0 \ master's_ip_address:5000/cb-vhost-p0
master:~/github/cb-k8s/cb-vhost-p0 $ docker push \    
10.12.90.111:5000/cb-vhost-p0

Now let's deploy containers to our mini cluster. First, ensure that worker1 physical node is ready to run SPDK applications by invoking SPDK setup.sh script with proper hugepage parameters. Inside cb-vhost-p0.yaml file, we define parameters to the vhost app and vhost blk targets. Now deploy the SPDK container and verify the status with following Kubernetes commands. The last 'kubectl logs' command takes the pod's id from previous 'kubectl get pods' output. The log should show the expected outputs (24 units of SPDK malloc drives are set up correctly as vhost blk targets) before moving to the next steps.

master:~$ kubectl apply -f cb-vhost-p0.yaml
master:~$ kubectl get pods
NAME                           READY   STATUS    RESTARTS   AGE
cb-vhost-p0-5f4f6c7649-dpnsq   1/1     Running   0          3h33m   
master:~$ kubectl logs pods/cb-vhost-p0-5f4f6c7649-dpnsq

Once SPDK vhost blk targets are ready via cb-vhost-p0 pod, we launch the virtual machine worker2. Most of the lines have been explained in previous technotes. Note the lines in bold italics where we are allocating 12 CPUs and 8G of memory. Also, we expose the physical machine's /tmp/mm directory to the VM.

worker1:~/work/qemu$ sudo ~/github/qemu-nvme/bin/qemu-system-x86_64 \
--enable-kvm -cpu host -smp 12 -m 8G \ 
-drive file=/home/cbuser/work/qemu/u1.qcow2,if=none,id=disk \
-device ide-hd,drive=disk,bootindex=0 \                                   -object memory-backend-file,id=mem0,size=8G,mem-path=/dev/hugepages,share=on \                                            -numa node,memdev=mem0 \                                                  -virtfs local,path=/tmp/mm,mount_tag=host0,security_model=passthrough,id=host0  \   -net nic,macaddr=DE:AD:BE:EF:01:41 \                                      -net tap,ifname=tap0,script=q_br_up.sh,downscript=q_br_down.sh -vnc :2 \   -chardev socket,id=spdk_vhost_blk0,path=/var/tmp/vhost.0 \                -device vhost-user-blk-pci,chardev=spdk_vhost_blk0,num-queues=4 \         ...(truncated)...                                                          -chardev socket,id=spdk_vhost_blk23,path=/var/tmp/vhost.23 \                -device vhost-user-blk-pci,chardev=spdk_vhost_blk23,num-queues=4

When logged onto the VM, we execute the following commands to make video source repo available. Then, we build EXT4 file systems on vhost blk targets and mount them as /mnt/vhosts/vd[a-x]. The shell files are available at github under scripts/ directory.

worker2:~$ ./mmmount.sh; ./vhostsmnt.sh

Finally, we launch the video application containers as a Kubernetes StatefulSets to this virtual machine. What defines app-vs-p2 as a StatefulSets is 'kind: StatefulSet' line in the YAML file. Note that cb-vhost-p0 is defined as 'kind: Deployment'. If everything worked fine so far, you should be able to see nginx's IP address and port numbers where videos can be played off as shown in the following.

master:~$ kubectl apply -f app-vs-p2.yaml
master:~$ ./pp.sh 
25 pod entries...                                                     
1: app-vs-p2-0  1/1   Running  0  4h3m 1 ==> 10.12.90.141:8080 rtmp:1935
2: app-vs-p2-1  1/1   Running  0  4h3m 1 ==> 10.12.90.141:8081 rtmp:1936  ...(truncated)...                                                         24: app-vs-p2-9 1/1   Running  0  4h3m 1 ==> 10.12.90.141:8089 rtmp:1944   25: cb-vhost-p0-5f4f6c7649-dpnsq   1/1    Running  0  4h6m 1 ==> VHOST_CONFIG: virtio is now ready for processing.

Under the hood

One can easily understand how different pieces work together by walking through the codes. Here, we summarize a few key points of the implementation.

Resource allocation and management - For our mini cluster, we define resource allocation caps for CPU cores and memory. Kubernetes allows specifying resource requests and limits within YAML files. In our example, worker1 has 24 CPU cores across two physical sockets. We allocate six cores 0-5 to SPDK vhost container. In addition to the YAML specification, one should specify the CPU mask parameter in the SPDK vhost command line. For app-vs-p2 video server pods, we cap at 500m CPUs which is 50% of a whole CPU core. Similarly, memory allocations can be controlled with YAML definitions. You can see the CPU allocations on worker1 physical machine shown by 'htop' command below. Core 1-6 are 100% utilized and used by cb-vhost-p0 pod. Core 13-24 are being utilized by app-vs-p2 pods where each of the 24 pods use 0.5 core at maximum. Core 7-12 are set aside for remaining tasks for the machine.

Kubernetes StatefulSets - Ordinary Kubernetes deployments do not distinguish between replicated pods. One feature we need out of Kubernetes is to be able to distinguish each video server pod(container) from each other because each of NUMINST needs to be served by exactly one video server pod. As one can see in app-vs-p2.yaml file, MY_POD_ID is defined and to pass unique StatefulSets pod ID to each container. For example, app-vs-ps2-0 will have 0 as its identifier. Likewise, app-vs-ps2-3 will have 3 as its identifier and serve videos off /dev/vdd whose files are copied from /tmp/mm/mp4-4.

Passing information and host resource sharing for containers - One can pass parametric information to each Kubernetes pod (container) through Linux environment variables. One example is RPC_IP from cb-vhost-p0.yaml. In addition, host directories can be shared with containers with mounting volumes. For example, our movies repo /tmp/mm on worker1 physical machine contains all video and thumbnail files. These files are only copied to /dev/vd[a-x] by individual pod. We are actually sharing /tmp/mm directory to the virtual machine worker2 by mounting 9p. Finally, this path is shared to each app-vs-p2 containers via mounting volumes defined in app-vs-p2 YAML file.

Conclusion

We showed how applications can be deployed with Docker and Kubernetes and consume data off SPDK enabled storage services. The example is limited to a single physical machine but it actually runs with good performance, i.e. each video server pod launches within seconds and all 24 of them run without slowdowns. When Circuit Blvd's OCSSD development hardware becomes available in the 2nd half of 2019, one can build storage applications and services with I/O isolation and QoS guarantees over Docker and Kubernetes. This approach scales well inside data centers as it provides high efficiency in software developments, quality assurance, and deployments.

Software versions

Sticking to following software versions will ensure all the commands in this article work in your environment:

Linux OS: Ubuntu 18.10 Desktop for both physical machine and qemu virtual machine
Linux kernel: 5.0.9 version
SPDK: v19.04-57-g00a6c491d
Kubernetes: v1.14.1
Docker: 18.06.1-ce, build e68fc7a

Questions?

Contact info@circuitblvd.com for additional information.

Appendix - Docker registry setup

First, run the command below on the machine where the registry will be run. In our example, we used the Kubernetes's master node.

master:~$ docker run -d -p 5000:5000 --restart=always \                  
--name cbregistry registry:2

For all the nodes that access the registry, add the following information.

node:~$ cat /etc/docker/daemon.json                                              
{                                                                
    "insecure-registries" : ["registry_ip_address:5000"]                                        }                                                                      node:~$ sudo service docker restart