diff --git a/contrib/mesos/docs/ha.md b/contrib/mesos/docs/ha.md index e4debcbc145fd..d737099bb4116 100644 --- a/contrib/mesos/docs/ha.md +++ b/contrib/mesos/docs/ha.md @@ -1,4 +1,32 @@ -## High Availability +# High Availability + +Kubernetes on Mesos will eventually support two HA modes: + +* [Hot-standby](#hot-standby) (*work-in-progress*) +* [Cold-standby](#cold-standby) + +Hot-standby mode is currently still work-in-progress as controller manager is not +yet HA-aware (the work is being tracked [here][2]). Nevertheless, we will +describe how hot-standby mode is intended to work. It is recommended to use +cold-standby mode for HA for the time being until this work is done. In hot-standby +mode all master components (apiserver, controller manager, and scheduler) +actively run on every master node. Additional logic is added to the controller +manager and scheduler to coordinate their access to the etcd backend to deal +with concurrency issues when modifying cluster state. As apiserver does not +modify cluster state, multiple of these can run concurrently without +coordination. When the leader (i.e., the node whose scheduler is active) +crashes, other master nodes will detect the failure after some time and then +elect a new leader. + +In cold-standby mode, similar to hot-standby mode apiserver will actively run +on every master node. However, only one scheduler and controller manager will +run at any instance in time. This is coordinated by a small external program +called `podmaster` that uses etcd to perform leadership selection, and only on +the leader node will the `podmaster` start the scheduler and controller +manager. Cold-standby mode is how Kubernetes supports HA, and more information +can be found [here][1]. + +## Hot-standby ### Scheduler @@ -66,5 +94,432 @@ The command line parameters that affect the hash calculation are listed below. - `--proxy-*` - `--static-pods-config` +## Cold-standby + +Setting up Kubernetes on Mesos in cold-standby mode is similar to Kubernetes in +standalone mode described in [Kubernetes HA][1]. However, special attention is +needed when setting up K8sm scheduler so that when the currently active +scheduler crashes/dies, a new one can be instantiated and take over the work. +More precisely, the new scheduler needs to be compatible with the executors +that were started previously by the dead scheduler. + +### Environment Variables + +We will set up K8sm master on 2 nodes in HA mode. The same steps can be +extended to set up more master nodes to deal with more concurrent failures. We +will define a few environment variables first to describe the testbed +environment. + +``` +MESOS_IP=192.168.0.1 +MESOS_PORT=5050 + +ETCD_IP=192.168.0.2 +ETCD_PORT=4001 + +K8S_1_IP=192.168.0.3 +K8S_2_IP=192.168.0.4 +K8S_APISERVER_PORT=8080 +K8S_SCHEDULER_PORT=10251 + +NGINX_IP=192.168.0.5 +NGINX_APISERVER_PORT=80 +NGINX_SCHEDULER_PORT=81 +``` + +Other than the 2 K8sm master nodes (`192.168.0.3` and `192.168.0.4`), we also +define a Mesos master at `192.168.0.1`, an etcd server at `192.168.0.2`, and an +Nginx server that load balances between the 2 K8sm master nodes. + +### K8sm Container Image + +We use podmaster to coordinate leadership selection amongst K8sm masters. +However, podmaster needs to run in a container (preferably in a pod), and on +the leader node, its podmaster will instantiate scheduler and controller +manager also in their separate pods. The podmaster image is pre-built and can +be obtained from `gcr.io/google_containers/podmaster`. An official image that +contains the `km` binary to start apiserver, scheduler, and controller +manager is not yet available. But it can be built fairly easily. + +```shell +$ cat <Dockerfile +FROM ubuntu +MAINTAINER Hai Huang +RUN mkdir -p /opt/kubernetes +COPY kubernetes/_output/dockerized/bin/linux/amd64/ /opt/kubernetes +ENTRYPOINT ["/opt/kubernetes/km"] +EOF +$ cat <build.sh +#!/bin/bash +K8SM_IMAGE_NAME=haih/k8sm +git clone https://github.com/mesosphere/kubernetes +cd kubernetes +git checkout release-v0.7-v1.1 +KUBERNETES_CONTRIB=mesos build/run.sh hack/build-go.sh +cd .. +sudo docker build -t $K8SM_IMAGE_NAME --no-cache . +EOF +$ chmod 755 build.sh +$ ./build.sh +``` + +Make sure Docker engine is running locally as we will compile Kubernetes using +a Docker image. One can also change the image name and which Kubernetes release +to compile by modifying the script. After the script has finished running, +there should be a local Docker image called `haih/k8sm` (use `docker images` to +check). + +Optionally, we can also push the image to Docker Hub (i.e., `docker push +$K8SM_IMAGE_NAME`) so we do not have to compile the image on every K8sm master +node. + +**IMPORTANT:** Mesosphere team is currently maintaining the stable K8sm release in +a separate [fork][3]. At the time of this writing, the latest stable release is +`release-v0.7-v1.1`. + + +### Configure ETCD + +We assume there's an etcd server on `$ETCD_IP`. Ideally this should be a +cluster of etcd servers running in HA mode backed up by redundant persistent +storage. For testing purposes, on the etcd server one can spin up an etcd +instance in a Docker container. + +```shell +$ docker run -d --hostname $(uname -n) --name etcd \ + -p ${ETCD_PORT}:${ETCD_PORT} \ + quay.io/coreos/etcd:v2.0.12 \ + --listen-client-urls http://0.0.0.0:${ETCD_PORT} \ + --advertise-client-urls http://${ETCD_IP}:${ETCD_PORT} +``` + +### Configure Podmaster + +Since we plan to run all K8sm components and podmaster in pods, we can use +`kubelet` to bootstrap these pods by specifying a manifests directory. + +```shell +$ mkdir -p /etc/kubernetes/manifests/ +$ mkdir -p /srv/kubernetes/manifests/ +``` + +Once the kubelet has started, it will check the manifests directory periodically +to see if it needs to start or stop pods. Pods can be started by putting their +specification yaml files into the manifests directory, and subsequently they +can be stopped by removing these yaml files. + +```shell +$ cat < /etc/kubernetes/manifests/podmaster.yaml +apiVersion: v1 +kind: Pod +metadata: + name: kube-podmaster + namespace: kube-system +spec: + hostNetwork: true + containers: + - name: scheduler-elector + image: gcr.io/google_containers/podmaster:1.1 + command: + - /podmaster + - --etcd-servers=http://${ETCD_IP}:${ETCD_PORT} + - --key=scheduler + - --whoami=${MY_IP} + - --source-file=/src/manifests/scheduler.yaml + - --dest-file=/dst/manifests/scheduler.yaml + volumeMounts: + - mountPath: /src/manifests + name: manifest-src + readOnly: true + - mountPath: /dst/manifests + name: manifest-dst + - name: controller-manager-elector + image: gcr.io/google_containers/podmaster:1.1 + command: + - /podmaster + - --etcd-servers=http://${ETCD_IP}:${ETCD_PORT} + - --key=controller + - --whoami=${MY_IP} + - --source-file=/src/manifests/controller-mgr.yaml + - --dest-file=/dst/manifests/controller-mgr.yaml + terminationMessagePath: /dev/termination-log + volumeMounts: + - mountPath: /src/manifests + name: manifest-src + readOnly: true + - mountPath: /dst/manifests + name: manifest-dst + volumes: + - hostPath: + path: /srv/kubernetes/manifests + name: manifest-src + - hostPath: + path: /etc/kubernetes/manifests + name: manifest-dst +``` + +One must change `$MY_IP` to either `$K8S_1_IP` or `K8S_2_IP` depending +on which master node you are currently setting up the podmaster. Podmasters +will compete with each other for leadership, and the winner will copy scheduler +and controller manager's pod specification yaml files from +`/srv/kubernetes/manifests/` to `/etc/kubernetes/manifests/`. When the kubelet +detects these new yaml files, it will start the corresponding pods. + +### Configure Scheduler + +The scheduler pod specification will be put into `/srv/kubernetes/manifests/`. + +```shell +$ cat < /srv/kubernetes/manifests/scheduler.yaml +apiVersion: v1 +kind: Pod +metadata: + name: kube-scheduler + namespace: kube-system +spec: + hostNetwork: true + containers: + - name: kube-scheduler + image: haih/k8sm:latest + imagePullPolicy: IfNotPresent + command: + - /opt/kubernetes/km + - scheduler + - --address=${MY_IP} + - --advertised-address=${NGINX_IP}:${NGINX_SCHEDULER_PORT} + - --mesos-master=${MESOS_IP}:${MESOS_PORT} + - --etcd-servers=http://${ETCD_IP}:${ETCD_PORT} + - --api-servers=${NGINX_IP}:${NGINX_APISERVER_PORT} + - --v=10 +EOF +``` + +Again, one must change `$MY_IP` to either `$K8S_1_IP` or `K8S_2_IP` depending +on which master node is currently being working on. Even though we have not set +up Nginx yet, we can still specify `--api-servers` and `--advertised-address` +using Nginx's address and ports (make sure Nginx is already running before +turning on the scheduler). Having `--api-servers` point to Nginx allows +executors to maintain connectivity to one of the apiservers even when one or +more apiservers is down as Nginx can automatically re-route requests to a +working apiserver. + +It is critically important to point `--advertised-address` to Nginx so all the +schedulers would be assigned the same executor ID. Otherwise, if we assign +`--advertised-address=${K8S_1_IP}` on the first K8s master and +`--advertised-address=${K8S_2_IP}` on the second K8s master, they would +generate different executor IDs. During a fail-over, the new scheduler would +not be able to use the executor started by the failed scheduler. If so, one +could get this error message in the scheduler log: + +> Declining incompatible offer... + +### Configure Controller Manager + +The controller manager pod specification will also be put into `/srv/kubernetes/manifests/`. + +```shell +$ cat < /srv/kubernetes/manifests/controller-mgr.yaml +apiVersion: v1 +kind: Pod +metadata: + name: kube-controller-manager + namespace: kube-system +spec: + hostNetwork: true + containers: + - name: kube-controller-manager + image: haih/k8sm:latest + imagePullPolicy: IfNotPresent + command: + - /opt/kubernetes/km + - controller-manager + - --master=http://${NGINX_IP}:${NGINX_APISERVER_PORT} + - --cloud-provider=mesos + - --cloud-config=/etc/kubernetes/mesos-cloud.conf + volumeMounts: + - mountPath: /etc/kubernetes + name: kubernetes-config + readOnly: true + volumes: + - hostPath: + path: /etc/kubernetes + name: kubernetes-config +EOF +``` + +Controller manager also needs a mesos configuration file as one of its +parameters, and this configuration file is written to +`/etc/kubernetes/mesos-cloud.conf`. + +```shell +$ cat </etc/kubernetes/mesos-cloud.conf +[mesos-cloud] + mesos-master = ${MESOS_IP}:${MESOS_PORT} +EOF +``` + +### Configure Apiserver + +Apiserver runs on every master node, so its specification file is put into +`/etc/kubernetes/manifests/`. + +```shell +cat < /etc/kubernetes/manifests/apiserver.yaml +apiVersion: v1 +kind: Pod +metadata: + name: kube-apiserver + namespace: kube-system +spec: + hostNetwork: true + containers: + - name: kube-apiserver + image: haih/k8sm:latest + imagePullPolicy: IfNotPresent + command: + - /opt/kubernetes/km + - apiserver + - --insecure-bind-address=0.0.0.0 + - --etcd-servers=http://${ETCD_IP}:${ETCD_PORT} + - --allow-privileged=true + - --service-cluster-ip-range=10.10.10.0/24 + - --insecure-port=${K8S_APISERVER_PORT} + - --cloud-provider=mesos + - --cloud-config=/etc/kubernetes/mesos-cloud.conf + - --advertise-address=${MY_IP} + ports: + - containerPort: ${K8S_APISERVER_PORT} + hostPort: ${K8S_APISERVER_PORT} + name: local + volumeMounts: + - mountPath: /etc/kubernetes + name: kubernetes-config + readOnly: true + volumes: + - hostPath: + path: /etc/kubernetes + name: kubernetes-config +EOF +``` + +Again, one must change `$MY_IP` to either `$K8S_1_IP` or `K8S_2_IP` +depending on which master node is currently being working on. + +To summarize our current setup: we have apiserver and podmaster's pod +specification files put into `/etc/kubernetes/manifests/` so they run on every +master node. Scheduler and controller manager's pod specification files are +put into `/srv/kubernetes/manifests/`, and they will be copied into +`/etc/kubernetes/manifests/` by their podmaster if and only if their podmaster was +elected the leader. + +### Configure Nginx + +Nginx needs to be configured to load balance for both the apiserver and scheduler. +For testing purpose, one can start Nginx in a Docker container. + +```shell +cat <nginx.conf +events { + worker_connections 4096; ## Default: 1024 +} + +http { + upstream apiservers { + server ${K8S_1_IP}:${K8S_APISERVER_PORT}; + server ${K8S_2_IP}:${K8S_APISERVER_PORT}; + } + + upstream schedulers { + server ${K8S_1_IP}:${K8S_SCHEDULER_PORT}; + server ${K8S_2_IP}:${K8S_SCHEDULER_PORT}; + } + + server { + listen ${NGINX_APISERVER_PORT}; + location / { + proxy_pass http://apiservers; + proxy_next_upstream error timeout invalid_header http_500; + proxy_connect_timeout 2; + proxy_buffering off; + proxy_read_timeout 12h; + proxy_send_timeout 12h; + } + } + + server { + listen ${NGINX_SCHEDULER_PORT}; + location / { + proxy_pass http://schedulers; + proxy_next_upstream error timeout invalid_header http_500; + proxy_connect_timeout 2; + proxy_buffering off; + proxy_read_timeout 12h; + proxy_send_timeout 12h; + } + } +} +EOF +$ docker run \ + -p $NGINX_APISERVER_PORT:$NGINX_APISERVER_PORT \ + -p $NGINX_SCHEDULER_PORT:$NGINX_SCHEDULER_PORT \ + --name nginx \ + -v `pwd`/nginx.conf:/etc/nginx/nginx.conf:ro \ + -d nginx:latest +``` + +For the sake of clarity, configuring Nginx to support HTTP over TLS/SPDY is +outside of our scope. However, one should keep in mind that without TLS/SPDY +properly configured, some `kubectl` commands might not work properly. This +problem is documented [here][4]. + +### Start Kubelet + +To start everything up, we need to start the kubelet on K8s master nodes so +they can start apiserver and podmaster. On the leader node, podmaster will +subsequently start the scheduler and controller manager. + +```shell +$ mkdir -p /var/log/kubernetes +$ kubelet \ + --api_servers=http://127.0.0.1:${K8S_APISERVER_PORT} \ + --register-node=false \ + --allow-privileged=true \ + --config=/etc/kubernetes/manifests \ + 1>/var/log/kubernetes/kubelet.log 2>&1 & +``` + +### Verification + +On each of the K8s master nodes, one can run `docker ps` to verify that there +is an apiserver pod and a podmaster pod running, and on one of the K8s master +nodes, there is a controller manager and a scheduler pod running. + +One should also verify if we can create user pods in the K8sm cluster + +```shell +$ export KUBERNETES_MASTER=http://${NGINX_IP}:${NGINX_APISERVER_PORT} +$ kubectl create -f +$ kubectl get pods +``` + +The pod should be shown in a `Running` state after some short amount of time. + +### Tuning + +During a fail-over, cold-standby mode takes some time before a new scheduler +can be started to take over the work from the failed one. However, one can +tune various parameters to shorten this time. + +Podmaster has `--sleep` and `--ttl-secs` parameters that can be tuned, and both +allow for faster failure detection. However, it is probably not a good idea to +set `--ttl-secs` too small to minimize false positives. + +Kubelet has `--file-check-frequency` parameter that controls how frequently it +checks the manifests directory. It is set to 20 seconds by default. + +[1]: http://kubernetes.io/v1.0/docs/admin/high-availability.html +[2]: https://github.com/mesosphere/kubernetes-mesos/issues/457 +[3]: https://github.com/mesosphere/kubernetes +[4]: https://github.com/kubernetes/kubernetes/blob/master/contrib/mesos/docs/issues.md#kubectl [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/contrib/mesos/docs/ha.md?pixel)]()