Skip to content

Commit

Permalink
Merge pull request kubernetes#17589 from mesosphere/jdef_better_known…
Browse files Browse the repository at this point in the history
…_issues

Auto commit by PR queue bot
  • Loading branch information
k8s-merge-robot committed Nov 25, 2015
2 parents ff3f240 + e5b8095 commit a873a43
Show file tree
Hide file tree
Showing 2 changed files with 185 additions and 33 deletions.
12 changes: 8 additions & 4 deletions contrib/mesos/docs/ha.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,12 +55,16 @@ The command line parameters that affect the hash calculation are listed below.
- `--allow-privileged`
- `--api-servers`
- `--auth-path`
- `--cluster_*`
- `--executor_*`
- `--kubelet_*`
- `--cluster-*`
- `--executor-*`
- `--kubelet-*`
- `--km-path`
- `--mesos-cgroup-prefix`
- `--mesos-launch-grace-period`
- `--minion-*`
- `--profiling`
- `--proxy_path`
- `--proxy-*`
- `--static-pods-config`


[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/contrib/mesos/docs/ha.md?pixel)]()
206 changes: 177 additions & 29 deletions contrib/mesos/docs/issues.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,118 @@
## Known Issues

### Pod Placement
This page identifies significant known issues with the Kubernetes-Mesos distribution.

* [General Known Issues](#general-known-issues)
* [DCOS Package Known Issues](#dcos-package-known-issues), in addendum to the above.

## General Known Issues

These known issues apply to all builds of Kubernetes-Mesos.

### Upgrades

Upgrading your Kubernetes-Mesos cluster is currently unsupported.
One known problem exists with respect to expressing executor (kubelet and kube-proxy) process configuration via command line flags.
It is **strongly** recommended that all of the Kubernetes-Mesos executors are destroyed before upgrading the Kubernetes-Mesos scheduler component:
- destroy all daemon controllers running in the cluster, across all namespaces
- destroy all replication controllers running in the cluster, across all namespaces
- destroy all pods running in the cluster, across all namespaces
- invoke the "kamikaze" debug endpoint on the scheduler (e.g. `curl http://10.2.0.5:10251/debug/actions/kamikaze`) to terminate all executors

Not following the above steps prior to upgrading the scheduler can result in a cluster wherein pods will never again be scheduled upon one or more nodes.
This issue is being tracked here: https://github.com/mesosphere/kubernetes-mesos/issues/572.

### Port Specifications

In order for pods (replicated, or otherwise) to be scheduled on the cluster, it is strongly recommended that:
* `pod.spec.containers[x].ports[y].hostPort` be left unspecified (or zero), or else;
* `pod.spec.containers[x].ports[y].hostPort` exists in the range of `ports` resources declared on Mesos slaves
- double-check the resource declaraions for your Mesos slaves, the default for `ports` is typically `[31000-32000]`

Mesos slave host `ports` are resources that are managed by the Mesos resource/offers ecosystem; slave host ports are consumed by launched tasks.
Kubernetes pod container specifications identify two types of ports, "container ports" and "host ports":
- container ports are allocated from the network namespace of the pod, which is independent from that of the host, whereas;
- host ports are allocated from the network namespace of the host.

**Notable on Kubernetes-Mesos**
- Mesos slaves must be configured to offer host `ports` resources in order for pods to use them. Most Mesos package distributions, by default, configure a `ports` resource range for each slave.
- The scheduler recognizes the declared *host ports* of each container in a pod/task and for each such host port, attempts to allocate it from the offered port resources listed in Mesos offers.
- If no host port is declared for a given port spec, then the scheduler may map that port spec's container port to any host port from the offered ports ranges.
- Any *host ports* explicitly declared in the pod container specification must fall within that range of `ports` offered by slaves in the cluster.
Ports declared outside that range (other than zero) will never match resource offers received by the scheduler, and so pod specifications that declare such ports will never be executed as tasks on the cluster.
- A missing pod container host port declaration or a host port set to zero will, by default, result in the allocation of a host port from a resource offer.
- If a pod is the target of a Kubernetes service selector then the related target container ports must be declared in the pod spec.
- In vanilla Kubernetes, host ports with the value zero are ignored.
To obtain the same behavior with the Kubernetes-Mesos scheduler pods must be assigned a label of `k8s.mesosphere.io/portMapping` with the value `fixed`
(see [#527](https://github.com/mesosphere/kubernetes-mesos/issues/527)).

### Pods

#### Pod Updates

Once a task has been launched for a given pod, Kubernetes-Mesos is blind to any updates applied to the pod state (other than for forced, or graceful deletion).

#### Pod Placement

The initial plan was to implement pod placement (aka scheduling "constraints") using rules similar to those found in Marathon.
Upon further consideration it has been decided that a greater alignment between the stock Kubernetes scheduler and kubernetes-mesos scheduler would benefit both projects, as well as end-users.
Currently it is not possible to specify pod placement constraints for the kubernetes-mesos scheduler.
Upon further consideration it has been decided that a greater alignment between the stock Kubernetes scheduler and Kubernetes-Mesos scheduler would benefit both projects, as well as end-users.
Currently there is limited support for pod placement using the Kubernetes-Mesos [scheduler](scheduler.md).
This issue is being tracked here: https://github.com/mesosphere/kubernetes-mesos/issues/338

### Ports
**Note:** An upcoming changeset will update the scheduler with initial support for multiple Mesos roles
(see [#482](https://github.com/mesosphere/kubernetes-mesos/issues/482)).

Mesos typically defines `ports` resources for each slave and these ports are consumed by tasks, as they are launched, that require one or more host ports.
Kubernetes pod container specifications identify two types of ports, container ports and host ports:
container ports are allocated from the network namespace of the pod, which is independent from that of the host, whereas;
host ports are allocated from the network namespace of the host.
The k8sm scheduler recognizes the declared host ports of each container in a pod/task and for each such port, attempts to allocate it from the offered ports listed in mesos resource offers.
If no host port is declared, then the scheduler may choose any port from the offered ports ranges.
#### Static Pods

If slaves are configured to offer a `ports` resource range, for example [31000-32000], then any host ports declared in the pod container specification must fall within that range.
Ports declared outside that range (other than zero) will never match resource offers received by the k8sm scheduler, and so pod specifications that declare such ports will never be executed as tasks on the cluster.
Static pods are supported by the scheduler.
The path to a directory containing pod definitions can be set via the `--static-pods-config` flag.
Static pods are subject to the following restrictions:

As opposed to Kubernetes proper, a missing pod container host port specification or a host port set to zero will allocate a host port from a resource offer.
- Static pods *are read only once* by the scheduler on startup.
Only newly started executor will get the latest static pod specs from the defined static pod directory.

### Static Pods
#### Orphan Pods

Static pods in general are supported by the k8sm-scheduler. The path of the pod definitions can be set via the `--static-pods-config` flag. There are two restrictions currently in that implementation:
The default `executor_shutdown_grace_period` of a Mesos slave is 3 seconds.
When the executor is shut down it forcefully terminates the Docker containers that it manages.
However, if terminating the Docker containers takes longer than the `executor_shutdown_grace_period` then some containers may not get a termination signal at all.
A consequence of this is that some pod containers, previously managed by the framework's executor, will remain running on the slave indefinitely.

- static pods *must have resource limits* on cpu and memory in their container specs (compare the [k8sm architecture](architecture.md))
- static pods *are read only once* by the k8sm-scheduler on startup. Only newly started executor will get the latest static pod specs from the defined static pod directory.
There are two work-arounds to this problem:
* Restart the framework and it should terminate the orphaned tasks.
* Adjust the value of `executor_shutdown_grace_period` to something greater than 3 seconds.

### Service Endpoints
### Services

#### Port Specifications

In order for Endpoints (therefore, Services) to be fully operational, it is strongly recommended that:
- service ports explicitly define a `name`
- service ports explicitly define a `targetPort`

For example:
```yaml
apiVersion: v1
kind: Service
metadata:
name: redis-master
labels:
app: redis
role: master
tier: backend
spec:
ports:
# the port that this service should serve on
- port: 6379
targetPort: 6379
name: k8sm-works-best-with-a-name-here
selector:
app: redis
role: master
tier: backend
```
#### Endpoints
At the time of this writing both Kubernetes and Mesos are using IPv4 addressing, albeit under different assumptions.
Mesos clusters configured with Docker typically use default Docker networking, which is host-private.
Expand All @@ -37,27 +121,91 @@ Kubernetes clusters assume a custom Docker networking configuration that assigns
Kubernetes service endpoints terminate, by default, at a backing pod's IPv4 address using the container-port selected for in the service specification (PodIP:ContainerPort).
This is problematic when default Docker networking has been configured, such as in the case of typical Mesos clusters, because a pod's host-private IPv4 address is not intended to be reachable outside of its host.
The k8sm project has implemented a work-around: service endpoints are terminated at HostIP:HostPort, where the HostIP is the IP address of the Mesos slave and the HostPort is the host port declared in the pod container port specification.
The Kubernetes-Mesos project has implemented a work-around:
service endpoints are terminated at HostIP:HostPort, where the HostIP is the IP address of the Mesos slave and the HostPort is the host port declared in the pod container port specification.
Host ports that are not defined, or else defined as zero, will automatically be assigned a (host) port resource from a resource offer.
To disable the work-around and revert to vanilla Kubernetes service endpoint termination:
* execute the k8sm controller-manager with `-host_port_endpoints=false`
- execute the k8sm controller-manager with `-host_port_endpoints=false`;

Then the usual Kubernetes network assumptions must be fulfilled for Kubernetes to work with Mesos, i.e. each container must get a cluster-wide routable IP (compare [Kubernetes Networking documentation](../../../docs/design/networking.md#container-to-container)).

Future support for IPv6 addressing in Docker and Kubernetes should obviate the need for this work-around.
This workaround may be mitigated down the road by:
- Future support for IPv6 addressing in Docker and Kubernetes
- Native IP-per-container support via Mesos with a custom Kubernetes network plugin

### Orphan Pods
### Scheduling

The default `executor_shutdown_grace_period` of a Mesos slave is 3 seconds.
When the executor is shut down it forcefully terminates the Docker containers that it manages.
However, if terminating the Docker containers takes longer than the `executor_shutdown_grace_period` then some containers may not get a termination signal at all.
A consequence of this is that some pod containers, previously managed by the framework's executor, will remain running on the slave indefinitely.
Statements in this section regarding the "scheduler" pertain specifically to the Kubernetes-Mesos scheduler, unless otherwise noted.

There are two work-arounds to this problem:
* Restart the framework and it should terminate the orphaned tasks.
* Adjust the value of `executor_shutdown_grace_period` to something greater than 3 seconds.
Some factors that influence when pods are scheduled by k8s-mesos:
- availability of a resource offer that "fits" the pod (mesos master/slave);
- scheduler *backoff* (to avoid busy-looping) during pod scheduling (k8s-mesos scheduler)

The scheduler attempts to mitigate the second item by cancelling the backoff period if an offer arrives that fits a pod-in-waiting.
However, there is nothing that the scheduler can do if there are no resources available in the cluster.

That said, the current scheduling algorithm is naive: it makes **no attempts to pack multiple pods into a single offer**.
This means that each pod launch requires an independent offer.
In a small cluster resource offers do not arrive very frequently.
In a large cluster with a "decent" amount of free resources the arrival rate of offers is expected to be much higher.

The slave on each host announces offers to Mesos periodically.
In a single node cluster only a single slave process is advertising resources to the master.
The master will pass those along to the scheduler, at some interval and level of 'fairness' determined by mesos.
That scheduler will pair each resource offer with a pod that needs to be placed in the cluster.
Once paired, a task is launched to instantiate the pod.
The used resources will be marked as consumed, the remaining resources are "returned" to the cluster and the scheduler will wait for the next resource offer from the master... and the cycle repeats itself.
This likely limits the scheduling throughput observable in a single-node cluster.

The team plans to conduct benchmarks on the scheduling algorithm to establish some baselines, and is definitely thinking about ways to increase scheduling throughput- including scheduling multiple pods per offer.

#### Runtime Configuration

- mesos: `--offer_timeout` : Duration of time before an offer is rescinded from a framework.
This helps fairness when running frameworks that hold on to offers, or frameworks that accidentally drop offers.
([via](http://mesos.apache.org/documentation/latest/configuration/))
- k8s-mesos `--scheduler-config` : An ini-style configuration file with low-level scheduler settings.
See `offer-ttl`, `initial-pod-backoff`, and `max-pod-backoff`.
([via](https://github.com/kubernetes/kubernetes/blob/master/contrib/mesos/pkg/scheduler/config/config.go))

What is not configurable, but perhaps should be, are the mesos "filters" that the scheduler includes when declining offers that are not matched to pods within the configured `offer-ttl` (see https://github.com/apache/mesos/blob/0.25.0/include/mesos/mesos.proto#L1165): the current `refuse_seconds` value is hard-coded to 5s.
That parameter should probably be exposed via the scheduler fine tuning mechanism.

#### Backoff

If no matching resource offer can be found for a pod then that pod is put into a backoff queue.
Once the backoff period expires the pod is re-added to the scheduling queue.
The backoff period may be truncated by the arrival of an offer with matching resources.
This is an event-based design and there is no polling.

#### Debugging

Good insight may be achieved when all of the relevant logs are collected into a single tool (Splunk, or an ELK stack) in a manner such that it is trivial to search for something along the lines of a task-id or pod-id during cluster debugging sessions.

The scheduler also offers `/debug` API endpoints that may be useful:
- on-demand explicit reconciliation: /debug/actions/requestExplicit
- on-demand implicit reconciliation: /debug/actions/requestImplicit
- kamikaze (terminate all "empty" executors that aren't running pods): /debug/actions/kamikaze
- pods to be scheduled: /debug/scheduler/podqueue
- pod registry changes waiting to be processed: /debug/scheduler/podstore
- schedulers internal task registry state: /debug/registry/tasks
- scheduler metrics are available at /metrics

## DCOS Package Known Issues

All of the issues in the above section also apply to the Kubernetes-Mesos DCOS package builds.
The issues listed in this section apply specifically to the Kubernetes-Mesos DCOS package available from https://github.com/mesosphere/multiverse.

### Kubectl

The following `kubectl` and `dcos kubectl` commands are not yet supported:

- exec (see [#356](https://github.com/mesosphere/kubernetes-mesos/issues/356))
- logs (see [#587](https://github.com/mesosphere/kubernetes-mesos/issues/587))
- port-forward
- proxy


[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/contrib/mesos/docs/issues.md?pixel)]()

0 comments on commit a873a43

Please sign in to comment.