Skip to content

Commit

Permalink
Add and update docs.
Browse files Browse the repository at this point in the history
  • Loading branch information
thockin committed Oct 16, 2014
1 parent e907011 commit d258eca
Show file tree
Hide file tree
Showing 8 changed files with 1,110 additions and 4 deletions.
4 changes: 2 additions & 2 deletions DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,9 +96,9 @@ There are 4 ways that a container manifest can be provided to the Kubelet:

### Kubernetes Proxy

Each node also runs a simple network proxy. This reflects `services` as defined in the Kubernetes API on each node and can do simple TCP stream forwarding or round robin TCP forwarding across a set of backends.
Each node also runs a simple network proxy. This reflects `services` (see [here](docs/services.md) for more details) as defined in the Kubernetes API on each node and can do simple TCP and UDP stream forwarding (round robin) across a set of backends.

Service endpoints are currently found through [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) environment variables specifying ports opened by the service proxy. Currently the user must select a unique port to expose the service on on the proxy, as well as the container's port to target.
Service endpoints are currently found through environment variables (both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and Kubernetes {FOO}_SERVICE_HOST and {FOO}_SERVICE_PORT variables are supported). These variables resolve to ports managed by the service proxy.

## The Kubernetes Control Plane

Expand Down
2 changes: 1 addition & 1 deletion docs/labels.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ key1 exists
LIST and WATCH operations may specify label selectors to filter the sets of objects returned using a query parameter: `?labels=key1%3Dvalue1,key2%3Dvalue2,...`. We may extend such filtering to DELETE operations in the future.

Kubernetes also currently supports two objects that use label selectors to keep track of their members, `service`s and `replicationController`s:
- `service`: A service is a configuration unit for the proxies that run on every worker node. It is named and points to one or more pods.
- `service`: A [service](services.md) is a configuration unit for the proxies that run on every worker node. It is named and points to one or more pods.
- `replicationController`: A [replication controller](replication-controller.md) ensures that a specified number of pod "replicas" are running at any one time. If there are too many, it'll kill some. If there are too few, it'll start more.

The set of pods that a `service` targets is defined with a label selector. Similarly, the population of pods that a `replicationController` is monitoring is also defined with a label selector.
Expand Down
2 changes: 1 addition & 1 deletion docs/networking.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ We want to be able to assign IP addresses externally from Docker ([Docker issue

In addition to enabling self-registration with 3rd-party discovery mechanisms, we'd like to setup DDNS automatically ([Issue #146](https://github.com/GoogleCloudPlatform/kubernetes/issues/146)). hostname, $HOSTNAME, etc. should return a name for the pod ([Issue #298](https://github.com/GoogleCloudPlatform/kubernetes/issues/298)), and gethostbyname should be able to resolve names of other pods. Probably we need to set up a DNS resolver to do the latter ([Docker issue #2267](https://github.com/dotcloud/docker/issues/2267)), so that we don't need to keep /etc/hosts files up to date dynamically.

Service endpoints are currently found through [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) environment variables specifying ports opened by the service proxy. We don't actually use [the Docker ambassador pattern](https://docs.docker.com/articles/ambassador_pattern_linking/) to link containers because we don't require applications to identify all clients at configuration time. Regardless, we're considering moving away from the current approach to an approach more akin to our approach for individual pods: allocate an IP address per service and automatically register the service in DDNS -- L3 load balancing, essentially. Using a flat service namespace doesn't scale and environment variables don't permit dynamic updates, which complicates service deployment by imposing implicit ordering constraints.
[Service](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/services.md) endpoints are currently found through environment variables. Both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) variables and kubernetes-specific variables ({NAME}_SERVICE_HOST and {NAME}_SERVICE_BAR) are supported, and resolve to ports opened by the service proxy. We don't actually use [the Docker ambassador pattern](https://docs.docker.com/articles/ambassador_pattern_linking/) to link containers because we don't require applications to identify all clients at configuration time, yet. While services today are managed by the service proxy, this is an implementation detail that applications should not rely on. Clients should instead use the [service portal IP](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/services.md) (which the above environment variables will resolve to). However, a flat service namespace doesn't scale and environment variables don't permit dynamic updates, which complicates service deployment by imposing implicit ordering constraints. We intend to register each service portal IP in DNS, and for that to become the preferred resolution protocol.

We'd also like to accommodate other load-balancing solutions (e.g., HAProxy), non-load-balanced services ([Issue #260](https://github.com/GoogleCloudPlatform/kubernetes/issues/260)), and other types of groups (worker pools, etc.). Providing the ability to Watch a label selector applied to pod addresses would enable efficient monitoring of group membership, which could be directly consumed or synced with a discovery mechanism. Event hooks ([Issue #140](https://github.com/GoogleCloudPlatform/kubernetes/issues/140)) for join/leave events would probably make this even easier.

Expand Down
151 changes: 151 additions & 0 deletions docs/services.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# Services in Kubernetes

## Overview

Kubernetes [`Pods`](pods.md) are ephemeral. They can come and go over time, especially when
driven by things like [ReplicationControllers](replication-controller.md).
While each `pod` gets its own IP address, those IP addresses can not be relied
upon to be stable over time. This leads to a problem: if some set of `pods`
(let's call them backends) provides functionality to other `pods` (let's call
them frontends) inside the Kubernetes cluster, how do those frontends find the
backends?

Enter `services`.

A Kubernetes `service` is an abstraction which defines a logical set of `pods` and
a policy by which to access them - sometimes called a micro-service. The goal
of `services` is to provide a bridge for non-Kubernetes-native applications to
access backends without the need to write code that is specific to Kubernetes.
A `service` offers clients an IP and port pair which, when accessed, redirects
to the appropriate backends. The set of `pods` targetted is determined by a label
selector.

As an example, consider an image-process backend which is running with 3 live
replicas. Those replicas are fungible - frontends do not care which backend
they use. While the actual `pods` that comprise the set may change, the
frontend client(s) do not need to know that. The `service` abstraction
enables this decoupling.

## Defining a service

A `service` in Kubernetes is a REST object, similar to a `pod`. Like a `pod` a
`service` definitions can be POSTed to the apiserver to create a new instance.
For example, suppose you have a set of `pods` that each expose port 9376 and
carry a label "app=MyApp".

```json
{
"id": "myapp",
"selector": {
"app": "MyApp"
},
"containerPort": 9376,
"protocol": "TCP",
"port": 8765
}
```

This specification will create a new `service` named "myapp" which resolves to
TCP port 9376 on any `pod` with the "app=MyApp" label. To access this
`service`, a client can simply connect to $MYAPP_SERVICE_HOST on port
$MYAPP_SERVICE_PORT.

## How do they work?

Each node in a Kubernetes cluster runs a `service proxy`. This application
watches the Kubernetes master for the addition and removal of `service`
objects and `endpoints` (pods that satisfy a service's label selector), and
maintains a mapping of `service` to list of `endpoints`. It opens a port on the
local node for each `service` and forwards traffic to backends (ostensibly
according to a policy, but the only policy supported for now is round-robin).

When a `pod` is scheduled, the master adds a set of environment variables for
each active `service`. We support both
[Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/)
variables and simpler {SVCNAME}_SERVICE_HOST and {SVCNAME}_SERVICE_PORT
variables. This does imply an ordering requirement - any `service` that a `pod`
wants to access must be created before the `pod` itself, or else the environment
variables will not be populated. This restriction will be removed once DNS for
`services` is supported.

A `service`, through its label selector, can resolve to 0 or more `endpoints`.
Over the life of a `services`, the set of `pods` which comprise that
`services` can
grow, shrink, or turn over completely. Clients will only see issues if they are
actively using a backend when that backend is removed from the `services` (and even
then, open connections will persist for some protocols).

![Services overview diagram](services_overview.png)

## The gory details

The previous information should be sufficient for many people who just want to
use `services`. However, there is a lot going on behind the scenes that may be
worth understanding.

### Avoiding collisions

One of the primary philosophies of Kubernetes is that users should not be
exposed to situations that could cause their actions to fail through no fault
of their own. In this situation, we are looking at network ports - users
should not have to choose a port number if that choice might collide with
another user. That is an isolation failure.

In order to allow users to choose a port number for their `services`, we must
ensure that no two `services` can collide. We do that by allocating each
`service` its own IP address.

### IPs and Portals

Unlike `pod` IP addresses, which actually route to a fixed destination,
`service` IPs are not actually answered by a single host. Instead, we use
`iptables` (packet processing logic in Linux) to define "virtual" IP addresses
which are transparently redirected as needed. We call the tuple of the
`service` IP and the `service` port the `portal`. When clients connect to the
`portal`, their traffic is automatically transported to an appropriate
endpoint. The environment variables for `services` are actually populated in
terms of the portal IP and port. We will be adding DNS support for
`services`, too.

As an example, consider the image processing application described above.
when the backend `services` is created, the Kubernetes master assigns a portal
IP address, for example 10.0.0.1. Assuming the `service` port is 1234, the
portal is 10.0.0.1:1234. The master stores that information, which is then
observed by all of the `service proxy` instances in the cluster. When a proxy
sees a new portal, it opens a new random port, establish an iptables redirect
from the portal to this new port, and starts accepting connections on it.

When a client connects to `MYAPP_SERVICE_HOST` on the portal port (whether
they know the port statically or look it up as MYAPP_SERVICE_PORT), the
iptables rule kicks in, and redirects the packets to the `service proxy`'s own
port. The `service proxy` chooses a backend, and starts proxying traffic from
the client to the backend.

The net result is that users can choose any `service` port they want without
risk of collision. Clients can simply connect to an IP and port, without
being aware of which `pods` they are accessing.

![Services detailed diagram](services_detail.png)

## Shortcomings

Part of the `service` specification is a `createExternalLoadBalancer` flag,
which tells the master to make an external load balancer that points to the
service. In order to do this today, the service proxy must answer on a known
(i.e. not random) port. In this case, the service port is promoted to the
proxy port. This means that is is still possible for users to collide with
each others services or with other pods. We expect most `services` will not
set this flag, mitigating the exposure.

We expect that using iptables for portals will work at small scale, but will
not scale to large clusters with thousands of services. See [the original
design proposal for
portals](https://github.com/GoogleCloudPlatform/kubernetes/issues/1107) for
more details.

## Future work

In the future we envision that the proxy policy can become more nuanced than
simple round robin balancing, for example master elected or sharded. We also
envision that some `services` will have "real" load balancers, in which case the
portal will simply transport the packets there.
Binary file added docs/services_detail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit d258eca

Please sign in to comment.