Skip to content

Commit

Permalink
Merge pull request kubernetes#481 from bgrant0607/master
Browse files Browse the repository at this point in the history
Documentation additions and refactoring.
  • Loading branch information
lavalamp committed Jul 16, 2014
2 parents 02b06a2 + 37fb72d commit 0e1636e
Show file tree
Hide file tree
Showing 4 changed files with 182 additions and 36 deletions.
69 changes: 33 additions & 36 deletions DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,54 +18,57 @@

## Overview

Kubernetes builds on top of [Docker](http://www.docker.io) to construct a clustered container scheduling service. Kubernetes enables users to ask a cluster to run a set of containers. The system will automatically pick worker nodes to run those containers on, which we think of more as "scheduling" than "orchestration". Kubernetes also provides ways for containers to find and communicate with each other and ways to manage both tightly coupled and loosely coupled sets of cooperating containers.
Kubernetes is a system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications. Its APIs are intended to serve as the foundation for an open ecosystem of tools, automation systems, and higher-level API layers.

Kubernetes uses [Docker](http://www.docker.io) to package, instantiate, and run containerized applications.

Is Kubernetes, then, a Docker "orchestration" system? Yes and no.

Kubernetes establishes robust declarative primitives for maintaining the desired state requested by the user. We see these primitives as the main value added by Kubernetes. Self-healing mechanisms, such as auto-restarting, re-scheduling, and replicating containers require active controllers, not just imperative orchestration.

Kubernetes is primarily targeted at applications comprised of multiple containers, such as elastic, distributed micro-services. It is also designed to facilitate migration of non-containerized application stacks to Kubernetes. It therefore includes abstractions for grouping containers in both loosely coupled and tightly coupled formations, and provides ways for containers to find and communicate with each other in relatively familiar ways.

Kubernetes enables users to ask a cluster to run a set of containers. The system automatically chooses hosts to run those containers on. While Kubernetes's scheduler is currently very simple, we expect it to grow in sophistication over time. Scheduling is a policy-rich, topology-aware, workload-specific function that significantly impacts availability, performance, and capacity. The scheduler needs to take into account individual and collective resource requirements, quality of service requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, deadlines, and so on. Workload-specific requirements will be exposed through the API as necessary.

Architecturally, we want Kubernetes to be built as a collection of pluggable components and layers, with the ability to use alternative schedulers, storage systems, and distribution mechanisms, and we're evolving its current code in that direction.

Kubernetes is intended to run on multiple cloud providers, as well as on physical hosts.

A single Kubernetes cluster is not intended to span multiple availability zones. Instead, we recommend building a higher-level layer to replicate complete deployments of highly available applications across multiple zones.

Kubernetes is not currently suitable for use by multiple users -- see [Cluster Security](#cluster-security), below.

## Key Concepts

While Docker itself works with individual containers, Kubernetes provides higher-level organizational constructs in support of common cluster-level usage patterns, currently focused on service applications.
While Docker itself works with individual containers, Kubernetes provides higher-level organizational constructs in support of common cluster-level usage patterns, currently focused on service applications, but which could also be expanded to batch and test workloads in the future.

### Pods

A `pod` (as in a pod of whales or pea pod) is a relatively tightly coupled group of containers that are scheduled onto the same physical node. In addition to defining the containers that run in the pod, the containers in the pod all use the same network namespace/IP (and port space), and define a set of shared storage volumes. Pods facilitate data sharing and IPC among their constituents, serve as units of scheduling, deployment, and horizontal scaling/replication, and share fate. In the future, they may share resources ([LPC2013](http://www.linuxplumbersconf.org/2013/ocw//system/presentations/1239/original/lmctfy%20(1).pdf)).
A _pod_ (as in a pod of whales or pea pod) is a relatively tightly coupled group of containers that are scheduled onto the same host. It models an application-specific "virtual host" in a containerized environment. Pods serve as units of scheduling, deployment, and horizontal scaling/replication, share fate, and share some resources, such as storage volumes and IP addresses.

While pods can be used for vertical integration of application stacks, their primary motivation is to support co-located, co-managed helper programs, such as:
- content management systems, file and data loaders, local cache managers, etc.
- log and checkpoint backup, compression, rotation, snapshotting, etc.
- data change watchers, log tailers, logging and monitoring adapters, event publishers, etc.
- proxies, bridges, and adapters
- controllers, managers, configurators, and updaters
[More details on pods](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/pods.md).

### Labels

Loosely coupled cooperating pods are organized using key/value labels.

Each pod can have a set of key/value labels set on it, with at most one label with a particular key.
Loosely coupled cooperating pods are organized using key/value _labels_.

Individual labels are used to specify identifying metadata, and to convey the semantic purposes/roles of pods of containers. Examples of typical pod label keys include `environment` (e.g., with values `dev`, `qa`, or `production`), `service`, `tier` (e.g., with values `frontend` or `backend`), `partition`, and `track` (e.g., with values `daily` or `weekly`), but you are free to develop your own conventions.
Individual labels are used to specify identifying metadata, and to convey the semantic purposes/roles of pods of containers. Examples of typical pod label keys include `service`, `environment` (e.g., with values `dev`, `qa`, or `production`), `tier` (e.g., with values `frontend` or `backend`), and `track` (e.g., with values `daily` or `weekly`), but you are free to develop your own conventions.

Via a "label selector" the user can identify a set of `pods`. The label selector is the core grouping primitive in Kubernetes. It could be used to identify service replicas or shards, worker pool members, or peers in a distributed application.
Via a _label selector_ the user can identify a set of pods. The label selector is the core grouping primitive in Kubernetes. It could be used to identify service replicas or shards, worker pool members, or peers in a distributed application.

Kubernetes currently supports two objects that use label selectors to keep track of their members, `service`s and `replicationController`s:
- `service`: A service is a configuration unit for the proxies that run on every worker node. It is named and points to one or more Pods.
- `service`: A service is a configuration unit for the [proxies](#kubernetes-proxy) that run on every worker node. It is named and points to one or more pods.
- `replicationController`: A replication controller takes a template and ensures that there is a specified number of "replicas" of that template running at any one time. If there are too many, it'll kill some. If there are too few, it'll start more.

The set of pods that a `service` targets is defined with a label selector. Similarly, the population of pods that a `replicationController` is monitoring is also defined with a label selector.

Pods may be removed from these sets by changing their labels. This flexibility may be used to remove pods from service for debugging, data recovery, etc.

For management convenience and consistency, `services` and `replicationControllers` may themselves have labels and would generally carry the labels their corresponding pods have in common.

Sets identified by labels and label selectors could be overlapping (think Venn diagrams). For instance, a service might point to all pods with `tier in (frontend), environment in (prod)`. Now say you have 10 replicated pods that make up this tier. But you want to be able to 'canary' a new version of this component. You could set up a `replicationController` (with `replicas` set to 9) for the bulk of the replicas with labels `tier=frontend, environment=prod, track=stable` and another `replicationController` (with `replicas` set to 1) for the canary with labels `tier=frontend, environment=prod, track=canary`. Now the service is covering both the canary and non-canary pods. But you can mess with the `replicationControllers` separately to test things out, monitor the results, etc.

Note that the superset described in the previous example is also heterogeneous. In long-lived, highly available, horizontally scaled, distributed, continuously evolving service applications, heterogeneity is inevitable, due to canaries, incremental rollouts, live reconfiguration, simultaneous updates and auto-scaling, hardware upgrades, and so on.

Pods may belong to multiple sets simultaneously, which enables representation of service substructure and/or superstructure. In particular, labels are intended to facilitate the creation of non-hierarchical, multi-dimensional deployment structures. They are useful for a variety of management purposes (e.g., configuration, deployment) and for application introspection and analysis (e.g., logging, monitoring, alerting, analytics). Without the ability to form sets by intersecting labels, many implicitly related, overlapping flat sets would need to be created, for each subset and/or superset desired, which would lose semantic information and be difficult to keep consistent. Purely hierarchically nested sets wouldn't readily support slicing sets across different dimensions.

Since labels can be set at pod creation time, no separate set add/remove operations are necessary, which makes them easier to use than manual set management. Additionally, since labels are directly attached to pods and label selectors are fairly simple, it's easy for users and for clients and tools to determine what sets they belong to. OTOH, with sets formed by just explicitly enumerating members, one would (conceptually) need to search all sets to determine which ones a pod belonged to.
[More details on labels](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/labels.md).

## The Kubernetes Node

When looking at the architecture of the system, we'll break it down to services that run on the worker node and services that play a "master" role.
When looking at the architecture of the system, we'll break it down to services that run on the worker node and services that comprise the cluster-level control plane.

The Kubernetes node has the services necessary to run Docker containers and be managed from the master systems.

Expand All @@ -89,9 +92,11 @@ There are 4 ways that a container manifest can be provided to the Kubelet:

Each node also runs a simple network proxy. This reflects `services` as defined in the Kubernetes API on each node and can do simple TCP stream forwarding or round robin TCP forwarding across a set of backends.

## The Kubernetes Master
Service endpoints are currently found through [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) environment variables specifying ports opened by the service proxy. Currently the user must select a unique port to expose the service on on the proxy, as well as the container's port to target.

## The Kubernetes Control Plane

The Kubernetes master is split into a set of components. These work together to provide an unified view of the cluster.
The Kubernetes control plane is split into a set of components, but they all run on a single _master_ node. These work together to provide an unified view of the cluster.

### etcd

Expand All @@ -112,14 +117,6 @@ Beyond just servicing REST operations, validating them and storing them in `etcd

The `replicationController` type described above isn't strictly necessary for Kubernetes to be useful. It is really a service that is layered on top of the simple `pod` API. To enforce this layering, the logic for the replicationController is actually broken out into another server. This server watches `etcd` for changes to `replicationController` objects and then uses the public Kubernetes API to implement the replication algorithm.

## Network Model

Kubernetes expands the default Docker networking model. The goal is to have each `pod` have an IP in a shared networking namespace that has full communication with other physical computers and containers across the network. In this way, it becomes much less necessary to map ports.

For the Google Compute Engine cluster configuration scripts, [advanced routing](https://developers.google.com/compute/docs/networking#routing) is set up so that each VM has a extra 256 IP addresses that get routed to it. This is in addition to the 'main' IP address assigned to the VM that is NAT-ed for Internet access. The networking bridge (called `cbr0` to differentiate it from `docker0`) is set up outside of Docker proper and only does NAT for egress network traffic that isn't aimed at the virtual network.

Ports mapped in from the 'main IP' (and hence the internet if the right firewall rules are set up) are proxied in user mode by Docker. In the future, this should be done with `iptables` by either the Kubelet or Docker: [Issue #15](https://github.com/GoogleCloudPlatform/kubernetes/issues/15).

## Release Process

Right now "building" or "releasing" Kubernetes consists of some scripts (in `release/`) to create a `tar` of the necessary data and then uploading it to Google Cloud Storage. In the future we will generate Docker images for the bulk of the above described components: [Issue #19](https://github.com/GoogleCloudPlatform/kubernetes/issues/19).
Expand All @@ -135,7 +132,7 @@ The heavy lifting of configuring the VMs is done by [SaltStack](http://www.salts
The bootstrapping works like this:

1. The `kube-up.sh` script uses the GCE [`startup-script`](https://developers.google.com/compute/docs/howtos/startupscript) mechanism for both the master node and the minion nodes.
* For the minion, this simply configures and installs SaltStack. The network range that this minion is assigned is baked into the startup-script for that minion.
* For the minion, this simply configures and installs SaltStack. The network range that this minion is assigned is baked into the startup-script for that minion (see [the networking doc](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/networking.md) for more details).
* For the master, the release files are downloaded from GCS and unpacked. Various parts (specifically the SaltStack configuration) are installed in the right places.
2. SaltStack then installs the necessary servers on each node.
* All go code is currently downloaded to each machine and compiled at install time.
Expand Down
29 changes: 29 additions & 0 deletions docs/labels.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Labels

Service deployments and batch processing pipelines are often multi-dimensional entities (e.g., multiple release tracks, multiple tiers, multiple micro-services per tier). Management often requires cross-cutting operations, which breaks encapsulation of strictly hierarchical representations, especially rigid hierarchies determined by the infrastructure rather than by users.

Therefore, loosely coupled cooperating pods are organized using key/value _labels_.

Individual labels are used to specify identifying metadata, and to convey the semantic purposes/roles of pods of containers. Examples of typical pod label keys include `service`, `environment` (e.g., with values `dev`, `qa`, or `production`), `tier` (e.g., with values `frontend` or `backend`), and `track` (e.g., with values `daily` or `weekly`), but you are free to develop your own conventions.

Each pod can have a set of key/value labels set on it, with at most one label with a particular key.

Via a "label selector" the user can identify a set of `pods`. The label selector is the core grouping primitive in Kubernetes. It could be used to identify service replicas or shards, worker pool members, or peers in a distributed application.

Kubernetes currently supports two objects that use label selectors to keep track of their members, `service`s and `replicationController`s:
- `service`: A service is a configuration unit for the proxies that run on every worker node. It is named and points to one or more pods.
- `replicationController`: A replication controller takes a template and ensures that there is a specified number of "replicas" of that template running at any one time. If there are too many, it'll kill some. If there are too few, it'll start more.

The set of pods that a `service` targets is defined with a label selector. Similarly, the population of pods that a `replicationController` is monitoring is also defined with a label selector.

Pods may be removed from these sets by changing their labels. This flexibility may be used to remove pods from service for debugging, data recovery, etc.

For management convenience and consistency, `services` and `replicationControllers` may themselves have labels and would generally carry the labels their corresponding pods have in common.

Sets identified by labels and label selectors could be overlapping (think Venn diagrams). For instance, a service might point to all pods with `tier in (frontend), environment in (prod)`. Now say you have 10 replicated pods that make up this tier. But you want to be able to 'canary' a new version of this component. You could set up a `replicationController` (with `replicas` set to 9) for the bulk of the replicas with labels `tier=frontend, environment=prod, track=stable` and another `replicationController` (with `replicas` set to 1) for the canary with labels `tier=frontend, environment=prod, track=canary`. Now the service is covering both the canary and non-canary pods. But you can mess with the `replicationControllers` separately to test things out, monitor the results, etc.

Note that the superset described in the previous example is also heterogeneous. In long-lived, highly available, horizontally scaled, distributed, continuously evolving service applications, heterogeneity is inevitable, due to canaries, incremental rollouts, live reconfiguration, simultaneous updates and auto-scaling, hardware upgrades, and so on.

Pods may belong to multiple sets simultaneously, which enables representation of service substructure and/or superstructure. In particular, labels are intended to facilitate the creation of non-hierarchical, multi-dimensional deployment structures. They are useful for a variety of management purposes (e.g., configuration, deployment) and for application introspection and analysis (e.g., logging, monitoring, alerting, analytics). Without the ability to form sets by intersecting labels, many implicitly related, overlapping flat sets would need to be created, for each subset and/or superset desired, which would lose semantic information and be difficult to keep consistent. Purely hierarchically nested sets wouldn't readily support slicing sets across different dimensions.

Since labels can be set at pod creation time, no separate set add/remove operations are necessary, which makes them easier to use than manual set management. Additionally, since labels are directly attached to pods and label selectors are fairly simple, it's easy for users and for clients and tools to determine what sets they belong to. OTOH, with sets formed by just explicitly enumerating members, one would (conceptually) need to search all sets to determine which ones a pod belonged to.
Loading

0 comments on commit 0e1636e

Please sign in to comment.