Update user guide and scheduler documentation to describe node affinity.

Register image priority locality function, which the original PR that introduced it forgot to do. Change zone and region labels to beta.
lavalamp · Feb 19, 2016 · 053f1c6 · 053f1c6
1 parent 5fe856c
commit 053f1c6
Show file tree

Hide file tree

Showing 6 changed files with 107 additions and 15 deletions.
diff --git a/docs/devel/scheduler_algorithm.md b/docs/devel/scheduler_algorithm.md
@@ -44,9 +44,8 @@ The purpose of filtering the nodes is to filter out the nodes that do not meet c
 - `NoVolumeZoneConflict`: Evaluate if the volumes a pod requests are available on the node, given the Zone restrictions.
 - `PodFitsResources`: Check if the free resource (CPU and Memory) meets the requirement of the Pod. The free resource is measured by the capacity minus the sum of requests of all Pods on the node. To learn more about the resource QoS in Kubernetes, please check [QoS proposal](../proposals/resource-qos.md).
 - `PodFitsHostPorts`: Check if any HostPort required by the Pod is already occupied on the node.
-- `PodFitsHost`: Filter out all nodes except the one specified in the PodSpec's NodeName field.
-- `PodSelectorMatches`: Check if the labels of the node match the labels specified in the Pod's `nodeSelector` field ([Here](../user-guide/node-selection/) is an example of how to use `nodeSelector` field).
-- `CheckNodeLabelPresence`: Check if all the specified labels exist on a node or not, regardless of the value.
+- `HostName`: Filter out all nodes except the one specified in the PodSpec's NodeName field.
+- `MatchNodeSelector`: Check if the labels of the node match the labels specified in the Pod's `nodeSelector` field and, as of Kubernetes v1.2, also match the `scheduler.alpha.kubernetes.io/affinity` pod annotation if present. See [here](../user-guide/node-selection/) for more details on both.
 - `MaxEBSVolumeCount`: Ensure that the number of attached ElasticBlockStore volumes does not exceed a maximum value (by default, 39, since Amazon recommends a maximum of 40 with one of those 40 reserved for the root volume -- see [Amazon's documentation](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/volume_limits.html#linux-specific-volume-limits)).  The maximum value can be controlled by setting the `KUBE_MAX_PD_VOLS` environment variable.
 - `MaxGCEPDVolumeCount`: Ensure that the number of attached GCE PersistentDisk volumes does not exceed a maximum value (by default, 16, which is the maximum GCE allows -- see [GCE's documentation](https://cloud.google.com/compute/docs/disks/persistent-disks#limits_for_predefined_machine_types)).  The maximum value can be controlled by setting the `KUBE_MAX_PD_VOLS` environment variable.
 
@@ -63,11 +62,11 @@ After the scores of all nodes are calculated, the node with highest score is cho
 Currently, Kubernetes scheduler provides some practical priority functions, including:
 
 - `LeastRequestedPriority`: The node is prioritized based on the fraction of the node that would be free if the new Pod were scheduled onto the node. (In other words, (capacity - sum of requests of all Pods already on the node - request of Pod that is being scheduled) / capacity). CPU and memory are equally weighted. The node with the highest free fraction is the most preferred. Note that this priority function has the effect of spreading Pods across the nodes with respect to resource consumption.
-- `CalculateNodeLabelPriority`: Prefer nodes that have the specified label.
 - `BalancedResourceAllocation`: This priority function tries to put the Pod on a node such that the CPU and Memory utilization rate is balanced after the Pod is deployed.
-- `CalculateSpreadPriority`: Spread Pods by minimizing the number of Pods belonging to the same service on the same node.  If zone information is present on the nodes, the priority will be adjusted so that pods are spread across zones and nodes.
+- `SelectorSpreadPriority`: Spread Pods by minimizing the number of Pods belonging to the same service, replication controller, or replica set on the same node.  If zone information is present on the nodes, the priority will be adjusted so that pods are spread across zones and nodes.
 - `CalculateAntiAffinityPriority`: Spread Pods by minimizing the number of Pods belonging to the same service on nodes with the same value for a particular label.
 - `ImageLocalityPriority`: Nodes are prioritized based on locality of images requested by a pod. Nodes with larger size of already-installed packages required by the pod will be preferred over nodes with no already-installed packages required by the pod or a small total size of already-installed packages required by the pod.
+- `NodeAffinityPriority`: (Kubernetes v1.2) Implements `preferredDuringSchedulingIgnoredDuringExecution` node affinity; see [here](../user-guide/node-selection/) for more details.
 
 The details of the above priority functions can be found in [plugin/pkg/scheduler/algorithm/priorities](http://releases.k8s.io/HEAD/plugin/pkg/scheduler/algorithm/priorities/). Kubernetes uses some, but not all, of these priority functions by default. You can see which ones are used by default in [plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go](http://releases.k8s.io/HEAD/plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go). Similar as predicates, you can combine the above priority functions and assign weight factors (positive number) to them as you want (check [scheduler.md](scheduler.md) for how to customize).
 

diff --git a/docs/user-guide/labels.md b/docs/user-guide/labels.md
@@ -48,6 +48,7 @@ Documentation for other releases can be found at
     - [Set references in API objects](#set-references-in-api-objects)
       - [Service and ReplicationController](#service-and-replicationcontroller)
       - [Job and other new resources](#job-and-other-new-resources)
+      - [Selecting sets of nodes](#selecting-sets-of-nodes)
 
 <!-- END MUNGE: GENERATED_TOC -->
 
@@ -211,6 +212,11 @@ selector:
 
 `matchLabels` is a map of `{key,value}` pairs. A single `{key,value}` in the `matchLabels` map is equivalent to an element of `matchExpressions`, whose `key` field is "key", the `operator` is "In", and the `values` array contains only "value". `matchExpressions` is a list of pod selector requirements. Valid operators include In, NotIn, Exists, and DoesNotExist. The values set must be non-empty in the case of In and NotIn. All of the requirements, from both `matchLabels` and `matchExpressions` are ANDed together -- they must all be satisfied in order to match.
 
+#### Selecting sets of nodes
+
+One use case for selecting over labels is to constrain the set of nodes onto which a pod can schedule.
+See the documentation on [node selection](node-selection/README.md) for more information.
+
 <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/user-guide/labels.md?pixel)]()
 <!-- END MUNGE: GENERATED_ANALYTICS -->
diff --git a/docs/user-guide/node-selection/README.md b/docs/user-guide/node-selection/README.md
@@ -56,7 +56,7 @@ You can verify that it worked by re-running `kubectl get nodes` and checking tha
 
 Take whatever pod config file you want to run, and add a nodeSelector section to it, like this. For example, if this is my pod config:
 
-<pre>
+```yaml
 apiVersion: v1
 kind: Pod
 metadata:
@@ -67,11 +67,11 @@ spec:
   containers:
   - name: nginx
     image: nginx
-</pre>
+```
 
 Then add a nodeSelector like so:
 
-<pre>
+```yaml
 apiVersion: v1
 kind: Pod
 metadata:
@@ -82,13 +82,95 @@ spec:
   containers:
   - name: nginx
     image: nginx
-    imagePullPolicy: IfNotPresent
-  <b>nodeSelector:
-    disktype: ssd</b>
-</pre>
+  nodeSelector:
+    disktype: ssd
+```
 
 When you then run `kubectl create -f pod.yaml`, the pod will get scheduled on the node that you attached the label to! You can verify that it worked by running `kubectl get pods -o wide` and looking at the "NODE" that the pod was assigned to.
 
+#### Alpha feature in Kubernetes v1.2: Node Affinity
+
+During the first half of 2016 we are rolling out a new mechanism, called *affinity* for controlling which nodes your pods wil be scheduled onto.
+Like `nodeSelector`, affinity is based on labels. But it allows you to write much more expressive rules.
+`nodeSelector` wil continue to work during the transition, but will eventually be deprecated.
+
+Kubernetes v1.2 offers an alpha version of the first piece of the affinity mechanism, called [node affinity](../../design/nodeaffinity.md).
+There are currently two types of node affinity, called `requiredDuringSchedulingIgnoredDuringExecution` and
+`preferresDuringSchedulingIgnoredDuringExecution`. You can think of them as "hard" and "soft" respectively,
+in the sense that the former specifies rules that *must* be met for a pod to schedule onto a node (just like
+`nodeSelector` but using a more expressive syntax), while the latter specifies *preferences* that the scheduler
+will try to enforce but will not guarantee. The "IgnoredDuringExecution" part of the names means that, similar
+to how `nodeSelector` works, if labels on a node change at runtime such that the rules on a pod are no longer
+met, the pod will still continue to run on the node. In the future we plan to offer
+`requiredDuringSchedulingRequiredDuringExecution` which will be just like `requiredDuringSchedulingIgnoredDuringExecution`
+except that it will evict pods from nodes that cease to satisfy the pods' node affinity requirements.
+
+Node affinity is currently expressed using an annotation on Pod. In v1.3 it will use a field, and we will
+also introduce the second piece of the affinity mechanism, called [pod affinity](../../design/podaffinity.md),
+which allows you to control whether a pod schedules onto a particular node based on which other pods are
+running on the node, rather than the labels on the node.
+
+Here's an example of a pod that uses node affinity:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: with-labels
+  annotations:
+    scheduler.alpha.kubernetes.io/affinity: >
+      {
+        "nodeAffinity": {
+          "requiredDuringSchedulingIgnoredDuringExecution": {
+            "nodeSelectorTerms": [
+              {
+                "matchExpressions": [
+                  {
+                    "key": "kubernetes.io/e2e-az-name",
+                    "operator": "In",
+                    "values": ["e2e-az1", "e2e-az2"]
+                  }
+                ]
+              }
+            ]
+          },
+          "preferredDuringSchedulingIgnoredDuringExecution": [
+            {
+              "weight": 10,
+              "preference": {"matchExpressions": [
+                {
+                  "key": "foo",
+                  "operator": "In", "values": ["bar"]
+                }
+              ]}
+            }
+          ]
+        }
+      }
+    another-annotation-key: another-annotation-value
+spec:
+  containers:
+  - name: with-labels
+    image: gcr.io/google_containers/pause:2.0
+```
+
+This node affinity rule says the pod can only be placed on a node with a label whose key is
+`kubernetes.io/e2e-az-name` and whose value is either `e2e-az1` or `e2e-az2`. In addition,
+among nodes that meet that criteria, nodes with a label whose key is `foo` and whose
+value is `bar` should be preferred.
+
+If you specify both `nodeSelector` and `nodeAffinity`, *both* must be satisfied for the pod
+to be scheduled onto a candidate node.
+
+### Built-in node labels
+
+In addition to labels you [attach yourself](#step-one-attach-label-to-the-node), nodes come pre-populated
+with a standard set of labels. As of Kubernetes v1.2 these labels are
+* `kubernetes.io/hostname`
+* `failure-domain.beta.kubernetes.io/zone`
+* `failure-domain.beta.kubernetes.io/region`
+* `beta.kubernetes.io/instance-type`
+
 ### Conclusion
 
 While this example only covered one node, you can attach labels to as many nodes as you want. Then when you schedule a pod with a nodeSelector, it can be scheduled on any of the nodes that satisfy that nodeSelector. Be careful that it will match at least one node, however, because if it doesn't the pod won't be scheduled at all.

diff --git a/pkg/api/unversioned/well_known_labels.go b/pkg/api/unversioned/well_known_labels.go
@@ -16,6 +16,7 @@ limitations under the License.
 
 package unversioned
 
-const LabelZoneFailureDomain = "failure-domain.alpha.kubernetes.io/zone"
-const LabelZoneRegion = "failure-domain.alpha.kubernetes.io/region"
+const LabelHostname = "kubernetes.io/hostname"
+const LabelZoneFailureDomain = "failure-domain.beta.kubernetes.io/zone"
+const LabelZoneRegion = "failure-domain.beta.kubernetes.io/region"
 const LabelInstanceType = "beta.kubernetes.io/instance-type"
diff --git a/pkg/kubelet/kubelet.go b/pkg/kubelet/kubelet.go
@@ -970,7 +970,7 @@ func (kl *Kubelet) initialNodeStatus() (*api.Node, error) {
 	node := &api.Node{
 		ObjectMeta: api.ObjectMeta{
 			Name:   kl.nodeName,
-			Labels: map[string]string{"kubernetes.io/hostname": kl.hostname},
+			Labels: map[string]string{unversioned.LabelHostname: kl.hostname},
 		},
 		Spec: api.NodeSpec{
 			Unschedulable: !kl.registerSchedulable,

diff --git a/plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go b/plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go
@@ -76,6 +76,10 @@ func init() {
 	// PodFitsPorts has been replaced by PodFitsHostPorts for better user understanding.
 	// For backwards compatibility with 1.0, PodFitsPorts is regitered as well.
 	factory.RegisterFitPredicate("PodFitsPorts", predicates.PodFitsHostPorts)
+	// ImageLocalityPriority prioritizes nodes based on locality of images requested by a pod. Nodes with larger size
+	// of already-installed packages required by the pod will be preferred over nodes with no already-installed
+	// packages required by the pod or a small total size of already-installed packages required by the pod.
+	factory.RegisterPriorityFunction("ImageLocalityPriority", priorities.ImageLocalityPriority, 1)
 }
 
 func defaultPredicates() sets.String {