HPA: Consider unready pods separately #33593

DirectXMan12 · 2016-09-27T19:28:04Z

Release note:

The Horizontal Pod Autoscaler now takes the readiness of pods into account when calculating desired replicas.

Currently, the HPA considers unready pods the same as ready pods when
looking at their CPU and custom metric usage. However, pods frequently
use extra CPU during initialization, so we want to consider them
separately.

This commit causes the HPA to consider unready pods as having 0 CPU
usage when scaling up, and ignores them when scaling down. If, when
scaling up, factoring the unready pods as having 0 CPU would cause a
downscale instead, we simply choose not to scale. Otherwise, we simply
scale up at the reduced amount calculated by factoring the pods in at
zero CPU usage.

Similarly, if we are missing metrics for any pods, those pods will be
considered as having 0% CPU when scaling up, and 100% CPU when
scaling down. As with the unready pods calculation, this cannot change
the direction of the scale.

The effect is that unready pods cause the autoscaler to be a bit more
conservative -- large increases in CPU usage can still cause scales,
even with unready pods in the mix, but will not cause the scale factors
to be as large, in anticipation of the new pods later becoming ready and
handling load.

This change is

DirectXMan12 · 2016-09-27T19:28:39Z

cc @kubernetes/autoscaling @fgrzadkowski as per our discussion in last week's SIG autoscaling meeting, this should help put us on the path to reducing and/or eliminating the HPA forbidden windows.

DirectXMan12 · 2016-09-27T21:03:31Z

looks like the GCI GKE test is hitting #33388

fgrzadkowski · 2016-10-03T09:47:19Z

FYI @mwielgus @piosz

jszczepkowski · 2016-10-04T08:08:57Z

pkg/controller/podautoscaler/metrics/metrics_client.go

+	// UnreadyPodsCount is the total number of running pods not factored into the computation of the returned metric value (due to being unready)
+	UnreadyPodsCount int
+
+	// OldestTimestamp is the time of generation of the olded of the utilization reports for the pods


s/olded/oldest

jszczepkowski · 2016-10-04T08:10:31Z

pkg/controller/podautoscaler/metrics/metrics_client.go

-	GetCPUUtilization(namespace string, selector labels.Selector) (*int, time.Time, error)
+	// (e.g. 70 means that an average pod uses 70% of the requested CPU), as well as associated information about
+	// the computation.
+	GetCPUUtilization(namespace string, selector labels.Selector) (*int, *UtilizationInfo, error)


nit: I would move average cpu utilization to UtilizationInfo

jszczepkowski · 2016-10-04T08:10:54Z

pkg/controller/podautoscaler/metrics/metrics_client.go

@@ -42,16 +42,28 @@ const (

 var heapsterQueryStart = -5 * time.Minute

+// UtilizationInfo contains extra metadata about the returned metric values
+type UtilizationInfo struct {
+	// ReadyPodsCount is the total number of pods factored into the computation of the returned metric value


nit: move average cpu utilization here

~~ack, will do~~ ah, I remember why I didn't do that -- you have two different types for CPU vs custom metrics utilization (int and float, respectively), so it makes sense to keep that separate, unless we just want to use float for both (which wouldn't be horrible).

@jszczepkowski WDYT about using float for both vs just leaving it the way it is?

Yes, let's use floats for both.

jszczepkowski · 2016-10-04T08:22:25Z

pkg/controller/podautoscaler/horizontal.go

+		newUsageRatio := float64(newUtilization) / float64(targetUtilization)
+
+		// simply don't scale if the new usage ratio would mean a downscale or no scale
+		if newUsageRatio < 0 || math.Abs(1.0-newUsageRatio) <= tolerance {


why are you checking here if newUsageRation is smaller than 0? it doesn't seem to be possible

maybe we should check if newUsageRation is smaller than 1.0? please add a unit test for this case

whoops, yeah, that should be < 1.0. Typo

jszczepkowski · 2016-10-04T08:24:03Z

pkg/controller/podautoscaler/horizontal.go

 	}

-	return currentReplicas, &utilization, timestamp, nil
+	// for unready pods, in the case of scale up, check to see if treating those pods as having a mock
+	// CPU utilization would change things


nit: I would explicitly mention here that the mock utilization is for now equal 0

jszczepkowski · 2016-10-04T08:53:13Z

pkg/controller/podautoscaler/metrics/metrics_client.go

 }

-func (h *HeapsterMetricsClient) getCpuUtilizationForPods(namespace string, selector labels.Selector, podNames map[string]struct{}) (int64, time.Time, error) {
+func (h *HeapsterMetricsClient) getCpuUtilizationForPods(namespace string, selector labels.Selector, podNames map[string]struct{}, unreadyPods map[string]struct{}) (int64, time.Time, error) {


wouldn't it be better to take only the set of ready pods here? it seems that unready pods are of no use

we need to have them present for things like the "unexpected metric" check.

jszczepkowski · 2016-10-04T08:59:32Z

pkg/controller/podautoscaler/metrics/metrics_client.go

@@ -191,50 +217,71 @@ func (h *HeapsterMetricsClient) getCpuUtilizationForPods(namespace string, selec
 	}


The condition above (line 201: if len(metrics.Items) != len(podNames)) is extremely strange. I would only check in this method if metrics for all ready pods are known. I wouldn't verify unready and pending pods at all. Although this condition was not introduced by this PR, I think this PR is a good opportunity to clean it up.

not sure which line you're talking about here. It might have gotten changed across a rebase?

https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/metrics/metrics_client.go#L175

jszczepkowski · 2016-10-04T09:01:47Z

pkg/controller/podautoscaler/metrics/metrics_client.go

 	metricSpec := getHeapsterCustomMetricDefinition(customMetricName)

 	podList, err := h.client.Core().Pods(namespace).List(api.ListOptions{LabelSelector: selector})

 	if err != nil {
-		return nil, time.Time{}, fmt.Errorf("failed to get pod list: %v", err)
+		return nil, nil, fmt.Errorf("failed to get pod list: %v", err)
 	}
 	podNames := []string{}


nit: consider using pkg/util/sets/string.go

a list is better than a set here

(because of how we want to use it in getCustomMetricForPods, basically -- it saves us converting to a list for use in Join and the rest of the fetch logic later).

jszczepkowski · 2016-10-04T09:10:52Z

pkg/controller/podautoscaler/horizontal.go

@@ -205,7 +225,7 @@ func (a *HorizontalController) computeReplicasForCustomMetrics(hpa *autoscaling.
 			a.eventRecorder.Event(hpa, api.EventTypeWarning, "InvalidSelector", errMsg)


So, computeReplicasForCustomMetrics method will not assume 0 from not ready pods during scale-up, but will rather take average from ready pods only? This is a different between scaling base on CPU and on custom metric. I think It would be more correct if we also assume 0 for not ready pods here but I don't have a strong opinion. Anyway, it definitely should be documented.

whoops, that was not intentional ;-). I'll fix it so both have the same

jszczepkowski · 2016-10-04T09:20:35Z

pkg/controller/podautoscaler/horizontal_test.go

+	}
+	tc.runTest(t)
+}
+


please, add a case when threre are unready pods and scale down is triggered

jszczepkowski · 2016-10-05T07:58:40Z

@DirectXMan12

I don't see any changes that applies review comments. Have you forgotten to git push?

jszczepkowski · 2016-10-11T10:19:37Z

This PR implements: #30471 (comment)

jszczepkowski · 2016-10-11T09:45:40Z

pkg/controller/podautoscaler/metrics/metrics_client.go


 	// GetCustomMetric returns the average value of the given custom metrics from the
 	// pods picked using the namespace and selector passed as arguments.
-	GetCustomMetric(customMetricName string, namespace string, selector labels.Selector) (*float64, time.Time, error)
+	GetCustomMetric(customMetricName string, namespace string, selector labels.Selector) (*UtilizationInfo, error)


nit: please extend the comment (similar as for GetCPUUtilization)

jszczepkowski · 2016-10-11T10:20:45Z

pkg/controller/podautoscaler/metrics/metrics_client.go

 	for _, pod := range podList.Items {
 		if pod.Status.Phase == api.PodPending {
 			// Skip pending pods.
 			continue
 		}
-		podNames = append(podNames, pod.Name)
+		if !api.IsPodReady(&pod) {


does it check readiness probe?

api.IsPodReady is a helper method that looks for the pod readiness condition (which is what the Kubelet updates to indicate the state of the last readiness prob, AFAICT). It's the same method used by the endpoints controller to determine readiness.

jszczepkowski · 2016-10-11T10:24:08Z

pkg/controller/podautoscaler/metrics/metrics_client.go

-		} else {
+
+			count++
+		} else if !unreadyPods.Has(m.Name) {


Shouldn't we just ignore such pods? Currently stats from pod not-running & not-ready blocks HPA

yep, probably a good idea. Just have to fiddle some things around (see above).

jszczepkowski · 2016-10-11T10:27:29Z

pkg/controller/podautoscaler/metrics/metrics_client.go

+	// not missing metrics from ready pods (if we only compare to the number
+	// of ready pods, we could incorrectly assume we have metrics for all ready
+	// pods, when in reality we have metrics for a mix of ready and unready pods)
+	if len(metrics.Items) != readyPods.Len()+unreadyPods.Len() {


What if there is a metric for pod not-running & not-ready, but one of ready-pods has no metric? We will not go into this if and not report metric missing. This seems to be a bug.

Maybe we can execute the code bellow unconditinally (without if)?

I think we only have a problem here if we don't error out on pods which are unknown or not running (currently, we don't enter the if, but we catch it below as an "unexpected pod"). It would be nice to not block on pending pods, so I'll do some fiddling.

jszczepkowski · 2016-10-11T11:14:17Z

pkg/controller/podautoscaler/metrics/metrics_client_test.go

@@ -280,6 +316,29 @@ func TestCPUAllPending(t *testing.T) {
 	tc.runTest(t)
 }

+func TestCPUAllUnready(t *testing.T) {


I would add one more case: pod not running (e.g.: unknown) but with stats.

jszczepkowski · 2016-10-11T11:27:40Z

pkg/controller/podautoscaler/horizontal_test.go

+	tc.runTest(t)
+}
+
+func TestScaleUpUnreadyNoScaleWouldScaleDown(t *testing.T) {


I don't see any difference between TestScaleUpUnreadyNoScale and TestScaleUpUnreadyNoScaleWouldScaleDown. Am I missing something?

One of them has numbers that that would end up causing adjusted scale ratio of 1.0. The other has numbers that would cause an adjusted scale ratio of < 1.0, and thus tests that when we adjust the scale ratio, we never go below 1.0 (i.e. a scale up will always be a scale up, or no action -- you can never turn a scale up into a scale down).

Can you point me to a difference in the definitions? They seem to be the same...

heh, looks like I mistyped or something there. You're right, there's no difference currently. Good catch.

jszczepkowski · 2016-10-14T13:40:37Z

@DirectXMan12 @mwielgus
The current HPA behavior (assuming CPU usage of not running pods be the same as of running pods here) is troublesome. We observed instability of HPA decisions cause by this. We should rise priority of this PR.

jszczepkowski · 2016-10-14T13:47:54Z

pkg/controller/podautoscaler/horizontal.go

+		// keep the reported utilization to be whatever we retrieved from the the ready pods
+	}
+
+	return int32(math.Ceil(usageRatio * float64(currentReplicas))), &utilization, utilizationInfo.OldestTimestamp, nil


This condition is incorrect. currentReplicas contains pending pods, however pending pods are not included in UnreadyPodsCount. So, a pending pod will be treated as a pod consuming newUsageRatio cpu, which is wrong. We should tread them as not consuming CPU.

if we want to get really technical here, pending pods and unready pods should probably be counted the same, but non-pending non-running pods should be different (succeeded, failed, etc), because they'll never transition back to running. I'll see if I can accurately reflect that

hmm... actually, there's no guarantee pending pods will start -- something could be pending with "RunContainerError", for instance. Perhaps we just say desired = usageRatio * (ready + unready) and then make sure desired >= current?

(or just assume that all pending pods will eventually start)

Ok, so, if we split pods into running|ready ("ready") and everything else ("unready"), then I think this calculation becomes ok in the scale up case. In the scale down case, we should always base off of running pods (otherwise, we'll kill unready/pending pods only, and the we'll have to scale down again next time).

If we consider the example from #34821 (comment), and assume:

target utilization = 100%

current utilization = 300%

pod 1 is ready, pods 2-10 are unready (pending or otherwise)

Then we get: newUtilization = (300 * 1) / (1 + 9) = 300 / 10 = 30 yielding a new usage ratio of 30 / 100 = 0.3, which gets adjusted to 1.0 (we never scale down when correcting for pending pods), leaving the desired replica count at 10. This is slightly more conservative than simply taking the original usage ratio (3.0) and multiplying by running pods (1, giving a desired replica count of 3), but I think it's ok to be a bit more conservative when doing predictive work. Once the pending pods become ready, the HPA may remove some if they're not doing enough work.

DirectXMan12 · 2016-10-18T18:24:13Z

@jszczepkowski I've addressed all your comments, I believe. PTAL

DirectXMan12 · 2016-10-19T17:20:29Z

@k8s-bot gci gke e2e test this

k8s-ci-robot · 2016-10-30T18:18:05Z

Jenkins GCE Node e2e failed for commit 73c6fdcacae44a9be373c75f7cdb6f0d4a7aae49. Full PR test history.

The magic incantation to run this job again is @k8s-bot node e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

DirectXMan12 · 2016-10-31T15:47:52Z

had to push a new copy due to some clientset naming changes.

k8s-ci-robot · 2016-10-31T16:18:46Z

Jenkins verification failed for commit 858b282841a60817e07d7ed4ffdd3370f7d084db. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

DirectXMan12 · 2016-11-01T14:33:45Z

@jszczepkowski can you re-add LGTM?

derekwaynecarr · 2016-11-01T18:53:56Z

/lgtm

jszczepkowski · 2016-11-02T14:27:13Z

One more think, before merge:
@DirectXMan12 can you update PR description in #33593 (comment), so that it will match #30471 (comment)? The current description misses running predicate.

DirectXMan12 · 2016-11-02T14:44:58Z

@jszczepkowski PR description updated

Currently, the HPA considers unready pods the same as ready pods when looking at their CPU and custom metric usage. However, pods frequently use extra CPU during initialization, so we want to consider them separately. This commit causes the HPA to consider unready pods as having 0 CPU usage when scaling up, and ignores them when scaling down. If, when scaling up, factoring the unready pods as having 0 CPU would cause a downscale instead, we simply choose not to scale. Otherwise, we simply scale up at the reduced amount caculated by factoring the pods in at zero CPU usage. The effect is that unready pods cause the autoscaler to be a bit more conservative -- large increases in CPU usage can still cause scales, even with unready pods in the mix, but will not cause the scale factors to be as large, in anticipation of the new pods later becoming ready and handling load. Similarly, if there are pods for which no metrics have been retrieved, these pods are treated as having 100% of the requested metric when scaling down, and 0% when scaling up. As above, this cannot change the direction of the scale. This commit also changes the HPA to ignore superfluous metrics -- as long as metrics for all ready pods are present, the HPA we make scaling decisions. Currently, this only works for CPU. For custom metrics, we cannot identify which metrics go to which pods if we get superfluous metrics, so we abort the scale.

k8s-github-robot · 2016-11-08T11:47:13Z

Automatic merge from submit-queue

googlebot added the cla: yes label Sep 27, 2016

k8s-github-robot assigned bprashanth Sep 27, 2016

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Sep 27, 2016

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 29, 2016

DirectXMan12 force-pushed the feature/hpa-pod-readiness branch from 966a092 to 0cfc463 Compare September 30, 2016 19:10

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2016

fgrzadkowski assigned jszczepkowski and unassigned bprashanth Oct 3, 2016

jszczepkowski suggested changes Oct 4, 2016

View reviewed changes

DirectXMan12 force-pushed the feature/hpa-pod-readiness branch from 0cfc463 to aee456a Compare October 10, 2016 21:04

jszczepkowski reviewed Oct 11, 2016

View reviewed changes

DirectXMan12 force-pushed the feature/hpa-pod-readiness branch 2 times, most recently from 796eeaa to 4971068 Compare October 11, 2016 17:05

mwielgus mentioned this pull request Oct 14, 2016

HPA uses wrong count to calculate target replcias #34821

Closed

jszczepkowski reviewed Oct 14, 2016

View reviewed changes

jszczepkowski mentioned this pull request Oct 17, 2016

HPA: fixed wrong count for target replicas calculations. #34955

Merged

DirectXMan12 force-pushed the feature/hpa-pod-readiness branch from 4971068 to e06e55d Compare October 17, 2016 20:28

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2016

DirectXMan12 force-pushed the feature/hpa-pod-readiness branch 2 times, most recently from 3178632 to 0b3ec15 Compare October 18, 2016 18:22

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2016

DirectXMan12 force-pushed the feature/hpa-pod-readiness branch from 73c6fdc to 858b282 Compare October 31, 2016 15:47

k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 31, 2016

DirectXMan12 force-pushed the feature/hpa-pod-readiness branch from 858b282 to e0010d6 Compare October 31, 2016 16:41

derekwaynecarr self-assigned this Nov 1, 2016

derekwaynecarr added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2016

DirectXMan12 mentioned this pull request Nov 2, 2016

HPA v2 (API Changes) #36033

Merged

jszczepkowski removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 2, 2016

jszczepkowski added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 2, 2016

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 6, 2016

DirectXMan12 force-pushed the feature/hpa-pod-readiness branch from e0010d6 to 2c66d47 Compare November 8, 2016 05:57

k8s-github-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 8, 2016

jszczepkowski added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 8, 2016

jszczepkowski added this to the v1.5 milestone Nov 8, 2016

k8s-github-robot merged commit 34b4b8f into kubernetes:master Nov 8, 2016

jszczepkowski mentioned this pull request Nov 14, 2016

[k8s.io] [HPA] Horizontal pod autoscaling (scale resource: CPU) [k8s.io] [Serial] [Slow] ReplicationController Should scale from 1 pod to 3 pods and from 3 to 5 and verify decision stability {Kubernetes e2e suite} #34301

Closed

DirectXMan12 deleted the feature/hpa-pod-readiness branch November 14, 2016 21:27

DirectXMan12 mentioned this pull request Nov 16, 2016

HPA: A single pod with unknown CPU blocks resizing #30471

Closed

jszczepkowski mentioned this pull request Nov 21, 2016

Kubectl GET HPA CPU Current isn't refreshed properly under Non Successfully POD creation scenario #19144

Closed

chentao1596 mentioned this pull request Dec 5, 2016

WIP:kubelet: support multi-headers when getting pod from HTTP source #38089

Closed

mdshuai mentioned this pull request Feb 9, 2017

HPA should rollback replicas if new scaled pods useless #41199

Closed

		@@ -191,50 +217,71 @@ func (h *HeapsterMetricsClient) getCpuUtilizationForPods(namespace string, selec
		}

		@@ -205,7 +225,7 @@ func (a HorizontalController) computeReplicasForCustomMetrics(hpa autoscaling.
		a.eventRecorder.Event(hpa, api.EventTypeWarning, "InvalidSelector", errMsg)

HPA: Consider unready pods separately #33593

HPA: Consider unready pods separately #33593

Conversation

DirectXMan12 commented Sep 27, 2016 • edited Loading

DirectXMan12 commented Sep 27, 2016 • edited Loading

DirectXMan12 commented Sep 27, 2016

fgrzadkowski commented Oct 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jszczepkowski Oct 4, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DirectXMan12 Oct 4, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jszczepkowski Oct 4, 2016 • edited Loading

Choose a reason for hiding this comment

jszczepkowski Oct 4, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jszczepkowski Oct 4, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jszczepkowski Oct 4, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jszczepkowski commented Oct 5, 2016

jszczepkowski commented Oct 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jszczepkowski Oct 11, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jszczepkowski commented Oct 14, 2016

Choose a reason for hiding this comment

DirectXMan12 Oct 14, 2016 • edited Loading

Choose a reason for hiding this comment

DirectXMan12 Oct 14, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DirectXMan12 commented Oct 18, 2016

DirectXMan12 commented Oct 19, 2016

k8s-ci-robot commented Oct 30, 2016

DirectXMan12 commented Oct 31, 2016

k8s-ci-robot commented Oct 31, 2016

DirectXMan12 commented Nov 1, 2016

derekwaynecarr commented Nov 1, 2016

jszczepkowski commented Nov 2, 2016

DirectXMan12 commented Nov 2, 2016

k8s-github-robot commented Nov 8, 2016

DirectXMan12 commented Sep 27, 2016 •

edited

Loading

DirectXMan12 commented Sep 27, 2016 •

edited

Loading

jszczepkowski Oct 4, 2016 •

edited

Loading

DirectXMan12 Oct 4, 2016 •

edited

Loading

jszczepkowski Oct 4, 2016 •

edited

Loading

jszczepkowski Oct 4, 2016 •

edited

Loading

jszczepkowski Oct 4, 2016 •

edited

Loading

jszczepkowski Oct 4, 2016 •

edited

Loading

jszczepkowski Oct 11, 2016 •

edited

Loading

DirectXMan12 Oct 14, 2016 •

edited

Loading

DirectXMan12 Oct 14, 2016 •

edited

Loading