Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPA: Consider unready pods separately #33593

Merged

Conversation

DirectXMan12
Copy link
Contributor

@DirectXMan12 DirectXMan12 commented Sep 27, 2016

Release note:

The Horizontal Pod Autoscaler now takes the readiness of pods into account when calculating desired replicas.

Currently, the HPA considers unready pods the same as ready pods when
looking at their CPU and custom metric usage. However, pods frequently
use extra CPU during initialization, so we want to consider them
separately.

This commit causes the HPA to consider unready pods as having 0 CPU
usage when scaling up, and ignores them when scaling down. If, when
scaling up, factoring the unready pods as having 0 CPU would cause a
downscale instead, we simply choose not to scale. Otherwise, we simply
scale up at the reduced amount calculated by factoring the pods in at
zero CPU usage.

Similarly, if we are missing metrics for any pods, those pods will be
considered as having 0% CPU when scaling up, and 100% CPU when
scaling down. As with the unready pods calculation, this cannot change
the direction of the scale.

The effect is that unready pods cause the autoscaler to be a bit more
conservative -- large increases in CPU usage can still cause scales,
even with unready pods in the mix, but will not cause the scale factors
to be as large, in anticipation of the new pods later becoming ready and
handling load.


This change is Reviewable

@DirectXMan12
Copy link
Contributor Author

DirectXMan12 commented Sep 27, 2016

cc @kubernetes/autoscaling @fgrzadkowski as per our discussion in last week's SIG autoscaling meeting, this should help put us on the path to reducing and/or eliminating the HPA forbidden windows.

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Sep 27, 2016
@DirectXMan12
Copy link
Contributor Author

looks like the GCI GKE test is hitting #33388

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 29, 2016
@DirectXMan12 DirectXMan12 force-pushed the feature/hpa-pod-readiness branch from 966a092 to 0cfc463 Compare September 30, 2016 19:10
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2016
@fgrzadkowski
Copy link
Contributor

FYI @mwielgus @piosz

// UnreadyPodsCount is the total number of running pods not factored into the computation of the returned metric value (due to being unready)
UnreadyPodsCount int

// OldestTimestamp is the time of generation of the olded of the utilization reports for the pods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/olded/oldest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️

GetCPUUtilization(namespace string, selector labels.Selector) (*int, time.Time, error)
// (e.g. 70 means that an average pod uses 70% of the requested CPU), as well as associated information about
// the computation.
GetCPUUtilization(namespace string, selector labels.Selector) (*int, *UtilizationInfo, error)
Copy link
Contributor

@jszczepkowski jszczepkowski Oct 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would move average cpu utilization to UtilizationInfo

@@ -42,16 +42,28 @@ const (

var heapsterQueryStart = -5 * time.Minute

// UtilizationInfo contains extra metadata about the returned metric values
type UtilizationInfo struct {
// ReadyPodsCount is the total number of pods factored into the computation of the returned metric value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move average cpu utilization here

Copy link
Contributor Author

@DirectXMan12 DirectXMan12 Oct 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, will do ah, I remember why I didn't do that -- you have two different types for CPU vs custom metrics utilization (int and float, respectively), so it makes sense to keep that separate, unless we just want to use float for both (which wouldn't be horrible).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jszczepkowski WDYT about using float for both vs just leaving it the way it is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's use floats for both.

newUsageRatio := float64(newUtilization) / float64(targetUtilization)

// simply don't scale if the new usage ratio would mean a downscale or no scale
if newUsageRatio < 0 || math.Abs(1.0-newUsageRatio) <= tolerance {
Copy link
Contributor

@jszczepkowski jszczepkowski Oct 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you checking here if newUsageRation is smaller than 0? it doesn't seem to be possible

Copy link
Contributor

@jszczepkowski jszczepkowski Oct 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should check if newUsageRation is smaller than 1.0? please add a unit test for this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, yeah, that should be < 1.0. Typo

}

return currentReplicas, &utilization, timestamp, nil
// for unready pods, in the case of scale up, check to see if treating those pods as having a mock
// CPU utilization would change things
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would explicitly mention here that the mock utilization is for now equal 0

}

func (h *HeapsterMetricsClient) getCpuUtilizationForPods(namespace string, selector labels.Selector, podNames map[string]struct{}) (int64, time.Time, error) {
func (h *HeapsterMetricsClient) getCpuUtilizationForPods(namespace string, selector labels.Selector, podNames map[string]struct{}, unreadyPods map[string]struct{}) (int64, time.Time, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be better to take only the set of ready pods here? it seems that unready pods are of no use

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to have them present for things like the "unexpected metric" check.

@@ -191,50 +217,71 @@ func (h *HeapsterMetricsClient) getCpuUtilizationForPods(namespace string, selec
}
Copy link
Contributor

@jszczepkowski jszczepkowski Oct 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition above (line 201: if len(metrics.Items) != len(podNames)) is extremely strange. I would only check in this method if metrics for all ready pods are known. I wouldn't verify unready and pending pods at all. Although this condition was not introduced by this PR, I think this PR is a good opportunity to clean it up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure which line you're talking about here. It might have gotten changed across a rebase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metricSpec := getHeapsterCustomMetricDefinition(customMetricName)

podList, err := h.client.Core().Pods(namespace).List(api.ListOptions{LabelSelector: selector})

if err != nil {
return nil, time.Time{}, fmt.Errorf("failed to get pod list: %v", err)
return nil, nil, fmt.Errorf("failed to get pod list: %v", err)
}
podNames := []string{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider using pkg/util/sets/string.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a list is better than a set here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(because of how we want to use it in getCustomMetricForPods, basically -- it saves us converting to a list for use in Join and the rest of the fetch logic later).

@@ -205,7 +225,7 @@ func (a *HorizontalController) computeReplicasForCustomMetrics(hpa *autoscaling.
a.eventRecorder.Event(hpa, api.EventTypeWarning, "InvalidSelector", errMsg)
Copy link
Contributor

@jszczepkowski jszczepkowski Oct 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, computeReplicasForCustomMetrics method will not assume 0 from not ready pods during scale-up, but will rather take average from ready pods only? This is a different between scaling base on CPU and on custom metric. I think It would be more correct if we also assume 0 for not ready pods here but I don't have a strong opinion. Anyway, it definitely should be documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, that was not intentional ;-). I'll fix it so both have the same

}
tc.runTest(t)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, add a case when threre are unready pods and scale down is triggered

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️

@jszczepkowski
Copy link
Contributor

@DirectXMan12

I don't see any changes that applies review comments. Have you forgotten to git push?

@DirectXMan12 DirectXMan12 force-pushed the feature/hpa-pod-readiness branch from 0cfc463 to aee456a Compare October 10, 2016 21:04
@jszczepkowski
Copy link
Contributor

This PR implements: #30471 (comment)


// GetCustomMetric returns the average value of the given custom metrics from the
// pods picked using the namespace and selector passed as arguments.
GetCustomMetric(customMetricName string, namespace string, selector labels.Selector) (*float64, time.Time, error)
GetCustomMetric(customMetricName string, namespace string, selector labels.Selector) (*UtilizationInfo, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please extend the comment (similar as for GetCPUUtilization)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️

for _, pod := range podList.Items {
if pod.Status.Phase == api.PodPending {
// Skip pending pods.
continue
}
podNames = append(podNames, pod.Name)
if !api.IsPodReady(&pod) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it check readiness probe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

api.IsPodReady is a helper method that looks for the pod readiness condition (which is what the Kubelet updates to indicate the state of the last readiness prob, AFAICT). It's the same method used by the endpoints controller to determine readiness.

} else {

count++
} else if !unreadyPods.Has(m.Name) {
Copy link
Contributor

@jszczepkowski jszczepkowski Oct 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just ignore such pods? Currently stats from pod not-running & not-ready blocks HPA

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, probably a good idea. Just have to fiddle some things around (see above).

// not missing metrics from ready pods (if we only compare to the number
// of ready pods, we could incorrectly assume we have metrics for all ready
// pods, when in reality we have metrics for a mix of ready and unready pods)
if len(metrics.Items) != readyPods.Len()+unreadyPods.Len() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there is a metric for pod not-running & not-ready, but one of ready-pods has no metric? We will not go into this if and not report metric missing. This seems to be a bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can execute the code bellow unconditinally (without if)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only have a problem here if we don't error out on pods which are unknown or not running (currently, we don't enter the if, but we catch it below as an "unexpected pod"). It would be nice to not block on pending pods, so I'll do some fiddling.

@@ -280,6 +316,29 @@ func TestCPUAllPending(t *testing.T) {
tc.runTest(t)
}

func TestCPUAllUnready(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add one more case: pod not running (e.g.: unknown) but with stats.

tc.runTest(t)
}

func TestScaleUpUnreadyNoScaleWouldScaleDown(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any difference between TestScaleUpUnreadyNoScale and TestScaleUpUnreadyNoScaleWouldScaleDown. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of them has numbers that that would end up causing adjusted scale ratio of 1.0. The other has numbers that would cause an adjusted scale ratio of < 1.0, and thus tests that when we adjust the scale ratio, we never go below 1.0 (i.e. a scale up will always be a scale up, or no action -- you can never turn a scale up into a scale down).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to a difference in the definitions? They seem to be the same...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, looks like I mistyped or something there. You're right, there's no difference currently. Good catch.

@DirectXMan12 DirectXMan12 force-pushed the feature/hpa-pod-readiness branch 2 times, most recently from 796eeaa to 4971068 Compare October 11, 2016 17:05
@jszczepkowski
Copy link
Contributor

@DirectXMan12 @mwielgus
The current HPA behavior (assuming CPU usage of not running pods be the same as of running pods here) is troublesome. We observed instability of HPA decisions cause by this. We should rise priority of this PR.

// keep the reported utilization to be whatever we retrieved from the the ready pods
}

return int32(math.Ceil(usageRatio * float64(currentReplicas))), &utilization, utilizationInfo.OldestTimestamp, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is incorrect. currentReplicas contains pending pods, however pending pods are not included in UnreadyPodsCount. So, a pending pod will be treated as a pod consuming newUsageRatio cpu, which is wrong. We should tread them as not consuming CPU.

Copy link
Contributor Author

@DirectXMan12 DirectXMan12 Oct 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to get really technical here, pending pods and unready pods should probably be counted the same, but non-pending non-running pods should be different (succeeded, failed, etc), because they'll never transition back to running. I'll see if I can accurately reflect that

Copy link
Contributor Author

@DirectXMan12 DirectXMan12 Oct 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... actually, there's no guarantee pending pods will start -- something could be pending with "RunContainerError", for instance. Perhaps we just say desired = usageRatio * (ready + unready) and then make sure desired >= current?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or just assume that all pending pods will eventually start)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so, if we split pods into running|ready ("ready") and everything else ("unready"), then I think this calculation becomes ok in the scale up case. In the scale down case, we should always base off of running pods (otherwise, we'll kill unready/pending pods only, and the we'll have to scale down again next time).

If we consider the example from #34821 (comment), and assume:

  • target utilization = 100%
  • current utilization = 300%
  • pod 1 is ready, pods 2-10 are unready (pending or otherwise)

Then we get: newUtilization = (300 * 1) / (1 + 9) = 300 / 10 = 30 yielding a new usage ratio of 30 / 100 = 0.3, which gets adjusted to 1.0 (we never scale down when correcting for pending pods), leaving the desired replica count at 10. This is slightly more conservative than simply taking the original usage ratio (3.0) and multiplying by running pods (1, giving a desired replica count of 3), but I think it's ok to be a bit more conservative when doing predictive work. Once the pending pods become ready, the HPA may remove some if they're not doing enough work.

@DirectXMan12 DirectXMan12 force-pushed the feature/hpa-pod-readiness branch from 4971068 to e06e55d Compare October 17, 2016 20:28
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2016
@DirectXMan12 DirectXMan12 force-pushed the feature/hpa-pod-readiness branch 2 times, most recently from 3178632 to 0b3ec15 Compare October 18, 2016 18:22
@DirectXMan12
Copy link
Contributor Author

@jszczepkowski I've addressed all your comments, I believe. PTAL

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2016
@DirectXMan12
Copy link
Contributor Author

@k8s-bot gci gke e2e test this

@k8s-ci-robot
Copy link
Contributor

Jenkins GCE Node e2e failed for commit 73c6fdcacae44a9be373c75f7cdb6f0d4a7aae49. Full PR test history.

The magic incantation to run this job again is @k8s-bot node e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@DirectXMan12 DirectXMan12 force-pushed the feature/hpa-pod-readiness branch from 73c6fdc to 858b282 Compare October 31, 2016 15:47
@DirectXMan12
Copy link
Contributor Author

had to push a new copy due to some clientset naming changes.

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 31, 2016
@k8s-ci-robot
Copy link
Contributor

Jenkins verification failed for commit 858b282841a60817e07d7ed4ffdd3370f7d084db. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@DirectXMan12 DirectXMan12 force-pushed the feature/hpa-pod-readiness branch from 858b282 to e0010d6 Compare October 31, 2016 16:41
@DirectXMan12
Copy link
Contributor Author

@jszczepkowski can you re-add LGTM?

@derekwaynecarr
Copy link
Member

/lgtm

@derekwaynecarr derekwaynecarr self-assigned this Nov 1, 2016
@derekwaynecarr derekwaynecarr added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2016
@jszczepkowski jszczepkowski removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 2, 2016
@jszczepkowski
Copy link
Contributor

One more think, before merge:
@DirectXMan12 can you update PR description in #33593 (comment), so that it will match #30471 (comment)? The current description misses running predicate.

@DirectXMan12
Copy link
Contributor Author

@jszczepkowski PR description updated

@jszczepkowski jszczepkowski added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 2, 2016
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 6, 2016
@DirectXMan12 DirectXMan12 force-pushed the feature/hpa-pod-readiness branch from e0010d6 to 2c66d47 Compare November 8, 2016 05:57
Currently, the HPA considers unready pods the same as ready pods when
looking at their CPU and custom metric usage.  However, pods frequently
use extra CPU during initialization, so we want to consider them
separately.

This commit causes the HPA to consider unready pods as having 0 CPU
usage when scaling up, and ignores them when scaling down.  If, when
scaling up, factoring the unready pods as having 0 CPU would cause a
downscale instead, we simply choose not to scale.  Otherwise, we simply
scale up at the reduced amount caculated by factoring the pods in at
zero CPU usage.

The effect is that unready pods cause the autoscaler to be a bit more
conservative -- large increases in CPU usage can still cause scales,
even with unready pods in the mix, but will not cause the scale factors
to be as large, in anticipation of the new pods later becoming ready and
handling load.

Similarly, if there are pods for which no metrics have been retrieved,
these pods are treated as having 100% of the requested metric when
scaling down, and 0% when scaling up.  As above, this cannot change the
direction of the scale.

This commit also changes the HPA to ignore superfluous metrics -- as
long as metrics for all ready pods are present, the HPA we make scaling
decisions.  Currently, this only works for CPU.  For custom metrics, we
cannot identify which metrics go to which pods if we get superfluous
metrics, so we abort the scale.
@k8s-github-robot k8s-github-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 8, 2016
@jszczepkowski jszczepkowski added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 8, 2016
@jszczepkowski jszczepkowski added this to the v1.5 milestone Nov 8, 2016
@k8s-github-robot
Copy link

Automatic merge from submit-queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants