-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPA: Consider unready pods separately #33593
HPA: Consider unready pods separately #33593
Conversation
cc @kubernetes/autoscaling @fgrzadkowski as per our discussion in last week's SIG autoscaling meeting, this should help put us on the path to reducing and/or eliminating the HPA forbidden windows. |
looks like the GCI GKE test is hitting #33388 |
966a092
to
0cfc463
Compare
// UnreadyPodsCount is the total number of running pods not factored into the computation of the returned metric value (due to being unready) | ||
UnreadyPodsCount int | ||
|
||
// OldestTimestamp is the time of generation of the olded of the utilization reports for the pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/olded/oldest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔️
GetCPUUtilization(namespace string, selector labels.Selector) (*int, time.Time, error) | ||
// (e.g. 70 means that an average pod uses 70% of the requested CPU), as well as associated information about | ||
// the computation. | ||
GetCPUUtilization(namespace string, selector labels.Selector) (*int, *UtilizationInfo, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would move average cpu utilization to UtilizationInfo
@@ -42,16 +42,28 @@ const ( | |||
|
|||
var heapsterQueryStart = -5 * time.Minute | |||
|
|||
// UtilizationInfo contains extra metadata about the returned metric values | |||
type UtilizationInfo struct { | |||
// ReadyPodsCount is the total number of pods factored into the computation of the returned metric value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move average cpu utilization here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, will do ah, I remember why I didn't do that -- you have two different types for CPU vs custom metrics utilization (int and float, respectively), so it makes sense to keep that separate, unless we just want to use float for both (which wouldn't be horrible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jszczepkowski WDYT about using float for both vs just leaving it the way it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's use floats for both.
newUsageRatio := float64(newUtilization) / float64(targetUtilization) | ||
|
||
// simply don't scale if the new usage ratio would mean a downscale or no scale | ||
if newUsageRatio < 0 || math.Abs(1.0-newUsageRatio) <= tolerance { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you checking here if newUsageRation is smaller than 0? it doesn't seem to be possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should check if newUsageRation is smaller than 1.0? please add a unit test for this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops, yeah, that should be < 1.0
. Typo
} | ||
|
||
return currentReplicas, &utilization, timestamp, nil | ||
// for unready pods, in the case of scale up, check to see if treating those pods as having a mock | ||
// CPU utilization would change things |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would explicitly mention here that the mock utilization is for now equal 0
} | ||
|
||
func (h *HeapsterMetricsClient) getCpuUtilizationForPods(namespace string, selector labels.Selector, podNames map[string]struct{}) (int64, time.Time, error) { | ||
func (h *HeapsterMetricsClient) getCpuUtilizationForPods(namespace string, selector labels.Selector, podNames map[string]struct{}, unreadyPods map[string]struct{}) (int64, time.Time, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be better to take only the set of ready pods here? it seems that unready pods are of no use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to have them present for things like the "unexpected metric" check.
@@ -191,50 +217,71 @@ func (h *HeapsterMetricsClient) getCpuUtilizationForPods(namespace string, selec | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition above (line 201: if len(metrics.Items) != len(podNames)) is extremely strange. I would only check in this method if metrics for all ready pods are known. I wouldn't verify unready and pending pods at all. Although this condition was not introduced by this PR, I think this PR is a good opportunity to clean it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure which line you're talking about here. It might have gotten changed across a rebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metricSpec := getHeapsterCustomMetricDefinition(customMetricName) | ||
|
||
podList, err := h.client.Core().Pods(namespace).List(api.ListOptions{LabelSelector: selector}) | ||
|
||
if err != nil { | ||
return nil, time.Time{}, fmt.Errorf("failed to get pod list: %v", err) | ||
return nil, nil, fmt.Errorf("failed to get pod list: %v", err) | ||
} | ||
podNames := []string{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider using pkg/util/sets/string.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a list is better than a set here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(because of how we want to use it in getCustomMetricForPods, basically -- it saves us converting to a list for use in Join
and the rest of the fetch logic later).
@@ -205,7 +225,7 @@ func (a *HorizontalController) computeReplicasForCustomMetrics(hpa *autoscaling. | |||
a.eventRecorder.Event(hpa, api.EventTypeWarning, "InvalidSelector", errMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, computeReplicasForCustomMetrics method will not assume 0 from not ready pods during scale-up, but will rather take average from ready pods only? This is a different between scaling base on CPU and on custom metric. I think It would be more correct if we also assume 0 for not ready pods here but I don't have a strong opinion. Anyway, it definitely should be documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops, that was not intentional ;-). I'll fix it so both have the same
} | ||
tc.runTest(t) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please, add a case when threre are unready pods and scale down is triggered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔️
I don't see any changes that applies review comments. Have you forgotten to git push? |
0cfc463
to
aee456a
Compare
This PR implements: #30471 (comment) |
|
||
// GetCustomMetric returns the average value of the given custom metrics from the | ||
// pods picked using the namespace and selector passed as arguments. | ||
GetCustomMetric(customMetricName string, namespace string, selector labels.Selector) (*float64, time.Time, error) | ||
GetCustomMetric(customMetricName string, namespace string, selector labels.Selector) (*UtilizationInfo, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please extend the comment (similar as for GetCPUUtilization)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔️
for _, pod := range podList.Items { | ||
if pod.Status.Phase == api.PodPending { | ||
// Skip pending pods. | ||
continue | ||
} | ||
podNames = append(podNames, pod.Name) | ||
if !api.IsPodReady(&pod) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it check readiness probe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
api.IsPodReady
is a helper method that looks for the pod readiness condition (which is what the Kubelet updates to indicate the state of the last readiness prob, AFAICT). It's the same method used by the endpoints controller to determine readiness.
} else { | ||
|
||
count++ | ||
} else if !unreadyPods.Has(m.Name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we just ignore such pods? Currently stats from pod not-running & not-ready blocks HPA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, probably a good idea. Just have to fiddle some things around (see above).
// not missing metrics from ready pods (if we only compare to the number | ||
// of ready pods, we could incorrectly assume we have metrics for all ready | ||
// pods, when in reality we have metrics for a mix of ready and unready pods) | ||
if len(metrics.Items) != readyPods.Len()+unreadyPods.Len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there is a metric for pod not-running & not-ready, but one of ready-pods has no metric? We will not go into this if and not report metric missing. This seems to be a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can execute the code bellow unconditinally (without if)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only have a problem here if we don't error out on pods which are unknown or not running (currently, we don't enter the if, but we catch it below as an "unexpected pod"). It would be nice to not block on pending pods, so I'll do some fiddling.
@@ -280,6 +316,29 @@ func TestCPUAllPending(t *testing.T) { | |||
tc.runTest(t) | |||
} | |||
|
|||
func TestCPUAllUnready(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add one more case: pod not running (e.g.: unknown) but with stats.
tc.runTest(t) | ||
} | ||
|
||
func TestScaleUpUnreadyNoScaleWouldScaleDown(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any difference between TestScaleUpUnreadyNoScale and TestScaleUpUnreadyNoScaleWouldScaleDown. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of them has numbers that that would end up causing adjusted scale ratio of 1.0. The other has numbers that would cause an adjusted scale ratio of < 1.0, and thus tests that when we adjust the scale ratio, we never go below 1.0 (i.e. a scale up will always be a scale up, or no action -- you can never turn a scale up into a scale down).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point me to a difference in the definitions? They seem to be the same...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh, looks like I mistyped or something there. You're right, there's no difference currently. Good catch.
796eeaa
to
4971068
Compare
@DirectXMan12 @mwielgus |
// keep the reported utilization to be whatever we retrieved from the the ready pods | ||
} | ||
|
||
return int32(math.Ceil(usageRatio * float64(currentReplicas))), &utilization, utilizationInfo.OldestTimestamp, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition is incorrect. currentReplicas
contains pending pods, however pending pods are not included in UnreadyPodsCount
. So, a pending pod will be treated as a pod consuming newUsageRatio
cpu, which is wrong. We should tread them as not consuming CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want to get really technical here, pending pods and unready pods should probably be counted the same, but non-pending non-running pods should be different (succeeded, failed, etc), because they'll never transition back to running. I'll see if I can accurately reflect that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... actually, there's no guarantee pending pods will start -- something could be pending with "RunContainerError", for instance. Perhaps we just say desired = usageRatio * (ready + unready)
and then make sure desired >= current
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or just assume that all pending pods will eventually start)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so, if we split pods into running|ready ("ready") and everything else ("unready"), then I think this calculation becomes ok in the scale up case. In the scale down case, we should always base off of running pods (otherwise, we'll kill unready/pending pods only, and the we'll have to scale down again next time).
If we consider the example from #34821 (comment), and assume:
- target utilization = 100%
- current utilization = 300%
- pod 1 is ready, pods 2-10 are unready (pending or otherwise)
Then we get: newUtilization = (300 * 1) / (1 + 9) = 300 / 10 = 30
yielding a new usage ratio of 30 / 100 = 0.3
, which gets adjusted to 1.0 (we never scale down when correcting for pending pods), leaving the desired replica count at 10. This is slightly more conservative than simply taking the original usage ratio (3.0) and multiplying by running pods (1, giving a desired replica count of 3), but I think it's ok to be a bit more conservative when doing predictive work. Once the pending pods become ready, the HPA may remove some if they're not doing enough work.
4971068
to
e06e55d
Compare
3178632
to
0b3ec15
Compare
@jszczepkowski I've addressed all your comments, I believe. PTAL |
@k8s-bot gci gke e2e test this |
Jenkins GCE Node e2e failed for commit 73c6fdcacae44a9be373c75f7cdb6f0d4a7aae49. Full PR test history. The magic incantation to run this job again is |
73c6fdc
to
858b282
Compare
had to push a new copy due to some clientset naming changes. |
Jenkins verification failed for commit 858b282841a60817e07d7ed4ffdd3370f7d084db. Full PR test history. The magic incantation to run this job again is |
858b282
to
e0010d6
Compare
@jszczepkowski can you re-add LGTM? |
/lgtm |
One more think, before merge: |
@jszczepkowski PR description updated |
e0010d6
to
2c66d47
Compare
Currently, the HPA considers unready pods the same as ready pods when looking at their CPU and custom metric usage. However, pods frequently use extra CPU during initialization, so we want to consider them separately. This commit causes the HPA to consider unready pods as having 0 CPU usage when scaling up, and ignores them when scaling down. If, when scaling up, factoring the unready pods as having 0 CPU would cause a downscale instead, we simply choose not to scale. Otherwise, we simply scale up at the reduced amount caculated by factoring the pods in at zero CPU usage. The effect is that unready pods cause the autoscaler to be a bit more conservative -- large increases in CPU usage can still cause scales, even with unready pods in the mix, but will not cause the scale factors to be as large, in anticipation of the new pods later becoming ready and handling load. Similarly, if there are pods for which no metrics have been retrieved, these pods are treated as having 100% of the requested metric when scaling down, and 0% when scaling up. As above, this cannot change the direction of the scale. This commit also changes the HPA to ignore superfluous metrics -- as long as metrics for all ready pods are present, the HPA we make scaling decisions. Currently, this only works for CPU. For custom metrics, we cannot identify which metrics go to which pods if we get superfluous metrics, so we abort the scale.
Automatic merge from submit-queue |
Release note:
Currently, the HPA considers unready pods the same as ready pods when
looking at their CPU and custom metric usage. However, pods frequently
use extra CPU during initialization, so we want to consider them
separately.
This commit causes the HPA to consider unready pods as having 0 CPU
usage when scaling up, and ignores them when scaling down. If, when
scaling up, factoring the unready pods as having 0 CPU would cause a
downscale instead, we simply choose not to scale. Otherwise, we simply
scale up at the reduced amount calculated by factoring the pods in at
zero CPU usage.
Similarly, if we are missing metrics for any pods, those pods will be
considered as having 0% CPU when scaling up, and 100% CPU when
scaling down. As with the unready pods calculation, this cannot change
the direction of the scale.
The effect is that unready pods cause the autoscaler to be a bit more
conservative -- large increases in CPU usage can still cause scales,
even with unready pods in the mix, but will not cause the scale factors
to be as large, in anticipation of the new pods later becoming ready and
handling load.
This change is