Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPA uses wrong count to calculate target replcias #34821

Closed
mwielgus opened this issue Oct 14, 2016 · 7 comments
Closed

HPA uses wrong count to calculate target replcias #34821

mwielgus opened this issue Oct 14, 2016 · 7 comments
Assignees
Labels
area/controller-manager priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.

Comments

@mwielgus
Copy link
Contributor

The current formula looks like:

currentReplicas := scale.Status.Replicas
[...] 
usageRatio := float64(utilization) / float64(targetUtilization)
    if math.Abs(1.0-usageRatio) > tolerance {
        return int32(math.Ceil(usageRatio * float64(currentReplicas))), &utilization, timestamp, nil
    }

If currentReplicas doesn't match to the number of replicas used to calculate utilization then the result can be way bigger than necessary. For example if there are 10 pods created in apiserver (= current replicas) but only 1 is scheduled and it is running at its last legs (300% of the target utilization) then the target will grow to 30 pods.
20 (30-10) pods get created and are not scheduled and will increase the replicas even further.

Solly you did some fixes in this area: #33593.
Will this PR fix the above scenario?

cc: @DirectXMan12 @jszczepkowski @fgrzadkowski @davidopp

@mwielgus mwielgus added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Oct 14, 2016
@jszczepkowski
Copy link
Contributor

If this is urgent and we need to fix it on 1.4, I can prepare a small, separate PR with the fix. PR #33593 is huge and will not be cherry-picked to 1.4.

@DirectXMan12
Copy link
Contributor

DirectXMan12 commented Oct 14, 2016

#33593 currently experiences this as well, I think, but I'm trying to figure out the right way to solve it. The problem, as I see it, is this: while we can be reasonably certain that an unready pod will probably become ready eventually (certain-ish, at least), a pod in a pending state may effectively be permafailed (e.g. if you don't specify a command, it looks like the pod will report pending, unready, with one of the containers reporting that the runtime indicated that no command was specified). So, we need to figure out what we're going to guess about pending pods. I need to do a bit more digging about pod lifecycle.

Talked to @decarr about lifecycle a bit. think it's safe to lump "phase: running, status: unready" together with "phase: pending", and write total number = running|ready + running|unready + pending.

@davidopp
Copy link
Member

cc/ @lavalamp for the HPA problem we saw today

@derekwaynecarr
Copy link
Member

@DirectXMan12 and I chatted on this today, I think we should split pods into two states based on the pod condition Ready. I would ignore cpu usage associated with pods whose Ready condition was not true.

@fgrzadkowski fgrzadkowski added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Oct 17, 2016
jszczepkowski added a commit to jszczepkowski/kubernetes that referenced this issue Oct 17, 2016
HPA: fixed wrong count for target replicas calculations (kubernetes#34821).
jszczepkowski added a commit to jszczepkowski/kubernetes that referenced this issue Oct 18, 2016
HPA: fixed wrong count for target replicas calculations (kubernetes#34821).
jszczepkowski added a commit to jszczepkowski/kubernetes that referenced this issue Oct 18, 2016
HPA: fixed wrong count for target replicas calculations (kubernetes#34821).
k8s-github-robot pushed a commit that referenced this issue Oct 18, 2016
Automatic merge from submit-queue

HPA: fixed wrong count for target replicas calculations.

```release-note
HPA: fixed wrong count for target replicas calculations (#34821).
```

HPA: fixed wrong count for target replicas calculations (#34821).
jessfraz pushed a commit to jessfraz/kubernetes that referenced this issue Oct 18, 2016
HPA: fixed wrong count for target replicas calculations (kubernetes#34821).
rootfs pushed a commit to rootfs/kubernetes that referenced this issue Oct 19, 2016
HPA: fixed wrong count for target replicas calculations (kubernetes#34821).
@jayunit100
Copy link
Member

is this sorted now ?

@DirectXMan12
Copy link
Contributor

DirectXMan12 commented Nov 15, 2016

it should have been fixed as part of #33593

@jszczepkowski
Copy link
Contributor

Fixed by #33593.

shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this issue Dec 1, 2016
HPA: fixed wrong count for target replicas calculations (kubernetes#34821).
tallclair pushed a commit to tallclair/kubernetes that referenced this issue Aug 3, 2020
HPA: fixed wrong count for target replicas calculations (kubernetes#34821).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller-manager priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.
Projects
None yet
Development

No branches or pull requests

8 participants