Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase supported pod deletion rate #96752

Closed
mborsz opened this issue Nov 20, 2020 · 27 comments
Closed

Increase supported pod deletion rate #96752

mborsz opened this issue Nov 20, 2020 · 27 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@mborsz
Copy link
Member

mborsz commented Nov 20, 2020

What would you like to be added:

Currently garbage collector is using two api calls per a single object deletion. Given that the performance of controllers is limited by --kube-api-qps flag (set to 100 e.g. in our scalability tests), we can create pods with 100 pods/s churn, they are deleted with only 50 pods/s churn.

Why is this needed:

  • Supporting high pod churn workloads
  • Making scalability tests faster (while we are able to create pods with 100 pods/s which takes ~30m in our tests, deletion of them happens with 50 pods/s (even less as garbage collector also removes other object types than pods), so it takes over hour).

How to do that:
Ref:

// TODO: It's only necessary to talk to the API server if this is a
// "virtual" node. The local graph could lag behind the real status, but in
// practice, the difference is small.
latest, err := gc.getObject(item.identity)
switch {
case errors.IsNotFound(err):
// the GraphBuilder can add "virtual" node for an owner that doesn't
// exist yet, so we need to enqueue a virtual Delete event to remove
// the virtual node from GraphBuilder.uidToNode.
klog.V(5).Infof("item %v not found, generating a virtual delete event", item.identity)
gc.dependencyGraphBuilder.enqueueVirtualDeleteEvent(item.identity)
// since we're manually inserting a delete event to remove this node,
// we don't need to keep tracking it as a virtual node and requeueing in attemptToDelete
item.markObserved()
return nil
case err != nil:
return err
}

IIUC this code correctly the additional get (and not using cache) has two purposes:

We can change the logic in attemptToDeleteItem to:

  • if the object is present in cache, use it
  • if it's not, fallback to API call

This way, for most of the cases we will skip api call and we will support both cases: attemptToDeleteItem will handle correctly "not yet observed by cache" objects (api call will return some object) and will make it possible to get rid of stale virtual objects (api call will return 404 and we will remove the virtual object from graph).

Alternatively, we can try something like:

  • in attemptToDeleteItem, simply using cached object, if it doesn't exist, drop the object from queue (when we will observe the virtual object by watch, it will be added back)
  • add a dedicated worker queue for removing stale virtual objects: if we observe a new virtual object, adding to that queue, each queue loop will:
    • check if the node is still virtual (not observed by cache): if not, drop it from queue
    • otherwise api call to validate if it still exists on apiserver:
      • if it does, requeue object with some backoff
      • if not, drop object from graph

The second solution seems to be a little bit more clear to me (it decouples two purposes of the attemptToDeleteItem call) and it avoids unnecessary API calls in case of watch delay (e.g. attemptToDeleteItem called on the pod that has been deleted and watch doesn't reflect this fact yet).

@liggitt @caesarxuchao @wojtek-t WDYT?

@mborsz mborsz added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 20, 2020
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 20, 2020
@k8s-ci-robot
Copy link
Contributor

@mborsz: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mborsz mborsz changed the title Increase pod deletion rate Increase supported pod deletion rate Nov 20, 2020
@mborsz
Copy link
Member Author

mborsz commented Nov 20, 2020

/sig scalability

@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 20, 2020
@wojtek-t
Copy link
Member

Re "why":

  • high-churn workloads is something that we should be focused
  • I wouldn't explicitly mention speeding up tests as the goal itself, we probably could come up with other ways to solve that too.
    But the first one is a compelling enough reasons and I fully support it.

Re solution itself:

  • option 1: what if the option present in cache is stale?
    We currently don't set any preconditions on ResourceVersion, so we would have to start doing that, then retry on failures, etc.]
    [All of that seems doable, but I'm just trying to point that it's not necessary as easy as it seems]

  • option 2:
    (a) it has the same problem as above
    (b) you're suggesting an additional periodic polling for virtual objects, which doesn't scale if we have many of those

@mborsz
Copy link
Member Author

mborsz commented Nov 23, 2020

Re option 1 comment:

  • I think this is an issue also today: it is possible that between GET and DELETE calls the object has been modified. I agree that with moving to cache this may happen more often.
  • Is there any way to DELETE api object conditionally (i.e. delete only if it has a given resourceVersion?)?

Re comment 2b: In fact we currently do periodic polling for virtual objects (here). I just suggest moving this logic to a separate place.

@wojtek-t
Copy link
Member

Yes - you can set Preconditions in DeleteOptions:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L506
and you can set RV in Preconditions:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L623

Re comment 2b: In fact we currently do periodic polling for virtual objects (here). I just suggest moving this logic to a separate place.

But unless I'm misreading it (sorry, didn't have time to look deep enough), this is happening on deletions.
The way how I understood your proposal is to poll them periodically for the whole lifetime?

@mborsz
Copy link
Member Author

mborsz commented Nov 23, 2020

I think it makes sense to start setting RV precondition anyway (so that if a given object changes between GET and DELETE, we will know that the DELETE is still up to date decision).

But unless I'm misreading it (sorry, didn't have time to look deep enough), this is happening on deletions.
The way how I understood your proposal is to poll them periodically for the whole lifetime?

My proposal is to "do whatever we are doing now, but in a separate loop". This way we will avoid doing API call on a standard code path only to support virtual objects. I understand that we are requeing all virtual objects, not only on deletions, but maybe I'm wrong.

@liggitt
Copy link
Member

liggitt commented Nov 23, 2020

I think it makes sense to start setting RV precondition anyway (so that if a given object changes between GET and DELETE, we will know that the DELETE is still up to date decision).

That will help in not taking action on stale data, but introduces the possibility that a frequently written object will never be garbage collected.

I understand that we are requeing all virtual objects, not only on deletions

Correct, virtual objects are already polled until they are observed present (via an informer event) or verified absent (via a 404 on the live lookup) and a virtual delete event is queued.

Having just spent a lot of time in the GC code, I'm not sure a third queue will simplify things.

@wojtek-t
Copy link
Member

I think it makes sense to start setting RV precondition anyway (so that if a given object changes between GET and DELETE, we will know that the DELETE is still up to date decision).

That will help in not taking action on stale data, but introduces the possibility that a frequently written object will never be garbage collected.

@liggitt - aren't we afraid that we delete something that we shouldn't delete in this case? I would assume that if something is supposed to be GC-ed it shouldn't be frequently changing (or at least it should eventually stop being updated frequently).
Setting preconditions here would make sense to me.

@mborsz
Copy link
Member Author

mborsz commented Nov 23, 2020

I think it makes sense to start setting RV precondition anyway (so that if a given object changes between GET and DELETE, we will know that the DELETE is still up to date decision).

That will help in not taking action on stale data, but introduces the possibility that a frequently written object will never be garbage collected.

Good point. Frequent object changes may cause gc starvation. This can be resolved by proper "precondition failed" error handing (i.e. retrying with GET API call)

@mborsz
Copy link
Member Author

mborsz commented Nov 23, 2020

Another idea: maybe in fact we don't need to get rid of GET call -- it's cheap and it's not a problem itself. The only problem about a GET call is that it consumes token from rate limiter. Maybe we can somehow change qps limit for gc collector only to allow it e.g. 200 qps while other controllers will be still using 100 qps?

@wojtek-t
Copy link
Member

So that will go away with Priority&Fairness eventually, because we want to get rid of qps-limits once this will prove to be stable.
[Also note that they are cheaper than modifying requests, but they are still not super cheap (e.g. they are linearized in etcd)]

@liggitt
Copy link
Member

liggitt commented Nov 23, 2020

@liggitt - aren't we afraid that we delete something that we shouldn't delete in this case?

Yes, but only if the updates are modifying ownerReferences

I would assume that if something is supposed to be GC-ed it shouldn't be frequently changing (or at least it should eventually stop being updated frequently).

I don't think that's necessarily true. Imagine a controller updating some status on an object as long as it exists, but which is depending on the object to go away when its owner does.

@wojtek-t
Copy link
Member

Yes, but only if the updates are modifying ownerReferences

Sure - that is exactly the point I had on my mind.
So doesn't that by itself deserve that we should protect this case?

I don't think that's necessarily true. Imagine a controller updating some status on an object as long as it exists, but which is depending on the object to go away when its owner does.

Sure - but I'm assuming that it shouldn't be more frequently than every 10s or so (so far by frequently I meant more than 1qps). And with longer delays between updates, we should eventually GC that (the latency of watch+GC processing time should be smaller than that).

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 21, 2021
@mborsz
Copy link
Member Author

mborsz commented Feb 22, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2021
@zorro786
Copy link

zorro786 commented Apr 7, 2021

@mborsz I am interested to increase deletion rate too, trying to understand your issue here. Does it mean with default settings at controller manager (--kube-api-qps=30), in one second only max 15 pods will go from RUNNING->TERMINATING? Or is this rate about after TERMINATING to removal from API Server?

@mborsz
Copy link
Member Author

mborsz commented Apr 7, 2021

Maciej Borsz I am interested to increase deletion rate too, trying to understand your issue here. Does it mean with default settings at controller manager (--kube-api-qps=30), in one second only max 15 pods will go from RUNNING->TERMINATED? Or is this rate about after TERMINATED to removal from API Server?

In that case, only 15 pods will be marked to be deleted (by setting deletetionTimestamp != 0; pod phase will not change IIUCC). Then kubelets will remove objects from API server, once they terminate the process running on the node.

@zorro786
Copy link

zorro786 commented Apr 7, 2021

In that case, only 15 pods will be marked to be deleted (by setting deletetionTimestamp != 0; pod phase will not change IIUCC). Then kubelets will remove objects from API server, once they terminate the process running on the node.

I see okay, I was interested to increase the rate at which pod status changes from RUNNING -> TERMINATING after deletion. If I delete say 1000 pods at once, their state changes from RUNNING -> TERMINATING in small batches.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 6, 2021
@wojtek-t
Copy link
Member

wojtek-t commented Jul 6, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 6, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 4, 2021
@wojtek-t
Copy link
Member

wojtek-t commented Oct 4, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 4, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2022
@wojtek-t
Copy link
Member

wojtek-t commented Jan 3, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 3, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 3, 2022
@mborsz
Copy link
Member Author

mborsz commented Apr 4, 2022

Fixed in #104966

/close

@k8s-ci-robot
Copy link
Contributor

@mborsz: Closing this issue.

In response to this:

Fixed in #104966

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

7 participants