-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify taint manager workqueue keys #65350
Simplify taint manager workqueue keys #65350
Conversation
029d893
to
cc8e67e
Compare
/assign @wojtek-t |
/retest |
/lgtm Thanks! |
/retest Review the full test history for this PR. Silence the bot with an |
/retest |
/test pull-kubernetes-e2e-kops-aws |
I am planning to cherrypick that one back to 1.8 |
/retest Review the full test history for this PR. Silence the bot with an |
/retest |
/retest Review the full test history for this PR. Silence the bot with an |
1 similar comment
/retest Review the full test history for this PR. Silence the bot with an |
cc8e67e
to
d2d3149
Compare
/lgtm |
/retest Review the full test history for this PR. Silence the bot with an |
the failures look related, which is concerning, because it seems to mean something is depending on level-driven processing of the nodes/pod updates. will keep digging as I have time |
a1f0aa9
to
6ff04a0
Compare
/milestone v1.13 |
/retest |
Yes - those seem to be real failures not flakes. So something is really terrifying in this PR. |
@wojtek-t @liggitt based on the comment above I am assuming this issue is serious enough to block Taint based eviction going to Beta in 1.13. @Huang-Wei 's comment indicated this PR to be a "nice to have" performance enhancement. Considering the scalability regressions we saw with taint nodes in 1.12 I am concerned with the signals here already. Please let us know of how critical is this fix for the feature to go to Beta in 1.13. Thanks |
6ff04a0
to
3f8c033
Compare
3f8c033
to
9503c64
Compare
/retest |
/test pull-kubernetes-e2e-gke |
@AishSundar - the issues with this PR have been resolved, and tests are green. this should help with taint/toleration scaleability issues |
/hold cancel |
Thanks @liggitt !! |
Thanks @liggitt and @Huang-Wei |
great. /lgtm |
/test pull-kubernetes-e2e-kops-aws |
/retest |
1 similar comment
/retest |
@@ -211,8 +198,8 @@ func (tc *NoExecuteTaintManager) Run(stopCh <-chan struct{}) { | |||
glog.V(0).Infof("Starting NoExecuteTaintManager") | |||
|
|||
for i := 0; i < UpdateWorkerSize; i++ { | |||
tc.nodeUpdateChannels = append(tc.nodeUpdateChannels, make(chan *nodeUpdateItem, NodeUpdateChannelSize)) | |||
tc.podUpdateChannels = append(tc.podUpdateChannels, make(chan *podUpdateItem, podUpdateChannelSize)) | |||
tc.nodeUpdateChannels = append(tc.nodeUpdateChannels, make(chan nodeUpdateItem, NodeUpdateChannelSize)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need channel group? The performance of TaintNodeByCondition is acceptable by one channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say "now".
I think that once we will be able to bump default qps limits in large clusters, it may appear to be too slow. But yeah - it's mostly guessing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TaintNodeByCondition is distributing a single update queue among multiple workers.
This has two update queues to coordinate, so the exact solution will look different. I think we can do something simpler than this two-layer async, but we need to do it in a way that still lets us fan out workers and give node update handling priority when handling pod update events.
This greatly reduces the memory size of the workqueue keys, and allows the queue to dedupe rapid events from the same objects, like other controllers do
Builds on #65339
Fixes #65325
/sig scheduling