Simplify taint manager workqueue keys #65350

liggitt · 2018-06-22T04:07:16Z

This greatly reduces the memory size of the workqueue keys, and allows the queue to dedupe rapid events from the same objects, like other controllers do

Builds on #65339

Fixes #65325

/sig scheduling

improves memory use and performance when processing large numbers of pods containing tolerations

liggitt · 2018-06-22T04:31:21Z

/assign @wojtek-t
@kubernetes/sig-scheduling-pr-reviews

liggitt · 2018-06-22T06:04:23Z

/retest

wojtek-t · 2018-06-22T06:41:15Z

/lgtm
/approve

Thanks!

fejta-bot · 2018-06-22T08:57:30Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

liggitt · 2018-06-22T10:06:28Z

/retest

dims · 2018-06-22T11:08:19Z

@liggitt the change in pkg/controller/nodelifecycle/scheduler/taint_manager.go is in this PR as well as #65339 i am guessing we don't need the 65339 then?

dims · 2018-06-22T11:08:29Z

/test pull-kubernetes-e2e-kops-aws

liggitt · 2018-06-22T13:15:51Z

i am guessing we don't need the 65339 then?

I am planning to cherrypick that one back to 1.8

fejta-bot · 2018-06-22T15:36:30Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

liggitt · 2018-06-22T18:24:25Z

/retest

fejta-bot · 2018-06-22T21:12:29Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

fejta-bot · 2018-06-23T00:00:32Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

wgliang · 2018-06-23T13:19:46Z

/lgtm

fejta-bot · 2018-06-23T15:24:30Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

liggitt · 2018-10-16T16:49:21Z

the failures look related, which is concerning, because it seems to mean something is depending on level-driven processing of the nodes/pod updates. will keep digging as I have time

spiffxp · 2018-10-17T01:09:39Z

/milestone v1.13
I'm adding this to the v1.13 milestone since it relates to kubernetes/enhancements#166 which is currently planned for v1.13

AishSundar · 2018-10-17T05:34:48Z

/retest

wojtek-t · 2018-10-17T07:09:26Z

the failures look related, which is concerning, because it seems to mean something is depending on level-driven processing of the nodes/pod updates. will keep digging as I have time

Yes - those seem to be real failures not flakes.
What is even more terrifying is that for e.g. gce-100-performance is (in terms of taints) preatty much relying only on making nodes ready/schedulable.

So something is really terrifying in this PR.

AishSundar · 2018-10-17T07:35:32Z

@wojtek-t @liggitt based on the comment above I am assuming this issue is serious enough to block Taint based eviction going to Beta in 1.13. @Huang-Wei 's comment indicated this PR to be a "nice to have" performance enhancement.

Considering the scalability regressions we saw with taint nodes in 1.12 I am concerned with the signals here already. Please let us know of how critical is this fix for the feature to go to Beta in 1.13. Thanks

liggitt · 2018-10-17T14:56:48Z

/retest

liggitt · 2018-10-17T15:40:17Z

/test pull-kubernetes-e2e-gke

liggitt · 2018-10-17T16:11:14Z

@AishSundar - the issues with this PR have been resolved, and tests are green. this should help with taint/toleration scaleability issues

liggitt · 2018-10-17T16:12:24Z

/hold cancel

Huang-Wei · 2018-10-17T16:23:09Z

Thanks @liggitt !!

AishSundar · 2018-10-17T16:27:29Z

Thanks @liggitt and @Huang-Wei

wojtek-t · 2018-10-17T17:04:49Z

great.

/lgtm

Huang-Wei · 2018-10-17T17:56:35Z

/test pull-kubernetes-e2e-kops-aws

liggitt · 2018-10-17T20:15:41Z

/retest

Huang-Wei · 2018-10-18T00:48:06Z

/retest

k82cn · 2018-10-18T12:13:13Z

pkg/controller/nodelifecycle/scheduler/taint_manager.go

@@ -211,8 +198,8 @@ func (tc *NoExecuteTaintManager) Run(stopCh <-chan struct{}) {
 	glog.V(0).Infof("Starting NoExecuteTaintManager")

 	for i := 0; i < UpdateWorkerSize; i++ {
-		tc.nodeUpdateChannels = append(tc.nodeUpdateChannels, make(chan *nodeUpdateItem, NodeUpdateChannelSize))
-		tc.podUpdateChannels = append(tc.podUpdateChannels, make(chan *podUpdateItem, podUpdateChannelSize))
+		tc.nodeUpdateChannels = append(tc.nodeUpdateChannels, make(chan nodeUpdateItem, NodeUpdateChannelSize))


Do we still need channel group? The performance of TaintNodeByCondition is acceptable by one channel.

I would say "now".
I think that once we will be able to bump default qps limits in large clusters, it may appear to be too slow. But yeah - it's mostly guessing...

TaintNodeByCondition is distributing a single update queue among multiple workers.

This has two update queues to coordinate, so the exact solution will look different. I think we can do something simpler than this two-layer async, but we need to do it in a way that still lets us fan out workers and give node update handling priority when handling pod update events.

k8s-ci-robot requested review from gmarek and ingvagabund June 22, 2018 04:07

liggitt force-pushed the simplify-taint-manager-key branch from 029d893 to cc8e67e Compare June 22, 2018 04:27

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 22, 2018

k8s-ci-robot assigned wojtek-t Jun 22, 2018

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 22, 2018

liggitt force-pushed the simplify-taint-manager-key branch from cc8e67e to d2d3149 Compare June 23, 2018 02:35

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 23, 2018

k8s-ci-robot assigned wgliang Jun 23, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 23, 2018

liggitt force-pushed the simplify-taint-manager-key branch from a1f0aa9 to 6ff04a0 Compare October 16, 2018 17:05

Huang-Wei mentioned this pull request Oct 16, 2018

[umbrella issue] promote TaintBasedEvictions to beta in 1.13 #69533

Closed

7 tasks

k8s-ci-robot added this to the v1.13 milestone Oct 17, 2018

Huang-Wei mentioned this pull request Oct 17, 2018

Taint Based Eviction kubernetes/enhancements#166

Closed

liggitt force-pushed the simplify-taint-manager-key branch from 6ff04a0 to 3f8c033 Compare October 17, 2018 13:48

Simplify taint manager workqueue keys

9503c64

liggitt force-pushed the simplify-taint-manager-key branch from 3f8c033 to 9503c64 Compare October 17, 2018 14:47

liggitt changed the title ~~WIP - Simplify taint manager workqueue keys~~ Simplify taint manager workqueue keys Oct 17, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 17, 2018

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 17, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 17, 2018

Huang-Wei mentioned this pull request Oct 17, 2018

promote TaintBasedEvictions featuregate to beta #69824

Merged

k8s-ci-robot merged commit 6f4b768 into kubernetes:master Oct 18, 2018

k82cn reviewed Oct 18, 2018

View reviewed changes

liggitt deleted the simplify-taint-manager-key branch October 19, 2018 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify taint manager workqueue keys #65350

Simplify taint manager workqueue keys #65350

liggitt commented Jun 22, 2018 •

edited

Loading

liggitt commented Jun 22, 2018

liggitt commented Jun 22, 2018

wojtek-t commented Jun 22, 2018

fejta-bot commented Jun 22, 2018

liggitt commented Jun 22, 2018

dims commented Jun 22, 2018

dims commented Jun 22, 2018

liggitt commented Jun 22, 2018

fejta-bot commented Jun 22, 2018

liggitt commented Jun 22, 2018

fejta-bot commented Jun 22, 2018

fejta-bot commented Jun 23, 2018

wgliang commented Jun 23, 2018

fejta-bot commented Jun 23, 2018

liggitt commented Oct 16, 2018

spiffxp commented Oct 17, 2018

AishSundar commented Oct 17, 2018

wojtek-t commented Oct 17, 2018

AishSundar commented Oct 17, 2018 •

edited

Loading

liggitt commented Oct 17, 2018

liggitt commented Oct 17, 2018

liggitt commented Oct 17, 2018

liggitt commented Oct 17, 2018

Huang-Wei commented Oct 17, 2018

AishSundar commented Oct 17, 2018

wojtek-t commented Oct 17, 2018

Huang-Wei commented Oct 17, 2018

liggitt commented Oct 17, 2018

Huang-Wei commented Oct 18, 2018

k82cn Oct 18, 2018

wojtek-t Oct 18, 2018

liggitt Oct 18, 2018

Simplify taint manager workqueue keys #65350

Simplify taint manager workqueue keys #65350

Conversation

liggitt commented Jun 22, 2018 • edited Loading

liggitt commented Jun 22, 2018

liggitt commented Jun 22, 2018

wojtek-t commented Jun 22, 2018

fejta-bot commented Jun 22, 2018

liggitt commented Jun 22, 2018

dims commented Jun 22, 2018

dims commented Jun 22, 2018

liggitt commented Jun 22, 2018

fejta-bot commented Jun 22, 2018

liggitt commented Jun 22, 2018

fejta-bot commented Jun 22, 2018

fejta-bot commented Jun 23, 2018

wgliang commented Jun 23, 2018

fejta-bot commented Jun 23, 2018

liggitt commented Oct 16, 2018

spiffxp commented Oct 17, 2018

AishSundar commented Oct 17, 2018

wojtek-t commented Oct 17, 2018

AishSundar commented Oct 17, 2018 • edited Loading

liggitt commented Oct 17, 2018

liggitt commented Oct 17, 2018

liggitt commented Oct 17, 2018

liggitt commented Oct 17, 2018

Huang-Wei commented Oct 17, 2018

AishSundar commented Oct 17, 2018

wojtek-t commented Oct 17, 2018

Huang-Wei commented Oct 17, 2018

liggitt commented Oct 17, 2018

Huang-Wei commented Oct 18, 2018

k82cn Oct 18, 2018

Choose a reason for hiding this comment

wojtek-t Oct 18, 2018

Choose a reason for hiding this comment

liggitt Oct 18, 2018

Choose a reason for hiding this comment

liggitt commented Jun 22, 2018 •

edited

Loading

AishSundar commented Oct 17, 2018 •

edited

Loading