-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sched: make CycleState's Read()/Write()/Delete() thread-safe #101542
sched: make CycleState's Read()/Write()/Delete() thread-safe #101542
Conversation
@Huang-Wei: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Huang-Wei The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold I don't expect the introduced read lock to introduce significant overhead. Will run perf tests to verify it. |
Updated the test result in /cc @alculquicondor @ahg-g |
Thank you for pushing this forward @Huang-Wei! 👏 |
That is very significant. |
To ensure it's not an outlier. I will rerun the test for |
Rerun SchedulingBasic and SchedulingNodeAffinity for 20 times, and updated the test result. @alculquicondor |
The results are quite noisy, but I am not concerned about performance, rlock should be very light weight (assuming no contention). |
// race-free via cycleState#Lock()/Unlock(). | ||
cycleState.RLock() | ||
defer cycleState.RUnlock() | ||
|
||
c, err := cycleState.Read(preFilterStateKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add the lock in the Read
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering the same, but I think the issue here is that adding the rlock inside Read syncs access to the map only, not the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but this is not about locking the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Adding RLock to Read but not lock in the Write function is strange though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add the lock in the
Read
function?
An enforced read lock inside Read
is inappropriate as users should have the option to choose what kind of lock around Read
- either simply read lock like the in-tree plugins, or write lock as the original issue stated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahg-g I think we're on the same page of how to efficiently use CycleState. What we're talking about here is how to deal with misuse of CycleState, from the perspective of defensive programming.
because I want to remove all Lock related functions, they are confusing and will likely be misused and cause problems.
This is fair. So @alculquicondor @ahg-g what's your take on this:
- keep 4 functions: Read/SafeRead/Write/SafeWrite, and remove Lock/Unlock
- in terms of sequential phases (PreFilter/PreScore), use Write or SafeWrite?
- keep 2 functions: Read/Write, and remove Lock/Unlock - i.e., we only have locking version of read/write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I say, let's start with option 2 and evaluate performance. Please make sure preemption performance is not affected as much. I think we would lock for each Pod that is added/removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, we also need to Lock inside Delete
(not sure if anyone uses it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG, will update and run the perf test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR and testing result.
Note that the value of scheduler_e2e_scheduling
varies a lot due to unpredictable binding duration. In my test, I tried my best to exclude the obvious outliers, but it's still challenging for this metric.
- add internal locking to CycleState's Read()/Write()/Delete() functions - remove Lock() and Unlock() functions
0644c7f
to
9c45e8a
Compare
Awesome, I'm happy with the results and the code change. I'll leave the LGTM to @ahg-g |
I guess I need to add some "Action Required" notes to the release notes, as we removed the Lock()/Unlock() functions. |
/lgtm |
/hold cancel |
/retest |
@Huang-Wei @ahg-g Hello, will this MR be backported to v1.19? If so, when will this be done? |
I have found the cherry pick guide 😂 . With the help of |
As well as PRs for branch release-1.20 and release-1.21. |
/remove-kind bug Unfortunately, I don't think this qualifies for backport: it's not a user facing bug. |
@lining2020x I don't quite think it's qualified for backporting. One reason is as @alculquicondor mentioned, it won't impact the end-users; the other reason is for out-of-tree plugin developers, you still have workarounds: e.g., inject a top-level state object as a placeholder, and then compose the locking logic inside the top-level state object in your plugin. |
@Huang-Wei I think this is a developer-friendly enhancement to the scheduler framework. According to the above test results of the performance impact caused by the increase of read-write locks, it did not bring obvious negative effects. Although I think it worthes, I will close my cherry-pick PRs. |
What type of PR is this?
/kind bug
/sig scheduling
What this PR does / why we need it:
This is a followup of the TBD item in #95887 (comment).
Scheduler used and exposed a map called
CycleState
to store transient data within one scheduling cycle. This struct is usually keyed with a particular plugin name + phase (PreFilter or PreScore), and valued with the concrete data. Without this PR, this struct is:Behavior 2 should be adjusted to acquire read lock before using it as we shouldn't make that assumption for out-of-tree plugins - which is how the racing condition occured in #95887. (although in-tree plugins don't write data to
CycleState
)In terms of behavior 1, although in-tree plugins use Write() in sequential phase, and hence can be lock-free. But again, to eventually remove misleading Lock()/Unlock() functions in CycleState, we decide to make no assumption which phase it can/should be used, in other words, add lock inside Write() as well.
Note that the racing pattern this PR solves is multiple plugins run in parallel to compete for
CycleSatate
. Another racing pattern is the same plugin runs in parallel (on different nodes), which can be resolved by the plugin's implementation itself (check #96777 for more details).Which issue(s) this PR fixes:
Fixes #95887
Special notes for your reviewer:
I didn't see an obvious performance downgrade according to the bench diff:
SchedulingBasic/5000Nodes
Preemption/5000Nodes
SchedulingNodeAffinity/5000Nodes
SchedulingPodAffinity/5000Nodes
SchedulingPodAntiAffinity/5000Nodes
TopologySpreading/5000Nodes
Does this PR introduce a user-facing change?