-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpumanager: rollback state if updateContainerCPUSet failed #67430
Conversation
/kind bug |
/assign @ConnorDoyle |
Hmm, looking at static policy, if |
Ah sorry, I spoke too soon. Indeed the container will not start if |
/ok-to-test Thanks for adding this! Please add some tests. |
/sig node |
/retest |
@ConnorDoyle |
lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests look fine to me, just a couple of small comments.
|
||
if !cpus.IsEmpty() { | ||
err = m.updateContainerCPUSet(containerID, cpus) | ||
if err != nil { | ||
glog.Errorf("[cpumanager] AddContainer error: %v", err) | ||
return err | ||
m.policy.RemoveContainer(m.state, containerID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If RemoveContainer also fails here then we possibly leaked some cores? Could we add some warnings in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add some error message here
@@ -175,18 +175,17 @@ func (m *manager) AddContainer(p *v1.Pod, c *v1.Container, containerID string) e | |||
return err | |||
} | |||
cpus := m.state.GetCPUSetOrDefault(containerID) | |||
m.Unlock() | |||
defer m.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous code avoided holding the lock while doing the I/O involved in updateContainerCPUSet (this involves a gRPC call to the CRI.) Could we release and re-acquire the lock if necessary before the RemoveContainer call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update it.
@ConnorDoyle |
} | ||
} else { | ||
glog.V(5).Infof("[cpumanager] update container resources is skipped due to cpu set is empty") | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we talk about this change? Doesn't this break with the none
policy? Can you refresh my memory on why this is in the diff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't change it, just did some refactor with it. Please review the whole function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, sorry.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: choury, ConnorDoyle The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/milestone v1.12 |
@ConnorDoyle: You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to set the milestone. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test pull-kubernetes-e2e-kops-aws |
/test all [submit-queue is verifying that this PR is safe to merge] |
/retest |
Automatic merge from submit-queue (batch tested with PRs 67430, 67550). If you want to cherry-pick this change to another branch, please follow the instructions here. |
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #63018
If
updateContainerCPUSet
failed, the container will start failed. We should rollback the state to avoid CPU leak.Special notes for your reviewer:
Release note: