Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpumanager: rollback state if updateContainerCPUSet failed #67430

Merged
merged 1 commit into from
Aug 22, 2018

Conversation

choury
Copy link
Contributor

@choury choury commented Aug 15, 2018

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #63018

If updateContainerCPUSet failed, the container will start failed. We should rollback the state to avoid CPU leak.
Special notes for your reviewer:

Release note:

cpumanager: rollback state if updateContainerCPUSet failed

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 15, 2018
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 15, 2018
@choury
Copy link
Contributor Author

choury commented Aug 15, 2018

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 15, 2018
@choury
Copy link
Contributor Author

choury commented Aug 15, 2018

/assign @ConnorDoyle
/assign @balajismaniam

@ipuustin
Copy link
Contributor

Hmm, looking at static policy, if updateContainerCPUSet() fails during AddContainer(), and we remove the container from state by calling RemoveContainer(), the reconciliation loop will try call AddContainer() again for it soon afterwards. I'm not sure if this is what we want. The current behavior is that the reconciliation loop will just call updateContainerCPUSet() again for the container.

@ipuustin
Copy link
Contributor

Ah sorry, I spoke too soon. Indeed the container will not start if AddContainer() called from PreStartHook() fails. I was thinking of the other case where it is called from the reconciliation. This patch makes sense to me.

@ConnorDoyle
Copy link
Contributor

/ok-to-test

Thanks for adding this! Please add some tests.

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 15, 2018
@ConnorDoyle
Copy link
Contributor

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Aug 15, 2018
@balajismaniam
Copy link
Contributor

/retest

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 16, 2018
@choury
Copy link
Contributor Author

choury commented Aug 16, 2018

@ConnorDoyle
tests added

@rphillips
Copy link
Member

lgtm

Copy link
Contributor

@ConnorDoyle ConnorDoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look fine to me, just a couple of small comments.


if !cpus.IsEmpty() {
err = m.updateContainerCPUSet(containerID, cpus)
if err != nil {
glog.Errorf("[cpumanager] AddContainer error: %v", err)
return err
m.policy.RemoveContainer(m.state, containerID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If RemoveContainer also fails here then we possibly leaked some cores? Could we add some warnings in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add some error message here

@@ -175,18 +175,17 @@ func (m *manager) AddContainer(p *v1.Pod, c *v1.Container, containerID string) e
return err
}
cpus := m.state.GetCPUSetOrDefault(containerID)
m.Unlock()
defer m.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous code avoided holding the lock while doing the I/O involved in updateContainerCPUSet (this involves a gRPC call to the CRI.) Could we release and re-acquire the lock if necessary before the RemoveContainer call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update it.

@choury
Copy link
Contributor Author

choury commented Aug 20, 2018

@ConnorDoyle
Any other idea?

}
} else {
glog.V(5).Infof("[cpumanager] update container resources is skipped due to cpu set is empty")
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we talk about this change? Doesn't this break with the none policy? Can you refresh my memory on why this is in the diff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change it, just did some refactor with it. Please review the whole function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, sorry.

@ConnorDoyle
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 22, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: choury, ConnorDoyle

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 22, 2018
@ConnorDoyle
Copy link
Contributor

/milestone v1.12

@k8s-ci-robot
Copy link
Contributor

@ConnorDoyle: You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to set the milestone.

In response to this:

/milestone v1.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@choury
Copy link
Contributor Author

choury commented Aug 22, 2018

/test pull-kubernetes-e2e-kops-aws

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@choury
Copy link
Contributor Author

choury commented Aug 22, 2018

/retest

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 67430, 67550). If you want to cherry-pick this change to another branch, please follow the instructions here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Internal PreStartContainer hook failed: not enough cpus available to satisfy request
7 participants