Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU manager phase 1 #49186

Closed
wants to merge 37 commits into from
Closed

Conversation

ConnorDoyle
Copy link
Contributor

@ConnorDoyle ConnorDoyle commented Jul 19, 2017

What this PR does / why we need it:

E2e tests are in progress and will be added in a subsequent PR.
User / operator docs for kubernetes/kubernetes.github.io are merged.

Release note:

TODO

Plan to break this into smaller PRs...


Starting the following three pods in order on an 8-core single socket system:

cpuset-visualizer-exclusive-1

apiVersion: v1
kind: Pod
metadata:
  name: cpuset-visualizer-exclusive-1
spec:
  containers:
  - image: quay.io/connordoyle/cpuset-visualizer
    name: cpuset-visualizer-exclusive-1
    resources:
      requests:
        cpu: 1
        memory: "256M"
      limits:
        cpu: 1
        memory: "256M"

cpuset-visualizer-exclusive-3

apiVersion: v1
kind: Pod
metadata:
  name: cpuset-visualizer-exclusive-3
spec:
  containers:
  - image: quay.io/connordoyle/cpuset-visualizer
    name: cpuset-visualizer-exclusive-3
    resources:
      requests:
        cpu: 3
        memory: "256M"
      limits:
        cpu: 3
        memory: "256M"

cpuset-visualizer-exclusive-4

apiVersion: v1
kind: Pod
metadata:
  name: cpuset-visualizer-exclusive-4
spec:
  containers:
  - image: quay.io/connordoyle/cpuset-visualizer
    name: cpuset-visualizer-exclusive-4
    resources:
      requests:
        cpu: 3
        memory: "256M"
      limits:
        cpu: 3
        memory: "256M"

Gave the following allocations. Note that cpu 0 was not used for pods in this case, per kube-reserved and system-reserved as described in the proposal.

topo-order

cc @sjenning @flyingcougar @nqn @teferi @vishh

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jul 19, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ConnorDoyle
We suggest the following additional approvers: gmarek, vishh

Assign the PR to them by writing /assign @gmarek @vishh in a comment when ready.

No associated issue. Update pull-request body to add a reference to an issue, or get approval with /approve no-issue

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jul 19, 2017
// ID of the container to update.
string container_id = 1;
// Resource configuration specific to Linux containers.
LinuxContainerResources linux = 15;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: would make more sense for this to be 2.

@@ -248,6 +251,9 @@ func (m *kubeGenericRuntimeManager) generateLinuxContainerConfig(container *v1.C
cpuShares = milliCPUToShares(cpuRequest.MilliValue())
}
lc.Resources.CpuShares = cpuShares
// SETH set initial cpuset here maybe?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: kill comment?

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 19, 2017
return cpuset.NewCPUSet(), fmt.Errorf("not enough cpus available to satisfy request")
}

// TODO(CD): Acquire CPUs topologically instead of sequentially...
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: delete these lines

@@ -1241,6 +1257,9 @@ func (kl *Kubelet) initializeModules() error {
// Start resource analyzer
kl.resourceAnalyzer.Start()

// Step 9: Start the CPU manager
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove Step 9 prefix

@@ -145,6 +145,7 @@ func GetHollowKubeletConfig(
c.VolumeStatsAggPeriod.Duration = time.Minute
c.CgroupRoot = ""
c.ContainerRuntime = kubetypes.DockerContainerRuntime
c.CPUManagerPolicy = "noop"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: use const from cpumanager package instead of this raw string

glog.Infof("[cpumanager] detected CPU topology: %v", topo)
newPolicy = NewStaticPolicy(topo)
default:
glog.Warningf("[cpumanager] Invalid policy, fallback to default policy - 'noop'")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: interpolate PolicyNoop const in this log output

@ConnorDoyle
Copy link
Contributor Author

/assign @derekwaynecarr @vishh
/unassign @pmorie @euank

@k8s-ci-robot k8s-ci-robot assigned derekwaynecarr and vishh and unassigned pmorie and euank Jul 19, 2017
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 20, 2017
@ConnorDoyle
Copy link
Contributor Author

@teferi and @nqn could you please help turn the CLA gate green? There's a wiki entry on how it works here: https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ

@sjenning
Copy link
Contributor

/assign

@sjenning
Copy link
Contributor

@ConnorDoyle thanks for opening this PR 👍 We do need to start moving this forward and getting review if we are going to make 1.8

I did want to clean up the commit history on our dev branch before opening a PR. I would like to maintain credit though for each author. Maybe we can squash by author and rework the commit messages so that we don't have "wip" commits and whatnot.

@sjenning
Copy link
Contributor

Reviewer note: the first commit in this PR is the commit for #46105. I think we might want to review the CRI change separately in that PR, which needs rebasing if we want to keep it open.

Thoughts?

@ConnorDoyle
Copy link
Contributor Author

ConnorDoyle commented Jul 21, 2017 via email

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 22, 2017
@ConnorDoyle
Copy link
Contributor Author

Rebased

@@ -116,8 +115,12 @@ func (m *kubeGenericRuntimeManager) startContainer(podSandboxID string, podSandb
m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", grpc.ErrorDesc(err))
return grpc.ErrorDesc(err), ErrCreateContainer
}
err = m.cpuManager.RegisterContainer(pod, container, containerID)
if err != nil {
m.recorder.Eventf(ref, v1.EventTypeWarning, events.FailedToStartContainer, "Failed to register container with CPU manager: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we introduce a new events constant here say events.FailedToRegisterContainer or something ?
since this is not a failure to start the container but a registration failure ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, haven't examined the event classifications closely enough to have an opinion about this. However, we are bailing out of startContainer on account of the registration failure so it doesn't seem too far off.

@@ -810,6 +813,7 @@ func (m *kubeGenericRuntimeManager) removeContainer(containerID string) error {
if err := m.removeContainerLog(containerID); err != nil {
return err
}
m.cpuManager.UnregisterContainer(containerID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we do in case we get an error while unregistering the container ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the error is discarded but that probably is not the right thing to do. In case unregistration fails we probably want to fail removeContainer so that it gets called again. Otherwise we could leak/strand CPUs.

CC @sjenning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been updated so that errors are surfaced to the caller of killContainer and removeContainer.

- Typos
- Type naming.
- File rename.
- Added alpha warning to --cpu-manager-policy config description.
@ConnorDoyle
Copy link
Contributor Author

ConnorDoyle commented Aug 24, 2017

Addressed some of the review comments. The following TODOs are not done yet but are hidden by the GitHub UI:

- Require kubereserved.cpu + systemreserved.cpu > 0.
- Updated shared pool and exclusive allocation accordingly.
@k8s-github-robot k8s-github-robot removed the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Aug 25, 2017
@vishh
Copy link
Contributor

vishh commented Aug 25, 2017 via email

k8s-github-robot pushed a commit that referenced this pull request Aug 25, 2017
Automatic merge from submit-queue (batch tested with PRs 50033, 49988, 51132, 49674, 51207)

Add cpuset helper library.

Blocker for CPU manager #49186 (1 of 6)

@sjenning @derekwaynecarr 

```release-note
NONE
```
@k8s-github-robot
Copy link

@ConnorDoyle PR needs rebase

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 29, 2017
k8s-github-robot pushed a commit that referenced this pull request Aug 30, 2017
Automatic merge from submit-queue (batch tested with PRs 51439, 51361, 51140, 51539, 51585)

CPU manager interfaces.

Please review / merge #51132 first.
Blocker for CPU manager #49186 (3 of 6)

@sjenning @derekwaynecarr
@ConnorDoyle
Copy link
Contributor Author

Addressed the vast majority of review comments and broke this down into smaller PRs. Those are all linked from the top PR description comment here. Continuing reviews together with @derekwaynecarr there. Thanks everyone for reviews on this monster PR -- closing.

k8s-github-robot pushed a commit that referenced this pull request Sep 1, 2017
…epolicy

Automatic merge from submit-queue (batch tested with PRs 49971, 51357, 51616, 51649, 51372)

CPU manager wiring and `none` policy

Blocker for CPU manager #49186 (4 of 6)

* Previous PR in this series: #51140
* Next PR in this series: #51180

cc @balajismaniam @derekwaynecarr @sjenning 

**Release note**:

```release-note
NONE
```

TODO:
- [X] In-memory CPU manager state
- [x] Kubelet config value
- [x] Feature gate
- [X] None policy
- [X] Unit tests
- [X] CPU manager instantiation
- [x] Calls into CPU manager from Kubelet container runtime
k8s-github-robot pushed a commit that referenced this pull request Sep 5, 2017
Automatic merge from submit-queue (batch tested with PRs 51180, 51893)

CPU manager static policy

Blocker for CPU manager #49186 (5 of 6)

* Previous PR in this series: #51357
* Next PR in this series: #51041

cc @derekwaynecarr @sjenning @flyingcougar @balajismaniam 

Attempting to be fairly accurate with main authorship at least at a file level -- please let me know if anyone has a better idea on how to improve this.

For posterity, here are the Kubelet flags to run the static policy (assuming `/kube-reserved` is a cgroup that exists for all required controllers)

`--feature-gates=CPUManager=true --cpu-manager-policy=static --cpu-manager-reconcile-period=5s --enforce-node-allocatable=pods,kube-reserved --kube-reserved-cgroup=/kube-reserved --kube-reserved=cpu=500m`

**Release note**:

```release-note
NONE
```
@ConnorDoyle ConnorDoyle deleted the cpu-manager branch September 6, 2017 23:40
k8s-github-robot pushed a commit that referenced this pull request Sep 12, 2017
Automatic merge from submit-queue

Node e2e tests for the CPU Manager. 

**What this PR does / why we need it**:
- Adds node e2e tests for the CPU Manager implementation in #49186.

**Special notes for your reviewer**: 
- Previous PR in this series: #51180
- Only `test/e2e_node/cpu_manager_test.go` must be reviewed as a part of this PR (i.e., the last commit). Rest of the comments belong in #51357 and #51180.
- The tests have been on run on `n1-standard-n4` and `n1-standard-n2` instances on GCE. 

To run this node e2e test, use the following command:
```sh
make test-e2e-node TEST_ARGS='--feature-gates=DynamicKubeletConfig=true' FOCUS="CPU Manager" SKIP="" PARALLELISM=1
```

CC @ConnorDoyle @sjenning
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants