Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: in-place update of pod resources #686

Merged
merged 22 commits into from
Oct 3, 2019

Conversation

vinaykul
Copy link
Member

This PR moves Karol's initial proposal for in-place update of pod resources from k/community to k/enhancements as required by the new process.

This KEP intends to build upon the ideas in proposal for live and in-place vertical scaling and Vertical Resources Scaling in Kubernetes.

This PR also updates the owning-sig to sig-autoscaling, and adds initial set of reviewers @bsalamat , @derekwaynecarr , @dchen1107 from sig-scheduling and sig-node where bulk of the anticipated code changes will happen.

The original pull request by Karol, and associated discussion is here.

CC: @kgolab @bskiba @schylek @bsalamat @dchen1107 @derekwaynecarr @karthickrajamani @YoungjaeLee @resouer

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jan 12, 2019
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/pm size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 12, 2019
@vinaykul
Copy link
Member Author

@k8s-ci-robot fixed CLA email address to match commit email.
/check-cla
/assign @DirectXMan12

@vinaykul
Copy link
Member Author

/check-cla

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jan 12, 2019
@thockin thockin self-assigned this Jan 14, 2019
@thockin
Copy link
Member

thockin commented Jan 14, 2019

Is this moving a (partially) consensus-reached KEP to the new repo ir is this the request for deeper review? If the latter, please adjust the title to reflect that this IS the KEP.

@vinaykul
Copy link
Member Author

Is this moving a (partially) consensus-reached KEP to the new repo ir is this the request for deeper review? If the latter, please adjust the title to reflect that this IS the KEP.

@thockin Yes this just moves the earlier draft KEP skeleton by @kgolab from k/community to k/enhancements in order to get the ball rolling per new process. The only change I made is to set the owning sig to autoscaling as I think it is the main driver for this feature.

I'll append my previous comments on this once I bring this up for an initial look with sig-node and sig-scheduling and get their official buy-off. This will evolve further as we merge design ideas and fill in details with community consensus.

Thanks to the above:
* PodSpec.Container.ResourceRequirements becomes purely a declaration,
denoting **desired** state of the Pod,
* PodStatus.ContainerStatus.ResourceAllocated (new object) denotes **actual**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a corresponding CRI change for review? That shouldn't block merging this KEP draft but it is going to be important for implementers as it would need to be reviewed for Linux+Windows compatibility and runtime compatibility (dockershim/kata/hyper-v)

cc @feiskyer

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very confident we have reached agreement that it's the direction we will go. If yes, a CRI change should be included here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PatrickLang In our implementation, is was sufficient to make changes in kubelet to detect a resources-only container spec update, and call UpdateContainerResources CRI API without any changes to the CRI itself. We have tested it with docker, we are yet to try kata.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinaykul Kata does not update "container" resource for now :-) Also, it's related to how CRI shim is implemented, in containerd shimv2 work, I remembered we didn't handle this update at least month ago.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While if we decide to go with the current narrative in this KEP, CRI do need to be updated (new filed: ResourceAllocated) and CRI shim & crictl maintainers should be notified about the incompatible change of meaning of LinuxContainerResources .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinaykul Kata does not update "container" resource for now :-) Also, it's related to how CRI shim is implemented, in containerd shimv2 work, I remembered we didn't handle this update at least month ago.

Ah that's good to know. I last toyed with Kata at GA, and they were working on getting cpu/mem update working.

I tried out krt1.4.1 earlier today, and found OCI mostly works. CPU / memory increase & decrease reflects in the cgroup inside kata container and enforced, but VSZ/RSS isn't lowered when memory is lowered, and Get doesn't reflect the actual usage.

I'll try k8s-crio-kata tomorrow or friday to see how well crio-oci translation works and identify gaps. It probably won't work if containerd shim doesnt handle it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While if we decide to go with the current narrative in this KEP, CRI do need to be updated (new filed: ResourceAllocated) and CRI shim & crictl maintainers should be notified about the incompatible change of meaning of LinuxContainerResources .

@resouer kata-runtime 1.4.1 seems to handle updating cpu/memory via CRI-O (below example)
image

Regarding CRI, kubelet would merely switch from using PodSpec.Container.Container.ResourceRequirements to PodStatus.Container.ResourceAllocated to get the limits when invoking the CRI API in this function for e.g: https://github.com/Huawei-PaaS/kubernetes/blob/vert-scaling-cp-review/pkg/kubelet/kuberuntime/kuberuntime_container.go#L619

Did I miss something in thinking that CRI update is not necessary?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I just check the kata-runtime's api and now every CRI shim could support container level resource adjustment. In that case, no CRI change is required, we can simply just use ResourceAllocated to generate containerResources.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about "every CRI"? I'll try to look on that deeper during next days, but almost for sure that will break https://github.com/Mirantis/virtlet/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about "every CRI"? I'll try to look on that deeper during next days, but almost for sure that will break https://github.com/Mirantis/virtlet/

@PatrickLang Over the past week, I experimented with WinServer 2019 (for another project planning that I'm working on), and got the chance to try a windows cross-compile kubelet of my implementation and take a closer look at how to set the updated limits. Windows does create the container with specified limits (perhaps using information from ContainerConfig.WindowsContainerConfig.WindowsContainerResources struct).

For cleanliness, I do see that we should update the CRI API to specify ContainerResources instead of LinuxContainerResources (which would have pointers to LinuxContainerResources or WindowsContainerResources similar to ContainerConfig). Do you think containerID + WindowsContainerResources is sufficient for Windows to successfully update the limits?

@jellonek I've not looked at virtlet. If you have had the chance, can you please check if its CRI shim is able to use LinuxContainerResources?

@DirectXMan12
Copy link
Contributor

/assign @mwielgus
/unassign @DirectXMan12

@derekwaynecarr
Copy link
Member

/assign

@kgolab
Copy link
Contributor

kgolab commented Sep 23, 2019

@vinaykul , thanks a lot for driving this forward and also simplifying the initial design.

Overall the KEP looks good to me already from the autoscaling perspective.
I've got a few smaller comments which could as well be addressed later, after the KEP is merged.
@mwielgus , not sure if you need something more for a formal approval.

@vinaykul
Copy link
Member Author

@vinaykul , thanks a lot for driving this forward and also simplifying the initial design.

Overall the KEP looks good to me already from the autoscaling perspective.
I've got a few smaller comments which could as well be addressed later, after the KEP is merged.
@mwielgus , not sure if you need something more for a formal approval.

@kgolab Thanks for the review, please see my responses above. The goal has been to keep things simple for Kubelet, and so we removed proposed changes that were not critical to the base resize operation. If possible, could you please join tomorrow 10 am PST sig-node weekly meeting to review the above concerns?

In today's sig-autoscaling meeting, I discussed this we will list @mwielgus as final approver in the KEP, he plans to approve it after your LGTM.

@dchen1107
Copy link
Member

/approve

@vinaykul I approved your KEP from SIG Node to unblock the ongoing API review since we are converging on the high level design. There are some small implementation details can be addressed later. Thanks for taking this project to SIG Node last couple of months, and leading the design conversation at our weekly meeting. Thanks for your patience.

@vinaykul
Copy link
Member Author

/assign @derekwaynecarr

@ahg-g
Copy link
Member

ahg-g commented Sep 28, 2019

/approve

Approving for sig-scheduling. Our main concern is the race that could happen between the scheduler and an in-place update. We agreed that we don't have enough data points to conclude how disruptive that is going to be, and so we will re-evaluate after alpha.

To provide fine-grained user control, PodSpec.Containers is extended with
ResizePolicy map (new object) for each resource type (CPU, memory):
* NoRestart - the default value; resize Container without restarting it,
* RestartContainer - restart the Container in-place to apply new resource
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see a case for a third policy here, which as a strawman I will call SignalContainerWINCH. This would allow the container to attempt to adjust its language runtime to conform to the new limits - e.g. a programmer determines that calling runtime.GOMAXPROCS(math.Ceil(numCPUs) + 1) results in less scheduler thrashing.

However, such a signal would only be useful if pods are able to interrogate the system for their own resource limits. This is perhaps best left to future enhancements to in-place update and should not block 1.17 implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On linux you can always read /sys/fs/cgroup can't you ?

@mwielgus
Copy link
Contributor

mwielgus commented Oct 3, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 3, 2019
@mwielgus
Copy link
Contributor

mwielgus commented Oct 3, 2019

/approve

Approving for sig-autoscaling.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, dashpole, dchen1107, mwielgus, vinaykul

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.