-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In-Place Pod Vertical Scaling feature #92127
In-Place Pod Vertical Scaling feature #92127
Conversation
Welcome @vinaykul! |
Hi @vinaykul. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/unassign |
After discussions with @sig-cli in today's meeting, the kubectl describe changes in this PR are best deferred to when we get this feature into beta - they are a nice to have, not critical to the feature. I'll update the PR to back-out the kubectl changes along with moving pod_resize.go from test/e2e/common to test/e2e/node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a giant PR, so I only started with API things, and I apologize if some have been covered - hard to catch the whole historical state. I started re-reading the KEP and there are things there that are unclear to me - is it reasonable to send comments on that?
Unfortunately - commenting on a committed KEP is hard. I'll open a PR with <<[UNRESOLVED]>> blocks for discussion. Is that fair?
@@ -1966,6 +1966,33 @@ const ( | |||
PullIfNotPresent PullPolicy = "IfNotPresent" | |||
) | |||
|
|||
// ContainerResizePolicy specifies user guidance on how container resource resize should be handled. | |||
// Only one of the following container resize policies may be specified. | |||
// If none of the following policies is specified, it defaults to NoRestart. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The remarks about default behavior are out of place here, and should go on the field not the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is that the safe default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By safe, do you mean 'wont have any undesirable results'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume 90% of users use the default. Are we going to behave correctly for most of them most of the time? It seems like "Restart" is the more conservative default, and "NoRestart" is more idealistic.
type ContainerResizePolicy string | ||
|
||
// These are the valid container resize policies: | ||
// NoRestart policy tells Kubelet to call UpdateContainerResources CRI API to resize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be clearer and comments should be grouped with the constants. E.g. here's my interpretation:
const (
// NoRestart indicates that this container does not require a restart
// to have a specific resource (e.g. memory) resized.
NoRestart ContainerResizePolicy = "NoRestart"
// Restart indicates that this container requires a restart when a
// specific resource (e.g. memory) is resized. For example, Java apps
// which use the `-xmxN` flag would require to be restarted when the
// memory resource is resized.
Restart ContainerResizePolicy = "Restart"
)
Note I renamed "RestartContainer" since "container" is redundant...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need things like RestartOnGrow and RestartOnShrink ? Should this be RestartAlways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These could be useful, especially RestartOnShrink since reclaiming memory may not be easily accomplished. @derekwaynecarr Should we add these now? I hope it is OK to updated a merged KEP after-the-fact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can always add them later, as long as we leave room for them. I don't like designing to hypotheticals, so maybe best to wait until we know we need them?
// ContainerResizePolicy specifies user guidance on how container resource resize should be handled. | ||
// Only one of the following container resize policies may be specified. | ||
// If none of the following policies is specified, it defaults to NoRestart. | ||
type ContainerResizePolicy string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really "Policy" or "Requirement" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming wise, I feel like this should be ResourceResizeRequirement or ResourceResizePolicy, and the struct below should be ContainerResizePolicy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I agree this looks cleaner. I'll update this to ResourceResizePolicy and ContainerResizePolicy, with values of Restart and NoRestart. RestartContainer got left that way because we had a third RestartPod option at one point which we dropped because we couldn't find a good documented use-case or reason to keep it in the IBM proposal (from where the resize policies came in the first place).
) | ||
|
||
// ResizePolicy represents the resource resize policy for a single container. | ||
type ResizePolicy struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ResourceList is one of the only places we use map
in the API, but it fits well. There's a value to consistency - why is this better as a struct than a map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin I wanted this to be a map (KEP discussion) for simplicity (both for user and for code) despite the API conventions recommending lists of named subobjects preferred over maps .. cpu and memory are the limited key choices here vs container object where user can choose the name.
We discussed this during API live review in 2019 KubeCon San Diego, and we leaned towards staying with the API conventions as it makes future extensibility simple.
@liggitt Do you think we can revisit this? Can we make an exception here and use maps even though at this point we don't have any reason for the values to take parameters? I haven't really spent a lot of time seeing APIs evolve, but if you feel there are strong reasons for value to be objects instead of string (my preference because it is simple), we can do that. Are we looking at something like ContainerState?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make the "value" a struct, then a list probably makes more sense.
If we make the "value" a single value, I have a slight lean towards the map, for congruence with the resources
field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we should leave it as a list - it is possible future iterations may want to qualify the behavior a bit more, such as number of failed update retry attempts before restarting.
@@ -2053,6 +2080,12 @@ type Container struct { | |||
// Compute resource requirements. | |||
// +optional | |||
Resources ResourceRequirements | |||
// Node compute resources allocated to the container. | |||
// +optional | |||
ResourcesAllocated ResourceList |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this spec
and not status
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field was added only so that we have a source-of-truth of actual node resource allocations to the containers and the pod, should the status field be erased or lost - API conventions need status to be reconstructible.
The status field now has a new Resources field (type ResourceRequirements) to reflect that actual config reported by the runtime via extended CRI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be reconstructible from the Kubelet? Being in spec requires all the admission control mess which is a really string smell that something is off, IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer - no.
For point-in-time, there is no way to discover memory request today - it is not stored anywhere. We can reconstruct current CPU requests, limits & memory limits from the extended ContainerStatus CRI API (not today) once runtime supports the CRI extension implemented in this PR.
We had discussed the alternative of checkpointing this locally on the node via atomic file-writes but Dawn mentioned they had run into issues with that kind of approach in Borg, and felt it was safest to have PodSpec be the single source-of-truth.
We had a long discussion on this in sig-node meetings before finally settling on adding ResourcesAllocated to PodSpec, and bringing it up during the Live API review last year. For gory details with kubelet restart scenarios, please see kubernetes/enhancements#686 (comment)
|
||
// InPlacePodVerticalScaling | ||
if utilfeature.DefaultFeatureGate.Enabled(features.InPlacePodVerticalScaling) { | ||
nodePolicyRules = append(nodePolicyRules, rbacv1helpers.NewRule("patch").Groups(legacyGroup).Resources("pods").RuleOrDie()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @tallclair on node update permission on pod specs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign tallclair
if a.GetResource().GroupResource() == corev1.Resource("pods") { | ||
deltaUsage = quota.Subtract(deltaUsage, prevUsage) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this leaves non-pod resources with deltaUsage uncomputed... need a test case that would have caught this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remind me what protects against making requests that 1) reduce pod resource usage and 2) are constructed to fail to persist (e.g. a conflicting resourceVersion) and artificially lowering consumed quota?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subtract - 🤦♂️ .. will fix and add test.
Regarding gaming of resource quota, we decided against using max(Resources.Requests, ResourcesAllocated) as we don't realistically expect a downward resize to stay unenforced by the kubelet for an extended period of time. Kubelet will always accept downward resize typically in the watch latency.
Please review the rationale noted on the KEP discussion for requests:
Similarly for limits, iirc we expected CRI to converge to the new desired limits shortly after watch latency.
if !ok { | ||
return apierrors.NewBadRequest("Resource was marked with kind Pod but was unable to be converted") | ||
} | ||
if len(pod.Spec.Containers) != len(oldPod.Spec.Containers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slight preference for only doing the resource preservation when the length is the same, and letting a length change continue on and get the normal validation error about not changing pod content
if len(pod.Spec.Containers) == len(oldPod.Spec.Containers) {
... preserve/populate values
}
if !ok { | ||
return apierrors.NewBadRequest("Resource was marked with kind Pod but was unable to be converted") | ||
} | ||
if len(pod.Spec.Containers) != len(oldPod.Spec.Containers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about initcontainers and ephemeralcontainers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature is not supported for init-containers, we called it out as a non-goal. At that time, I was unaware of ephemeral containers so it is not covered in the KEP. But documentation on it suggested setting resources aren't allowed for ECs, so I didn't worry about it here. I haven't tested it, but I think normal validation will error out if user tries to set EC resources.
if !utilfeature.DefaultFeatureGate.Enabled(features.InPlacePodVerticalScaling) { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer injecting the feature gates over taking a direct dependency on the global feature gate. as an example, see:
kubernetes/plugin/pkg/admission/runtimeclass/admission.go
Lines 82 to 87 in c53ce48
// InspectFeatureGates allows setting bools without taking a dep on a global variable | |
func (r *RuntimeClass) InspectFeatureGates(featureGates featuregate.FeatureGate) { | |
r.runtimeClassEnabled = featureGates.Enabled(features.RuntimeClass) | |
r.podOverheadEnabled = featureGates.Enabled(features.PodOverhead) | |
r.inspectedFeatures = true | |
} |
kubernetes/plugin/pkg/admission/runtimeclass/admission.go
Lines 99 to 106 in c53ce48
// ValidateInitialization implements the WantsExternalKubeInformerFactory interface. | |
func (r *RuntimeClass) ValidateInitialization() error { | |
if !r.inspectedFeatures { | |
return fmt.Errorf("InspectFeatureGates was not called") | |
} | |
if !r.runtimeClassEnabled { | |
return nil | |
} |
kubernetes/plugin/pkg/admission/runtimeclass/admission.go
Lines 116 to 120 in c53ce48
// Admit makes an admission decision based on the request attributes | |
func (r *RuntimeClass) Admit(ctx context.Context, attributes admission.Attributes, o admission.ObjectInterfaces) error { | |
if !r.runtimeClassEnabled { | |
return nil | |
} |
if !utilfeature.DefaultFeatureGate.Enabled(features.InPlacePodVerticalScaling) && !inPlacePodVerticalScalingInUse(oldPodSpec) { | ||
// Drop ResourcesAllocated and ResizePolicy fields. Don't drop updates to Resources field because | ||
// template spec Resources field is mutable for certain controllers. Let ValidatePodUpdate handle it. | ||
for i := range podSpec.Containers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also initContainers/ephemeralContainers? should this use the VisitContainers helper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try out init and ephemeral containers and be sure about them. I think initContainers would need this.
} | ||
var inUse bool | ||
VisitContainers(podSpec, AllContainers, func(c *api.Container, containerType ContainerType) bool { | ||
if (c.ResourcesAllocated != nil && len(c.ResourcesAllocated) > 0) || len(c.ResizePolicy) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can simplify to
if (c.ResourcesAllocated != nil && len(c.ResourcesAllocated) > 0) || len(c.ResizePolicy) > 0 { | |
if len(c.ResourcesAllocated) > 0 || len(c.ResizePolicy) > 0 { |
case "resourcesAllocated.cpu": | ||
return convertResourceCPUToString(container.ResourcesAllocated.Cpu(), divisor) | ||
case "resourcesAllocated.memory": | ||
return convertResourceMemoryToString(container.ResourcesAllocated.Memory(), divisor) | ||
case "resourcesAllocated.ephemeral-storage": | ||
return convertResourceEphemeralStorageToString(container.ResourcesAllocated.StorageEphemeral(), divisor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this extends allowed values for the downward API in ways old kubelets won't understand, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like older kubelet should fail validation if users try to expose these fields in their containers (a check I need to update to allow these fields).
End result is that pod won't create, so we won't get to the point where kubelet gets bad info - atleast for 1.18 this is true, I didn't try 1.17, but same validation code exists there. Is this an issue? This seems like the right behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you concerned about the case where apiserver has been upgraded, but kubelet has yet to be?
In that event, the old kubelet will not recognize these new fields that apiserver lets through - and it results in container start error. So not great UX - "sometimes pod starts, and sometimes it won't" - depending on whether pod lands on and updated kubelet or not, until upgrade process complete.
Is there a way we can handle this?
/assign @tallclair |
Notes from today's SIG-scheduling weekly meeting discussion on this feature:
/cc @ahg-g |
@vinaykul: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'm closing this PR and will start a new one. For posterity, we decided that previously approved API changes are not what we really wanted, and we are revisiting the KEP as per discussion in PR kubernetes/enhancements#1883 I'm in the process of refactoring this PR with API changes as per the above discussion. I'll close this PR and start a new one. @alejandrox1 I'll create a separate issue and a PR that fixes LookForStringInPodExec instead of adding LookForStringInPodExecToContainer as discussed in #92127 (comment) . The usage of LookForStringInPodExec is still very low this is a zero functionality change, so it should be a breeze to review. The e2e tests that consume this will come with the main PR for this feature. |
What type of PR is this?
What this PR does / why we need it:
This PR implements API changes defined in the In-Place Pod Vertical Scaling KEP , the CRI-API changes defined in Kubelet CRI KEP , and the core implementation of In-Place Pod Vertical Scaling feature that allows users or entities such as VPA to request pod resize without restarting containers.
To summarize changes from the two KEPs referenced above, this PR brings the following:
Adds new fields named ResourcesAllocated and ResizePolicy to Pod's Container type, and Resources field to ContainerStatus type. It also adds a new admission controller that limits the ability to change ResourcesAllocated field to system-node account. This API change enables users to scale a Pod's CPU and memory resources up or down.
The CRI change enables runtimes to report currently applied resource limits for a container, and adds support for Windows containers.
The API change code had an initial review by @thockin , and the core implementation has been reviewed by @dashpole and revised. See: vinaykul#1
We are now at a point where we can use further review from the community.
Alpha Graduation Criteria Status
Which issue(s) this PR fixes:
Fixes #9043
Special notes for your reviewer:
Does this PR introduce a user-facing change?: Yes
The resources can be patched via API or via kubectl as illustrated below:
kubectl describe pod now displays resources allocated to the pod's containers as shown below:
Additionally, kubectl describe node displays total node resource allocations and CPU and memory resource allocations for all pods on the node as below:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: