Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-Place Pod Vertical Scaling feature #92127

Closed

Conversation

vinaykul
Copy link
Member

@vinaykul vinaykul commented Jun 14, 2020

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespace from that line:

/kind api-change
/kind bug
/kind cleanup
/kind deprecation
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
This PR implements API changes defined in the In-Place Pod Vertical Scaling KEP , the CRI-API changes defined in Kubelet CRI KEP , and the core implementation of In-Place Pod Vertical Scaling feature that allows users or entities such as VPA to request pod resize without restarting containers.

To summarize changes from the two KEPs referenced above, this PR brings the following:

  • Adds new fields named ResourcesAllocated and ResizePolicy to Pod's Container type, and Resources field to ContainerStatus type. It also adds a new admission controller that limits the ability to change ResourcesAllocated field to system-node account. This API change enables users to scale a Pod's CPU and memory resources up or down.

  • The CRI change enables runtimes to report currently applied resource limits for a container, and adds support for Windows containers.

The API change code had an initial review by @thockin , and the core implementation has been reviewed by @dashpole and revised. See: vinaykul#1

We are now at a point where we can use further review from the community.

Alpha Graduation Criteria Status

  • API change and core functionality implemented.
  • Resource quota, Limit ranger, scheduler changes, support for kubectl describe is in.
  • Resize policy has been implemented.
  • Unit tests for the core feature changes are in place.
  • E2E tests: Just started working on it.
  • Manually tested the at-least following E2E scenarios using kubectl patch:
    • Single container (guaranteed, burstable) Pod, increase resources: CPU + Memory, CPU only, Memory only.
    • Single container (guaranteed, burstable) Pod, decrease resources: CPU + Memory, CPU only, Memory only.
    • Three containers (guaranteed, burstable) Pod, increase CPU of c1, decrease c2, no change for c3, decrease memory of c1, no change for c2 mem, increase memory of c3.
    • Restart kubelet while pod resize is in progress.
    • Inject random failure in dockershim response to UpdateContainerResources CRI API.
    • Dockershim does not return Resources information in ContainerStatus API.
    • Three containers (guaranteed) with RestartContainer resize policy for c2 memory - ensure that c2 is restarted only when c2 memory is resized, regardless of c2 cpu resize.

Which issue(s) this PR fixes:

Fixes #9043

Special notes for your reviewer:

  1. The commit named 'InPlace Pod Vertical Scaling feature - API change' implements the API change for In-Place Pod Vertical Scaling.
  2. The commit named 'InPlace Pod Vertical Scaling feature - Kubelet CRI change' implements the CRI changes to allow runtimes to report resource limits.
  3. The commits named 'InPlace Pod Vertical Scaling feature - core implementation' contains the core kubelet changes that implements the feature, and changes to scheduler to use ResourcesAllocated field, and apiserver to support resource-quota, limit-ranger, and kubectl descibe outputs. Please look at the change to switch-over to ResourcesAllocated, and the resource-quota, limit-ranger changes closely - I have not had a lot of play-time with those.

Does this PR introduce a user-facing change?: Yes

1. A new field named ResourcesAllocated (type v1.ResourceList) is added to Container struct.
2. A new field named ResizePolicy is added to Container struct.
3. A new field named Resources (type v1.ResourceRequirements) is added to ContainerStatus struct.
4. A new admission controller named 'PodResourceAllocation' is introduced to limit access to modification of ResourcesAllocated field.

e.g: 
root@skibum:~/vInPlacePodVerticalScaling# cat ~/YML/1pod.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: 1pod
spec:
  containers:
  - name: stress
    image: skiibum/ubuntu-stress:18.10
    resources:
      limits:
        cpu: "1"
        memory: "1Gi"
      requests:
        cpu: "1"
        memory: "1Gi"
    resizePolicy:
    - resourceName: cpu
      policy: NoRestart
    - resourceName: memory
      policy: RestartContainer

The resources can be patched via API or via kubectl as illustrated below:

kubectl patch pod 1pod --patch '{"spec":{"containers":[{"name":"stress", "resources":{"requests":{"cpu":"1500m"}, "limits":{"cpu":"1500m"}}}]}}'

kubectl describe pod now displays resources allocated to the pod's containers as shown below:

root@skibum:~/vInPlacePodVerticalScaling# ./cluster/kubectl.sh describe pod 1pod
Name:         1pod
Namespace:    default
Priority:     0
Node:         127.0.0.1/127.0.0.1
Start Time:   Tue, 30 Jun 2020 14:58:02 -0700
Labels:       <none>
Annotations:  <none>
Status:       Running
IP:           172.18.0.3
IPs:
  IP:  172.18.0.3
Containers:
  stress:
    Container ID:   docker://481892996beb94ee0e0f90de3911f683419ea9d832f765ae762ce893af16feec
    Image:          skiibum/ubuntu-stress:18.10
    Image ID:       docker-pullable://skiibum/ubuntu-stress@sha256:161c50ff4ccc79744c29ea0f337652ce2854ce827914160c7043646b1f8988ba
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Tue, 30 Jun 2020 14:58:03 -0700
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1500m
      memory:  1Gi
    Requests:
      cpu:     1500m
      memory:  1Gi
    Allocations:
      cpu:        1500m
      memory:     1Gi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-lvrwt (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  default-token-lvrwt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-lvrwt
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From                Message
  ----    ------     ----  ----                -------
  Normal  Scheduled  73s   default-scheduler   Successfully assigned default/1pod to 127.0.0.1
  Normal  Pulled     73s   kubelet, 127.0.0.1  Container image "skiibum/ubuntu-stress:18.10" already present on machine
  Normal  Created    73s   kubelet, 127.0.0.1  Created container stress
  Normal  Started    73s   kubelet, 127.0.0.1  Started container stress
root@skibum:~/vInPlacePodVerticalScaling#

Additionally, kubectl describe node displays total node resource allocations and CPU and memory resource allocations for all pods on the node as below:

root@skibum:~/vInPlacePodVerticalScaling# ./cluster/kubectl.sh describe node 127.0.0.1
Name:               127.0.0.1
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=127.0.0.1
                    kubernetes.io/os=linux
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 30 Jun 2020 14:52:23 -0700
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  127.0.0.1
  AcquireTime:     <unset>
  RenewTime:       Tue, 30 Jun 2020 15:04:13 -0700
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 30 Jun 2020 15:02:35 -0700   Tue, 30 Jun 2020 14:52:23 -0700   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 30 Jun 2020 15:02:35 -0700   Tue, 30 Jun 2020 14:52:23 -0700   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 30 Jun 2020 15:02:35 -0700   Tue, 30 Jun 2020 14:52:23 -0700   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 30 Jun 2020 15:02:35 -0700   Tue, 30 Jun 2020 14:52:33 -0700   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  127.0.0.1
  Hostname:    127.0.0.1
Capacity:
  cpu:                16
  ephemeral-storage:  3807464952Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32935964Ki
  pods:               110
Allocatable:
  cpu:                16
  ephemeral-storage:  3508959693954
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32833564Ki
  pods:               110
System Info:
  Machine ID:                 91a8ee87a44815729b015c5a5af38563
  System UUID:                49434D53-0200-2500-9072-002590726C8A
  Boot ID:                    5f7a62a7-8d6d-49a5-950b-b89a5a7561c0
  Kernel Version:             4.4.0-184-generic
  OS Image:                   Ubuntu 16.04.6 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.8
  Kubelet Version:            v1.19.0-beta.2.529+426d0bbcb13489-dirty
  Kube-Proxy Version:         v1.19.0-beta.2.529+426d0bbcb13489-dirty
Non-terminated Pods:          (2 in total)
  Namespace                   Name                         CPU Requests  CPU Limits  CPU Allocations  Memory Requests  Memory Limits  Memory Allocations  AGE
  ---------                   ----                         ------------  ----------  ---------------  ---------------  -------------  ------------------  ---
  default                     1pod                         1500m (9%)    1500m (9%)  1500m (9%)       1Gi (3%)         1Gi (3%)       1Gi (3%)            6m13s
  kube-system                 kube-dns-5949897dd5-cjqdl    260m (1%)     0 (0%)      260m (1%)        110Mi (0%)       170Mi (0%)     110Mi (0%)          11m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits       Allocations
  --------           --------     ------       -----------
  cpu                1760m (11%)  1500m (9%)   1760m (11%)
  memory             1134Mi (3%)  1194Mi (3%)  1134Mi (3%)
  ephemeral-storage  0 (0%)       0 (0%)       0 (0%)
  hugepages-1Gi      0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
Events:
  Type    Reason                   Age                From                   Message
  ----    ------                   ----               ----                   -------
  Normal  Starting                 11m                kubelet, 127.0.0.1     Starting kubelet.
  Normal  NodeHasSufficientMemory  11m (x2 over 11m)  kubelet, 127.0.0.1     Node 127.0.0.1 status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    11m (x2 over 11m)  kubelet, 127.0.0.1     Node 127.0.0.1 status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     11m (x2 over 11m)  kubelet, 127.0.0.1     Node 127.0.0.1 status is now: NodeHasSufficientPID
  Normal  NodeAllocatableEnforced  11m                kubelet, 127.0.0.1     Updated Node Allocatable limit across pods
  Normal  Starting                 11m                kube-proxy, 127.0.0.1  Starting kube-proxy.
  Normal  NodeReady                11m                kubelet, 127.0.0.1     Node 127.0.0.1 status is now: NodeReady
root@skibum:~/vInPlacePodVerticalScaling#

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

In-Place Update of Pod Resources KEP: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20181106-in-place-update-of-pod-resources.md

Container Resources CRI API Changes for Pod Vertical Scaling KEP:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20191025-kubelet-container-resources-cri-api-changes.md

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 14, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @vinaykul!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 14, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @vinaykul. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/apiserver area/kubeadm area/kubectl area/kubelet kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 14, 2020
@vinaykul
Copy link
Member Author

@neolit123
Copy link
Member

/unassign

@vinaykul
Copy link
Member Author

vinaykul commented Jul 1, 2020

After discussions with @sig-cli in today's meeting, the kubectl describe changes in this PR are best deferred to when we get this feature into beta - they are a nice to have, not critical to the feature. I'll update the PR to back-out the kubectl changes along with moving pod_resize.go from test/e2e/common to test/e2e/node

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a giant PR, so I only started with API things, and I apologize if some have been covered - hard to catch the whole historical state. I started re-reading the KEP and there are things there that are unclear to me - is it reasonable to send comments on that?

Unfortunately - commenting on a committed KEP is hard. I'll open a PR with <<[UNRESOLVED]>> blocks for discussion. Is that fair?

@@ -1966,6 +1966,33 @@ const (
PullIfNotPresent PullPolicy = "IfNotPresent"
)

// ContainerResizePolicy specifies user guidance on how container resource resize should be handled.
// Only one of the following container resize policies may be specified.
// If none of the following policies is specified, it defaults to NoRestart.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remarks about default behavior are out of place here, and should go on the field not the type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is that the safe default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By safe, do you mean 'wont have any undesirable results'?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume 90% of users use the default. Are we going to behave correctly for most of them most of the time? It seems like "Restart" is the more conservative default, and "NoRestart" is more idealistic.

type ContainerResizePolicy string

// These are the valid container resize policies:
// NoRestart policy tells Kubelet to call UpdateContainerResources CRI API to resize
Copy link
Member

@thockin thockin Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be clearer and comments should be grouped with the constants. E.g. here's my interpretation:

const (
	// NoRestart indicates that this container does not require a restart
	// to have a specific resource (e.g. memory) resized.
	NoRestart ContainerResizePolicy = "NoRestart"
	// Restart indicates that this container requires a restart when a
	// specific resource (e.g. memory) is resized.  For example, Java apps
	// which use the `-xmxN` flag would require to be restarted when the
	// memory resource is resized.
	Restart ContainerResizePolicy = "Restart"
)

Note I renamed "RestartContainer" since "container" is redundant...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need things like RestartOnGrow and RestartOnShrink ? Should this be RestartAlways?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could be useful, especially RestartOnShrink since reclaiming memory may not be easily accomplished. @derekwaynecarr Should we add these now? I hope it is OK to updated a merged KEP after-the-fact.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always add them later, as long as we leave room for them. I don't like designing to hypotheticals, so maybe best to wait until we know we need them?

// ContainerResizePolicy specifies user guidance on how container resource resize should be handled.
// Only one of the following container resize policies may be specified.
// If none of the following policies is specified, it defaults to NoRestart.
type ContainerResizePolicy string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really "Policy" or "Requirement" ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming wise, I feel like this should be ResourceResizeRequirement or ResourceResizePolicy, and the struct below should be ContainerResizePolicy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I agree this looks cleaner. I'll update this to ResourceResizePolicy and ContainerResizePolicy, with values of Restart and NoRestart. RestartContainer got left that way because we had a third RestartPod option at one point which we dropped because we couldn't find a good documented use-case or reason to keep it in the IBM proposal (from where the resize policies came in the first place).

)

// ResizePolicy represents the resource resize policy for a single container.
type ResizePolicy struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResourceList is one of the only places we use map in the API, but it fits well. There's a value to consistency - why is this better as a struct than a map?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin I wanted this to be a map (KEP discussion) for simplicity (both for user and for code) despite the API conventions recommending lists of named subobjects preferred over maps .. cpu and memory are the limited key choices here vs container object where user can choose the name.

We discussed this during API live review in 2019 KubeCon San Diego, and we leaned towards staying with the API conventions as it makes future extensibility simple.

@liggitt Do you think we can revisit this? Can we make an exception here and use maps even though at this point we don't have any reason for the values to take parameters? I haven't really spent a lot of time seeing APIs evolve, but if you feel there are strong reasons for value to be objects instead of string (my preference because it is simple), we can do that. Are we looking at something like ContainerState?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make the "value" a struct, then a list probably makes more sense.

If we make the "value" a single value, I have a slight lean towards the map, for congruence with the resources field

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should leave it as a list - it is possible future iterations may want to qualify the behavior a bit more, such as number of failed update retry attempts before restarting.

@@ -2053,6 +2080,12 @@ type Container struct {
// Compute resource requirements.
// +optional
Resources ResourceRequirements
// Node compute resources allocated to the container.
// +optional
ResourcesAllocated ResourceList
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this spec and not status ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field was added only so that we have a source-of-truth of actual node resource allocations to the containers and the pod, should the status field be erased or lost - API conventions need status to be reconstructible.

The status field now has a new Resources field (type ResourceRequirements) to reflect that actual config reported by the runtime via extended CRI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be reconstructible from the Kubelet? Being in spec requires all the admission control mess which is a really string smell that something is off, IMO

Copy link
Member Author

@vinaykul vinaykul Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer - no.

For point-in-time, there is no way to discover memory request today - it is not stored anywhere. We can reconstruct current CPU requests, limits & memory limits from the extended ContainerStatus CRI API (not today) once runtime supports the CRI extension implemented in this PR.

We had discussed the alternative of checkpointing this locally on the node via atomic file-writes but Dawn mentioned they had run into issues with that kind of approach in Borg, and felt it was safest to have PodSpec be the single source-of-truth.

We had a long discussion on this in sig-node meetings before finally settling on adding ResourcesAllocated to PodSpec, and bringing it up during the Live API review last year. For gory details with kubelet restart scenarios, please see kubernetes/enhancements#686 (comment)


// InPlacePodVerticalScaling
if utilfeature.DefaultFeatureGate.Enabled(features.InPlacePodVerticalScaling) {
nodePolicyRules = append(nodePolicyRules, rbacv1helpers.NewRule("patch").Groups(legacyGroup).Resources("pods").RuleOrDie())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @tallclair on node update permission on pod specs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign tallclair

Comment on lines +513 to +515
if a.GetResource().GroupResource() == corev1.Resource("pods") {
deltaUsage = quota.Subtract(deltaUsage, prevUsage)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this leaves non-pod resources with deltaUsage uncomputed... need a test case that would have caught this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remind me what protects against making requests that 1) reduce pod resource usage and 2) are constructed to fail to persist (e.g. a conflicting resourceVersion) and artificially lowering consumed quota?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subtract - 🤦‍♂️ .. will fix and add test.

Regarding gaming of resource quota, we decided against using max(Resources.Requests, ResourcesAllocated) as we don't realistically expect a downward resize to stay unenforced by the kubelet for an extended period of time. Kubelet will always accept downward resize typically in the watch latency.

Please review the rationale noted on the KEP discussion for requests:

Similarly for limits, iirc we expected CRI to converge to the new desired limits shortly after watch latency.

if !ok {
return apierrors.NewBadRequest("Resource was marked with kind Pod but was unable to be converted")
}
if len(pod.Spec.Containers) != len(oldPod.Spec.Containers) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slight preference for only doing the resource preservation when the length is the same, and letting a length change continue on and get the normal validation error about not changing pod content

if len(pod.Spec.Containers) == len(oldPod.Spec.Containers) {
  ... preserve/populate values
}

if !ok {
return apierrors.NewBadRequest("Resource was marked with kind Pod but was unable to be converted")
}
if len(pod.Spec.Containers) != len(oldPod.Spec.Containers) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about initcontainers and ephemeralcontainers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature is not supported for init-containers, we called it out as a non-goal. At that time, I was unaware of ephemeral containers so it is not covered in the KEP. But documentation on it suggested setting resources aren't allowed for ECs, so I didn't worry about it here. I haven't tested it, but I think normal validation will error out if user tries to set EC resources.

Comment on lines +62 to +64
if !utilfeature.DefaultFeatureGate.Enabled(features.InPlacePodVerticalScaling) {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer injecting the feature gates over taking a direct dependency on the global feature gate. as an example, see:

// InspectFeatureGates allows setting bools without taking a dep on a global variable
func (r *RuntimeClass) InspectFeatureGates(featureGates featuregate.FeatureGate) {
r.runtimeClassEnabled = featureGates.Enabled(features.RuntimeClass)
r.podOverheadEnabled = featureGates.Enabled(features.PodOverhead)
r.inspectedFeatures = true
}

// ValidateInitialization implements the WantsExternalKubeInformerFactory interface.
func (r *RuntimeClass) ValidateInitialization() error {
if !r.inspectedFeatures {
return fmt.Errorf("InspectFeatureGates was not called")
}
if !r.runtimeClassEnabled {
return nil
}

// Admit makes an admission decision based on the request attributes
func (r *RuntimeClass) Admit(ctx context.Context, attributes admission.Attributes, o admission.ObjectInterfaces) error {
if !r.runtimeClassEnabled {
return nil
}

if !utilfeature.DefaultFeatureGate.Enabled(features.InPlacePodVerticalScaling) && !inPlacePodVerticalScalingInUse(oldPodSpec) {
// Drop ResourcesAllocated and ResizePolicy fields. Don't drop updates to Resources field because
// template spec Resources field is mutable for certain controllers. Let ValidatePodUpdate handle it.
for i := range podSpec.Containers {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also initContainers/ephemeralContainers? should this use the VisitContainers helper?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try out init and ephemeral containers and be sure about them. I think initContainers would need this.

}
var inUse bool
VisitContainers(podSpec, AllContainers, func(c *api.Container, containerType ContainerType) bool {
if (c.ResourcesAllocated != nil && len(c.ResourcesAllocated) > 0) || len(c.ResizePolicy) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can simplify to

Suggested change
if (c.ResourcesAllocated != nil && len(c.ResourcesAllocated) > 0) || len(c.ResizePolicy) > 0 {
if len(c.ResourcesAllocated) > 0 || len(c.ResizePolicy) > 0 {

Comment on lines +250 to +255
case "resourcesAllocated.cpu":
return convertResourceCPUToString(container.ResourcesAllocated.Cpu(), divisor)
case "resourcesAllocated.memory":
return convertResourceMemoryToString(container.ResourcesAllocated.Memory(), divisor)
case "resourcesAllocated.ephemeral-storage":
return convertResourceEphemeralStorageToString(container.ResourcesAllocated.StorageEphemeral(), divisor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this extends allowed values for the downward API in ways old kubelets won't understand, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like older kubelet should fail validation if users try to expose these fields in their containers (a check I need to update to allow these fields).

End result is that pod won't create, so we won't get to the point where kubelet gets bad info - atleast for 1.18 this is true, I didn't try 1.17, but same validation code exists there. Is this an issue? This seems like the right behavior.

Copy link
Member Author

@vinaykul vinaykul Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you concerned about the case where apiserver has been upgraded, but kubelet has yet to be?

In that event, the old kubelet will not recognize these new fields that apiserver lets through - and it results in container start error. So not great UX - "sometimes pod starts, and sometimes it won't" - depending on whether pod lands on and updated kubelet or not, until upgrade process complete.

Is there a way we can handle this?

@thockin
Copy link
Member

thockin commented Jul 1, 2020

@vinaykul PTAL at kubernetes/enhancements#1883

@vinaykul
Copy link
Member Author

vinaykul commented Jul 1, 2020

/assign @tallclair

@vinaykul
Copy link
Member Author

vinaykul commented Jul 2, 2020

Notes from today's SIG-scheduling weekly meeting discussion on this feature:

  • Scheduler change looks OK, but consider expanding the container.ResourcesAllocated change to InitContainers as well to keep everything uniform.
    • admission plugin should set ResourcesAllocated for init containers as well in this case.
    • validation should reject attempt to mutate init container.
  • Verify if this works for upgrade

/cc @ahg-g

@k8s-ci-robot k8s-ci-robot requested review from ahg-g and liggitt July 2, 2020 17:24
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 7, 2020
@k8s-ci-robot
Copy link
Contributor

@vinaykul: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vinaykul
Copy link
Member Author

I'm closing this PR and will start a new one.

For posterity, we decided that previously approved API changes are not what we really wanted, and we are revisiting the KEP as per discussion in PR kubernetes/enhancements#1883

I'm in the process of refactoring this PR with API changes as per the above discussion. I'll close this PR and start a new one.

@alejandrox1 I'll create a separate issue and a PR that fixes LookForStringInPodExec instead of adding LookForStringInPodExecToContainer as discussed in #92127 (comment) . The usage of LookForStringInPodExec is still very low this is a zero functionality change, so it should be a breeze to review.

The e2e tests that consume this will come with the main PR for this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/kubectl area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In-place rolling updates