Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tighten validation on the qosClass field of pod status #127744

Merged
merged 1 commit into from
Nov 1, 2024

Conversation

carlory
Copy link
Member

@carlory carlory commented Sep 30, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

According to this code, the QosClass of pod status is immutable.

This function also shows that QosClass is computed from the resources instead of being provided by users.

However, we can use rest api or "kubectl edit pod --subresource='status'" to change it.

Which issue(s) this PR fixes:

Fixes #126662

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Tighten validation on the qosClass field of pod status. This field is immutable but it would be populated with the old status by kube-apiserver if it is unset in the new status when updating this field via the status subsource.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Sep 30, 2024
@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Sep 30, 2024
@carlory
Copy link
Member Author

carlory commented Sep 30, 2024

/cc @kannon92

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 30, 2024
@carlory
Copy link
Member Author

carlory commented Sep 30, 2024

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Sep 30, 2024
@carlory
Copy link
Member Author

carlory commented Sep 30, 2024

/retest

@ffromani
Copy link
Contributor

/triage accepted
/priority backlog

priority set based on the fact this is a very old issue we endured so far
I agree users should not be allowed to set QoSClass. This field is owned and set by the system

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Sep 30, 2024
@liggitt
Copy link
Member

liggitt commented Sep 30, 2024

As long as ComputePodQOS only looks at spec fields, I think this makes sense.

If ComputePodQOS ever starts to look at status fields that can change, I think this might get us into trouble.

@tallclair, what is the plan for ensuring in-place resource changes can or cannot change PodQOS?

@liggitt
Copy link
Member

liggitt commented Sep 30, 2024

/assign @tallclair

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/test sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 15, 2024
@SergeyKanzhelev
Copy link
Member

/hold

even if we decide to go ahead with this, we cannot merge it for at least two releases since old kubelet is resetting the QoS on pod admission failure: #128083

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2024
@thockin
Copy link
Member

thockin commented Oct 29, 2024

Would we want to allow this to merge with a special-case for setting it to "", perhaps also check when we know it is being rejected by kubelet?

@thockin
Copy link
Member

thockin commented Oct 29, 2024

I think the plan, even with in-place updates, is to never change this (even if updates to spec might make it LOOK LIKE a different QoS (which is currently disallowed))

@SergeyKanzhelev
Copy link
Member

Would we want to allow this to merge with a special-case for setting it to "", perhaps also check when we know it is being rejected by kubelet?

it is a very good idea. I will be ok with this. Even if it will disallow some updates, it will make more good than bad by keeping the QoS constant. And we do not need to wait for two releases.

@carlory
Copy link
Member Author

carlory commented Oct 29, 2024

Would we want to allow this to merge with a special-case for setting it to ""

Updated. @SergeyKanzhelev An empty string is allowed when updating a Pod's status, so other changes in the integration tests can be reverted, so I want to confirm whether I should remove these changes from all test files.

@carlory
Copy link
Member Author

carlory commented Oct 29, 2024

@SergeyKanzhelev Another question is that if the pr is merged, do we need to revert the #128083?

@SergeyKanzhelev
Copy link
Member

do we need to revert the #128083?

No, let's keep it. Removing the QoS just doesn't make sense.

@@ -5373,6 +5373,11 @@ func ValidatePodStatusUpdate(newPod, oldPod *core.Pod, opts PodValidationOptions
}
}

// For backward compatibility, we allow the qosClass to be set to an empty string.
if newPod.Status.QOSClass != oldPod.Status.QOSClass && len(newPod.Status.QOSClass) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also allow to set a new QoS class when the old QoS class was empty?

Copy link
Member Author

@carlory carlory Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The QOSClass field will be populated by kube-apiserver before a Pod is created. https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/core/pod/strategy.go#L89

For backward compatibility, the QOSClass field can be overridden by the user or a controller to an empty string. But I don't know whether there is a case where some users or controllers reset the QOSClass field to another value after the field is dropped by others or themselves. cc @thockin

If we want the QOSClass field not to be changed by others, we may be able to override the field via call ComputePodQOS(pod) or newPod.status.qosClass = oldPod.status.qosclass in the podStatusStrategy's PrepareForUpdate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand it is not something we will do. It also opens a door to actually changing the QoS class in two changes QoS1 -> empty -> QoS2.

However, consider a scenario: two controllers updating the status of a Pod. One is implemented with the reset to empty (similar to what pod rejection is doing), another is implemented with the keeping the QoS (reading a status and then updating some other field, while keeping the QoS). We will end up with the race condition - if first controller executed first, than second one will fail. If second will succeed first, then first will work fine. This situation is not that critical if failed status update is handled correctly. But if it will result in continuous retries, it may cause some problems.

If this example doesn't make sense, and there are other reasons this scenario is not possible, please let me know. I didn't try it myself, but can imagine how it is likely will work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two possible paths, maybe:

  1. If we know that we're compensating for that specific kubelet issue, we could allow changing it to "" ONLY WHEN we are also changing phase to "Failed". There's no coming back from that, right? My kingdom for a stat machine diagram.

  2. If we know that changing from a value to "" is just for that kubelet issue, we could manually "fix up" the operation to carry the old value forward. We do this in a couple places in Service, where we auto-write values to spec. It's kind of awful but it works. pkg/registry/core/service/storage/storage.go patchAllocatedValues()

if needsClusterIP(oldSvc) && needsClusterIP(newSvc) { 
          if newSvc.Spec.ClusterIP == "" { 
              newSvc.Spec.ClusterIP = oldSvc.Spec.ClusterIP
          } 

This is called from beginUpdate which is installed as the store.BeginUpdate hook in NewREST().

If someone tries to change from QoS1 -> QoS2, validation should still fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After I look at the service code, the second way isn't a better choice. According to the BeginUpdateFunc definition, the function should only be used for a "transaction-like" operation. It is obvious that the BeginUpdateFunc is not suitable for this case even though it works. the UpdateStrategy machinisim is more preferred. It allows us to implement resource-specific behavior during updates.

The qosClass of a Pod is computed according to the PodSpec. I think the best way to resolve this problem is to override it with the old pod‘s status whatever its value from being updated Pod is. Changing or dropping the qosClass of a Pod is not intended to be a user-facing operation, and it is not recommended to allow it to be changed. The kubelet didn't really want to drop this field when it rejected a pod. A reject reason, message, and marking it as failed is all that the kubelet wants to do. we can treat the qosClass as a managed field by the kube-apiserver, any user or external controller can not drop and change it by updating the pod status. No error will be returned because the field will be populated by the kube-apiserver.

We know the InPlacePodVerticalScaling feature disallows to change a pod's qosClass via updating the pod's spec settings. so we don't need to re-compute the pod's qosClass with the new spec. we can just override the qosClass of the pod with the old pod's status.

For example,

func (podStatusStrategy) PrepareForUpdate(ctx context.Context, obj, old runtime.Object) {
	newPod := obj.(*api.Pod)
	oldPod := old.(*api.Pod)
	newPod.Spec = oldPod.Spec
	newPod.DeletionTimestamp = nil

	// don't allow the pods/status endpoint to touch owner references since old kubelets corrupt them in a way
	// that breaks garbage collection
	newPod.OwnerReferences = oldPod.OwnerReferences
        // the Pod QoS is immutable and populated at creation time by the kube-apiserver
        newPod.Status.QOSClass = oldPod.Status.QOSClass
}

The benefits of this approach are:

  1. No breaking changes for the old kubelet
  2. No need to validate the qosClass of the pod in the update validation. If we allow the qosClass to be updated from empty to a string, we have to re-compute the qosClass of the pod with the new spec settings. And then compare the value with the qosClass of the pod's status.
  3. Users or external controllers don't worry about the conflict when updating the pod's status.

What do you think? @SergeyKanzhelev @thockin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like all the benefits of the override method. Let's do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should impact as little as possible. In other words, if we see the QoS going from a real value to "", patch it up, but if we see it change from value A to another value B, that should be a validation error.

I don't know the history of the OwnerReferences doing this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin @SergeyKanzhelev It's only populated when its value is empty. And add a validation to check if it's mutated. Please review it again. Thanks a lot!

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 30, 2024
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

/hold

@SergeyKanzhelev ?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 18396fc296839a13c7df520b729471a7d155d2dd

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 1, 2024
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/unhold
/approve

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 1, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: carlory, SergeyKanzhelev, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 6bc0768 into kubernetes:master Nov 1, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Nov 1, 2024
@carlory carlory deleted the fix-126662 branch November 2, 2024 05:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: API review completed, 1.32
Archived in project
Development

Successfully merging this pull request may close these issues.

QosClass of pod status shouldn't be changeable
7 participants