-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tighten validation on the qosClass field of pod status #127744
Conversation
/cc @kannon92 |
/sig node |
/retest |
/triage accepted priority set based on the fact this is a very old issue we endured so far |
As long as ComputePodQOS only looks at spec fields, I think this makes sense. If ComputePodQOS ever starts to look at status fields that can change, I think this might get us into trouble. @tallclair, what is the plan for ensuring in-place resource changes can or cannot change PodQOS? |
/assign @tallclair |
/hold even if we decide to go ahead with this, we cannot merge it for at least two releases since old kubelet is resetting the QoS on pod admission failure: #128083 |
Would we want to allow this to merge with a special-case for setting it to |
I think the plan, even with in-place updates, is to never change this (even if updates to spec might make it LOOK LIKE a different QoS (which is currently disallowed)) |
it is a very good idea. I will be ok with this. Even if it will disallow some updates, it will make more good than bad by keeping the QoS constant. And we do not need to wait for two releases. |
Updated. @SergeyKanzhelev An empty string is allowed when updating a Pod's status, so other changes in the integration tests can be reverted, so I want to confirm whether I should remove these changes from all test files. |
@SergeyKanzhelev Another question is that if the pr is merged, do we need to revert the #128083? |
No, let's keep it. Removing the QoS just doesn't make sense. |
@@ -5373,6 +5373,11 @@ func ValidatePodStatusUpdate(newPod, oldPod *core.Pod, opts PodValidationOptions | |||
} | |||
} | |||
|
|||
// For backward compatibility, we allow the qosClass to be set to an empty string. | |||
if newPod.Status.QOSClass != oldPod.Status.QOSClass && len(newPod.Status.QOSClass) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also allow to set a new QoS class when the old QoS class was empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The QOSClass field will be populated by kube-apiserver before a Pod is created. https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/core/pod/strategy.go#L89
For backward compatibility, the QOSClass field can be overridden by the user or a controller to an empty string. But I don't know whether there is a case where some users or controllers reset the QOSClass field to another value after the field is dropped by others or themselves. cc @thockin
If we want the QOSClass field not to be changed by others, we may be able to override the field via call ComputePodQOS(pod)
or newPod.status.qosClass = oldPod.status.qosclass
in the podStatusStrategy's PrepareForUpdate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @liggitt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand it is not something we will do. It also opens a door to actually changing the QoS class in two changes QoS1 -> empty -> QoS2.
However, consider a scenario: two controllers updating the status of a Pod. One is implemented with the reset to empty (similar to what pod rejection is doing), another is implemented with the keeping the QoS (reading a status and then updating some other field, while keeping the QoS). We will end up with the race condition - if first controller executed first, than second one will fail. If second will succeed first, then first will work fine. This situation is not that critical if failed status update is handled correctly. But if it will result in continuous retries, it may cause some problems.
If this example doesn't make sense, and there are other reasons this scenario is not possible, please let me know. I didn't try it myself, but can imagine how it is likely will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two possible paths, maybe:
-
If we know that we're compensating for that specific kubelet issue, we could allow changing it to "" ONLY WHEN we are also changing phase to "Failed". There's no coming back from that, right? My kingdom for a stat machine diagram.
-
If we know that changing from a value to "" is just for that kubelet issue, we could manually "fix up" the operation to carry the old value forward. We do this in a couple places in Service, where we auto-write values to spec. It's kind of awful but it works. pkg/registry/core/service/storage/storage.go
patchAllocatedValues()
if needsClusterIP(oldSvc) && needsClusterIP(newSvc) {
if newSvc.Spec.ClusterIP == "" {
newSvc.Spec.ClusterIP = oldSvc.Spec.ClusterIP
}
This is called from beginUpdate
which is installed as the store.BeginUpdate
hook in NewREST()
.
If someone tries to change from QoS1 -> QoS2, validation should still fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After I look at the service code, the second way isn't a better choice. According to the BeginUpdateFunc
definition, the function should only be used for a "transaction-like" operation. It is obvious that the BeginUpdateFunc
is not suitable for this case even though it works. the UpdateStrategy
machinisim is more preferred. It allows us to implement resource-specific behavior during updates.
The qosClass of a Pod is computed according to the PodSpec. I think the best way to resolve this problem is to override it with the old pod‘s status whatever its value from being updated Pod is. Changing or dropping the qosClass of a Pod is not intended to be a user-facing operation, and it is not recommended to allow it to be changed. The kubelet didn't really want to drop this field when it rejected a pod. A reject reason, message, and marking it as failed is all that the kubelet wants to do. we can treat the qosClass as a managed field by the kube-apiserver, any user or external controller can not drop and change it by updating the pod status. No error will be returned because the field will be populated by the kube-apiserver.
We know the InPlacePodVerticalScaling feature disallows to change a pod's qosClass via updating the pod's spec settings. so we don't need to re-compute the pod's qosClass with the new spec. we can just override the qosClass of the pod with the old pod's status.
For example,
func (podStatusStrategy) PrepareForUpdate(ctx context.Context, obj, old runtime.Object) {
newPod := obj.(*api.Pod)
oldPod := old.(*api.Pod)
newPod.Spec = oldPod.Spec
newPod.DeletionTimestamp = nil
// don't allow the pods/status endpoint to touch owner references since old kubelets corrupt them in a way
// that breaks garbage collection
newPod.OwnerReferences = oldPod.OwnerReferences
// the Pod QoS is immutable and populated at creation time by the kube-apiserver
newPod.Status.QOSClass = oldPod.Status.QOSClass
}
The benefits of this approach are:
- No breaking changes for the old kubelet
- No need to validate the qosClass of the pod in the update validation. If we allow the qosClass to be updated from empty to a string, we have to re-compute the qosClass of the pod with the new spec settings. And then compare the value with the qosClass of the pod's status.
- Users or external controllers don't worry about the conflict when updating the pod's status.
What do you think? @SergeyKanzhelev @thockin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like all the benefits of the override method. Let's do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should impact as little as possible. In other words, if we see the QoS going from a real value to "", patch it up, but if we see it change from value A to another value B, that should be a validation error.
I don't know the history of the OwnerReferences
doing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin @SergeyKanzhelev It's only populated when its value is empty. And add a validation to check if it's mutated. Please review it again. Thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM label has been added. Git tree hash: 18396fc296839a13c7df520b729471a7d155d2dd
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/unhold
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: carlory, SergeyKanzhelev, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
According to this code, the QosClass of pod status is immutable.
This function also shows that QosClass is computed from the resources instead of being provided by users.
However, we can use rest api or "kubectl edit pod --subresource='status'" to change it.
Which issue(s) this PR fixes:
Fixes #126662
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: