Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: fix a bug where kubelet wrongly drops the QOSClass field of the Pod's s status when it rejects a Pod #128083

Merged
merged 1 commit into from
Oct 23, 2024

Conversation

carlory
Copy link
Member

@carlory carlory commented Oct 15, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

#127744 tries tightening validation on the qosClass field of pod status. It will be forbidden to update the QoS class field via the status subsource. If a Node can not admit a pod, the kubelet will update the pod status with the reason and message and mark the pod as failed but it forgets to keep the qosClass field unchanged. This will cause the pod to be updated with an empty qosClass field. the Pod will stay pending forever because the update is forbidden by a kube-apiserver once #127744 is merged.

Considering the upgrade/downgrade scenario, I submit this PR to fix the issue.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Oct 15 08:06:42 kind-control-plane kubelet[748]: I1015 08:06:42.562211     748 config.go:292] "Setting pods for source" source="api"
Oct 15 08:06:42 kind-control-plane kubelet[748]: I1015 08:06:42.563598     748 config.go:397] "Receiving a new pod" pod="pod-os-rejection-9071/wrong-pod-os"
Oct 15 08:06:42 kind-control-plane kubelet[748]: I1015 08:06:42.563690     748 kubelet.go:2410] "SyncLoop ADD" source="api" pods=["pod-os-rejection-9071/wrong-pod-os"]
Oct 15 08:06:42 kind-control-plane kubelet[748]: I1015 08:06:42.563844     748 kubelet.go:2309] "Pod admission denied" podUID="d3affda3-8d17-4cc8-b730-56bacfe136fe" pod="pod-os-rejection-9071/wrong-pod-os" reason="PodOSNotSupported" message="Failed to admit pod as the OS field doesn't match node OS"
Oct 15 08:06:42 kind-control-plane kubelet[748]: I1015 08:06:42.563934     748 status_manager.go:227] "Syncing updated statuses"
Oct 15 08:06:42 kind-control-plane kubelet[748]: I1015 08:06:42.564160     748 event.go:389] "Event occurred" object="pod-os-rejection-9071/wrong-pod-os" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="PodOSNotSupported" message="Failed to admit pod as the OS field doesn't match node OS"
Oct 15 08:06:42 kind-control-plane kubelet[748]: I1015 08:06:42.573671     748 status_manager.go:872] "Patch status for pod" pod="pod-os-rejection-9071/wrong-pod-os" podUID="d3affda3-8d17-4cc8-b730-56bacfe136fe" patch=""
Oct 15 08:06:42 kind-control-plane kubelet[748]: I1015 08:06:42.573723     748 status_manager.go:875] "Failed to update status for pod" pod="pod-os-rejection-9071/wrong-pod-os" err="failed to patch status \"{\\\"metadata\\\":{\\\"uid\\\":\\\"d3affda3-8d17-4cc8-b730-56bacfe136fe\\\"},\\\"status\\\":{\\\"message\\\":\\\"Pod was rejected: Failed to admit pod as the OS field doesn't match node OS\\\",\\\"phase\\\":\\\"Failed\\\",\\\"qosClass\\\":null,\\\"reason\\\":\\\"PodOSNotSupported\\\",\\\"startTime\\\":\\\"2024-10-15T08:06:42Z\\\"}}\" for pod \"pod-os-rejection-9071\"/\"wrong-pod-os\": Pod \"wrong-pod-os\" is invalid: status.qosClass: Invalid value: \"\": field is immutable"

Does this PR introduce a user-facing change?

kubelet: fix a bug where kubelet wrongly drops the QOSClass field of the Pod's s status when it rejects a Pod

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 15, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 15, 2024
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 15, 2024
@carlory
Copy link
Member Author

carlory commented Oct 15, 2024

/cc @liggitt @kannon92 @ffromani

By the way, I want to confirm whether the issue #126662 is still accepted or not? Can you help me to clarify this? Thanks.

@carlory carlory changed the title kubelet: fix a bug where kubelet drops the QOSClass field of the Pod's s status when it rejects a Pod kubelet: fix a bug where kubelet wrongly drops the QOSClass field of the Pod's s status when it rejects a Pod Oct 15, 2024
@haircommander
Copy link
Contributor

/cc

@SergeyKanzhelev
Copy link
Member

this change is not critical from my point of view. I would think we can take it only if we pair it up with the e2e test I suggested. We can even make this test to be a conformance test in future.

@carlory carlory force-pushed the fix-126662-kubelet branch from a4d823b to f0edb32 Compare October 17, 2024 07:00
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 17, 2024
@carlory carlory force-pushed the fix-126662-kubelet branch from 5b853e4 to 0c1b91f Compare October 22, 2024 02:40
@carlory
Copy link
Member Author

carlory commented Oct 22, 2024

/test pull-kubernetes-e2e-gce

@SergeyKanzhelev
Copy link
Member

Commented #128083 (comment) - mentioning it here in case the comment will be hard to notice

@carlory carlory force-pushed the fix-126662-kubelet branch from 0c1b91f to 090039a Compare October 23, 2024 07:00
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 23, 2024
@carlory carlory force-pushed the fix-126662-kubelet branch from 090039a to dbb4729 Compare October 23, 2024 07:05
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 23, 2024
@carlory
Copy link
Member Author

carlory commented Oct 23, 2024

Test without this patch:

• [FAILED] [4.806 seconds]
[sig-node] PodRejectionStatus Kubelet [It] should reject pod when the node didn't have enough resource [sig-node]
/Users/kiki/workspace/golang/src/k8s.io/kubernetes/test/e2e/common/node/pod_admission.go:74

  [FAILED] Expected
      <v1.PodStatus>:
          message: 'Pod was rejected: Node didn''t have enough resource: cpu, requested: 1000000000000000,
            used: 5050, capacity: 8000'
          phase: Failed
          reason: OutOfcpu
          startTime: "2024-10-23T07:04:31Z"
  to equal
      <v1.PodStatus>:
          message: 'Pod was rejected: Node didn''t have enough resource: cpu, requested: 1000000000000000,
            used: 5050, capacity: 8000'
          phase: Failed
          qosClass: Burstable
          reason: OutOfcpu
          startTime: "2024-10-23T07:04:31Z"
  In [It] at: /Users/kiki/workspace/golang/src/k8s.io/kubernetes/test/e2e/common/node/pod_admission.go:151

Test with this patch:

[ReportAfterSuite] PASSED [0.000 seconds]
------------------------------

Ran 1 of 6603 Specs in 4.252 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 6602 Skipped
You're using deprecated Ginkgo functionality:
=============================================
  --ginkgo.slow-spec-threshold is deprecated --slow-spec-threshold has been deprecated and will be removed in a future version of Ginkgo.  This feature has proved to be more noisy than useful.  You can use --poll-progress-after, instead, to get more actionable feedback about potentially slow specs and understand where they might be getting stuck.

To silence deprecations that can be silenced set the following environment variable:
  ACK_GINKGO_DEPRECATIONS=2.19.0

PASS

Comment on lines 145 to 150
expectedStatus.InitContainerStatuses = nil
expectedStatus.ContainerStatuses = nil
// expectedStatus.QOSClass keep it as is
expectedStatus.EphemeralContainerStatuses = nil
expectedStatus.Resize = ""
expectedStatus.ResourceClaimStatuses = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were those not nil before?

Copy link
Member Author

@carlory carlory Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are nil. The kube-scheduler will set some fields of the pod status, i.e., conditions. I don't know whether other components set other fields of the pod status. If other components set the fields of the pod status, the test will fail, so I explicitly set all the fields of the pod status in the same order as the definition.

If it is not required, I will remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SergeyKanzhelev should I remove those?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

expectedStatus.NominatedNodeName = ""
expectedStatus.HostIP = ""
expectedStatus.HostIPs = nil
expectedStatus.PodIP = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is PodIP not nil in both cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@carlory
Copy link
Member Author

carlory commented Oct 23, 2024

/test pull-kubernetes-e2e-kind-ipv6

…s status when it rejects a Pod

Co-authored-by: Sergey Kanzhelev <S.Kanzhelev@live.com>
@carlory carlory force-pushed the fix-126662-kubelet branch from dbb4729 to c7e384f Compare October 23, 2024 17:01
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 23, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 7405edbfbe37c904d3598393ef710ebdc22d970f

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: carlory, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 23, 2024
@carlory
Copy link
Member Author

carlory commented Oct 23, 2024

/retest

@k8s-ci-robot k8s-ci-robot merged commit d7e5ff8 into kubernetes:master Oct 23, 2024
15 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Oct 23, 2024
@carlory carlory deleted the fix-126662-kubelet branch October 24, 2024 02:46
@pacoxu
Copy link
Member

pacoxu commented Oct 28, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Development

Successfully merging this pull request may close these issues.

5 participants