Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash on kube manager's service-lb-controller after v1.31.0. #128182

Merged
merged 2 commits into from
Oct 21, 2024

Conversation

carlory
Copy link
Member

@carlory carlory commented Oct 18, 2024

What type of PR is this?

/kind bug
/kind regression

What this PR does / why we need it:

If the init fails, we shouldn't register event handlers. It will cause a crash since #122145 is merged.

There's a memory leak in the code. the queue has no worker to consume the event.

Which issue(s) this PR fixes:

Fixes #128121

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixes 1.31 regression that can crash kube-controller-manager's service-lb-controller loop

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/regression Categorizes issue or PR as related to a regression from a prior release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/cloudprovider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 18, 2024
@k8s-ci-robot k8s-ci-robot requested review from bowei and thockin October 18, 2024 10:31
@carlory
Copy link
Member Author

carlory commented Oct 18, 2024

@carlory
Copy link
Member Author

carlory commented Oct 18, 2024

Test result:

(base) ➜  kubernetes git:(fix-128121-1) kind create cluster --name fix-128121-1 --image kindest/node:fix-128121-1
Creating cluster "fix-128121-1" ...
 ✓ Ensuring node image (kindest/node:fix-128121-1) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-fix-128121-1"
You can now use your cluster with:

kubectl cluster-info --context kind-fix-128121-1

Have a nice day! 👋
(base) ➜  kubernetes git:(fix-128121-1) kubectl create -f __testdata/service.yaml
service/example-service created
(base) ➜  kubernetes git:(fix-128121-1) kubectl get pods -A
NAMESPACE            NAME                                                 READY   STATUS    RESTARTS   AGE
kube-system          coredns-7c65d6cfc9-4vjhj                             1/1     Running   0          2m46s
kube-system          coredns-7c65d6cfc9-plg98                             1/1     Running   0          2m46s
kube-system          etcd-fix-128121-1-control-plane                      1/1     Running   0          2m52s
kube-system          kindnet-7579v                                        1/1     Running   0          2m46s
kube-system          kube-apiserver-fix-128121-1-control-plane            1/1     Running   0          2m52s
kube-system          kube-controller-manager-fix-128121-1-control-plane   1/1     Running   0          2m52s
kube-system          kube-proxy-gsdn9                                     1/1     Running   0          2m46s
kube-system          kube-scheduler-fix-128121-1-control-plane            1/1     Running   0          2m52s
local-path-storage   local-path-provisioner-57c5987fd4-rpj4b              1/1     Running   0          2m46s
(base) ➜  kubernetes git:(fix-128121-1) kubectl patch service example-service -p '{"spec":{"externalTrafficPolicy":"Local"}}'
service/example-service patched
(base) ➜  kubernetes git:(fix-128121-1) kubectl get pods -A
NAMESPACE            NAME                                                 READY   STATUS    RESTARTS   AGE
kube-system          coredns-7c65d6cfc9-4vjhj                             1/1     Running   0          3m46s
kube-system          coredns-7c65d6cfc9-plg98                             1/1     Running   0          3m46s
kube-system          etcd-fix-128121-1-control-plane                      1/1     Running   0          3m52s
kube-system          kindnet-7579v                                        1/1     Running   0          3m46s
kube-system          kube-apiserver-fix-128121-1-control-plane            1/1     Running   0          3m52s
kube-system          kube-controller-manager-fix-128121-1-control-plane   1/1     Running   0          3m52s
kube-system          kube-proxy-gsdn9                                     1/1     Running   0          3m46s
kube-system          kube-scheduler-fix-128121-1-control-plane            1/1     Running   0          3m52s
local-path-storage   local-path-provisioner-57c5987fd4-rpj4b              1/1     Running   0          3m46s

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 18, 2024
@aojea
Copy link
Member

aojea commented Oct 18, 2024

can we have an unit test @carlory ? I think cmd/kube-controller-manager/app/controllermanager_test.go may have somethign we can use to check those controllers are not started if Cloud is nil

@aojea
Copy link
Member

aojea commented Oct 18, 2024

the job integration failure seems related, wonder if we have some code depending on this behavior 🤔

@carlory
Copy link
Member Author

carlory commented Oct 18, 2024

I think cmd/kube-controller-manager/app/controllermanager_test.go may have somethign we can use to check those controllers are not started if Cloud is nil

Need to add a new unit test. I will do it later.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 21, 2024
Co-authored-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 21, 2024
@aojea
Copy link
Member

aojea commented Oct 21, 2024

/lgtm

@thockin can you approve the service controller changes?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 21, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: a37bfd93b61cfaed586c8d0e657e89626c2a2d69

@dims
Copy link
Member

dims commented Oct 21, 2024

cc @oliviassss

@aojea
Copy link
Member

aojea commented Oct 21, 2024

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 21, 2024
@aojea
Copy link
Member

aojea commented Oct 21, 2024

/hold cancel

for the additional test

the test has been modified and it fails without the fix now

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 21, 2024
@thockin
Copy link
Member

thockin commented Oct 21, 2024

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, carlory, soltysh, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 21, 2024
@k8s-ci-robot k8s-ci-robot merged commit 442183a into kubernetes:master Oct 21, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Oct 21, 2024
@carlory carlory deleted the fix-128121-1 branch October 21, 2024 17:00
richabanker pushed a commit to richabanker/kubernetes that referenced this pull request Oct 29, 2024
…ernetes#128182)

* Fix crash on kube manager's service-lb-controller after v1.31.0.

* Update cmd/kube-controller-manager/app/controllermanager_test.go

Co-authored-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>

---------

Co-authored-by: Antonio Ojea <antonio.ojea.garcia@gmail.com>
@fedebongio
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 29, 2024
k8s-ci-robot added a commit that referenced this pull request Nov 12, 2024
…182-upstream-release-1.31

Automated cherry pick of #128182: Fix crash on kube manager's service-lb-controller after v1.31.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crash on kube manager's service-lb-controller after v1.31.0
10 participants