Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck in Terminating state for >3 days; unable to delete via standard methods on GKE #129473

Open
Raymonr opened this issue Jan 3, 2025 · 4 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@Raymonr
Copy link

Raymonr commented Jan 3, 2025

What happened?

We encountered an issue where four pods in our GKE cluster have been stuck in the Terminating state for over three days. Jobs created these pods have already been deleted. Despite our attempts to delete the pods using standard methods (e.g., kubectl delete --force --grace-period=0), they remain in this state.

Here is the pod status:

kubectl get pods
NAME                                                       READY   STATUS        RESTARTS   AGE
allocation-28910880-tzr87                       0/2     Terminating   0          3d
allocation-28912320-8mn4z                       0/2     Terminating   0          3d
allocation-28913760-4gsxh                       0/2     Terminating   0          3d
allocation-28916640-vwdj2                       0/2     Terminating   0          3d

Additional details:

  • Each pod has the finalizer batch.kubernetes.io/job-tracking and a QoS class of Burstable.
  • Attempts to remove the finalizers result in the following error:
    The Pod "allocation-event-sync-28912320-8mn4z" is invalid: spec.initContainers: Forbidden: pod updates may not add or remove containers
  • One of the pods (allocation-28912320-8mn4z) was running on a node (gke-primary-pool-bd44533c-aezp) that has since been drained and deleted. However, the pod still references the deleted node while remaining in the Terminating state.

Attempts to resolve:

  1. Force deletion with kubectl delete --force --grace-period=0 and via kubectl replace.
  2. Deleting via the kubectl proxy API (both deletion and replacement attempts).
  3. Draining and deleting the node where one of the pods was running.
  4. Attempting to modify the finalizers or QoS class, resulting in the above error.

What did you expect to happen?

The pods should be removed from the cluster once the Job is deleted, or at least be force-deletable using standard methods like kubectl delete.

How can we reproduce it (as minimally and precisely as possible)?

We cannot reproduce this issue at the moment. Creating a Job via the CLI or a YAML file and then deleting it does not result in the same stuck Terminating state. This appears to be a specific edge case that we can't relate to.

Anything else we need to know?

The only thing we had in the cluster was that we had recently updated the nodes from version 1.27 to 1.28, but we think that was before the job deletion.

Kubernetes version

$ kubectl version
Client Version: v1.28.13-gke.600
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3

Cloud provider

Google cloud

OS version

# On Linux:
$ cat /etc/os-release
NAME="Container-Optimized OS"
ID=cos
PRETTY_NAME="Container-Optimized OS from Google"
HOME_URL="https://cloud.google.com/container-optimized-os/docs"
BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us"
GOOGLE_CRASH_ID=Lakitu
GOOGLE_METRICS_PRODUCT_ID=26
KERNEL_COMMIT_ID=3e0971e1551e88a5a9e615c239f034fd9fd8a423
VERSION=109
VERSION_ID=109
BUILD_ID=17800.309.13
$ uname -a
Linux gke-primary-pool-bd44533c-138r 6.1.100+ #1 SMP PREEMPT_DYNAMIC Sat Aug 10 14:21:56 UTC 2024 x86_64 Intel(R) Xeon(R) CPU @ 2.20GHz GenuineIntel GNU/Linux

Install tools

kubectl gcloud

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@Raymonr Raymonr added the kind/bug Categorizes issue or PR as related to a bug. label Jan 3, 2025
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 3, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kannon92
Copy link
Contributor

kannon92 commented Jan 3, 2025

/sig apps

Webhooks have been known to affect the finalized removal. I’d check if you have a pod webhooks installed.

1.28 is out of support so please try a supported version of Kubernetes. (1.29-1.32)

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 3, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Jan 3, 2025
@Raymonr
Copy link
Author

Raymonr commented Jan 3, 2025

/sig apps

Webhooks have been known to affect the finalized removal. I’d check if you have a pod webhooks installed.

1.28 is out of support so please try a supported version of Kubernetes. (1.29-1.32)

@kannon92 Thank you for your quick response!

  1. Regarding your suggestion about pod webhooks:

    • I searched the entire cluster for pods with names related to "webhook," but no such pods exist in our cluster.
  2. Clarification on webhooks:

    • Do you mean MutatingAdmissionWebhooks or ValidatingAdmissionWebhooks? If so, I inspected all webhook configurations in the cluster and checked for rules applying to apiGroups, apiVersions, operations, or resources that might affect pods. I didn’t find any webhook rules targeting these criteria.
  3. About updating the cluster:

    • I understand your point about upgrading to a supported Kubernetes version (1.29-1.32). While we plan to update the cluster, we are cautious about proceeding, as we are unsure if the upgrade might lead to further disturbances caused by the allocation-event-sync pods stuck in the Terminating state.

Could you provide further guidance or confirm if there’s another area we should investigate?

@kannon92
Copy link
Contributor

kannon92 commented Jan 3, 2025

We wrote this guide because this is a pretty common issue: https://kubernetes.io/docs/tasks/debug/debug-application/debug-pods/#my-pod-stays-terminating

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Status: Needs Triage
Development

No branches or pull requests

3 participants