Pods stuck in Terminating state for >3 days; unable to delete via standard methods on GKE #129473

Raymonr · 2025-01-03T12:19:12Z

What happened?

We encountered an issue where four pods in our GKE cluster have been stuck in the Terminating state for over three days. Jobs created these pods have already been deleted. Despite our attempts to delete the pods using standard methods (e.g., kubectl delete --force --grace-period=0), they remain in this state.

Here is the pod status:

kubectl get pods
NAME                                                       READY   STATUS        RESTARTS   AGE
allocation-28910880-tzr87                       0/2     Terminating   0          3d
allocation-28912320-8mn4z                       0/2     Terminating   0          3d
allocation-28913760-4gsxh                       0/2     Terminating   0          3d
allocation-28916640-vwdj2                       0/2     Terminating   0          3d

Additional details:

Each pod has the finalizer batch.kubernetes.io/job-tracking and a QoS class of Burstable.

Attempts to remove the finalizers result in the following error:

The Pod "allocation-event-sync-28912320-8mn4z" is invalid: spec.initContainers: Forbidden: pod updates may not add or remove containers

One of the pods (allocation-28912320-8mn4z) was running on a node (gke-primary-pool-bd44533c-aezp) that has since been drained and deleted. However, the pod still references the deleted node while remaining in the Terminating state.

Attempts to resolve:

Force deletion with kubectl delete --force --grace-period=0 and via kubectl replace.
Deleting via the kubectl proxy API (both deletion and replacement attempts).
Draining and deleting the node where one of the pods was running.
Attempting to modify the finalizers or QoS class, resulting in the above error.

What did you expect to happen?

The pods should be removed from the cluster once the Job is deleted, or at least be force-deletable using standard methods like kubectl delete.

How can we reproduce it (as minimally and precisely as possible)?

We cannot reproduce this issue at the moment. Creating a Job via the CLI or a YAML file and then deleting it does not result in the same stuck Terminating state. This appears to be a specific edge case that we can't relate to.

Anything else we need to know?

The only thing we had in the cluster was that we had recently updated the nodes from version 1.27 to 1.28, but we think that was before the job deletion.

Kubernetes version

$ kubectl version
Client Version: v1.28.13-gke.600
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3

Cloud provider

Google cloud

OS version

# On Linux:
$ cat /etc/os-release
NAME="Container-Optimized OS"
ID=cos
PRETTY_NAME="Container-Optimized OS from Google"
HOME_URL="https://cloud.google.com/container-optimized-os/docs"
BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us"
GOOGLE_CRASH_ID=Lakitu
GOOGLE_METRICS_PRODUCT_ID=26
KERNEL_COMMIT_ID=3e0971e1551e88a5a9e615c239f034fd9fd8a423
VERSION=109
VERSION_ID=109
BUILD_ID=17800.309.13
$ uname -a
Linux gke-primary-pool-bd44533c-138r 6.1.100+ #1 SMP PREEMPT_DYNAMIC Sat Aug 10 14:21:56 UTC 2024 x86_64 Intel(R) Xeon(R) CPU @ 2.20GHz GenuineIntel GNU/Linux

Install tools

kubectl gcloud

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2025-01-03T12:19:22Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kannon92 · 2025-01-03T12:39:48Z

/sig apps

Webhooks have been known to affect the finalized removal. I’d check if you have a pod webhooks installed.

1.28 is out of support so please try a supported version of Kubernetes. (1.29-1.32)

Raymonr · 2025-01-03T13:06:43Z

/sig apps

Webhooks have been known to affect the finalized removal. I’d check if you have a pod webhooks installed.

1.28 is out of support so please try a supported version of Kubernetes. (1.29-1.32)

@kannon92 Thank you for your quick response!

Regarding your suggestion about pod webhooks:
- I searched the entire cluster for pods with names related to "webhook," but no such pods exist in our cluster.
Clarification on webhooks:
- Do you mean MutatingAdmissionWebhooks or ValidatingAdmissionWebhooks? If so, I inspected all webhook configurations in the cluster and checked for rules applying to apiGroups, apiVersions, operations, or resources that might affect pods. I didn’t find any webhook rules targeting these criteria.
About updating the cluster:
- I understand your point about upgrading to a supported Kubernetes version (1.29-1.32). While we plan to update the cluster, we are cautious about proceeding, as we are unsure if the upgrade might lead to further disturbances caused by the allocation-event-sync pods stuck in the Terminating state.

Could you provide further guidance or confirm if there’s another area we should investigate?

kannon92 · 2025-01-03T16:15:03Z

We wrote this guide because this is a pretty common issue: https://kubernetes.io/docs/tasks/debug/debug-application/debug-pods/#my-pod-stays-terminating

Raymonr added the kind/bug Categorizes issue or PR as related to a bug. label Jan 3, 2025

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 3, 2025

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 3, 2025

github-project-automation bot added this to SIG Apps Jan 3, 2025

github-project-automation bot moved this to Needs Triage in SIG Apps Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods stuck in Terminating state for >3 days; unable to delete via standard methods on GKE #129473

Pods stuck in Terminating state for >3 days; unable to delete via standard methods on GKE #129473

Raymonr commented Jan 3, 2025

k8s-ci-robot commented Jan 3, 2025

kannon92 commented Jan 3, 2025

Raymonr commented Jan 3, 2025

kannon92 commented Jan 3, 2025

Pods stuck in Terminating state for >3 days; unable to delete via standard methods on GKE #129473

Pods stuck in Terminating state for >3 days; unable to delete via standard methods on GKE #129473

Comments

Raymonr commented Jan 3, 2025

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Jan 3, 2025

kannon92 commented Jan 3, 2025

Raymonr commented Jan 3, 2025

kannon92 commented Jan 3, 2025