Pods stuck in Terminating state for >3 days; unable to delete via standard methods on GKE #129473
Labels
kind/bug
Categorizes issue or PR as related to a bug.
needs-triage
Indicates an issue or PR lacks a `triage/foo` label and requires one.
sig/apps
Categorizes an issue or PR as relevant to SIG Apps.
What happened?
We encountered an issue where four pods in our GKE cluster have been stuck in the
Terminating
state for over three days. Jobs created these pods have already been deleted. Despite our attempts to delete the pods using standard methods (e.g.,kubectl delete --force --grace-period=0
), they remain in this state.Here is the pod status:
Additional details:
batch.kubernetes.io/job-tracking
and a QoS class ofBurstable
.The Pod "allocation-event-sync-28912320-8mn4z" is invalid: spec.initContainers: Forbidden: pod updates may not add or remove containers
allocation-28912320-8mn4z
) was running on a node (gke-primary-pool-bd44533c-aezp
) that has since been drained and deleted. However, the pod still references the deleted node while remaining in theTerminating
state.Attempts to resolve:
kubectl delete --force --grace-period=0
and viakubectl replace
.kubectl proxy
API (both deletion and replacement attempts).What did you expect to happen?
The pods should be removed from the cluster once the Job is deleted, or at least be force-deletable using standard methods like kubectl delete.
How can we reproduce it (as minimally and precisely as possible)?
We cannot reproduce this issue at the moment. Creating a Job via the CLI or a YAML file and then deleting it does not result in the same stuck Terminating state. This appears to be a specific edge case that we can't relate to.
Anything else we need to know?
The only thing we had in the cluster was that we had recently updated the nodes from version 1.27 to 1.28, but we think that was before the job deletion.
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: