-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'kubectl delete job <jobname>' hangs for ten minutes, then times out #28977
Comments
I've edited your post fixing markdown formatting for code blocks to be more readable. |
In your case here's what's happening. Your job is in an unexpected state, because it has following spec: |
Hi Maciej, Thank you for taking a look into this! I apologize for the formatting. I'll take your feedback as an invitation to strengthen my grasp of markdown. :)
I have a cron task that runs:
How it had been working is this would remove the old job, and replace it with a new identical job. If there is any other information I can provide, I'm happy to dig up whatever you need. Thank you, Dan |
I'm using an HA master config (kops/aws) if that's relevant |
Thanks, I'll try to investigate what might be causing this. |
I was working with @justinsb looking from the kubernetes-users Slack channel. He had a couple of ideas, and said it would be ok if I posted them here:
|
Restarting the controller had no apparent effect on the bug. |
closed on accident. :") |
Unable to reproduce this with 1.3.0/1.3.0 combination with a very similar job. |
@foxish I don't know about reproducibility but I have a live VM that's exhibiting this so if you want me to poke at the state I can.
|
FWIW in my case the job is stuck with exactly the same state that @soltysh described. Here's the snippet from the kubectl log:
parallelism:0, completions:1, active:1, succeeded: 1 |
Have a stuck job too. Will leave it at that until I'm asked for more details, don't want to clog up the thread. You can also reach me directly on Kubernetes slack as joshk. |
@joshk0 can you verify your job against the previous comment from Brian, if you're having the same problem exactly? (I didn't find you on slack) |
Yes, it also had "active": 1 despite having succeeded. Unfortunately the engineer who noticed this in our team already discovered how to delete it: Using the REST API directly. So, it is now gone. |
@joshua: I tried the api and failed to delete my jobs. What did you send to On Aug 15, 2016 7:11 PM, "Joshua Kwan" notifications@github.com wrote:
|
@dstynchula after authenticating, we sent a DELETE to |
Guys sorry, I was out for two last weeks taking some time off. I'm hoping to pick this up when I get through the tons of emails in my inbox. |
I have encountering the same issue. It's probably due to the reason that k8s is trying to delete resources created by the Job you are deleting. And using the $ kubectl describe job deployer
Name: deployer
Namespace: default
Image(s): gcr.io/maven-clinic-image-builder/deployer:31e65ed9452c5047c9e79fbf8e1ee0e2f046d3ea,gcr.io/maven-clinic-image-builder/api:31e65ed9452c5047c9e79fbf8e1ee0e2f046d3ea
Selector: controller-uid=7ad4f8ea-7eac-11e6-8f5a-42010af001cc
Parallelism: 0
Completions: 1
Start Time: Mon, 19 Sep 2016 17:03:14 -0400
Labels: controller-uid=7ad4f8ea-7eac-11e6-8f5a-42010af001cc
job-name=deployer
Pods Statuses: 0 Running / 0 Succeeded / 156 Failed
No volumes.
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
22h 9m 893 {job-controller } Normal SuccessfulCreate (events with common reason combined)
8m 8m 1 {job-controller } Normal SuccessfulDelete Deleted pod: deployer-23j9s
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.6", GitCommit:"ae4550cc9c89a593bcda6678df201db1b208133b", GitTreeState:"clean", BuildDate:"2016-08-26T18:13:23Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.6", GitCommit:"ae4550cc9c89a593bcda6678df201db1b208133b", GitTreeState:"clean", BuildDate:"2016-08-26T18:06:06Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}
$ kubectl delete job deployer --cascade=false
job "deployer" deleted |
@ye right, but still I'd like to fix the root cause of the problem. |
@soltysh 👍 @foxish FYI, to reproduce the issue, I believe you can intentionally create a job with at least one Docker container that has a command that will fail and set |
@soltysh Able to reproduce this in about 20 minutes with the following job definition.
|
The problem is not with the controller, but with the deletion logic.
Yes, there's no job prunning implemented. @erictune @bgrant0607 I'm currently implementing job pruninng downstream in openshift/origin#11345 would it make sense to have some pruning mechanism in k8s? Or we should rather extend the garbage collection, iiuc it currently handles only pods, am I right? |
@soltysh GC can handle all kinds of resources. |
@soltysh Job keeps around all dead pods? Job needs to keep the successful pods. It doesn't strictly need to keep around failed ones, though keeping them can be useful for debugging, especially if the logs aren't pushed. I think a reasonable starting point would be to keep the most recent failed pod for any given completion. I wouldn't add a configurable policy at this point. You'll also want to set ownerReferences on the pods so that server-side cascading deletion will kick in when the job is deleted. |
Yes, which is quite painful in case of ScheduledJobs firing every minute or so. All of your suggestions sounds ok, I'll add them to the SJ feature so that it's covered before migrating SJ to beta. |
User doc on server-side garbage collection: http://kubernetes.io/docs/user-guide/garbage-collection/ @soltysh please loop me in if you want to add ownerReferences for pods created by jobs. |
@dstynchula There are no sig labels on this issue. Please add a sig label by: |
/sig cli |
I think with server side deletion this can be closed now. If you don't agree please feel free to re-open. |
Running
kubectl delete job <job name>
or
kubectl delete -f <job yaml definition>.yaml
causes kubectl to hang for approximately 10 minutes.
(absent waiting 10 minutes, ctrl+c is required to kill the process)
After 10 minutes, kubectl reports:
This bug has caused major problems as jobs that have run for months have suddenly ceased to operate normally.
In essence, the kubectl binary just hangs up and does nothing unless verbosity is increased.
This occurs in:
Background
I have a scheduled task that performs:
I did this to schedule tasks that need to run periodically and close. Rather than having huge quantities of dead containers stacking up, we opted to use jobs instead.
Debug information
Adding verbosity 9:
The text was updated successfully, but these errors were encountered: