Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'kubectl delete job <jobname>' hangs for ten minutes, then times out #28977

Closed
dstynchula opened this issue Jul 14, 2016 · 30 comments
Closed

'kubectl delete job <jobname>' hangs for ten minutes, then times out #28977

dstynchula opened this issue Jul 14, 2016 · 30 comments
Assignees
Labels
area/batch area/controller-manager sig/cli Categorizes an issue or PR as relevant to SIG CLI.

Comments

@dstynchula
Copy link

dstynchula commented Jul 14, 2016

Running

kubectl delete job <job name>

or

kubectl delete -f <job yaml definition>.yaml

causes kubectl to hang for approximately 10 minutes.

(absent waiting 10 minutes, ctrl+c is required to kill the process)

After 10 minutes, kubectl reports:

F0714 16:26:29.350802   10368 helpers.go:108] error: timed out waiting for the condition

This bug has caused major problems as jobs that have run for months have suddenly ceased to operate normally.

In essence, the kubectl binary just hangs up and does nothing unless verbosity is increased.

This occurs in:

$kubectl version
Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.0", GitCommit:"283137936a498aed572ee22af6774b6fb6e9fd94", GitTreeState:"clean", BuildDate:"2016-07-01T19:26:38Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.0", GitCommit:"283137936a498aed572ee22af6774b6fb6e9fd94", GitTreeState:"clean", BuildDate:"2016-07-01T19:19:19Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}

Background

I have a scheduled task that performs:

kubectl delete -f job.yaml
kubectl create -f job.yaml

I did this to schedule tasks that need to run periodically and close. Rather than having huge quantities of dead containers stacking up, we opted to use jobs instead.

Debug information

Adding verbosity 9:

kubectl delete job sproket_maker-3 --v=9
I0714 15:00:19.009628    7749 loader.go:327] Config loaded from file /home/dstynchula/.kube/config
I0714 15:00:19.010519    7749 round_trippers.go:299] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.3.0 (linux/amd64) kubernetes/2831379" https://api.internal.prod.kubernetes.local/api
I0714 15:00:19.327875    7749 round_trippers.go:318] GET https://api.internal.prod.kubernetes.local/api 200 OK in 317 milliseconds
I0714 15:00:19.327888    7749 round_trippers.go:324] Response Headers:
I0714 15:00:19.327894    7749 round_trippers.go:327]     Date: Thu, 14 Jul 2016 22:00:16 GMT
I0714 15:00:19.327897    7749 round_trippers.go:327]     Content-Length: 135
I0714 15:00:19.327900    7749 round_trippers.go:327]     Content-Type: application/json
I0714 15:00:19.327928    7749 request.go:891] Response Body: {"kind":"APIVersions","versions":["v1"],"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]}
I0714 15:00:19.328086    7749 round_trippers.go:299] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.3.0 (linux/amd64) kubernetes/2831379" https://api.internal.prod.kubernetes.local/apis
I0714 15:00:19.398573    7749 round_trippers.go:318] GET https://api.internal.prod.kubernetes.local/apis 200 OK in 70 milliseconds
I0714 15:00:19.398585    7749 round_trippers.go:324] Response Headers:
I0714 15:00:19.398588    7749 round_trippers.go:327]     Content-Type: application/json
I0714 15:00:19.398591    7749 round_trippers.go:327]     Date: Thu, 14 Jul 2016 22:00:16 GMT
I0714 15:00:19.398594    7749 round_trippers.go:327]     Content-Length: 1649
I0714 15:00:19.398623    7749 request.go:891] Response Body: {"kind":"APIGroupList","groups":[{"name":"apps","versions":[{"groupVersion":"apps/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"apps/v1alpha1","version":"v1alpha1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]},{"name":"autoscaling","versions":[{"groupVersion":"autoscaling/v1","version":"v1"}],"preferredVersion":{"groupVersion":"autoscaling/v1","version":"v1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]},{"name":"batch","versions":[{"groupVersion":"batch/v1","version":"v1"},{"groupVersion":"batch/v2alpha1","version":"v2alpha1"}],"preferredVersion":{"groupVersion":"batch/v1","version":"v1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]},{"name":"extensions","versions":[{"groupVersion":"extensions/v1beta1","version":"v1beta1"}],"preferredVersion":{"groupVersion":"extensions/v1beta1","version":"v1beta1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]},{"name":"policy","versions":[{"groupVersion":"policy/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"policy/v1alpha1","version":"v1alpha1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]},{"name":"rbac.authorization.k8s.io","versions":[{"groupVersion":"rbac.authorization.k8s.io/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"rbac.authorization.k8s.io/v1alpha1","version":"v1alpha1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]}]}
I0714 15:00:19.399553    7749 round_trippers.go:299] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.3.0 (linux/amd64) kubernetes/2831379" https://api.internal.prod.kubernetes.local/api
I0714 15:00:19.469695    7749 round_trippers.go:318] GET https://api.internal.prod.kubernetes.local/api 200 OK in 70 milliseconds
I0714 15:00:19.469706    7749 round_trippers.go:324] Response Headers:
I0714 15:00:19.469710    7749 round_trippers.go:327]     Content-Type: application/json
I0714 15:00:19.469713    7749 round_trippers.go:327]     Date: Thu, 14 Jul 2016 22:00:16 GMT
I0714 15:00:19.469715    7749 round_trippers.go:327]     Content-Length: 135
I0714 15:00:19.469740    7749 request.go:891] Response Body: {"kind":"APIVersions","versions":["v1"],"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]}
I0714 15:00:19.469849    7749 round_trippers.go:299] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.3.0 (linux/amd64) kubernetes/2831379" https://api.internal.prod.kubernetes.local/apis
I0714 15:00:19.540326    7749 round_trippers.go:318] GET https://api.internal.prod.kubernetes.local/apis 200 OK in 70 milliseconds
I0714 15:00:19.540337    7749 round_trippers.go:324] Response Headers:
I0714 15:00:19.540340    7749 round_trippers.go:327]     Content-Type: application/json
I0714 15:00:19.540343    7749 round_trippers.go:327]     Date: Thu, 14 Jul 2016 22:00:17 GMT
I0714 15:00:19.540346    7749 round_trippers.go:327]     Content-Length: 1649
I0714 15:00:19.540372    7749 request.go:891] Response Body: {"kind":"APIGroupList","groups":[{"name":"apps","versions":[{"groupVersion":"apps/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"apps/v1alpha1","version":"v1alpha1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]},{"name":"autoscaling","versions":[{"groupVersion":"autoscaling/v1","version":"v1"}],"preferredVersion":{"groupVersion":"autoscaling/v1","version":"v1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]},{"name":"batch","versions":[{"groupVersion":"batch/v1","version":"v1"},{"groupVersion":"batch/v2alpha1","version":"v2alpha1"}],"preferredVersion":{"groupVersion":"batch/v1","version":"v1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]},{"name":"extensions","versions":[{"groupVersion":"extensions/v1beta1","version":"v1beta1"}],"preferredVersion":{"groupVersion":"extensions/v1beta1","version":"v1beta1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]},{"name":"policy","versions":[{"groupVersion":"policy/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"policy/v1alpha1","version":"v1alpha1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]},{"name":"rbac.authorization.k8s.io","versions":[{"groupVersion":"rbac.authorization.k8s.io/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"rbac.authorization.k8s.io/v1alpha1","version":"v1alpha1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]}]}
I0714 15:00:19.541964    7749 round_trippers.go:299] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.3.0 (linux/amd64) kubernetes/2831379" https://api.internal.prod.kubernetes.local/api
I0714 15:00:19.612162    7749 round_trippers.go:318] GET https://api.internal.prod.kubernetes.local/api 200 OK in 70 milliseconds
I0714 15:00:19.612176    7749 round_trippers.go:324] Response Headers:
I0714 15:00:19.612180    7749 round_trippers.go:327]     Content-Length: 135
I0714 15:00:19.612183    7749 round_trippers.go:327]     Content-Type: application/json
I0714 15:00:19.612187    7749 round_trippers.go:327]     Date: Thu, 14 Jul 2016 22:00:17 GMT
I0714 15:00:19.612213    7749 request.go:891] Response Body: {"kind":"APIVersions","versions":["v1"],"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]}
I0714 15:00:19.612330    7749 round_trippers.go:299] curl -k -v -XGET  -H "User-Agent: kubectl/v1.3.0 (linux/amd64) kubernetes/2831379" -H "Accept: application/json, */*" https://api.internal.prod.kubernetes.local/apis
I0714 15:00:19.682754    7749 round_trippers.go:318] GET https://api.internal.prod.kubernetes.local/apis 200 OK in 70 milliseconds
I0714 15:00:19.682773    7749 round_trippers.go:324] Response Headers:
I0714 15:00:19.682780    7749 round_trippers.go:327]     Content-Type: application/json
I0714 15:00:19.682786    7749 round_trippers.go:327]     Date: Thu, 14 Jul 2016 22:00:17 GMT
I0714 15:00:19.682791    7749 round_trippers.go:327]     Content-Length: 1649
I0714 15:00:19.682827    7749 request.go:891] Response Body: {"kind":"APIGroupList","groups":[{"name":"apps","versions":[{"groupVersion":"apps/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"apps/v1alpha1","version":"v1alpha1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]},{"name":"autoscaling","versions":[{"groupVersion":"autoscaling/v1","version":"v1"}],"preferredVersion":{"groupVersion":"autoscaling/v1","version":"v1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]},{"name":"batch","versions":[{"groupVersion":"batch/v1","version":"v1"},{"groupVersion":"batch/v2alpha1","version":"v2alpha1"}],"preferredVersion":{"groupVersion":"batch/v1","version":"v1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]},{"name":"extensions","versions":[{"groupVersion":"extensions/v1beta1","version":"v1beta1"}],"preferredVersion":{"groupVersion":"extensions/v1beta1","version":"v1beta1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]},{"name":"policy","versions":[{"groupVersion":"policy/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"policy/v1alpha1","version":"v1alpha1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]},{"name":"rbac.authorization.k8s.io","versions":[{"groupVersion":"rbac.authorization.k8s.io/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"rbac.authorization.k8s.io/v1alpha1","version":"v1alpha1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"172.30.91.140:443"}]}]}
I0714 15:00:19.683728    7749 round_trippers.go:299] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.3.0 (linux/amd64) kubernetes/2831379" https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3
I0714 15:00:19.754895    7749 round_trippers.go:318] GET https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3 200 OK in 71 milliseconds
I0714 15:00:19.754914    7749 round_trippers.go:324] Response Headers:
I0714 15:00:19.754921    7749 round_trippers.go:327]     Content-Type: application/json
I0714 15:00:19.754926    7749 round_trippers.go:327]     Date: Thu, 14 Jul 2016 22:00:17 GMT
I0714 15:00:19.754933    7749 round_trippers.go:327]     Content-Length: 1448
I0714 15:00:19.754965    7749 request.go:891] Response Body: {"kind":"Job","apiVersion":"batch/v1","metadata":{"name":"sproket_maker-3","namespace":"datascience","selfLink":"/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3","uid":"9c8a4148-49a2-11e6-a9db-1284d5eeaed3","resourceVersion":"1345707","creationTimestamp":"2016-07-14T09:09:04Z","labels":{"app":"sproket_maker-3"}},"spec":{"parallelism":0,"completions":1,"selector":{"matchLabels":{"app":"sproket_maker-3"}},"manualSelector":true,"template":{"metadata":{"name":"sproket_maker-3","creationTimestamp":null,"labels":{"app":"sproket_maker-3"}},"spec":{"containers":[{"name":"sproket_maker-3","image":"<image name>","command":["/bin/bash","-c"],"args":["echo '<ip> <hostname>' \u003e\u003e /etc/hosts \u0026\u0026 start.sh python /var/www/sites/cron/company-ml-master/sproket_maker.py --shards 10 --shard-index 3 --limit 20000"],"ports":[{"containerPort":80,"protocol":"TCP"}],"resources":{},"terminationMessagePath":"/dev/termination-log","imagePullPolicy":"Always"}],"restartPolicy":"Never","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","nodeSelector":{"nodetype":"datascience-scheduled"},"securityContext":{},"imagePullSecrets":[{"name":"dockerhub"}]}}},"status":{"conditions":[{"type":"Complete","status":"True","lastProbeTime":"2016-07-14T09:09:09Z","lastTransitionTime":"2016-07-14T09:09:09Z"}],"startTime":"2016-07-14T09:09:04Z","completionTime":"2016-07-14T09:09:09Z","active":1,"succeeded":1}}
I0714 15:00:20.755424    7749 round_trippers.go:299] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.3.0 (linux/amd64) kubernetes/2831379" https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3
I0714 15:00:20.827376    7749 round_trippers.go:318] GET https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3 200 OK in 71 milliseconds
I0714 15:00:20.827392    7749 round_trippers.go:324] Response Headers:
I0714 15:00:20.827396    7749 round_trippers.go:327]     Content-Type: application/json
I0714 15:00:20.827399    7749 round_trippers.go:327]     Date: Thu, 14 Jul 2016 22:00:18 GMT
I0714 15:00:20.827402    7749 round_trippers.go:327]     Content-Length: 1448
I0714 15:00:20.827441    7749 request.go:891] Response Body: {"kind":"Job","apiVersion":"batch/v1","metadata":{"name":"sproket_maker-3","namespace":"datascience","selfLink":"/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3","uid":"9c8a4148-49a2-11e6-a9db-1284d5eeaed3","resourceVersion":"1345707","creationTimestamp":"2016-07-14T09:09:04Z","labels":{"app":"sproket_maker-3"}},"spec":{"parallelism":0,"completions":1,"selector":{"matchLabels":{"app":"sproket_maker-3"}},"manualSelector":true,"template":{"metadata":{"name":"sproket_maker-3","creationTimestamp":null,"labels":{"app":"sproket_maker-3"}},"spec":{"containers":[{"name":"sproket_maker-3","image":"company/sproket_maker","command":["/bin/bash","-c"],"args":["echo '<ip> <name>' \u003e\u003e /etc/hosts \u0026\u0026 start.sh python /var/www/sites/cron/company-ml-master/sproket_maker.py --shards 10 --shard-index 3 --limit 20000"],"ports":[{"containerPort":80,"protocol":"TCP"}],"resources":{},"terminationMessagePath":"/dev/termination-log","imagePullPolicy":"Always"}],"restartPolicy":"Never","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","nodeSelector":{"nodetype":"datascience-scheduled"},"securityContext":{},"imagePullSecrets":[{"name":"dockerhub"}]}}},"status":{"conditions":[{"type":"Complete","status":"True","lastProbeTime":"2016-07-14T09:09:09Z","lastTransitionTime":"2016-07-14T09:09:09Z"}],"startTime":"2016-07-14T09:09:04Z","completionTime":"2016-07-14T09:09:09Z","active":1,"succeeded":1}}
I0714 15:00:20.829210    7749 request.go:558] Request Body: {"kind":"Job","apiVersion":"batch/v1","metadata":{"name":"sproket_maker-3","namespace":"datascience","selfLink":"/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3","uid":"9c8a4148-49a2-11e6-a9db-1284d5eeaed3","resourceVersion":"1345707","creationTimestamp":"2016-07-14T09:09:04Z","labels":{"app":"sproket_maker-3"}},"spec":{"parallelism":0,"completions":1,"selector":{"matchLabels":{"app":"sproket_maker-3"}},"manualSelector":true,"template":{"metadata":{"name":"sproket_maker-3","creationTimestamp":null,"labels":{"app":"sproket_maker-3"}},"spec":{"containers":[{"name":"sproket_maker-3","image":"company/sproket_maker","command":["/bin/bash","-c"],"args":["echo '<ip> <name>' \u003e\u003e /etc/hosts \u0026\u0026 start.sh python /var/www/sites/cron/company-ml-master/sproket_maker.py --shards 10 --shard-index 3 --limit 20000"],"ports":[{"containerPort":80,"protocol":"TCP"}],"resources":{},"terminationMessagePath":"/dev/termination-log","imagePullPolicy":"Always"}],"restartPolicy":"Never","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","nodeSelector":{"nodetype":"datascience-scheduled"},"securityContext":{},"imagePullSecrets":[{"name":"dockerhub"}]}}},"status":{"conditions":[{"type":"Complete","status":"True","lastProbeTime":"2016-07-14T09:09:09Z","lastTransitionTime":"2016-07-14T09:09:09Z"}],"startTime":"2016-07-14T09:09:04Z","completionTime":"2016-07-14T09:09:09Z","active":1,"succeeded":1}}
I0714 15:00:20.829248    7749 round_trippers.go:299] curl -k -v -XPUT  -H "Accept: application/json, */*" -H "Content-Type: application/json" -H "User-Agent: kubectl/v1.3.0 (linux/amd64) kubernetes/2831379" https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3
I0714 15:00:20.900850    7749 round_trippers.go:318] PUT https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3 200 OK in 71 milliseconds
I0714 15:00:20.900874    7749 round_trippers.go:324] Response Headers:
I0714 15:00:20.900877    7749 round_trippers.go:327]     Content-Type: application/json
I0714 15:00:20.900880    7749 round_trippers.go:327]     Date: Thu, 14 Jul 2016 22:00:18 GMT
I0714 15:00:20.900884    7749 round_trippers.go:327]     Content-Length: 1448
I0714 15:00:20.900927    7749 request.go:891] Response Body: {"kind":"Job","apiVersion":"batch/v1","metadata":{"name":"sproket_maker-3","namespace":"datascience","selfLink":"/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3","uid":"9c8a4148-49a2-11e6-a9db-1284d5eeaed3","resourceVersion":"1345707","creationTimestamp":"2016-07-14T09:09:04Z","labels":{"app":"sproket_maker-3"}},"spec":{"parallelism":0,"completions":1,"selector":{"matchLabels":{"app":"sproket_maker-3"}},"manualSelector":true,"template":{"metadata":{"name":"sproket_maker-3","creationTimestamp":null,"labels":{"app":"sproket_maker-3"}},"spec":{"containers":[{"name":"sproket_maker-3","image":"company/sproket_maker","command":["/bin/bash","-c"],"args":["echo '<ip> <name>' \u003e\u003e /etc/hosts \u0026\u0026 start.sh python /var/www/sites/cron/company-ml-master/sproket_maker.py --shards 10 --shard-index 3 --limit 20000"],"ports":[{"containerPort":80,"protocol":"TCP"}],"resources":{},"terminationMessagePath":"/dev/termination-log","imagePullPolicy":"Always"}],"restartPolicy":"Never","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","nodeSelector":{"nodetype":"datascience-scheduled"},"securityContext":{},"imagePullSecrets":[{"name":"dockerhub"}]}}},"status":{"conditions":[{"type":"Complete","status":"True","lastProbeTime":"2016-07-14T09:09:09Z","lastTransitionTime":"2016-07-14T09:09:09Z"}],"startTime":"2016-07-14T09:09:04Z","completionTime":"2016-07-14T09:09:09Z","active":1,"succeeded":1}}
I0714 15:00:20.901220    7749 round_trippers.go:299] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.3.0 (linux/amd64) kubernetes/2831379" https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3
I0714 15:00:20.971869    7749 round_trippers.go:318] GET https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3 200 OK in 70 milliseconds
I0714 15:00:20.971886    7749 round_trippers.go:324] Response Headers:
I0714 15:00:20.971889    7749 round_trippers.go:327]     Content-Length: 1448
I0714 15:00:20.971892    7749 round_trippers.go:327]     Content-Type: application/json
I0714 15:00:20.971897    7749 round_trippers.go:327]     Date: Thu, 14 Jul 2016 22:00:18 GMT
I0714 15:00:20.971935    7749 request.go:891] Response Body: {"kind":"Job","apiVersion":"batch/v1","metadata":{"name":"sproket_maker-3","namespace":"datascience","selfLink":"/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3","uid":"9c8a4148-49a2-11e6-a9db-1284d5eeaed3","resourceVersion":"1345707","creationTimestamp":"2016-07-14T09:09:04Z","labels":{"app":"sproket_maker-3"}},"spec":{"parallelism":0,"completions":1,"selector":{"matchLabels":{"app":"sproket_maker-3"}},"manualSelector":true,"template":{"metadata":{"name":"sproket_maker-3","creationTimestamp":null,"labels":{"app":"sproket_maker-3"}},"spec":{"containers":[{"name":"sproket_maker-3","image":"company/sproket_maker","command":["/bin/bash","-c"],"args":["echo '<ip> <name>' \u003e\u003e /etc/hosts \u0026\u0026 start.sh python /var/www/sites/cron/company-ml-master/sproket_maker.py --shards 10 --shard-index 3 --limit 20000"],"ports":[{"containerPort":80,"protocol":"TCP"}],"resources":{},"terminationMessagePath":"/dev/termination-log","imagePullPolicy":"Always"}],"restartPolicy":"Never","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","nodeSelector":{"nodetype":"datascience-scheduled"},"securityContext":{},"imagePullSecrets":[{"name":"dockerhub"}]}}},"status":{"conditions":[{"type":"Complete","status":"True","lastProbeTime":"2016-07-14T09:09:09Z","lastTransitionTime":"2016-07-14T09:09:09Z"}],"startTime":"2016-07-14T09:09:04Z","completionTime":"2016-07-14T09:09:09Z","active":1,"succeeded":1}}
I0714 15:00:21.972360    7749 round_trippers.go:299] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.3.0 (linux/amd64) kubernetes/2831379" https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3
I0714 15:00:22.043372    7749 round_trippers.go:318] GET https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3 200 OK in 70 milliseconds
I0714 15:00:22.043389    7749 round_trippers.go:324] Response Headers:
I0714 15:00:22.043395    7749 round_trippers.go:327]     Content-Type: application/json
I0714 15:00:22.043398    7749 round_trippers.go:327]     Date: Thu, 14 Jul 2016 22:00:19 GMT
I0714 15:00:22.043402    7749 round_trippers.go:327]     Content-Length: 1448
I0714 15:00:22.043436    7749 request.go:891] Response Body: {"kind":"Job","apiVersion":"batch/v1","metadata":{"name":"sproket_maker-3","namespace":"datascience","selfLink":"/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3","uid":"9c8a4148-49a2-11e6-a9db-1284d5eeaed3","resourceVersion":"1345707","creationTimestamp":"2016-07-14T09:09:04Z","labels":{"app":"sproket_maker-3"}},"spec":{"parallelism":0,"completions":1,"selector":{"matchLabels":{"app":"sproket_maker-3"}},"manualSelector":true,"template":{"metadata":{"name":"sproket_maker-3","creationTimestamp":null,"labels":{"app":"sproket_maker-3"}},"spec":{"containers":[{"name":"sproket_maker-3","image":"company/sproket_maker","command":["/bin/bash","-c"],"args":["echo '<ip> <name>' \u003e\u003e /etc/hosts \u0026\u0026 start.sh python /var/www/sites/cron/company-ml-master/sproket_maker.py --shards 10 --shard-index 3 --limit 20000"],"ports":[{"containerPort":80,"protocol":"TCP"}],"resources":{},"terminationMessagePath":"/dev/termination-log","imagePullPolicy":"Always"}],"restartPolicy":"Never","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","nodeSelector":{"nodetype":"datascience-scheduled"},"securityContext":{},"imagePullSecrets":[{"name":"dockerhub"}]}}},"status":{"conditions":[{"type":"Complete","status":"True","lastProbeTime":"2016-07-14T09:09:09Z","lastTransitionTime":"2016-07-14T09:09:09Z"}],"startTime":"2016-07-14T09:09:04Z","completionTime":"2016-07-14T09:09:09Z","active":1,"succeeded":1}}
I0714 15:00:22.972359    7749 round_trippers.go:299] curl -k -v -XGET  -H "User-Agent: kubectl/v1.3.0 (linux/amd64) kubernetes/2831379" -H "Accept: application/json, */*" https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3
I0714 15:00:23.043554    7749 round_trippers.go:318] GET https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3 200 OK in 71 milliseconds
I0714 15:00:23.043570    7749 round_trippers.go:324] Response Headers:
I0714 15:00:23.043574    7749 round_trippers.go:327]     Content-Length: 1448
I0714 15:00:23.043577    7749 round_trippers.go:327]     Content-Type: application/json
I0714 15:00:23.043580    7749 round_trippers.go:327]     Date: Thu, 14 Jul 2016 22:00:20 GMT
I0714 15:00:23.043615    7749 request.go:891] Response Body: {"kind":"Job","apiVersion":"batch/v1","metadata":{"name":"sproket_maker-3","namespace":"datascience","selfLink":"/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3","uid":"9c8a4148-49a2-11e6-a9db-1284d5eeaed3","resourceVersion":"1345707","creationTimestamp":"2016-07-14T09:09:04Z","labels":{"app":"sproket_maker-3"}},"spec":{"parallelism":0,"completions":1,"selector":{"matchLabels":{"app":"sproket_maker-3"}},"manualSelector":true,"template":{"metadata":{"name":"sproket_maker-3","creationTimestamp":null,"labels":{"app":"sproket_maker-3"}},"spec":{"containers":[{"name":"sproket_maker-3","image":"company/sproket_maker","command":["/bin/bash","-c"],"args":["echo <ip> <name>' \u003e\u003e /etc/hosts \u0026\u0026 start.sh python /var/www/sites/cron/company-ml-master/sproket_maker.py --shards 10 --shard-index 3 --limit 20000"],"ports":[{"containerPort":80,"protocol":"TCP"}],"resources":{},"terminationMessagePath":"/dev/termination-log","imagePullPolicy":"Always"}],"restartPolicy":"Never","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","nodeSelector":{"nodetype":"datascience-scheduled"},"securityContext":{},"imagePullSecrets":[{"name":"dockerhub"}]}}},"status":{"conditions":[{"type":"Complete","status":"True","lastProbeTime":"2016-07-14T09:09:09Z","lastTransitionTime":"2016-07-14T09:09:09Z"}],"startTime":"2016-07-14T09:09:04Z","completionTime":"2016-07-14T09:09:09Z","active":1,"succeeded":1}}
I0714 15:00:23.972322    7749 round_trippers.go:299] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.3.0 (linux/amd64) kubernetes/2831379" https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3
I0714 15:00:24.043488    7749 round_trippers.go:318] GET https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3 200 OK in 71 milliseconds
I0714 15:00:24.043506    7749 round_trippers.go:324] Response Headers:
I0714 15:00:24.043512    7749 round_trippers.go:327]     Date: Thu, 14 Jul 2016 22:00:21 GMT
I0714 15:00:24.043517    7749 round_trippers.go:327]     Content-Length: 1448
I0714 15:00:24.043522    7749 round_trippers.go:327]     Content-Type: application/json
I0714 15:00:24.043559    7749 request.go:891] Response Body: {"kind":"Job","apiVersion":"batch/v1","metadata":{"name":"sproket_maker-3","namespace":"datascience","selfLink":"/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3","uid":"9c8a4148-49a2-11e6-a9db-1284d5eeaed3","resourceVersion":"1345707","creationTimestamp":"2016-07-14T09:09:04Z","labels":{"app":"sproket_maker-3"}},"spec":{"parallelism":0,"completions":1,"selector":{"matchLabels":{"app":"sproket_maker-3"}},"manualSelector":true,"template":{"metadata":{"name":"sproket_maker-3","creationTimestamp":null,"labels":{"app":"sproket_maker-3"}},"spec":{"containers":[{"name":"sproket_maker-3","image":"company/sproket_maker","command":["/bin/bash","-c"],"args":["echo '<ip> <name>' \u003e\u003e /etc/hosts \u0026\u0026 start.sh python /var/www/sites/cron/company-ml-master/sproket_maker.py --shards 10 --shard-index 3 --limit 20000"],"ports":[{"containerPort":80,"protocol":"TCP"}],"resources":{},"terminationMessagePath":"/dev/termination-log","imagePullPolicy":"Always"}],"restartPolicy":"Never","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","nodeSelector":{"nodetype":"datascience-scheduled"},"securityContext":{},"imagePullSecrets":[{"name":"dockerhub"}]}}},"status":{"conditions":[{"type":"Complete","status":"True","lastProbeTime":"2016-07-14T09:09:09Z","lastTransitionTime":"2016-07-14T09:09:09Z"}],"startTime":"2016-07-14T09:09:04Z","completionTime":"2016-07-14T09:09:09Z","active":1,"succeeded":1}}
I0714 15:00:24.972316    7749 round_trippers.go:299] curl -k -v -XGET  -H "User-Agent: kubectl/v1.3.0 (linux/amd64) kubernetes/2831379" -H "Accept: application/json, */*" https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3
I0714 15:00:25.043459    7749 round_trippers.go:318] GET https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3 200 OK in 71 milliseconds
I0714 15:00:25.043480    7749 round_trippers.go:324] Response Headers:
I0714 15:00:25.043483    7749 round_trippers.go:327]     Content-Type: application/json
I0714 15:00:25.043487    7749 round_trippers.go:327]     Date: Thu, 14 Jul 2016 22:00:22 GMT
I0714 15:00:25.043489    7749 round_trippers.go:327]     Content-Length: 1448
I0714 15:00:25.043525    7749 request.go:891] Response Body: {"kind":"Job","apiVersion":"batch/v1","metadata":{"name":"sproket_maker-3","namespace":"datascience","selfLink":"/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3","uid":"9c8a4148-49a2-11e6-a9db-1284d5eeaed3","resourceVersion":"1345707","creationTimestamp":"2016-07-14T09:09:04Z","labels":{"app":"sproket_maker-3"}},"spec":{"parallelism":0,"completions":1,"selector":{"matchLabels":{"app":"sproket_maker-3"}},"manualSelector":true,"template":{"metadata":{"name":"sproket_maker-3","creationTimestamp":null,"labels":{"app":"sproket_maker-3"}},"spec":{"containers":[{"name":"sproket_maker-3","image":"company/sproket_maker","command":["/bin/bash","-c"],"args":["echo '<ip> <name>' \u003e\u003e /etc/hosts \u0026\u0026 start.sh python /var/www/sites/cron/company-ml-master/sproket_maker.py --shards 10 --shard-index 3 --limit 20000"],"ports":[{"containerPort":80,"protocol":"TCP"}],"resources":{},"terminationMessagePath":"/dev/termination-log","imagePullPolicy":"Always"}],"restartPolicy":"Never","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","nodeSelector":{"nodetype":"datascience-scheduled"},"securityContext":{},"imagePullSecrets":[{"name":"dockerhub"}]}}},"status":{"conditions":[{"type":"Complete","status":"True","lastProbeTime":"2016-07-14T09:09:09Z","lastTransitionTime":"2016-07-14T09:09:09Z"}],"startTime":"2016-07-14T09:09:04Z","completionTime":"2016-07-14T09:09:09Z","active":1,"succeeded":1}}
I0714 15:00:25.972362    7749 round_trippers.go:299] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.3.0 (linux/amd64) kubernetes/2831379" https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3
I0714 15:00:26.049541    7749 round_trippers.go:318] GET https://api.internal.prod.kubernetes.local/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3 200 OK in 77 milliseconds
I0714 15:00:26.049561    7749 round_trippers.go:324] Response Headers:
I0714 15:00:26.049568    7749 round_trippers.go:327]     Content-Type: application/json
I0714 15:00:26.049573    7749 round_trippers.go:327]     Date: Thu, 14 Jul 2016 22:00:23 GMT
I0714 15:00:26.049579    7749 round_trippers.go:327]     Content-Length: 1448
I0714 15:00:26.049615    7749 request.go:891] Response Body: {"kind":"Job","apiVersion":"batch/v1","metadata":{"name":"sproket_maker-3","namespace":"datascience","selfLink":"/apis/batch/v1/namespaces/datascience/jobs/sproket_maker-3","uid":"9c8a4148-49a2-11e6-a9db-1284d5eeaed3","resourceVersion":"1345707","creationTimestamp":"2016-07-14T09:09:04Z","labels":{"app":"sproket_maker-3"}},"spec":{"parallelism":0,"completions":1,"selector":{"matchLabels":{"app":"sproket_maker-3"}},"manualSelector":true,"template":{"metadata":{"name":"sproket_maker-3","creationTimestamp":null,"labels":{"app":"sproket_maker-3"}},"spec":{"containers":[{"name":"sproket_maker-3","image":"company/sproket_maker","command":["/bin/bash","-c"],"args":["echo '<ip> <name>' \u003e\u003e /etc/hosts \u0026\u0026 start.sh python /var/www/sites/cron/company-ml-master/sproket_maker.py --shards 10 --shard-index 3 --limit 20000"],"ports":[{"containerPort":80,"protocol":"TCP"}],"resources":{},"terminationMessagePath":"/dev/termination-log","imagePullPolicy":"Always"}],"restartPolicy":"Never","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","nodeSelector":{"nodetype":"datascience-scheduled"},"securityContext":{},"imagePullSecrets":[{"name":"dockerhub"}]}}},"status":{"conditions":[{"type":"Complete","status":"True","lastProbeTime":"2016-07-14T09:09:09Z","lastTransitionTime":"2016-07-14T09:09:09Z"}],"startTime":"2016-07-14T09:09:04Z","completionTime":"2016-07-14T09:09:09Z","active":1,"succeeded":1}}
@dstynchula dstynchula changed the title Kubectl delete job hangs indefinitely, preventing new jobs from being run Kubectl delete job hangs for ten minutes, preventing new jobs from being run Jul 14, 2016
@dstynchula dstynchula changed the title Kubectl delete job hangs for ten minutes, preventing new jobs from being run 'kubectl delete job <jobname>' hangs for ten minutes, then times out Jul 15, 2016
@soltysh soltysh self-assigned this Jul 15, 2016
@soltysh
Copy link
Contributor

soltysh commented Jul 15, 2016

I've edited your post fixing markdown formatting for code blocks to be more readable.

@soltysh
Copy link
Contributor

soltysh commented Jul 15, 2016

In your case here's what's happening. Your job is in an unexpected state, because it has following spec: parallelism:0 and completions:1 but at the same time this status: active:1 and succeeded:1. When it comes to removing a job we are scaling the job to parallelism:0 and checking this condition. In your case, we end up checking this line specifically, and if you put here the values we're getting: 1 - 1 - 1 = -1 which does not meet the condition and we keep re-checking that condition over and over again until you reach that timeout. To help me identify the problem, can you provide me with a reproducer of that state I've described at the beginning, iow: parallelism:0, completions:1, active:1, succeeded:1?

@dstynchula
Copy link
Author

Hi Maciej,

Thank you for taking a look into this! I apologize for the formatting. I'll take your feedback as an invitation to strengthen my grasp of markdown. :)
I'm not sure what in the mix of things I'm doing is generating that condition.
Here is the YAML I'm using the create the job:

apiVersion: extensions/v1beta1
kind: Job
metadata:
  name: sproket-maker-0
spec:
  selector:
    matchLabels:
      app: sproket-maker-0
  template:
    metadata:
      name: sproket-maker-0
      labels:
        app: sproket-maker-0
    spec:
      imagePullSecrets:
      - name: "dockerhub"
      containers:
      - name: sproket-maker-0
        image: "company/sproket-maker"
        command: ["/bin/bash", "-c"]
        args: ["echo '<ip_redacted> <name_redacted>' >> /etc/hosts && start.sh python /var/www/sites/cron/company-ml-master/sproket_maker.py --shards 10 --shard-index 0 --limit 20000"]
        ports:
        - containerPort: 80
        imagePullPolicy: Always
      restartPolicy: Never
      nodeSelector:
        nodetype: memory_optimized      

I have a cron task that runs:

kubectl delete -f <job_yaml>
kubectl create -f <job_yaml>

How it had been working is this would remove the old job, and replace it with a new identical job.
The python process inside these containers normally runs for 5-10 minutes then terminates.

If there is any other information I can provide, I'm happy to dig up whatever you need.

Thank you,

Dan

@dstynchula
Copy link
Author

I'm using an HA master config (kops/aws) if that's relevant

@soltysh
Copy link
Contributor

soltysh commented Jul 18, 2016

Thanks, I'll try to investigate what might be causing this.

@dstynchula
Copy link
Author

dstynchula commented Jul 18, 2016

I was working with @justinsb looking from the kubernetes-users Slack channel. He had a couple of ideas, and said it would be ok if I posted them here:


justinsb [9:43 AM] 
My theory is that the job controller isn't 100% correct, HA is exposing it, you've got a split view of jobs & pods

(me)slightlycleverprimate [9:43 AM]  
that sounds in line with the behavior.

justinsb [9:45 AM]  
Yes.  I think that the == test in JobHasDesiredParallelism is too strict (that issue that @soltysh pointed out), but my guess is that the core problem is an issue in the job controller

[9:46]  
Restarting KCM should hopefully get you back to the initial behaviour

[9:46]  
And if not .... more data

[9:48]  
I also think it's odd that active==1; if it continues to happen it is worth checking the status of all the pods

[9:48]  
(because I think the pod has stoppped, so active should be 0?)

[9:50]  
If it happens again, if you send me the status of all the (job pods), in particular their Status.Phase and DeletionTimestamp

[9:52]  
Aha ... I think that is it...

[9:52]  

> 

 Jul 14 21:07:50 ip-172-30-49-0 kubelet[824]: I0714 21:07:50.647594     824 kubelet.go:2561] SyncLoop (PLEG): "brisque-audit-0-mazs3_datascience(efdcd2a2-4a05-11e6-94b9-0eda3f8ca9bd)", event: &pleg.PodLifecycleEvent{ID:"efdcd2a2-4a05-11e6-94b9-0eda3f8ca9bd", Type:"ContainerDied", Data:"7395e2b214e5425c7eeb78022ef6f0394c7ac0696a2e43645d08d09a35a76534"}
Jul 14 23:18:56 ip-172-30-49-0 kubelet[824]: I0714 23:18:56.370395     824 kubelet.go:2541] SyncLoop (UPDATE, "api"): "brisque-audit-0-mazs3_datascience(efdcd2a2-4a05-11e6-94b9-0eda3f8ca9bd):DeletionTimestamp=2016-07-14T23:18:56Z"

> 

[9:52]  
It took 11 minutes to set the DeletionTimestamp (I think that is what it is saying)

[9:53]  
I bet the jobs are in Phase=PodSucceeded but DeletionTimestamp=nil

[9:53]  
That is considered active

[9:53]  
But also success (!)

@dstynchula
Copy link
Author

Restarting the controller had no apparent effect on the bug.

@dstynchula
Copy link
Author

closed on accident. :")

@foxish
Copy link
Contributor

foxish commented Jul 21, 2016

Unable to reproduce this with 1.3.0/1.3.0 combination with a very similar job.
Are there some other conditions I am missing? @dstynchula, are you able to reproduce it reliably?

@bfallik
Copy link
Contributor

bfallik commented Jul 21, 2016

@foxish I don't know about reproducibility but I have a live VM that's exhibiting this so if you want me to poke at the state I can.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.0+2831379", GitCommit:"283137936a498aed572ee22af6774b6fb6e9fd94", GitTreeState:"not a git tree", BuildDate:"2016-07-05T15:40:23Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.0+coreos.1", GitCommit:"83e9c91279813860f241b68d076d58f9c5871357", GitTreeState:"clean", BuildDate:"2016-07-06T20:04:26Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}

@bfallik
Copy link
Contributor

bfallik commented Jul 21, 2016

FWIW in my case the job is stuck with exactly the same state that @soltysh described. Here's the snippet from the kubectl log:

I0721 19:31:15.472801   57675 round_trippers.go:318] GET https://172.17.4.99:443/apis/batch/v1/namespaces/default/jobs/mockworker 200 OK in 2 milliseconds
I0721 19:31:15.472821   57675 round_trippers.go:324] Response Headers:
I0721 19:31:15.472827   57675 round_trippers.go:327]     Date: Thu, 21 Jul 2016 23:31:15 GMT
I0721 19:31:15.472833   57675 round_trippers.go:327]     Content-Length: 1137
I0721 19:31:15.472838   57675 round_trippers.go:327]     Content-Type: application/json
I0721 19:31:15.472881   57675 request.go:891] Response Body: {"kind":"Job","apiVersion":"batch/v1","metadata":{"name":"mockworker","namespace":"default","selfLink":"/apis/batch/v1/namespaces/default/jobs/mockworker","uid":"44d5ce4c-4c71-11e6-8cd8-080027eaf841","resourceVersion":"15447","creationTimestamp":"2016-07-17T22:53:25Z"},"spec":{"parallelism":0,"completions":1,"selector":{"matchLabels":{"controller-uid":"44d5ce4c-4c71-11e6-8cd8-080027eaf841"}},"template":{"metadata":{"name":"mockworker","creationTimestamp":null,"labels":{"controller-uid":"44d5ce4c-4c71-11e6-8cd8-080027eaf841","job-name":"mockworker"}},"spec":{"containers":[{"name":"mockworker","image":"busybox","command":["sh","-c","for i in $(seq 20); do echo ${i}; sleep 1; done"],"resources":{},"terminationMessagePath":"/dev/termination-log","imagePullPolicy":"Always"}],"restartPolicy":"Never","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","securityContext":{}}}},"status":{"conditions":[{"type":"Complete","status":"True","lastProbeTime":"2016-07-17T22:53:47Z","lastTransitionTime":"2016-07-17T22:53:47Z"}],"startTime":"2016-07-17T22:53:25Z","completionTime":"2016-07-17T22:53:47Z","active":1,"succeeded":1}}

parallelism:0, completions:1, active:1, succeeded: 1

@k8s-github-robot k8s-github-robot added area/kubectl sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Aug 3, 2016
@joshk0
Copy link

joshk0 commented Aug 12, 2016

Have a stuck job too. Will leave it at that until I'm asked for more details, don't want to clog up the thread. You can also reach me directly on Kubernetes slack as joshk.

@soltysh
Copy link
Contributor

soltysh commented Aug 12, 2016

@joshk0 can you verify your job against the previous comment from Brian, if you're having the same problem exactly? (I didn't find you on slack)

@joshk0
Copy link

joshk0 commented Aug 16, 2016

Yes, it also had "active": 1 despite having succeeded.

Unfortunately the engineer who noticed this in our team already discovered how to delete it: Using the REST API directly. So, it is now gone.

@dstynchula
Copy link
Author

@joshua: I tried the api and failed to delete my jobs. What did you send to
the apt to remove them?

On Aug 15, 2016 7:11 PM, "Joshua Kwan" notifications@github.com wrote:

Yes, it also had "active": 1 despite having succeeded.

Unfortunately the engineer who noticed this in our team already discovered
how to delete it: Using the REST API directly. So, it is now gone.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28977 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVvXZz8R5DhdoH5x2avWBC3RwSkbF-vks5qgRxjgaJpZM4JM5dH
.

@asonawalla
Copy link

@dstynchula after authenticating, we sent a DELETE to /apis/batch/v1/namespaces/{namespace}/jobs/{job-name} for each of the jobs that had missing active pods.

@soltysh
Copy link
Contributor

soltysh commented Aug 31, 2016

Guys sorry, I was out for two last weeks taking some time off. I'm hoping to pick this up when I get through the tons of emails in my inbox.

@ye
Copy link

ye commented Sep 20, 2016

I have encountering the same issue. It's probably due to the reason that k8s is trying to delete resources created by the Job you are deleting. And using the --cascade=false flag helps. see #8598 for more details.

$ kubectl describe job deployer
Name:       deployer
Namespace:  default
Image(s):   gcr.io/maven-clinic-image-builder/deployer:31e65ed9452c5047c9e79fbf8e1ee0e2f046d3ea,gcr.io/maven-clinic-image-builder/api:31e65ed9452c5047c9e79fbf8e1ee0e2f046d3ea
Selector:   controller-uid=7ad4f8ea-7eac-11e6-8f5a-42010af001cc
Parallelism:    0
Completions:    1
Start Time: Mon, 19 Sep 2016 17:03:14 -0400
Labels:     controller-uid=7ad4f8ea-7eac-11e6-8f5a-42010af001cc
        job-name=deployer
Pods Statuses:  0 Running / 0 Succeeded / 156 Failed
No volumes.
Events:
  FirstSeen LastSeen    Count   From            SubobjectPath   Type        Reason          Message
  --------- --------    -----   ----            -------------   --------    ------          -------
  22h       9m      893 {job-controller }           Normal      SuccessfulCreate    (events with common reason combined)
  8m        8m      1   {job-controller }           Normal      SuccessfulDelete    Deleted pod: deployer-23j9s
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.6", GitCommit:"ae4550cc9c89a593bcda6678df201db1b208133b", GitTreeState:"clean", BuildDate:"2016-08-26T18:13:23Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.6", GitCommit:"ae4550cc9c89a593bcda6678df201db1b208133b", GitTreeState:"clean", BuildDate:"2016-08-26T18:06:06Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}

$ kubectl delete job deployer --cascade=false
job "deployer" deleted

@soltysh
Copy link
Contributor

soltysh commented Sep 21, 2016

@ye right, but still I'd like to fix the root cause of the problem.

@ye
Copy link

ye commented Sep 21, 2016

@soltysh 👍 @foxish FYI, to reproduce the issue, I believe you can intentionally create a job with at least one Docker container that has a command that will fail and set restartPolicy: OnFailure and let the job sit there for a day or two so that there are a ton of failed pods created. And then try to delete the job, you will encounter the aforementioned "hanging". Turning on the debugging flag verbosity to be 9 or above (--v=9) you will see that k8s is actually trying to kill off all these failed pod in a cascading fashion. Although --cascade=false will immediate complete the deletion process, it's probably leaving a ton of orphaned resources not garbage collected.

@foxish
Copy link
Contributor

foxish commented Oct 12, 2016

@soltysh Able to reproduce this in about 20 minutes with the following job definition.

apiVersion: batch/v1
kind: Job
metadata:
  name: fail
spec:
  template:
    metadata:
      name: fail
    spec:
      containers:
      - name: fail
        image: alpine
        command: ["fail"]
      restartPolicy: Never
  • Despite [Controller Manager] Fix job controller hot loop #32785, the pod creates are being retried every 1 second or so, and not backing off exponentially.
  • The pods are never cleaned up from the apiserver after they fail and "complete" till kubectl delete is invoked. (is this by design?)

@soltysh
Copy link
Contributor

soltysh commented Oct 15, 2016

Despite #32785, the pod creates are being retried every 1 second or so, and not backing off exponentially.

The problem is not with the controller, but with the deletion logic.

The pods are never cleaned up from the apiserver after they fail and "complete" till kubectl delete is invoked. (is this by design?)

Yes, there's no job prunning implemented.

@erictune @bgrant0607 I'm currently implementing job pruninng downstream in openshift/origin#11345 would it make sense to have some pruning mechanism in k8s? Or we should rather extend the garbage collection, iiuc it currently handles only pods, am I right?

@foxish foxish added area/controller-manager team/control-plane and removed area/kubectl sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Oct 15, 2016
@bgrant0607
Copy link
Member

cc @caesarxuchao @lavalamp

@bgrant0607
Copy link
Member

@bgrant0607
Copy link
Member

@soltysh Job keeps around all dead pods?

Job needs to keep the successful pods. It doesn't strictly need to keep around failed ones, though keeping them can be useful for debugging, especially if the logs aren't pushed. I think a reasonable starting point would be to keep the most recent failed pod for any given completion. I wouldn't add a configurable policy at this point.

You'll also want to set ownerReferences on the pods so that server-side cascading deletion will kick in when the job is deleted.

@soltysh
Copy link
Contributor

soltysh commented Oct 20, 2016

@soltysh Job keeps around all dead pods?

Yes, which is quite painful in case of ScheduledJobs firing every minute or so.

All of your suggestions sounds ok, I'll add them to the SJ feature so that it's covered before migrating SJ to beta.

@caesarxuchao
Copy link
Member

User doc on server-side garbage collection: http://kubernetes.io/docs/user-guide/garbage-collection/

@soltysh please loop me in if you want to add ownerReferences for pods created by jobs.

@k8s-github-robot
Copy link

@dstynchula There are no sig labels on this issue. Please add a sig label by:
(1) mentioning a sig: @kubernetes/sig-<team-name>-misc
(2) specifying the label manually: /sig <label>

Note: method (1) will trigger a notification to the team. You can find the team list here.

@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 31, 2017
@0xmichalis
Copy link
Contributor

/sig cli

@k8s-ci-robot k8s-ci-robot added the sig/cli Categorizes an issue or PR as relevant to SIG CLI. label Jun 23, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 23, 2017
@soltysh
Copy link
Contributor

soltysh commented Jul 31, 2017

I think with server side deletion this can be closed now. If you don't agree please feel free to re-open.

@soltysh soltysh closed this as completed Jul 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/batch area/controller-manager sig/cli Categorizes an issue or PR as relevant to SIG CLI.
Projects
None yet
Development

No branches or pull requests