Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e flake: Kubelet failed in scalability test #21866

Closed
wojtek-t opened this issue Feb 24, 2016 · 15 comments
Closed

e2e flake: Kubelet failed in scalability test #21866

wojtek-t opened this issue Feb 24, 2016 · 15 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@wojtek-t
Copy link
Member

One of our scalability test runs failed with:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:84 Feb 23 13:57:52.186: All nodes should be ready after test, Not ready nodes: [{{ } {e2e-scalability-minion-9rwz /api/v1/nodes/e2e-scalability-minion-9rwz 03080b91-da72-11e5-8b5f-42010af00002 64561 0 2016-02-23 13:11:31 -0800 PST map[kubernetes.io/hostname:e2e-scalability-minion-9rwz master: beta.kubernetes.io/instance-type:n1-standard-2 failure-domain.beta.kubernetes.io/region:us-central1 failure-domain.beta.kubernetes.io/zone:us-central1-f] map[]} {10.245.65.0/24 2699012913551505030 gce://kubernetes-jenkins/us-central1-f/e2e-scalability-minion-9rwz false} {map[cpu:{2.000 DecimalSI} memory:{7864139776.000 BinarySI} pods:{110.000 DecimalSI}] map[cpu:{2.000 DecimalSI} memory:{7864139776.000 BinarySI} pods:{110.000 DecimalSI}] [{OutOfDisk Unknown 2016-02-23 13:47:13 -0800 PST 2016-02-23 13:47:58 -0800 PST NodeStatusUnknown Kubelet stopped posting node status.} {Ready Unknown 2016-02-23 13:47:13 -0800 PST 2016-02-23 13:47:58 -0800 PST NodeStatusUnknown Kubelet stopped posting node status.}] [{InternalIP 10.240.0.72} {ExternalIP 104.154.18.94}] {{10250}} { A9FE5A87-A63A-D4D8-F8D9-1C5061866D89 ef27c059-e5ef-4dd0-9405-13c60d31c654 3.16.0-4-amd64 Debian GNU/Linux 7 (wheezy) docker://1.9.1 v1.2.0-alpha.8.221+2524249a415fb3 v1.2.0-alpha.8.221+2524249a415fb3} [{[gcr.io/google_containers/kube-proxy:4c6e67dae2145726ff3cae91de4a2960] 165640277} {[gcr.io/google_containers/fluentd-elasticsearch:1.14] 562034622} {[gcr.io/google_containers/pause:2.0] 350164} {[gcr.io/google_containers/pause:0.8.0] 241656} {[gcr.io/google_containers/serve_hostname:1.1] 4522409}]}}]

I took a quick look into it and it seems that something strange happened to Kubelet.
See the end of this file:
https://00e9e64bacf1da6c6807cb09861e2f84245c7b68119c8eddd5-apidata.googleusercontent.com/download/storage/v1_internal/b/kubernetes-jenkins/o/logs%2Fkubernetes-e2e-gce-scalability%2F4805%2Fartifacts%2F104.154.18.94:22-kubelet.log?qk=AD5uMEuGCECRLVur5CciL5wz4yH6zwU4Hpby0bxZyymy7GwmBVd3OokDf1mxIgV7ufBpaDXbvvKChJ5qfzehFhjPed4upFRssuBNJHSTO6BWDIIUFvwo9uwf1qHWJ_ZKNXiSV4s6wkJBsQEZij6jCq17FPvUfEp-2O5C2TWrAzjaig80ZvsWwJFigkjNgEfACn63STUAqAIg5pk32xqydV6Y0E0y7OAruTblsA-WijF8BwDEXlgfkvF9ZIOuO00honN7BmlUPGB-6Z_WstGfZZpxIivnk5D4nLsQkuwOqp7uLpiEXY25-GKHY3Lt6Yl5XoBbbs5WE7gnbo1kTqmOaiaQHOG7bN4MHb_vYbdRsgzl1M-ZfP155VVEeEFJtGYZrcGcwwWHCli7MmsGjA3FOUwM8Scd-Xl1CdF2jLGVYotjp_ObbdZ-7ReroIK706omRyV0QeccWAYLfPerVS8JJytotb-YupFNaufDr1flKgqKsCl2ta-0TQYjWV8vosWcyrifSaGq27yY6y3oMsBv33eOx2ITW1cZDrzCvW3pzqFKJlHsnWf_PumJ1YkfTlXBlBMj8XfdF6at1Igcha0BxjyUaDVhKYaBaYZalJrFPZHS_VRRNCuk3q_IKshqj8RB_7cogFKfssEdEfI_yYL3Y8k85cvK6kVpZkZiVEuQa3Uwwgf6A7yEI-vQLujpcgnRisvdYJ_6Zu3UR-wepcnaXv13QN1Sd6D3RBE5ih4l-xw0fIdKCAUVZc46k-PBSitofoW_tqZgn4AskTd6Z__W1v2OlC8-7PfXNLWFvBuwYihFRo_l38GdqPFRoSBgo0HjZ9tZgxH7xsv_DsZw-tlGbZcvxoKbbK7tSDUyt_takszDwwaJzpD67eSM0IuJxcknOVmWqHty9T3EkbpYtX_JBTQV6eUL-7u1W7jWhvCwYNsluqxM0uPSYvg

For completeness - the failed run is this one:
https://storage.googleapis.com/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/4805/build-log.txt

@wojtek-t wojtek-t added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. kind/flake Categorizes issue or PR as related to a flaky test. labels Feb 24, 2016
@wojtek-t
Copy link
Member Author

@dchen1107 - can you please take a look or delegate?

@adohe-zz
Copy link

More complete info:

Feb 23 13:57:52.186: FAIL: All nodes should be ready after test, Not ready nodes: [{{ } {e2e-scalability-minion-9rwz   /api/v1/nodes/e2e-scalability-minion-9rwz 03080b91-da72-11e5-8b5f-42010af00002 64561 0 2016-02-23 13:11:31 -0800 PST <nil> <nil> map[kubernetes.io/hostname:e2e-scalability-minion-9rwz master: beta.kubernetes.io/instance-type:n1-standard-2 failure-domain.beta.kubernetes.io/region:us-central1 failure-domain.beta.kubernetes.io/zone:us-central1-f] map[]} {10.245.65.0/24 2699012913551505030 gce://kubernetes-jenkins/us-central1-f/e2e-scalability-minion-9rwz false} {map[cpu:{2.000 DecimalSI} memory:{7864139776.000 BinarySI} pods:{110.000 DecimalSI}] map[cpu:{2.000 DecimalSI} memory:{7864139776.000 BinarySI} pods:{110.000 DecimalSI}]  [{OutOfDisk Unknown 2016-02-23 13:47:13 -0800 PST 2016-02-23 13:47:58 -0800 PST NodeStatusUnknown Kubelet stopped posting node status.} {Ready Unknown 2016-02-23 13:47:13 -0800 PST 2016-02-23 13:47:58 -0800 PST NodeStatusUnknown Kubelet stopped posting node status.}] [{InternalIP 10.240.0.72} {ExternalIP 104.154.18.94}] {{10250}} { A9FE5A87-A63A-D4D8-F8D9-1C5061866D89 ef27c059-e5ef-4dd0-9405-13c60d31c654 3.16.0-4-amd64 Debian GNU/Linux 7 (wheezy) docker://1.9.1 v1.2.0-alpha.8.221+2524249a415fb3 v1.2.0-alpha.8.221+2524249a415fb3} [{[gcr.io/google_containers/kube-proxy:4c6e67dae2145726ff3cae91de4a2960] 165640277} {[gcr.io/google_containers/fluentd-elasticsearch:1.14] 562034622} {[gcr.io/google_containers/pause:2.0] 350164} {[gcr.io/google_containers/pause:0.8.0] 241656} {[gcr.io/google_containers/serve_hostname:1.1] 4522409}]}}]
STEP: Destroying namespace "e2e-tests-density-1c81d" for this suite.

• Failure in Spec Teardown (AfterEach) [896.986 seconds]
Density
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:482
  [Feature:Performance] should allow starting 30 pods per node [AfterEach]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/density.go:480

  Feb 23 13:57:52.186: All nodes should be ready after test, Not ready nodes: [{{ } {e2e-scalability-minion-9rwz   /api/v1/nodes/e2e-scalability-minion-9rwz 03080b91-da72-11e5-8b5f-42010af00002 64561 0 2016-02-23 13:11:31 -0800 PST <nil> <nil> map[kubernetes.io/hostname:e2e-scalability-minion-9rwz master: beta.kubernetes.io/instance-type:n1-standard-2 failure-domain.beta.kubernetes.io/region:us-central1 failure-domain.beta.kubernetes.io/zone:us-central1-f] map[]} {10.245.65.0/24 2699012913551505030 gce://kubernetes-jenkins/us-central1-f/e2e-scalability-minion-9rwz false} {map[cpu:{2.000 DecimalSI} memory:{7864139776.000 BinarySI} pods:{110.000 DecimalSI}] map[cpu:{2.000 DecimalSI} memory:{7864139776.000 BinarySI} pods:{110.000 DecimalSI}]  [{OutOfDisk Unknown 2016-02-23 13:47:13 -0800 PST 2016-02-23 13:47:58 -0800 PST NodeStatusUnknown Kubelet stopped posting node status.} {Ready Unknown 2016-02-23 13:47:13 -0800 PST 2016-02-23 13:47:58 -0800 PST NodeStatusUnknown Kubelet stopped posting node status.}] [{InternalIP 10.240.0.72} {ExternalIP 104.154.18.94}] {{10250}} { A9FE5A87-A63A-D4D8-F8D9-1C5061866D89 ef27c059-e5ef-4dd0-9405-13c60d31c654 3.16.0-4-amd64 Debian GNU/Linux 7 (wheezy) docker://1.9.1 v1.2.0-alpha.8.221+2524249a415fb3 v1.2.0-alpha.8.221+2524249a415fb3} [{[gcr.io/google_containers/kube-proxy:4c6e67dae2145726ff3cae91de4a2960] 165640277} {[gcr.io/google_containers/fluentd-elasticsearch:1.14] 562034622} {[gcr.io/google_containers/pause:2.0] 350164} {[gcr.io/google_containers/pause:0.8.0] 241656} {[gcr.io/google_containers/serve_hostname:1.1] 4522409}]}}]

  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:223

@yujuhong
Copy link
Contributor

To save time for other people, below lists the relevant logs:
https://pantheon.corp.google.com/m/cloudstorage/b/kubernetes-jenkins/o/logs/kubernetes-e2e-gce-scalability/4805/artifacts/104.154.18.94%3A22-docker.log.log
https://pantheon.corp.google.com/m/cloudstorage/b/kubernetes-jenkins/o/logs/kubernetes-e2e-gce-scalability/4805/artifacts/104.154.18.94%3A22-kern.log.log
https://pantheon.corp.google.com/m/cloudstorage/b/kubernetes-jenkins/o/logs/kubernetes-e2e-gce-scalability/4805/artifacts/104.154.18.94%3A22-kube-proxy.log
https://pantheon.corp.google.com/m/cloudstorage/b/kubernetes-jenkins/o/logs/kubernetes-e2e-gce-scalability/4805/artifacts/104.154.18.94%3A22-kubelet.log
https://pantheon.corp.google.com/m/cloudstorage/b/kubernetes-jenkins/o/logs/kubernetes-e2e-gce-scalability/4805/artifacts/104.154.18.94%3A22-supervisord-kubelet-stderr.log
https://pantheon.corp.google.com/m/cloudstorage/b/kubernetes-jenkins/o/logs/kubernetes-e2e-gce-scalability/4805/artifacts/104.154.18.94%3A22-supervisord-kubelet-stdout.log
https://pantheon.corp.google.com/m/cloudstorage/b/kubernetes-jenkins/o/logs/kubernetes-e2e-gce-scalability/4805/artifacts/104.154.18.94%3A22-supervisord.log

Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.092043] INFO: task kworker/u4:1:40 blocked for more than 120 seconds.
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.099166]       Tainted: G         C    3.16.0-4-amd64 #1
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.104861] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.112822] kworker/u4:1    D ffff880214c53078     0    40      2 0x00000000
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.123175] Workqueue: netns cleanup_net
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.130270]  ffff880214c52c20 0000000000000046 0000000000012f00 ffff880213e97fd8
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.145364]  0000000000012f00 ffff880214c52c20 ffffffff818bbe20 ffff880213e97d78
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.160449]  ffff88020d3aa9a0 ffff880214c52c20 ffff880213e97fd8 ffffffff818bbe40
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.175460] Call Trace:
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.179449]  [<ffffffff81510915>] ? schedule_preempt_disabled+0x25/0x70
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.187593]  [<ffffffff815124a5>] ? __mutex_lock_slowpath+0x1b5/0x1c0
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.195528]  [<ffffffff815124cb>] ? mutex_lock+0x1b/0x2a
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.202369]  [<ffffffff8141cd0b>] ? default_device_exit_batch+0x6b/0x140
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.210584]  [<ffffffff810a7ba0>] ? prepare_to_wait_event+0xf0/0xf0
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.218495]  [<ffffffff81417470>] ? cleanup_net+0x100/0x1e0
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.225707]  [<ffffffff81081762>] ? process_one_work+0x172/0x420
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.233413]  [<ffffffff81081df3>] ? worker_thread+0x113/0x4f0
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.240807]  [<ffffffff81510081>] ? __schedule+0x2b1/0x700
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.247815]  [<ffffffff81081ce0>] ? rescuer_thread+0x2d0/0x2d0
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.255196]  [<ffffffff8108805d>] ? kthread+0xbd/0xe0
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.261741]  [<ffffffff81087fa0>] ? kthread_create_on_node+0x180/0x180
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.269812]  [<ffffffff81513c58>] ? ret_from_fork+0x58/0x90
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.276943]  [<ffffffff81087fa0>] ? kthread_create_on_node+0x180/0x180

The symptom seems consistent with #20096 (comment)

@yujuhong
Copy link
Contributor

/cc @bprashanth @dchen1107

@bprashanth
Copy link
Contributor

This is #21085 (comment)

@bprashanth
Copy link
Contributor

More specifically, you should see:

Feb  3 15:46:06 gce-soak-weekly-minion-yho6 kernel: [108155.756089] unregister_netdevice: waiting for vethe59f135 to become free. Usage count = 1
Feb  3 15:46:14 gce-soak-weekly-minion-yho6 kernel: [108164.672055] unregister_netdevice: waiting for lo to become free. Usage count = 2
Feb  3 15:46:16 gce-soak-weekly-minion-yho6 kernel: [108165.824040] unregister_netdevice: waiting for vethe59f135 to become free. Usage count = 1
Feb  3 15:46:25 gce-soak-weekly-minion-yho6 kernel: [108174.916039] unregister_netdevice: waiting for lo to become free. Usage count = 2

In the kernel log for #20096. I think this is something @dchen1107 just wanted to document in the release notes. Our decision to do so should hinge on how frequently it happens, if it's a daily occurrence we should treat it more seriously IMO.

@yujuhong
Copy link
Contributor

This is #21085 (comment)

The differences are:

  1. There were no docker errors leading to this.
  2. The node was starting only 30 pause pods with no crashlooping containers.

More specifically, you should see:

Thanks for the clarification.

@bprashanth
Copy link
Contributor

Absence of evidence is not evidence of absence. Dawn observed this yesterday on a cluster in the wild. I can say with some confidence that it isn't that kernel bug, a hung task could of course be for many reasons, including another kernel bug or a buggy disk device driver (less likely with a GCE vm of course).

Hung tasks are probably exacerbated by our sending of kill signals. We can mitigate this by increasing the hung task timeout, but there's really no way to recover without a reboot, which we don't want to do automatically (if we did, we can easily just push out #20947). I do believe there's a corresponding (non hairpin) docker bug for this.

@yujuhong
Copy link
Contributor

Hung tasks are probably exacerbated by our sending of kill signals. We can mitigate this by increasing the hung task timeout, but there's really no way to recover without a reboot, which we don't want to do automatically (if we did, we can easily just push out #20947). I do believe there's a corresponding (non hairpin) docker bug for this.

Clarification:

We didn't try killing docker. Our health check docker version worked fine during that time. The only thing that got repeatedly killed was kubelet, due to the docker be unresponsive.

2016-02-23 21:10:20,664 CRIT Supervisor running as root (no user in config file)
2016-02-23 21:10:20,705 INFO RPC interface 'supervisor' initialized
2016-02-23 21:10:20,706 WARN cElementTree not installed, using slower XML parser for XML-RPC
2016-02-23 21:10:20,706 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2016-02-23 21:10:20,706 INFO daemonizing the supervisord process
2016-02-23 21:10:20,707 INFO supervisord started with pid 2112
2016-02-23 21:11:30,150 WARN received SIGTERM indicating exit request
2016-02-23 21:11:35,199 CRIT Supervisor running as root (no user in config file)
2016-02-23 21:11:35,199 WARN Included extra file "/etc/supervisor/conf.d/kubelet.conf" during parsing
2016-02-23 21:11:35,200 WARN Included extra file "/etc/supervisor/conf.d/docker.conf" during parsing
2016-02-23 21:11:35,220 INFO RPC interface 'supervisor' initialized
2016-02-23 21:11:35,220 WARN cElementTree not installed, using slower XML parser for XML-RPC
2016-02-23 21:11:35,220 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2016-02-23 21:11:35,222 INFO daemonizing the supervisord process
2016-02-23 21:11:35,225 INFO supervisord started with pid 3427
2016-02-23 21:11:36,228 INFO spawned: 'kubelet' with pid 3453
2016-02-23 21:11:36,229 INFO spawned: 'docker' with pid 3454
2016-02-23 21:11:37,287 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-23 21:11:37,287 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-23 21:12:06,313 INFO exited: docker (exit status 2; expected)
2016-02-23 21:12:07,315 INFO spawned: 'docker' with pid 3762
2016-02-23 21:12:08,327 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-23 21:52:21,813 INFO exited: kubelet (exit status 2; expected)
2016-02-23 21:52:22,815 INFO spawned: 'kubelet' with pid 17256
2016-02-23 21:52:23,866 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-23 21:53:22,879 INFO exited: kubelet (exit status 2; expected)
2016-02-23 21:53:23,881 INFO spawned: 'kubelet' with pid 17349
2016-02-23 21:53:24,931 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-23 21:54:23,940 INFO exited: kubelet (exit status 2; expected)
2016-02-23 21:54:24,942 INFO spawned: 'kubelet' with pid 17442
2016-02-23 21:54:25,990 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-23 21:55:24,999 INFO exited: kubelet (exit status 2; expected)
2016-02-23 21:55:26,001 INFO spawned: 'kubelet' with pid 17563
2016-02-23 21:55:27,050 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-23 21:56:26,063 INFO exited: kubelet (exit status 2; expected)
2016-02-23 21:56:27,065 INFO spawned: 'kubelet' with pid 17681
2016-02-23 21:56:28,114 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-23 21:57:27,122 INFO exited: kubelet (exit status 2; expected)
2016-02-23 21:57:28,125 INFO spawned: 'kubelet' with pid 17822
2016-02-23 21:57:29,178 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-23 21:58:28,187 INFO exited: kubelet (exit status 2; expected)
2016-02-23 21:58:29,189 INFO spawned: 'kubelet' with pid 17952
2016-02-23 21:58:30,270 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-23 21:59:29,246 INFO exited: kubelet (exit status 2; expected)
2016-02-23 21:59:29,326 INFO spawned: 'kubelet' with pid 18040
2016-02-23 21:59:30,376 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-23 22:00:29,385 INFO exited: kubelet (exit status 2; expected)
2016-02-23 22:00:30,387 INFO spawned: 'kubelet' with pid 18143
2016-02-23 22:00:31,433 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-23 22:01:30,444 INFO exited: kubelet (exit status 2; expected)
2016-02-23 22:01:31,446 INFO spawned: 'kubelet' with pid 18229
2016-02-23 22:01:32,496 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-23 22:02:31,505 INFO exited: kubelet (exit status 2; expected)
2016-02-23 22:02:32,507 INFO spawned: 'kubelet' with pid 18396
2016-02-23 22:02:33,560 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-23 22:03:32,569 INFO exited: kubelet (exit status 2; expected)
2016-02-23 22:03:33,572 INFO spawned: 'kubelet' with pid 18482
2016-02-23 22:03:34,620 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

@dchen1107
Copy link
Member

@bprashanth is right. This is a different issue moby/moby#9605 since 1.5 release. This is relative rare case comparing to unregister_netdevice issue.

@yujuhong the reason kubelet is killed because docker API request hang, which makes syncLoop too long, fails kubelet health check here. If we switch back to docker ps from docker version, you will see both docker and kubelet restart. The only way to deal with this issue is rebooting the node.

@yujuhong
Copy link
Contributor

@yujuhong the reason kubelet is killed because docker API request hang, which makes syncLoop too long, fails kubelet health check here. If we switch back to docker ps from docker version, you will see both docker and kubelet restart. The only way to deal with this issue is rebooting the node.

I knew that and was just telling @bprashanth that docker wasn't being killed during that time because he wrote "Hung tasks are probably exacerbated by our sending of kill signals" :)

@dchen1107
Copy link
Member

This is not a flaky, and the docker is totally hung due to deadlock at kernel:

Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.175460] Call Trace:
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.179449] [] ? schedule_preempt_disabled+0x25/0x70
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.187593] [] ? __mutex_lock_slowpath+0x1b5/0x1c0
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.195528] [] ? mutex_lock+0x1b/0x2a
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.202369] [] ? default_device_exit_batch+0x6b/0x140
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.210584] [] ? prepare_to_wait_event+0xf0/0xf0
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.218495] [] ? cleanup_net+0x100/0x1e0
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.225707] [] ? process_one_work+0x172/0x420
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.233413] [] ? worker_thread+0x113/0x4f0
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.240807] [] ? __schedule+0x2b1/0x700
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.247815] [] ? rescuer_thread+0x2d0/0x2d0
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.255196] [] ? kthread+0xbd/0xe0
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.261741] [] ? kthread_create_on_node+0x180/0x180
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.269812] [] ? ret_from_fork+0x58/0x90
Feb 23 21:50:14 e2e-scalability-minion-9rwz kernel: [ 2400.276943] [] ? kthread_create_on_node+0x180/0x180

We cannot do much except documenting it and telling the user to reboot the node like previous releases since we don't have a way to gracefully drain a node yet.

@dchen1107
Copy link
Member

#21000

@wojtek-t
Copy link
Member Author

wojtek-t commented Mar 9, 2016

Hmm - since we can't do anything about it and it's documented, can we close this one? @dchen1107 ?

@dchen1107
Copy link
Member

Close this one. After adding workaround for hairpin mode, docker process hung issue is rare. I am closing this issue, and we are going to document docker issue as known release issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

5 participants