Node E2E is failing on GCI #35935

Random-Liu · 2016-10-31T22:45:50Z

All node e2e test run fails on GCI from around 13:00 today. I don't know what exactly happened, here is some observations:

Ssh pipeline is broken during the test, It seems that the network is broken or the node is rebooted.

�[90m------------------------------�[0m
Write failed: Broken pipe

Failure Finished Test Suite on Host tmp-node-e2e-6e3a480a-gci-dev-55-8872-18-0
[command [ssh -i /home/jenkins/.ssh/google_compute_engine -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o CheckHostIP=no -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o LogLevel=ERROR 104.154.160.25 -- sh -c 'cd /tmp/gcloud-e2e-1994784968 && timeout -k 30s 2700.000000s ./ginkgo --nodes=8 --skip="\[Flaky\]|\[Serial\]" ./e2e_node.test -- --logtostderr --v 4 --node-name=tmp-node-e2e-6e3a480a-gci-dev-55-8872-18-0 --report-dir=/tmp/gcloud-e2e-1994784968/results --report-prefix=gci-family-gci-dev-55-8872-18-0 --experimental-mounter-path=/tmp/gcloud-e2e-1994784968/cluster/gce/gci/mounter/mounter --experimental-mounter-rootfs-path=/media/root '] failed with error: exit status 255 and output:
Running Suite: E2eNode Suite

Node goes into bad state before pipe broken.
- Docker seems to hang:

FailedSync: Error syncing pod, skipping: failed to "StartContainer" for "client-container" with RunContainerError: "runContainer: operation timeout: context deadline exceeded"

GCI mounter throws out error:

FailedMount: MountVolume.SetUp failed for volume "kubernetes.io/empty-dir/9b68fb0a-9fa0-11e6-9b5b-42010a800006-restart-count" (spec.Name: "restart-count") pod "9b68fb0a-9fa0-11e6-9b5b-42010a800006" (UID: "9b68fb0a-9fa0-11e6-9b5b-42010a800006") with: mount failed: exit status 254
Mounting command: /tmp/gcloud-e2e-1994784968/cluster/gce/gci/mounter/mounter

This is blocking submit queue now, mark P0.

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10759
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10761

@vishh @jingxu97
/cc @kubernetes/sig-node

The text was updated successfully, but these errors were encountered:

yujuhong · 2016-10-31T22:53:56Z

By the way, the logs on the node were not collected because of the connection issues. The best way to debug probably is to run node e2e in one's own cluster.
Related issue: #35914

dchen1107 · 2016-10-31T23:03:24Z

I didn't follow up deeply with all gci-mounter related prs. But I think one can quickly revert #35652 from their branch, and re-run node-e2e to see if the regression is caused by gci.

EDITED by dchen1107: Tried to revert #35349 not #35652 in your branch.

Random-Liu · 2016-10-31T23:05:27Z

However, the latest node e2e became green. https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10762

yujuhong · 2016-10-31T23:07:29Z

However, the latest node e2e became green. https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10762

Node e2e has been flaky since last week. It doesn't always fail.

jingxu97 · 2016-11-01T00:40:11Z

I checked five failures but they are all different. There is only one log showing mount failed message with exit code 254 (close connection?). Other failed messages include

permission denied (https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10760/build-log.txt)
failed with error: exit status 255 (https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10761/build-log.txt)
3 Failed to stop services: error stopping "services": wait: no child processes (https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10758/build-log.txt)

So it is very possible that mount failed because of other reason (connection close etc.)

Random-Liu · 2016-11-01T00:40:26Z

This failure happened since last Friday.

There are 2 kinds of failure, but I assume they are caused by the same issue:

Write failed: Broken pipe. The test failed before completion because of ssh pipe is broken:

�[90m------------------------------�[0m
Write failed: Broken pipe

Failure Finished Test Suite on Host tmp-node-e2e-6e3a480a-gci-dev-55-8872-18-0
[command [ssh -i /home/jenkins/.ssh/google_compute_engine -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o CheckHostIP=no -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o LogLevel=ERROR 104.154.160.25 -- sh -c 'cd /tmp/gcloud-e2e-1994784968 && timeout -k 30s 2700.000000s ./ginkgo --nodes=8 --skip="\[Flaky\]|\[Serial\]" ./e2e_node.test -- --logtostderr --v 4 --node-name=tmp-node-e2e-6e3a480a-gci-dev-55-8872-18-0 --report-dir=/tmp/gcloud-e2e-1994784968/results --report-prefix=gci-family-gci-dev-55-8872-18-0 --experimental-mounter-path=/tmp/gcloud-e2e-1994784968/cluster/gce/gci/mounter/mounter --experimental-mounter-rootfs-path=/media/root '] failed with error: exit status 255 and output:
Running Suite: E2eNode Suite

No test result at all. We can only see Copying test artifacts from xxx. Usually test result should be printed after it, but there is nothing

1028 05:07:50.421848   30664 remote.go:290] Starting tests on tmp-node-e2e-6ae5f238-gci-dev-55-8872-18-0
I1028 05:07:51.444594   30664 remote.go:290] Starting tests on tmp-node-e2e-6ae5f238-e2e-node-containervm-v20160604-image
I1028 05:07:51.738267   30664 remote.go:290] Starting tests on tmp-node-e2e-6ae5f238-coreos-alpha-1122-0-0-v20160727
I1028 05:07:52.335168   30664 remote.go:290] Starting tests on tmp-node-e2e-6ae5f238-e2e-node-ubuntu-trusty-docker10-v1-image
I1028 05:07:52.929704   30664 remote.go:290] Starting tests on tmp-node-e2e-6ae5f238-e2e-node-ubuntu-trusty-docker9-v1-image
I1028 05:11:06.677431   30664 remote.go:297] Copying test artifacts from tmp-node-e2e-6ae5f238-gci-dev-55-8872-18-0
I1028 05:39:20.158098   30664 remote.go:297] Copying test artifacts from tmp-node-e2e-6ae5f238-coreos-alpha-1122-0-0-v20160727
I1028 05:39:20.684135   30664 run_remote.go:566] Deleting instance "tmp-node-e2e-6ae5f238-coreos-alpha-1122-0-0-v20160727"

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>                              START TEST                                >
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Start Test Suite on Host tmp-node-e2e-6ae5f238-coreos-alpha-1122-0-0-v20160727
Running Suite: E2eNode Suite
============================
Random Seed: �[1m1477656471�[0m - Will randomize all specs
Will run �[1m116�[0m of �[1m149�[0m specs

I believe this is the same issue with 1). The test only runs for 4 minutes. I believe it also fails in the middle because of broken ssh pipe.
As for why there is no test result, I think it's because there is something wrong with ssh or the network, the scp just stuck forever. So at the end of the test, we see the following log, because the test result from gci is never scped.

Build timed out (after 90 minutes). Marking the build as failed.
Build was aborted
Recording test results
I1028 06:31:24.967016   30664 run_remote.go:566] Deleting instance "tmp-node-e2e-6ae5f238-gci-dev-55-8872-18-0"

https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10665/build-log.txt
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10750/build-log.txt
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10747/build-log.txt
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10712/build-log.txt
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10667/build-log.txt

Both 1) and 2) are related to the broken ssh or network.
/cc @kubernetes/sig-node @kubernetes/goog-image

Random-Liu · 2016-11-01T00:42:20Z

permission denied (https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10760/build-log.txt)

failed with error: exit status 255 (https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10761/build-log.txt)

Failed to stop services: error stopping "services": wait: no child processes (https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10758/build-log.txt)

is another failure on ubuntu trusty, I'll figure it out.
and 3) are just this issue.

Random-Liu · 2016-11-02T17:22:19Z

FYI, we disabled gci mounter in cri validation node e2e test since yesterday night, and this failure never happens again.
https://k8s-testgrid.appspot.com/google-node#kubelet-cri-gce-e2e&width=5

dchen1107 · 2016-11-02T18:16:26Z

@vishh and @jingxu97 Can we have a separate jenkin job for testing gci mounter, and disable it from the rest of node e2e for now? Like feature test suite / flaky suite at cluster e2e? The flakiness of gci mounter causes a lot of noise and blocks our submit-queue.

vishh · 2016-11-02T19:27:43Z

@dchen1107 there is a PR under review that reduces the scope for the gci
mounter. I'm hoping it will be done by today. That should reduce the
possibility of gci mounter related flakes.

On Wed, Nov 2, 2016 at 11:17 AM, Dawn Chen notifications@github.com wrote:

@vishh https://github.com/vishh and @jingxu97
https://github.com/jingxu97 Can we have a separate jenkin job for
testing gci mounter, and disable it from the rest of node e2e for now? Like
feature test suite / flaky suite at cluster e2e? The flakiness of gci
mounter causes a lot of noise and blocks our submit-queue.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35935 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvIKI-pAewntu5r3nSHBqhyjlXU720qks5q6NOwgaJpZM4KlmHW
.

rootfs · 2016-11-02T19:58:19Z

exist status 254 is rkt exit code, not related to mount.
rkt/rkt#3261

dchen1107 · 2016-11-02T20:21:01Z

re: #35935 (comment)

@vishh I know there is a pending pr to reduce the scope of GCI mounter. But I am not sure if that one can be get in today since there are under review. Even that pr get in today, I am not sure if that one can remove all flakiness from our critical builds unless we move both NFS and Glusterfs related tests to different build which is what I suggested above.

We ran into this kind of issue before in 1.4 release with MemoryEviction, and had a long time discussion between you and the rest of team. I thought we reached an agreement at the end, and you even wrote a criteria about this at:
#31676 (comment)

vishh · 2016-11-02T20:24:56Z

Hmm. In that case, I'd recommend merging
#35724 to disable gci mounter
on node e2e.

On Wed, Nov 2, 2016 at 1:21 PM, Dawn Chen notifications@github.com wrote:

re: #35935 (comment)
#35935 (comment)

@vishh https://github.com/vishh I know there is a pending pr to reduce
the scope of GCI mounter. But I am not sure if that one can be get in today
since there are under review. Even that pr get in today, I am not sure if
that one can remove all flakiness from our critical builds unless we move
both NFS and Glusterfs related tests to different build which is what I
suggested above.

We ran into this kind of issue before in 1.4 release with MemoryEviction,
and had a long time discussion between you and the rest of team. I thought
we reached an agreement at the end, and you even wrote a criteria about
this at:
#31676 (comment)
#31676 (comment)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35935 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvIKH0v8GO_VXJLkxoTQuGr5CTvUjWRks5q6PDkgaJpZM4KlmHW
.

dchen1107 · 2016-11-02T20:35:25Z

SGTM

Random-Liu · 2016-11-02T21:04:32Z

exist status 254 is rkt exit code, not related to mount.

@rootfs IIUC, the mounter is using rkt fly.

yujuhong · 2016-11-04T15:41:13Z

I added some debugging code in the node e2e framework to ssh in to the node and copy back the system logs after the failure (#36181)

Here is one test failure with the " Write failed: Broken pipe" error
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/36225/pull-kubernetes-node-e2e/3518/

In the GCS browser, you will find a tmp-node-e2e-bf3cf870-gci-dev-55-8872-18-0-gcloud-e2e-1757716123-system.log file
http://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/36225/pull-kubernetes-node-e2e/3518/artifacts/

Nov 04 12:26:58 tmp-node-e2e-bf3cf870-gci-dev-55-8872-18-0 systemd-timesyncd[592]: Network configuration changed, trying to establish connection.
Nov 04 12:26:58 tmp-node-e2e-bf3cf870-gci-dev-55-8872-18-0 systemd-timesyncd[592]: Synchronized to time server 169.254.169.254:123 (169.254.169.254).
-- Reboot --
Nov 04 12:28:01 localhost systemd-journal[425]: Runtime journal (/run/log/journal/) is currently using 8.0M.
                                                Maximum allowed usage is set to 184.9M.
                                                Leaving at least 277.4M free (of currently available 1.7G of space).
                                                Enforced usage limit is thus 184.9M.
Nov 04 12:28:01 localhost systemd-journal[425]: Permanent journal (/var/log/journal/) is currently using 32.0M.
                                                Maximum allowed usage is set to 1.0G.
                                                Leaving at least 867.5M free (of currently available 4.2G of space).
                                                Enforced usage limit is thus 1.0G.

From the log, it shows that the node rebooted during the test, confirming my suspicions.

I haven't had time to look any further about why the node rebooted.

yujuhong · 2016-11-04T18:25:59Z

Another one with the the broken pipe, but the system logs show
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/36239/pull-kubernetes-node-e2e/3599/

Right before the node went bad, there was quite a lot of mount related errors:
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/36239/pull-kubernetes-node-e2e/3599/artifacts/tmp-node-e2e-6f4712d0-gci-dev-55-8872-18-0-gcloud-e2e-699837479-system.log

Some examples, more in the log:

Nov 04 17:59:23 tmp-node-e2e-6f4712d0-gci-dev-55-8872-18-0 systemd[386]: Failed to open directory /usr/lib/systemd/system/var-lib-rkt-pods-run-c07c69c1\x2d507d\x2d424e\x2d8d1b\x2d2b63c6638ba2-stage1-rootfs-opt-stage2-gci\x2dmounter-rootfs-media-root-var-lib-kubelet-pods-461fc2a8\x2da2b8\x2d11e6\x2d972a\x2d42010a80001c-volumes-kubernetes.io\x7edownward\x2dapi-podinfo.mount.requires: File name too long
Nov 04 17:59:25 tmp-node-e2e-6f4712d0-gci-dev-55-8872-18-0 kubelet[1705]: E1104 17:59:24.822761    1705 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/empty-dir/3ac194d8-a2b8-11e6-972a-42010a80001c-test-volume\" (\"3ac194d8-a2b8-11e6-972a-42010a80001c\")" failed. No retries permitted until 2016-11-04 17:59:25.32273376 +0000 UTC (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "kubernetes.io/empty-dir/3ac194d8-a2b8-11e6-972a-42010a80001c-test-volume" (spec.Name: "test-volume") pod "3ac194d8-a2b8-11e6-972a-42010a80001c" (UID: "3ac194d8-a2b8-11e6-972a-42010a80001c") with: mount failed: exit status 254

dchen1107 · 2016-11-04T18:44:13Z

@vishh looks like gci-mounter still causes the node e2e failure which blocks the submit queue here. Can you take a look at this? @yujuhong provided some detail logs related to the recent failures. Thanks!

yujuhong · 2016-11-07T19:50:23Z

After disabling gci_mounter in the node e2e suite, the builds have become more stable (sans some flakes and timeout issues). Demoting to p1

dims · 2016-11-16T15:21:13Z

This needs to be triaged as a release-blocker or not for 1.5 @vishh @jingxu97 @yujuhong

Random-Liu · 2016-11-17T19:42:09Z

Another occurrence https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/11234. Still on GCI.

Random-Liu · 2016-11-17T19:43:13Z

I'll add periodic system log fetching in the framework to help debug this.

dims · 2016-11-18T13:08:53Z

@vishh @jingxu97 all issues must be labeled either release blocker or non release blocking by end of day 18 November 2016 PST. (or please move it to 1.6) cc @kubernetes/sig-node

dchen1107 · 2016-11-21T19:30:20Z

Based on node-e2e stats, it only failed once on gci node without clear signal related to gci-mounter. I am closing this issue for now.

@saad-ali

Automatic merge from submit-queue Node E2E: Fix remote log fetching. For issue #37333. This will help debug #37333. Mark v1.5 because this helps debug an issue #37333, which was originally #35935. /cc @saad-ali @yujuhong @dchen1107 @jingxu97 /cc @kubernetes/sig-node

Random-Liu added sig/node Categorizes an issue or PR as relevant to SIG Node. area/node-e2e labels Oct 31, 2016

Random-Liu assigned vishh and jingxu97 Oct 31, 2016

Random-Liu added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Oct 31, 2016

dchen1107 added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Oct 31, 2016

dchen1107 added the area/os/gci label Oct 31, 2016

deads2k mentioned this issue Nov 1, 2016

make using service account credentials from controllers optional #35970

Merged

Random-Liu mentioned this issue Nov 1, 2016

Node e2e fails with "Failed to configured firewall" and "unable to create gce instance" #35983

Closed

mikedanese mentioned this issue Nov 2, 2016

implement kubectl procelain csr commands #30237

Merged

dchen1107 mentioned this issue Nov 2, 2016

Temporarily disable GCI mounter in e2e node tests #35724

Merged

dchen1107 added this to the v1.5 milestone Nov 4, 2016

Random-Liu added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Nov 7, 2016

dchen1107 closed this as completed Nov 21, 2016

ivan4th mentioned this issue Nov 22, 2016

Add options for build container rsync optimization #36361

Merged

This was referenced Nov 23, 2016

Node E2E: Broken Pipe during test on GCI. #37333

Closed

Node E2E: Fix remote log fetching. #37338

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node E2E is failing on GCI #35935

Node E2E is failing on GCI #35935

Random-Liu commented Oct 31, 2016 •

edited

Loading

yujuhong commented Oct 31, 2016

dchen1107 commented Oct 31, 2016 •

edited

Loading

Random-Liu commented Oct 31, 2016 •

edited

Loading

yujuhong commented Oct 31, 2016

jingxu97 commented Nov 1, 2016

Random-Liu commented Nov 1, 2016 •

edited

Loading

Random-Liu commented Nov 1, 2016 •

edited

Loading

Random-Liu commented Nov 2, 2016 •

edited by yujuhong

Loading

dchen1107 commented Nov 2, 2016

vishh commented Nov 2, 2016

rootfs commented Nov 2, 2016

dchen1107 commented Nov 2, 2016

vishh commented Nov 2, 2016

dchen1107 commented Nov 2, 2016

Random-Liu commented Nov 2, 2016

yujuhong commented Nov 4, 2016

yujuhong commented Nov 4, 2016

dchen1107 commented Nov 4, 2016

yujuhong commented Nov 7, 2016

dims commented Nov 16, 2016

Random-Liu commented Nov 17, 2016 •

edited

Loading

Random-Liu commented Nov 17, 2016 •

edited

Loading

dims commented Nov 18, 2016

dchen1107 commented Nov 21, 2016

Node E2E is failing on GCI #35935

Node E2E is failing on GCI #35935

Comments

Random-Liu commented Oct 31, 2016 • edited Loading

yujuhong commented Oct 31, 2016

dchen1107 commented Oct 31, 2016 • edited Loading

Random-Liu commented Oct 31, 2016 • edited Loading

yujuhong commented Oct 31, 2016

jingxu97 commented Nov 1, 2016

Random-Liu commented Nov 1, 2016 • edited Loading

Random-Liu commented Nov 1, 2016 • edited Loading

Random-Liu commented Nov 2, 2016 • edited by yujuhong Loading

dchen1107 commented Nov 2, 2016

vishh commented Nov 2, 2016

rootfs commented Nov 2, 2016

dchen1107 commented Nov 2, 2016

vishh commented Nov 2, 2016

dchen1107 commented Nov 2, 2016

Random-Liu commented Nov 2, 2016

yujuhong commented Nov 4, 2016

yujuhong commented Nov 4, 2016

dchen1107 commented Nov 4, 2016

yujuhong commented Nov 7, 2016

dims commented Nov 16, 2016

Random-Liu commented Nov 17, 2016 • edited Loading

Random-Liu commented Nov 17, 2016 • edited Loading

dims commented Nov 18, 2016

dchen1107 commented Nov 21, 2016

Random-Liu commented Oct 31, 2016 •

edited

Loading

dchen1107 commented Oct 31, 2016 •

edited

Loading

Random-Liu commented Oct 31, 2016 •

edited

Loading

Random-Liu commented Nov 1, 2016 •

edited

Loading

Random-Liu commented Nov 1, 2016 •

edited

Loading

Random-Liu commented Nov 2, 2016 •

edited by yujuhong

Loading

Random-Liu commented Nov 17, 2016 •

edited

Loading

Random-Liu commented Nov 17, 2016 •

edited

Loading