Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node E2E is failing on GCI #35935

Closed
Random-Liu opened this issue Oct 31, 2016 · 24 comments
Closed

Node E2E is failing on GCI #35935

Random-Liu opened this issue Oct 31, 2016 · 24 comments
Assignees
Labels
area/node-e2e priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Milestone

Comments

@Random-Liu
Copy link
Member

Random-Liu commented Oct 31, 2016

All node e2e test run fails on GCI from around 13:00 today. I don't know what exactly happened, here is some observations:

  • Ssh pipeline is broken during the test, It seems that the network is broken or the node is rebooted.
�[90m------------------------------�[0m
Write failed: Broken pipe

Failure Finished Test Suite on Host tmp-node-e2e-6e3a480a-gci-dev-55-8872-18-0
[command [ssh -i /home/jenkins/.ssh/google_compute_engine -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o CheckHostIP=no -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o LogLevel=ERROR 104.154.160.25 -- sh -c 'cd /tmp/gcloud-e2e-1994784968 && timeout -k 30s 2700.000000s ./ginkgo --nodes=8 --skip="\[Flaky\]|\[Serial\]" ./e2e_node.test -- --logtostderr --v 4 --node-name=tmp-node-e2e-6e3a480a-gci-dev-55-8872-18-0 --report-dir=/tmp/gcloud-e2e-1994784968/results --report-prefix=gci-family-gci-dev-55-8872-18-0 --experimental-mounter-path=/tmp/gcloud-e2e-1994784968/cluster/gce/gci/mounter/mounter --experimental-mounter-rootfs-path=/media/root '] failed with error: exit status 255 and output:
Running Suite: E2eNode Suite
  • Node goes into bad state before pipe broken.
    • Docker seems to hang:
FailedSync: Error syncing pod, skipping: failed to "StartContainer" for "client-container" with RunContainerError: "runContainer: operation timeout: context deadline exceeded"
  • GCI mounter throws out error:
FailedMount: MountVolume.SetUp failed for volume "kubernetes.io/empty-dir/9b68fb0a-9fa0-11e6-9b5b-42010a800006-restart-count" (spec.Name: "restart-count") pod "9b68fb0a-9fa0-11e6-9b5b-42010a800006" (UID: "9b68fb0a-9fa0-11e6-9b5b-42010a800006") with: mount failed: exit status 254
Mounting command: /tmp/gcloud-e2e-1994784968/cluster/gce/gci/mounter/mounter

This is blocking submit queue now, mark P0.

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10759
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10761

@vishh @jingxu97
/cc @kubernetes/sig-node

@Random-Liu Random-Liu added sig/node Categorizes an issue or PR as relevant to SIG Node. area/node-e2e labels Oct 31, 2016
@Random-Liu Random-Liu added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Oct 31, 2016
@yujuhong
Copy link
Contributor

By the way, the logs on the node were not collected because of the connection issues. The best way to debug probably is to run node e2e in one's own cluster.
Related issue: #35914

@dchen1107 dchen1107 added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Oct 31, 2016
@dchen1107
Copy link
Member

dchen1107 commented Oct 31, 2016

I didn't follow up deeply with all gci-mounter related prs. But I think one can quickly revert #35652 from their branch, and re-run node-e2e to see if the regression is caused by gci.

EDITED by dchen1107: Tried to revert #35349 not #35652 in your branch.

@Random-Liu
Copy link
Member Author

Random-Liu commented Oct 31, 2016

@yujuhong
Copy link
Contributor

However, the latest node e2e became green. https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10762

Node e2e has been flaky since last week. It doesn't always fail.

@jingxu97
Copy link
Contributor

jingxu97 commented Nov 1, 2016

I checked five failures but they are all different. There is only one log showing mount failed message with exit code 254 (close connection?). Other failed messages include

  1. permission denied (https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10760/build-log.txt)
  2. failed with error: exit status 255 (https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10761/build-log.txt)
    3 Failed to stop services: error stopping "services": wait: no child processes (https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10758/build-log.txt)

So it is very possible that mount failed because of other reason (connection close etc.)

@Random-Liu
Copy link
Member Author

Random-Liu commented Nov 1, 2016

This failure happened since last Friday.

There are 2 kinds of failure, but I assume they are caused by the same issue:

  1. Write failed: Broken pipe. The test failed before completion because of ssh pipe is broken:
�[90m------------------------------�[0m
Write failed: Broken pipe

Failure Finished Test Suite on Host tmp-node-e2e-6e3a480a-gci-dev-55-8872-18-0
[command [ssh -i /home/jenkins/.ssh/google_compute_engine -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o CheckHostIP=no -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -o LogLevel=ERROR 104.154.160.25 -- sh -c 'cd /tmp/gcloud-e2e-1994784968 && timeout -k 30s 2700.000000s ./ginkgo --nodes=8 --skip="\[Flaky\]|\[Serial\]" ./e2e_node.test -- --logtostderr --v 4 --node-name=tmp-node-e2e-6e3a480a-gci-dev-55-8872-18-0 --report-dir=/tmp/gcloud-e2e-1994784968/results --report-prefix=gci-family-gci-dev-55-8872-18-0 --experimental-mounter-path=/tmp/gcloud-e2e-1994784968/cluster/gce/gci/mounter/mounter --experimental-mounter-rootfs-path=/media/root '] failed with error: exit status 255 and output:
Running Suite: E2eNode Suite

https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10761/build-log.txt
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10706/build-log.txt
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10759/build-log.txt
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10729/build-log.txt
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10702/build-log.txt
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10683/build-log.txt
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10681/build-log.txt
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10677/build-log.txt
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10676/build-log.txt

  1. No test result at all. We can only see Copying test artifacts from xxx. Usually test result should be printed after it, but there is nothing
1028 05:07:50.421848   30664 remote.go:290] Starting tests on tmp-node-e2e-6ae5f238-gci-dev-55-8872-18-0
I1028 05:07:51.444594   30664 remote.go:290] Starting tests on tmp-node-e2e-6ae5f238-e2e-node-containervm-v20160604-image
I1028 05:07:51.738267   30664 remote.go:290] Starting tests on tmp-node-e2e-6ae5f238-coreos-alpha-1122-0-0-v20160727
I1028 05:07:52.335168   30664 remote.go:290] Starting tests on tmp-node-e2e-6ae5f238-e2e-node-ubuntu-trusty-docker10-v1-image
I1028 05:07:52.929704   30664 remote.go:290] Starting tests on tmp-node-e2e-6ae5f238-e2e-node-ubuntu-trusty-docker9-v1-image
I1028 05:11:06.677431   30664 remote.go:297] Copying test artifacts from tmp-node-e2e-6ae5f238-gci-dev-55-8872-18-0
I1028 05:39:20.158098   30664 remote.go:297] Copying test artifacts from tmp-node-e2e-6ae5f238-coreos-alpha-1122-0-0-v20160727
I1028 05:39:20.684135   30664 run_remote.go:566] Deleting instance "tmp-node-e2e-6ae5f238-coreos-alpha-1122-0-0-v20160727"

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>                              START TEST                                >
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Start Test Suite on Host tmp-node-e2e-6ae5f238-coreos-alpha-1122-0-0-v20160727
Running Suite: E2eNode Suite
============================
Random Seed: �[1m1477656471�[0m - Will randomize all specs
Will run �[1m116�[0m of �[1m149�[0m specs

I believe this is the same issue with 1). The test only runs for 4 minutes. I believe it also fails in the middle because of broken ssh pipe.
As for why there is no test result, I think it's because there is something wrong with ssh or the network, the scp just stuck forever. So at the end of the test, we see the following log, because the test result from gci is never scped.

Build timed out (after 90 minutes). Marking the build as failed.
Build was aborted
Recording test results
I1028 06:31:24.967016   30664 run_remote.go:566] Deleting instance "tmp-node-e2e-6ae5f238-gci-dev-55-8872-18-0"

https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10665/build-log.txt
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10750/build-log.txt
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10747/build-log.txt
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10712/build-log.txt
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10667/build-log.txt

Both 1) and 2) are related to the broken ssh or network.
/cc @kubernetes/sig-node @kubernetes/goog-image

@Random-Liu
Copy link
Member Author

Random-Liu commented Nov 1, 2016

  1. permission denied (https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10760/build-log.txt)
  2. failed with error: exit status 255 (https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10761/build-log.txt)
  3. Failed to stop services: error stopping "services": wait: no child processes (https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10758/build-log.txt)
  1. is another failure on ubuntu trusty, I'll figure it out.
  2. and 3) are just this issue.

@Random-Liu
Copy link
Member Author

Random-Liu commented Nov 2, 2016

FYI, we disabled gci mounter in cri validation node e2e test since yesterday night, and this failure never happens again.
https://k8s-testgrid.appspot.com/google-node#kubelet-cri-gce-e2e&width=5

@dchen1107
Copy link
Member

@vishh and @jingxu97 Can we have a separate jenkin job for testing gci mounter, and disable it from the rest of node e2e for now? Like feature test suite / flaky suite at cluster e2e? The flakiness of gci mounter causes a lot of noise and blocks our submit-queue.

@vishh
Copy link
Contributor

vishh commented Nov 2, 2016

@dchen1107 there is a PR under review that reduces the scope for the gci
mounter. I'm hoping it will be done by today. That should reduce the
possibility of gci mounter related flakes.

On Wed, Nov 2, 2016 at 11:17 AM, Dawn Chen notifications@github.com wrote:

@vishh https://github.com/vishh and @jingxu97
https://github.com/jingxu97 Can we have a separate jenkin job for
testing gci mounter, and disable it from the rest of node e2e for now? Like
feature test suite / flaky suite at cluster e2e? The flakiness of gci
mounter causes a lot of noise and blocks our submit-queue.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35935 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvIKI-pAewntu5r3nSHBqhyjlXU720qks5q6NOwgaJpZM4KlmHW
.

@rootfs
Copy link
Contributor

rootfs commented Nov 2, 2016

exist status 254 is rkt exit code, not related to mount.
rkt/rkt#3261

@dchen1107
Copy link
Member

re: #35935 (comment)

@vishh I know there is a pending pr to reduce the scope of GCI mounter. But I am not sure if that one can be get in today since there are under review. Even that pr get in today, I am not sure if that one can remove all flakiness from our critical builds unless we move both NFS and Glusterfs related tests to different build which is what I suggested above.

We ran into this kind of issue before in 1.4 release with MemoryEviction, and had a long time discussion between you and the rest of team. I thought we reached an agreement at the end, and you even wrote a criteria about this at:
#31676 (comment)

@vishh
Copy link
Contributor

vishh commented Nov 2, 2016

Hmm. In that case, I'd recommend merging
#35724 to disable gci mounter
on node e2e.

On Wed, Nov 2, 2016 at 1:21 PM, Dawn Chen notifications@github.com wrote:

re: #35935 (comment)
#35935 (comment)

@vishh https://github.com/vishh I know there is a pending pr to reduce
the scope of GCI mounter. But I am not sure if that one can be get in today
since there are under review. Even that pr get in today, I am not sure if
that one can remove all flakiness from our critical builds unless we move
both NFS and Glusterfs related tests to different build which is what I
suggested above.

We ran into this kind of issue before in 1.4 release with MemoryEviction,
and had a long time discussion between you and the rest of team. I thought
we reached an agreement at the end, and you even wrote a criteria about
this at:
#31676 (comment)
#31676 (comment)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#35935 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvIKH0v8GO_VXJLkxoTQuGr5CTvUjWRks5q6PDkgaJpZM4KlmHW
.

@dchen1107
Copy link
Member

SGTM

@Random-Liu
Copy link
Member Author

exist status 254 is rkt exit code, not related to mount.

@rootfs IIUC, the mounter is using rkt fly.

@yujuhong
Copy link
Contributor

yujuhong commented Nov 4, 2016

I added some debugging code in the node e2e framework to ssh in to the node and copy back the system logs after the failure (#36181)

Here is one test failure with the " Write failed: Broken pipe" error
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/36225/pull-kubernetes-node-e2e/3518/

In the GCS browser, you will find a tmp-node-e2e-bf3cf870-gci-dev-55-8872-18-0-gcloud-e2e-1757716123-system.log file
http://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/36225/pull-kubernetes-node-e2e/3518/artifacts/

Nov 04 12:26:58 tmp-node-e2e-bf3cf870-gci-dev-55-8872-18-0 systemd-timesyncd[592]: Network configuration changed, trying to establish connection.
Nov 04 12:26:58 tmp-node-e2e-bf3cf870-gci-dev-55-8872-18-0 systemd-timesyncd[592]: Synchronized to time server 169.254.169.254:123 (169.254.169.254).
-- Reboot --
Nov 04 12:28:01 localhost systemd-journal[425]: Runtime journal (/run/log/journal/) is currently using 8.0M.
                                                Maximum allowed usage is set to 184.9M.
                                                Leaving at least 277.4M free (of currently available 1.7G of space).
                                                Enforced usage limit is thus 184.9M.
Nov 04 12:28:01 localhost systemd-journal[425]: Permanent journal (/var/log/journal/) is currently using 32.0M.
                                                Maximum allowed usage is set to 1.0G.
                                                Leaving at least 867.5M free (of currently available 4.2G of space).
                                                Enforced usage limit is thus 1.0G.

From the log, it shows that the node rebooted during the test, confirming my suspicions.

I haven't had time to look any further about why the node rebooted.

@yujuhong
Copy link
Contributor

yujuhong commented Nov 4, 2016

Another one with the the broken pipe, but the system logs show
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/36239/pull-kubernetes-node-e2e/3599/

Right before the node went bad, there was quite a lot of mount related errors:
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/36239/pull-kubernetes-node-e2e/3599/artifacts/tmp-node-e2e-6f4712d0-gci-dev-55-8872-18-0-gcloud-e2e-699837479-system.log

Some examples, more in the log:

Nov 04 17:59:23 tmp-node-e2e-6f4712d0-gci-dev-55-8872-18-0 systemd[386]: Failed to open directory /usr/lib/systemd/system/var-lib-rkt-pods-run-c07c69c1\x2d507d\x2d424e\x2d8d1b\x2d2b63c6638ba2-stage1-rootfs-opt-stage2-gci\x2dmounter-rootfs-media-root-var-lib-kubelet-pods-461fc2a8\x2da2b8\x2d11e6\x2d972a\x2d42010a80001c-volumes-kubernetes.io\x7edownward\x2dapi-podinfo.mount.requires: File name too long
Nov 04 17:59:25 tmp-node-e2e-6f4712d0-gci-dev-55-8872-18-0 kubelet[1705]: E1104 17:59:24.822761    1705 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/empty-dir/3ac194d8-a2b8-11e6-972a-42010a80001c-test-volume\" (\"3ac194d8-a2b8-11e6-972a-42010a80001c\")" failed. No retries permitted until 2016-11-04 17:59:25.32273376 +0000 UTC (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume "kubernetes.io/empty-dir/3ac194d8-a2b8-11e6-972a-42010a80001c-test-volume" (spec.Name: "test-volume") pod "3ac194d8-a2b8-11e6-972a-42010a80001c" (UID: "3ac194d8-a2b8-11e6-972a-42010a80001c") with: mount failed: exit status 254

@dchen1107 dchen1107 added this to the v1.5 milestone Nov 4, 2016
@dchen1107
Copy link
Member

@vishh looks like gci-mounter still causes the node e2e failure which blocks the submit queue here. Can you take a look at this? @yujuhong provided some detail logs related to the recent failures. Thanks!

@Random-Liu Random-Liu added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Nov 7, 2016
@yujuhong
Copy link
Contributor

yujuhong commented Nov 7, 2016

After disabling gci_mounter in the node e2e suite, the builds have become more stable (sans some flakes and timeout issues). Demoting to p1

@dims
Copy link
Member

dims commented Nov 16, 2016

This needs to be triaged as a release-blocker or not for 1.5 @vishh @jingxu97 @yujuhong

@Random-Liu
Copy link
Member Author

Random-Liu commented Nov 17, 2016

@Random-Liu
Copy link
Member Author

Random-Liu commented Nov 17, 2016

I'll add periodic system log fetching in the framework to help debug this.

@dims
Copy link
Member

dims commented Nov 18, 2016

@vishh @jingxu97 all issues must be labeled either release blocker or non release blocking by end of day 18 November 2016 PST. (or please move it to 1.6) cc @kubernetes/sig-node

@dchen1107
Copy link
Member

Based on node-e2e stats, it only failed once on gci node without clear signal related to gci-mounter. I am closing this issue for now.

k8s-github-robot pushed a commit that referenced this issue Nov 23, 2016
Automatic merge from submit-queue

Node E2E: Fix remote log fetching.

For issue #37333.

This will help debug #37333.

Mark v1.5 because this helps debug an issue #37333, which was originally #35935. /cc @saad-ali 

@yujuhong @dchen1107 @jingxu97 
/cc @kubernetes/sig-node
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/node-e2e priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

7 participants