-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node E2E is failing on GCI #35935
Comments
By the way, the logs on the node were not collected because of the connection issues. The best way to debug probably is to run node e2e in one's own cluster. |
However, the latest node e2e became green. https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10762 |
Node e2e has been flaky since last week. It doesn't always fail. |
I checked five failures but they are all different. There is only one log showing mount failed message with exit code 254 (close connection?). Other failed messages include
So it is very possible that mount failed because of other reason (connection close etc.) |
This failure happened since last Friday. There are 2 kinds of failure, but I assume they are caused by the same issue:
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10761/build-log.txt
I believe this is the same issue with 1). The test only runs for 4 minutes. I believe it also fails in the middle because of broken ssh pipe.
https://storage.googleapis.com/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10665/build-log.txt Both 1) and 2) are related to the broken ssh or network. |
|
FYI, we disabled gci mounter in cri validation node e2e test since yesterday night, and this failure never happens again. |
@dchen1107 there is a PR under review that reduces the scope for the gci On Wed, Nov 2, 2016 at 11:17 AM, Dawn Chen notifications@github.com wrote:
|
|
re: #35935 (comment) @vishh I know there is a pending pr to reduce the scope of GCI mounter. But I am not sure if that one can be get in today since there are under review. Even that pr get in today, I am not sure if that one can remove all flakiness from our critical builds unless we move both NFS and Glusterfs related tests to different build which is what I suggested above. We ran into this kind of issue before in 1.4 release with MemoryEviction, and had a long time discussion between you and the rest of team. I thought we reached an agreement at the end, and you even wrote a criteria about this at: |
Hmm. In that case, I'd recommend merging On Wed, Nov 2, 2016 at 1:21 PM, Dawn Chen notifications@github.com wrote:
|
SGTM |
@rootfs IIUC, the mounter is using rkt fly. |
I added some debugging code in the node e2e framework to ssh in to the node and copy back the system logs after the failure (#36181) Here is one test failure with the " Write failed: Broken pipe" error In the GCS browser, you will find a
From the log, it shows that the node rebooted during the test, confirming my suspicions. I haven't had time to look any further about why the node rebooted. |
Another one with the the broken pipe, but the system logs show Right before the node went bad, there was quite a lot of mount related errors: Some examples, more in the log:
|
After disabling gci_mounter in the node e2e suite, the builds have become more stable (sans some flakes and timeout issues). Demoting to p1 |
Another occurrence https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/11234. Still on GCI. |
I'll add periodic system log fetching in the framework to help debug this. |
Based on node-e2e stats, it only failed once on gci node without clear signal related to gci-mounter. I am closing this issue for now. |
All node e2e test run fails on GCI from around 13:00 today. I don't know what exactly happened, here is some observations:
This is blocking submit queue now, mark P0.
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10759
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubelet-gce-e2e-ci/10761
@vishh @jingxu97
/cc @kubernetes/sig-node
The text was updated successfully, but these errors were encountered: