Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Networking test timing out on Jenkins #5629

Closed
zmerlynn opened this issue Mar 19, 2015 · 39 comments
Closed

Networking test timing out on Jenkins #5629

zmerlynn opened this issue Mar 19, 2015 · 39 comments
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@zmerlynn
Copy link
Member

cc @ixdy @satnam6502

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:180
STEP: Creating service with name nettest in namespace nettest-9970
STEP: Creating a replication controller
STEP: Waiting for connectivity to be verified
Attempt 0/60: service/pod still starting. (error: 'endpoints "nettest" not found')
Attempt 1/60: service/pod still starting. (error: 'endpoints "nettest" not found')
Attempt 2/60: service/pod still starting. (error: 'endpoints "nettest" not found')
Attempt 3/60: test still running
Attempt 4/60: test still running
Attempt 5/60: test still running
Attempt 6/60: test still running
Attempt 7/60: test still running
Attempt 8/60: test still running
Attempt 9/60: test still running
Attempt 10/60: test still running
Attempt 11/60: test still running
Attempt 12/60: test still running
Attempt 13/60: test still running
Attempt 14/60: test still running
Attempt 15/60: test still running
Attempt 16/60: test still running
Attempt 17/60: test still running
Attempt 18/60: test still running
Attempt 19/60: test still running
Attempt 20/60: test still running
Attempt 21/60: test still running
Attempt 22/60: test still running
Attempt 23/60: test still running
Attempt 24/60: test still running
Attempt 25/60: test still running
Attempt 26/60: test still running
Attempt 27/60: test still running
Attempt 28/60: test still running
Attempt 29/60: test still running
Attempt 30/60: test still running
Attempt 31/60: test still running
Attempt 32/60: test still running
Attempt 33/60: test still running
Attempt 34/60: test still running
Attempt 35/60: test still running
Attempt 36/60: test still running
Attempt 37/60: test still running
Attempt 38/60: test still running
Attempt 39/60: test still running
Attempt 40/60: service/pod still starting. (error: 'request [&{Method:GET URL:https://104.197.4.134/api/v1beta1/proxy/services/nettest/status?namespace=nettest-9970 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[] Body:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:104.197.4.134 Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil>}] failed (504) 504 Gateway Time-out: <html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.2.1</center>
</body>
</html>
')
Attempt 41/60: service/pod still starting. (error: 'request [&{Method:GET URL:https://104.197.4.134/api/v1beta1/proxy/services/nettest/status?namespace=nettest-9970 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[] Body:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:104.197.4.134 Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil>}] failed (504) 504 Gateway Time-out: <html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.2.1</center>
</body>
</html>
')
Attempt 42/60: service/pod still starting. (error: 'request [&{Method:GET URL:https://104.197.4.134/api/v1beta1/proxy/services/nettest/status?namespace=nettest-9970 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[] Body:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:104.197.4.134 Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil>}] failed (504) 504 Gateway Time-out: <html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.2.1</center>
</body>
</html>
')
@zmerlynn
Copy link
Member Author

I think we need to revert the PRs that went in

@zmerlynn
Copy link
Member Author

(The later commits, not the namespace one.)

@zmerlynn zmerlynn added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/test-infra labels Mar 19, 2015
@zmerlynn
Copy link
Member Author

Wait, no. It's been timing out since 11am.

@satnam6502
Copy link
Contributor

Yes, I am noticing it too and I think there might be deadlock somewhere which occurs just as the test tries to leave the running state to pass.

@zmerlynn
Copy link
Member Author

I really wanted to blame you, but the first failure is at git v0.12.0-882-g7269960, which is shortly before your first PR went in. :)

@satnam6502
Copy link
Contributor

I can't see what's in that which would cause the networking test to break?

On Wed, Mar 18, 2015 at 5:37 PM, Zach Loafman notifications@github.com
wrote:

I really wanted to blame you, but the first failure is at git
v0.12.0-882-g7269960, which is shortly before your first PR went in. :)


Reply to this email directly or view it on GitHub
#5629 (comment)
.

@bprashanth
Copy link
Contributor

fwiw i had this failing an ~hour ago but not on rebase

@zmerlynn
Copy link
Member Author

My guess is that it's #5555.

@satnam6502
Copy link
Contributor

The tests seem to work for me:

Running Suite: Kubernetes e2e Suite run 1 of 1
==============================================
Random Seed: 1426726935 - Will randomize all specs
Will run 2 of 30 specs

SSSSSSSSSSSSSSSS
------------------------------
Networking 
  should function for pods
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:193
[BeforeEach] Networking
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:39
>>> testContext.kubeConfig: /usr/local/google/home/satnam/.kube/.kubeconfig
[It] should function for pods
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:193
STEP: Creating service with name nettest in namespace nettest-6777
STEP: Creating a webserver pod on each node
INFO: Creating pod nettest-0 on node e2e-test-satnam-minion-rkqu.c.kubernetes-satnam.internal
INFO: Creating pod nettest-1 on node e2e-test-satnam-minion-s7sh.c.kubernetes-satnam.internal
STEP: Wait for the webserver pods to be ready
STEP: waiting up to 5m0s for pod nettest-0 status to be running
INFO: Waiting for pod nettest-0 in namespace nettest-6777 status to be "running" (found "Pending") (406.567337ms)
INFO: Waiting for pod nettest-0 in namespace nettest-6777 status to be "running" (found "Pending") (5.458883651s)
STEP: waiting up to 5m0s for pod nettest-1 status to be running
INFO: Waiting for pod nettest-1 in namespace nettest-6777 status to be "running" (found "Pending") (43.816791ms)
INFO: Waiting for pod nettest-1 in namespace nettest-6777 status to be "running" (found "Pending") (5.089164552s)
STEP: Waiting for connectivity to be verified
Attempt 0/60: test still running
Attempt 1/60: test still running
Attempt 2/60: test still running
Attempt 3/60: test still running
Passed on attempt 4. Cleaning up.
STEP: Cleaning up the webserver pods
STEP: Cleaning up the service

• [SLOW TEST:32.236 seconds]
Networking
/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:215
  should function for pods
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:193
------------------------------
SSS
------------------------------
Networking 
  should provide unchanging URLs
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:214
[BeforeEach] Networking
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:39
>>> testContext.kubeConfig: /usr/local/google/home/satnam/.kube/.kubeconfig
[It] should provide unchanging URLs
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:214
STEP: testing: /validate
STEP: testing: /healthz
•SSSSSSSSS
Ran 2 of 30 Specs in 32.689 seconds
SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 28 Skipped I0318 18:02:48.681366   31568 driver.go:89] All tests pass

@zmerlynn
Copy link
Member Author

This test is now passing with gusto. I'm confused.

@zmerlynn zmerlynn reopened this Mar 31, 2015
@zmerlynn
Copy link
Member Author

This is the same thing @satnam6502 stuck in the logging for, but it's snuck back in, now with more logging:

Networking 
  should function for pods
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:203
[BeforeEach] Networking
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:39
>>> testContext.kubeConfig: /var/lib/jenkins/jobs/kubernetes-e2e-gce/workspace/.kube/.kubeconfig
[It] should function for pods
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:203
STEP: Creating service with name nettest in namespace nettest-409
STEP: Creating a webserver pod on each node
INFO: Creating pod nettest-0 on node e2e-test-jenkins-minion-f0u9.c.kubernetes-jenkins.internal
INFO: Creating pod nettest-1 on node e2e-test-jenkins-minion-t24l.c.kubernetes-jenkins.internal
STEP: Wait for the webserver pods to be ready
STEP: waiting up to 5m0s for pod nettest-0 status to be running
INFO: Waiting for pod nettest-0 in namespace nettest-409 status to be "running" (found "Pending") (5.993637ms)
STEP: waiting up to 5m0s for pod nettest-1 status to be running
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (3.633675ms)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (5.008000484s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (10.012479962s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (15.018019974s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (20.022416932s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (25.027329461s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (30.032123805s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (35.037695148s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (40.042410693s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (45.0471947s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (50.051412829s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (55.056787917s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (1m0.061367203s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (1m5.066695731s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (1m10.071522224s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (1m15.076154792s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (1m20.080382614s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (1m25.085091455s)
STEP: Waiting for connectivity to be verified
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.003636218s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.004278343s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 6.570714ms
INFO: Attempt 2/60: test still running
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5.693632ms
INFO: Attempt 3/60: test still running
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.00408193s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.004904434s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.005336923s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.005038183s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.005179133s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.004361386s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 6.605119ms
INFO: Attempt 10/60: test still running
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5.692527ms
INFO: Attempt 11/60: test still running
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.004285752s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 7.01869ms
INFO: Attempt 13/60: test still running
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.004841588s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 9.717108ms
INFO: Attempt 15/60: test still running
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.004179451s
INFO: About to make a proxy status call
Build timed out (after 90 minutes). Marking the build as aborted.
Build timed out (after 90 minutes). Marking the build as failed.

We've had a couple of aborted Jenkins builds recently due to this.

@j3ffml
Copy link
Contributor

j3ffml commented Apr 1, 2015

From running the test locally, the failure I see is that the webserver container cannot run.

{
    "kind": "Pod",
    "id": "nettest-0",
    "uid": "6384d3eb-d804-11e4-80af-42010af08eed",
    "creationTimestamp": "2015-04-01T00:16:48Z",
    "selfLink": "/api/v1beta1/pods/nettest-0?namespace=nettest-7301",
    "resourceVersion": 682,
    "apiVersion": "v1beta1",
    "namespace": "nettest-7301",
    "labels": {
        "name": "nettest"
    },
    "desiredState": {
        "manifest": {
            "version": "v1beta2",
            "id": "",
            "volumes": null,
            "containers": [
                {
                    "name": "webserver",
                    "image": "gcr.io/google_containers/nettest:1.1",
                    "ports": [
                        {
                            "containerPort": 8080,
                            "protocol": "TCP"
                        }
                    ],
                    "resources": {},
                    "terminationMessagePath": "/dev/termination-log",
                    "imagePullPolicy": "PullIfNotPresent",
                    "capabilities": {}
                }
            ],
            "restartPolicy": {
                "never": {}
            },
            "dnsPolicy": "ClusterFirst"
        },
        "host": "e2e-test-jeffml-minion-8w7n.c.jeffmltest.internal"
    },
    "currentState": {
        "manifest": {
            "version": "",
            "id": "",
            "volumes": null,
            "containers": null,
            "restartPolicy": {}
        },
        "status": "Waiting",
        "Condition": [
            {
                "kind": "Ready",
                "status": "None"
            }
        ],
        "host": "e2e-test-jeffml-minion-8w7n.c.jeffmltest.internal",
        "hostIP": "104.197.4.9",
        "info": {
            "POD": {
                "state": {
                    "termination": {
                        "exitCode": 2,
                        "startedAt": "2015-04-01T00:16:49Z",
                        "finishedAt": "2015-04-01T00:16:58Z"
                    }
                },
                "ready": false,
                "restartCount": 0,
                "image": "kubernetes/pause:latest",
                "imageID": "docker://6c4579af347b649857e915521132f15a06186d73faa62145e3eeeb6be0e97c27",
                "containerID": "docker://20fb8035f4b0b3deee7cd9bae3e1a8c4e892298c9b72a989c8e4f81a96f48a7f"
            },
            "webserver": {
                "state": {
                    "waiting": {
                        "reason": "Container cannot run"
                    }
                },
                "ready": false,
                "restartCount": 0,
                "image": "gcr.io/google_containers/nettest:1.1",
                "imageID": "docker://a1320a4ec4cd92675d03f4d932c5a40977a864651573d242c0007a8fa5536985",
                "containerID": "docker://2dc9729cfd3f788fbd9234eccbb28cd620c34c9e41e4f5063158ab84e73e341f"
            }
        }
    }
}

This seems to be a catchall failure. @dchen1107 may have some idea what this error indicates.

@satnam6502
Copy link
Contributor

If the pod's state here is "termination" then why does it show up as
"Pending" in the test log output?

S

On Tue, Mar 31, 2015 at 5:28 PM, Jeff Lowdermilk notifications@github.com
wrote:

From running the test locally, the failure I see is that the webserver
container cannot run.

{
"kind": "Pod",
"id": "nettest-0",
"uid": "6384d3eb-d804-11e4-80af-42010af08eed",
"creationTimestamp": "2015-04-01T00:16:48Z",
"selfLink": "/api/v1beta1/pods/nettest-0?namespace=nettest-7301",
"resourceVersion": 682,
"apiVersion": "v1beta1",
"namespace": "nettest-7301",
"labels": {
"name": "nettest"
},
"desiredState": {
"manifest": {
"version": "v1beta2",
"id": "",
"volumes": null,
"containers": [
{
"name": "webserver",
"image": "gcr.io/google_containers/nettest:1.1",
"ports": [
{
"containerPort": 8080,
"protocol": "TCP"
}
],
"resources": {},
"terminationMessagePath": "/dev/termination-log",
"imagePullPolicy": "PullIfNotPresent",
"capabilities": {}
}
],
"restartPolicy": {
"never": {}
},
"dnsPolicy": "ClusterFirst"
},
"host": "e2e-test-jeffml-minion-8w7n.c.jeffmltest.internal"
},
"currentState": {
"manifest": {
"version": "",
"id": "",
"volumes": null,
"containers": null,
"restartPolicy": {}
},
"status": "Waiting",
"Condition": [
{
"kind": "Ready",
"status": "None"
}
],
"host": "e2e-test-jeffml-minion-8w7n.c.jeffmltest.internal",
"hostIP": "104.197.4.9",
"info": {
"POD": {
"state": {
"termination": {
"exitCode": 2,
"startedAt": "2015-04-01T00:16:49Z",
"finishedAt": "2015-04-01T00:16:58Z"
}
},
"ready": false,
"restartCount": 0,
"image": "kubernetes/pause:latest",
"imageID": "docker://6c4579af347b649857e915521132f15a06186d73faa62145e3eeeb6be0e97c27",
"containerID": "docker://20fb8035f4b0b3deee7cd9bae3e1a8c4e892298c9b72a989c8e4f81a96f48a7f"
},
"webserver": {
"state": {
"waiting": {
"reason": "Container cannot run"
}
},
"ready": false,
"restartCount": 0,
"image": "gcr.io/google_containers/nettest:1.1",
"imageID": "docker://a1320a4ec4cd92675d03f4d932c5a40977a864651573d242c0007a8fa5536985",
"containerID": "docker://2dc9729cfd3f788fbd9234eccbb28cd620c34c9e41e4f5063158ab84e73e341f"
}
}
}
}

This seems to be a catchall failure
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/kubelet/dockertools/docker.go#L618.
@dchen1107 https://github.com/dchen1107 may have some idea what this
error indicates.


Reply to this email directly or view it on GitHub
#5629 (comment)
.

@dchen1107
Copy link
Member

@jlowdermilk Which kubernetes version are you running here? It looks like an old version since the currentstate still have container POD information.

The reason of pod state is terminated and container doesn't run because its network container POD is dead. I don't know why? could you please run:

cat docker.log | grep "20fb8035f4b0b3deee7cd9bae3e1a8c4e892298c9b72a989c8e4f81a96f48a7f" to see anything abnormal there?

I can try to reproduce the issue later when I have VPN.

@j3ffml
Copy link
Contributor

j3ffml commented Apr 1, 2015

Don't see anything abnormal docker.log, but also not sure what would be abnormal. Here is output.

When trying to repro I saw the pod stuck in pending until timeout reached. While it was pending this was the output of docker ps -a | grep nettest on a minion

jeffml@e2e-test-jeffml-minion-8w7n:~$ sudo docker ps -a | grep nettest
3fb45954f92b        gcr.io/google_containers/nettest:1.1                      "-service=nettest -p   About an hour ago                                                      k8s_webserver.cdc9249f_
nettest-0_nettest-7539_618da41b-d80a-11e4-8a8d-42010af08eed_b43e3b95                                                                                                      
d74f85d6f37e        gcr.io/google_containers/pause:1.0                        "/pause"               About an hour ago   Exited (2) About an hour ago                       k8s_POD.98c00084_nettes
t-0_nettest-7539_618da41b-d80a-11e4-8a8d-42010af08eed_fa4e3f25

So yeah, the network container has been dead some time. I see the same thing consistently when retrying the test.

@dchen1107
Copy link
Member

Ok, I realized that POD container might be killed by kubelet when kubelet found all user containers are not running, and restart policy is never.

Can you please post the output of 'cat docker.log | grep 2dc9729cfd3f788fbd9234eccbb28cd620c34c9e41e4f5063158ab84e73e341f' here?

@j3ffml
Copy link
Contributor

j3ffml commented Apr 1, 2015

Ah, sorry, I finished working with that e2e cluster and tore it down. I'm spinning up a new one and will see if I can repro the above symptoms.

@dchen1107
Copy link
Member

Don't worry, I will try to reproduce it later. Thanks!

@satnam6502
Copy link
Contributor

I will also independently have a look at this, depending on how soon I can
neutralize the kids.

On Tue, Mar 31, 2015 at 7:55 PM, Dawn Chen notifications@github.com wrote:

Don't worry, I will try to reproduce it later. Thanks!


Reply to this email directly or view it on GitHub
#5629 (comment)
.

@dchen1107
Copy link
Member

Ok, running e2e takes forever, sigh!

It is very easy to reproduce the issue, just require a long time. docker log shows that container cannot run because a bad image:

Cannot start container 336ddc1f6a73209d416c1b7b8a789e7139d8f72ea940a0413dee823810bab3fc: exec: "-service=nettest": executable file not found in $PATH
time="2015-04-01T06:39:59Z" level="info" msg="-job start(336ddc1f6a73209d416c1b7b8a789e7139d8f72ea940a0413dee823810bab3fc) = ERR (1)" 

I believe this is a regression introduced by #6100 which changes the way how kubernetes are going to use ENTRYPOINT. The fix should be straighforward. I can provide it tomorrow morning.

@dchen1107
Copy link
Member

cc/ @bgrant0607 :-)

@bgrant0607
Copy link
Member

FYI @pmorie

@bgrant0607
Copy link
Member

If #6100 is the culprit, it's a bug. The image's entrypoint is supposed to be used if entrypoint isn't provided.

@pmorie
Copy link
Member

pmorie commented Apr 1, 2015

@bgrant0607 @dchen1107 @satnam6502 Taking a look to see if it's #6100

@zmerlynn
Copy link
Member Author

zmerlynn commented Apr 1, 2015

#6100 definitely lines up with the beginning of the Networking test failures, which actually started after I reopened this bug, amusingly. This bug has taken on a different meaning, though. I reopened the bug from a totally issue that we're still seeing, which is a hang off contacting the proxy endpoint. I'm forking and re-lowering the priority because we haven't seen another red dot in a little bit.

@zmerlynn zmerlynn added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Apr 1, 2015
@zmerlynn
Copy link
Member Author

zmerlynn commented Apr 1, 2015

Forked to #6294. This bug should stay as the random, very serious issue where the nettest attempts to make a client call and hangs arbitrarily, which is a violation of all reasonable principals dealing with timeouts.

@pmorie
Copy link
Member

pmorie commented Apr 1, 2015

@bgrant0607 the code was functioning correctly; in this case the e2e test
needed to be changed to set the args, not the command.

On Wed, Apr 1, 2015 at 11:34 AM, Zach Loafman notifications@github.com
wrote:

Forked to #6294
#6294. This bug
should stay as the random, very serious issue where the nettest attempts to
make a client call and hangs arbitrarily, which is a violation of all
reasonable principals dealing with timeouts.


Reply to this email directly or view it on GitHub
#5629 (comment)
.

@dchen1107
Copy link
Member

@bgrant0607 It is not a bug introduced by #6100 since we know it is a breakage change. The fix should be straightforward.

@dchen1107 dchen1107 self-assigned this Apr 1, 2015
@dchen1107
Copy link
Member

Just found @pmorie already provided a fix for #6294 which is dup of this one. Shouldn't we close this?

@zmerlynn
Copy link
Member Author

zmerlynn commented Apr 1, 2015

@dchen1107: Unless you're fixing the hang, no. I forked that off to #6294.

@dchen1107
Copy link
Member

What do you mean hang? I thought that is the only issue reporting here. Now we have a new issue or you just simple point to the that performance issue we suspected watcher?

@zmerlynn
Copy link
Member Author

zmerlynn commented Apr 1, 2015

The hang that this issue was opened for. It most recently occurred in #5629 (comment) and caused Jenkins to timeout, losing the build. IIRC, there was another recurrence on gke-ci at a similar time.

It just happened to be there were two networking related failures yesterday, by coincidence. #6294 just happened to land right after we had a recurrence of this hang, so people glommed onto this bug. Which is why I forked it.

@jayunit100
Copy link
Member

yup im seeing the same thing, ran many many times

@lavalamp
Copy link
Member

lavalamp commented Apr 1, 2015

@pmorie's fix looks good to me, but it's not related to the OP.

The problem in the OP is caused by apiserver crashing (only way I'm aware of for an nginx instance to give bad gateway errors), can we get apiserver logs?

@zmerlynn
Copy link
Member Author

zmerlynn commented Apr 1, 2015

@lavalamp: Why is the client not timing out? That seems like a bug, too.

@lavalamp
Copy link
Member

lavalamp commented Apr 1, 2015

@zmerlynn it looks like nginx is timing out waiting for apiserver. From our client's perspective, the connection is just fine but apiserver is slow sending data. So the fix is probably to make nginx not wait for apiserver.

Also we should figure out what happened to apiserver, and fix that, too, of course.

@lavalamp
Copy link
Member

lavalamp commented Apr 1, 2015

Looking at the logs further down in the thread...

It's also possible that apiserver's proxy is incorrectly waiting for something that's not answering; if so that'd be a totally different issue from apiserver crashing and nginx bad gatewaying, even though the symptom would be similar.

@a-robinson
Copy link
Contributor

Should this issue be closed? It doesn't look like the networking test has had any problems the last couple weeks.

@jayunit100
Copy link
Member

yup close this guy . this isnt an issue anymore,

  • it runs generally in 15 seconds or less (on a single node).
  • Its usually less than a minute on clusters from my experience.... last time we ran was 40 seconds on a 20 node cluster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

10 participants