Networking test timing out on Jenkins #5629

zmerlynn · 2015-03-19T00:17:58Z

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:180
STEP: Creating service with name nettest in namespace nettest-9970
STEP: Creating a replication controller
STEP: Waiting for connectivity to be verified
Attempt 0/60: service/pod still starting. (error: 'endpoints "nettest" not found')
Attempt 1/60: service/pod still starting. (error: 'endpoints "nettest" not found')
Attempt 2/60: service/pod still starting. (error: 'endpoints "nettest" not found')
Attempt 3/60: test still running
Attempt 4/60: test still running
Attempt 5/60: test still running
Attempt 6/60: test still running
Attempt 7/60: test still running
Attempt 8/60: test still running
Attempt 9/60: test still running
Attempt 10/60: test still running
Attempt 11/60: test still running
Attempt 12/60: test still running
Attempt 13/60: test still running
Attempt 14/60: test still running
Attempt 15/60: test still running
Attempt 16/60: test still running
Attempt 17/60: test still running
Attempt 18/60: test still running
Attempt 19/60: test still running
Attempt 20/60: test still running
Attempt 21/60: test still running
Attempt 22/60: test still running
Attempt 23/60: test still running
Attempt 24/60: test still running
Attempt 25/60: test still running
Attempt 26/60: test still running
Attempt 27/60: test still running
Attempt 28/60: test still running
Attempt 29/60: test still running
Attempt 30/60: test still running
Attempt 31/60: test still running
Attempt 32/60: test still running
Attempt 33/60: test still running
Attempt 34/60: test still running
Attempt 35/60: test still running
Attempt 36/60: test still running
Attempt 37/60: test still running
Attempt 38/60: test still running
Attempt 39/60: test still running
Attempt 40/60: service/pod still starting. (error: 'request [&{Method:GET URL:https://104.197.4.134/api/v1beta1/proxy/services/nettest/status?namespace=nettest-9970 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[] Body:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:104.197.4.134 Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil>}] failed (504) 504 Gateway Time-out: <html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.2.1</center>
</body>
</html>
')
Attempt 41/60: service/pod still starting. (error: 'request [&{Method:GET URL:https://104.197.4.134/api/v1beta1/proxy/services/nettest/status?namespace=nettest-9970 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[] Body:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:104.197.4.134 Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil>}] failed (504) 504 Gateway Time-out: <html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.2.1</center>
</body>
</html>
')
Attempt 42/60: service/pod still starting. (error: 'request [&{Method:GET URL:https://104.197.4.134/api/v1beta1/proxy/services/nettest/status?namespace=nettest-9970 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[] Body:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:104.197.4.134 Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil>}] failed (504) 504 Gateway Time-out: <html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.2.1</center>
</body>
</html>
')

The text was updated successfully, but these errors were encountered:

zmerlynn · 2015-03-19T00:18:20Z

I think we need to revert the PRs that went in

zmerlynn · 2015-03-19T00:18:33Z

(The later commits, not the namespace one.)

zmerlynn · 2015-03-19T00:25:34Z

Wait, no. It's been timing out since 11am.

satnam6502 · 2015-03-19T00:29:48Z

Yes, I am noticing it too and I think there might be deadlock somewhere which occurs just as the test tries to leave the running state to pass.

zmerlynn · 2015-03-19T00:36:40Z

I really wanted to blame you, but the first failure is at git v0.12.0-882-g7269960, which is shortly before your first PR went in. :)

satnam6502 · 2015-03-19T00:40:12Z

I can't see what's in that which would cause the networking test to break?

On Wed, Mar 18, 2015 at 5:37 PM, Zach Loafman notifications@github.com
wrote:

I really wanted to blame you, but the first failure is at git
v0.12.0-882-g7269960, which is shortly before your first PR went in. :)

—
Reply to this email directly or view it on GitHub
#5629 (comment)
.

bprashanth · 2015-03-19T00:43:27Z

fwiw i had this failing an ~hour ago but not on rebase

zmerlynn · 2015-03-19T00:43:56Z

My guess is that it's #5555.

satnam6502 · 2015-03-19T01:03:40Z

The tests seem to work for me:

Running Suite: Kubernetes e2e Suite run 1 of 1
==============================================
Random Seed: 1426726935 - Will randomize all specs
Will run 2 of 30 specs

SSSSSSSSSSSSSSSS
------------------------------
Networking 
  should function for pods
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:193
[BeforeEach] Networking
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:39
>>> testContext.kubeConfig: /usr/local/google/home/satnam/.kube/.kubeconfig
[It] should function for pods
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:193
STEP: Creating service with name nettest in namespace nettest-6777
STEP: Creating a webserver pod on each node
INFO: Creating pod nettest-0 on node e2e-test-satnam-minion-rkqu.c.kubernetes-satnam.internal
INFO: Creating pod nettest-1 on node e2e-test-satnam-minion-s7sh.c.kubernetes-satnam.internal
STEP: Wait for the webserver pods to be ready
STEP: waiting up to 5m0s for pod nettest-0 status to be running
INFO: Waiting for pod nettest-0 in namespace nettest-6777 status to be "running" (found "Pending") (406.567337ms)
INFO: Waiting for pod nettest-0 in namespace nettest-6777 status to be "running" (found "Pending") (5.458883651s)
STEP: waiting up to 5m0s for pod nettest-1 status to be running
INFO: Waiting for pod nettest-1 in namespace nettest-6777 status to be "running" (found "Pending") (43.816791ms)
INFO: Waiting for pod nettest-1 in namespace nettest-6777 status to be "running" (found "Pending") (5.089164552s)
STEP: Waiting for connectivity to be verified
Attempt 0/60: test still running
Attempt 1/60: test still running
Attempt 2/60: test still running
Attempt 3/60: test still running
Passed on attempt 4. Cleaning up.
STEP: Cleaning up the webserver pods
STEP: Cleaning up the service

• [SLOW TEST:32.236 seconds]
Networking
/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:215
  should function for pods
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:193
------------------------------
SSS
------------------------------
Networking 
  should provide unchanging URLs
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:214
[BeforeEach] Networking
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:39
>>> testContext.kubeConfig: /usr/local/google/home/satnam/.kube/.kubeconfig
[It] should provide unchanging URLs
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:214
STEP: testing: /validate
STEP: testing: /healthz
•SSSSSSSSS
Ran 2 of 30 Specs in 32.689 seconds
SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 28 Skipped I0318 18:02:48.681366   31568 driver.go:89] All tests pass

zmerlynn · 2015-03-19T15:54:01Z

This test is now passing with gusto. I'm confused.

zmerlynn · 2015-03-31T15:45:05Z

This is the same thing @satnam6502 stuck in the logging for, but it's snuck back in, now with more logging:

Networking 
  should function for pods
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:203
[BeforeEach] Networking
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:39
>>> testContext.kubeConfig: /var/lib/jenkins/jobs/kubernetes-e2e-gce/workspace/.kube/.kubeconfig
[It] should function for pods
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/networking.go:203
STEP: Creating service with name nettest in namespace nettest-409
STEP: Creating a webserver pod on each node
INFO: Creating pod nettest-0 on node e2e-test-jenkins-minion-f0u9.c.kubernetes-jenkins.internal
INFO: Creating pod nettest-1 on node e2e-test-jenkins-minion-t24l.c.kubernetes-jenkins.internal
STEP: Wait for the webserver pods to be ready
STEP: waiting up to 5m0s for pod nettest-0 status to be running
INFO: Waiting for pod nettest-0 in namespace nettest-409 status to be "running" (found "Pending") (5.993637ms)
STEP: waiting up to 5m0s for pod nettest-1 status to be running
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (3.633675ms)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (5.008000484s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (10.012479962s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (15.018019974s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (20.022416932s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (25.027329461s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (30.032123805s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (35.037695148s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (40.042410693s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (45.0471947s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (50.051412829s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (55.056787917s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (1m0.061367203s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (1m5.066695731s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (1m10.071522224s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (1m15.076154792s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (1m20.080382614s)
INFO: Waiting for pod nettest-1 in namespace nettest-409 status to be "running" (found "Pending") (1m25.085091455s)
STEP: Waiting for connectivity to be verified
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.003636218s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.004278343s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 6.570714ms
INFO: Attempt 2/60: test still running
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5.693632ms
INFO: Attempt 3/60: test still running
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.00408193s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.004904434s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.005336923s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.005038183s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.005179133s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.004361386s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 6.605119ms
INFO: Attempt 10/60: test still running
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5.692527ms
INFO: Attempt 11/60: test still running
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.004285752s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 7.01869ms
INFO: Attempt 13/60: test still running
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.004841588s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 9.717108ms
INFO: Attempt 15/60: test still running
INFO: About to make a proxy status call
INFO: Proxy status call returned in 5m0.004179451s
INFO: About to make a proxy status call
Build timed out (after 90 minutes). Marking the build as aborted.
Build timed out (after 90 minutes). Marking the build as failed.

We've had a couple of aborted Jenkins builds recently due to this.

j3ffml · 2015-04-01T00:28:23Z

From running the test locally, the failure I see is that the webserver container cannot run.

{
    "kind": "Pod",
    "id": "nettest-0",
    "uid": "6384d3eb-d804-11e4-80af-42010af08eed",
    "creationTimestamp": "2015-04-01T00:16:48Z",
    "selfLink": "/api/v1beta1/pods/nettest-0?namespace=nettest-7301",
    "resourceVersion": 682,
    "apiVersion": "v1beta1",
    "namespace": "nettest-7301",
    "labels": {
        "name": "nettest"
    },
    "desiredState": {
        "manifest": {
            "version": "v1beta2",
            "id": "",
            "volumes": null,
            "containers": [
                {
                    "name": "webserver",
                    "image": "gcr.io/google_containers/nettest:1.1",
                    "ports": [
                        {
                            "containerPort": 8080,
                            "protocol": "TCP"
                        }
                    ],
                    "resources": {},
                    "terminationMessagePath": "/dev/termination-log",
                    "imagePullPolicy": "PullIfNotPresent",
                    "capabilities": {}
                }
            ],
            "restartPolicy": {
                "never": {}
            },
            "dnsPolicy": "ClusterFirst"
        },
        "host": "e2e-test-jeffml-minion-8w7n.c.jeffmltest.internal"
    },
    "currentState": {
        "manifest": {
            "version": "",
            "id": "",
            "volumes": null,
            "containers": null,
            "restartPolicy": {}
        },
        "status": "Waiting",
        "Condition": [
            {
                "kind": "Ready",
                "status": "None"
            }
        ],
        "host": "e2e-test-jeffml-minion-8w7n.c.jeffmltest.internal",
        "hostIP": "104.197.4.9",
        "info": {
            "POD": {
                "state": {
                    "termination": {
                        "exitCode": 2,
                        "startedAt": "2015-04-01T00:16:49Z",
                        "finishedAt": "2015-04-01T00:16:58Z"
                    }
                },
                "ready": false,
                "restartCount": 0,
                "image": "kubernetes/pause:latest",
                "imageID": "docker://6c4579af347b649857e915521132f15a06186d73faa62145e3eeeb6be0e97c27",
                "containerID": "docker://20fb8035f4b0b3deee7cd9bae3e1a8c4e892298c9b72a989c8e4f81a96f48a7f"
            },
            "webserver": {
                "state": {
                    "waiting": {
                        "reason": "Container cannot run"
                    }
                },
                "ready": false,
                "restartCount": 0,
                "image": "gcr.io/google_containers/nettest:1.1",
                "imageID": "docker://a1320a4ec4cd92675d03f4d932c5a40977a864651573d242c0007a8fa5536985",
                "containerID": "docker://2dc9729cfd3f788fbd9234eccbb28cd620c34c9e41e4f5063158ab84e73e341f"
            }
        }
    }
}

This seems to be a catchall failure. @dchen1107 may have some idea what this error indicates.

satnam6502 · 2015-04-01T00:52:22Z

If the pod's state here is "termination" then why does it show up as
"Pending" in the test log output?

S

On Tue, Mar 31, 2015 at 5:28 PM, Jeff Lowdermilk notifications@github.com
wrote:

From running the test locally, the failure I see is that the webserver
container cannot run.

{
"kind": "Pod",
"id": "nettest-0",
"uid": "6384d3eb-d804-11e4-80af-42010af08eed",
"creationTimestamp": "2015-04-01T00:16:48Z",
"selfLink": "/api/v1beta1/pods/nettest-0?namespace=nettest-7301",
"resourceVersion": 682,
"apiVersion": "v1beta1",
"namespace": "nettest-7301",
"labels": {
"name": "nettest"
},
"desiredState": {
"manifest": {
"version": "v1beta2",
"id": "",
"volumes": null,
"containers": [
{
"name": "webserver",
"image": "gcr.io/google_containers/nettest:1.1",
"ports": [
{
"containerPort": 8080,
"protocol": "TCP"
}
],
"resources": {},
"terminationMessagePath": "/dev/termination-log",
"imagePullPolicy": "PullIfNotPresent",
"capabilities": {}
}
],
"restartPolicy": {
"never": {}
},
"dnsPolicy": "ClusterFirst"
},
"host": "e2e-test-jeffml-minion-8w7n.c.jeffmltest.internal"
},
"currentState": {
"manifest": {
"version": "",
"id": "",
"volumes": null,
"containers": null,
"restartPolicy": {}
},
"status": "Waiting",
"Condition": [
{
"kind": "Ready",
"status": "None"
}
],
"host": "e2e-test-jeffml-minion-8w7n.c.jeffmltest.internal",
"hostIP": "104.197.4.9",
"info": {
"POD": {
"state": {
"termination": {
"exitCode": 2,
"startedAt": "2015-04-01T00:16:49Z",
"finishedAt": "2015-04-01T00:16:58Z"
}
},
"ready": false,
"restartCount": 0,
"image": "kubernetes/pause:latest",
"imageID": "docker://6c4579af347b649857e915521132f15a06186d73faa62145e3eeeb6be0e97c27",
"containerID": "docker://20fb8035f4b0b3deee7cd9bae3e1a8c4e892298c9b72a989c8e4f81a96f48a7f"
},
"webserver": {
"state": {
"waiting": {
"reason": "Container cannot run"
}
},
"ready": false,
"restartCount": 0,
"image": "gcr.io/google_containers/nettest:1.1",
"imageID": "docker://a1320a4ec4cd92675d03f4d932c5a40977a864651573d242c0007a8fa5536985",
"containerID": "docker://2dc9729cfd3f788fbd9234eccbb28cd620c34c9e41e4f5063158ab84e73e341f"
}
}
}
}

This seems to be a catchall failure
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/kubelet/dockertools/docker.go#L618.
@dchen1107 https://github.com/dchen1107 may have some idea what this
error indicates.

—
Reply to this email directly or view it on GitHub
#5629 (comment)
.

dchen1107 · 2015-04-01T02:01:32Z

@jlowdermilk Which kubernetes version are you running here? It looks like an old version since the currentstate still have container POD information.

The reason of pod state is terminated and container doesn't run because its network container POD is dead. I don't know why? could you please run:

cat docker.log | grep "20fb8035f4b0b3deee7cd9bae3e1a8c4e892298c9b72a989c8e4f81a96f48a7f" to see anything abnormal there?

I can try to reproduce the issue later when I have VPN.

j3ffml · 2015-04-01T02:20:41Z

Don't see anything abnormal docker.log, but also not sure what would be abnormal. Here is output.

When trying to repro I saw the pod stuck in pending until timeout reached. While it was pending this was the output of docker ps -a | grep nettest on a minion

jeffml@e2e-test-jeffml-minion-8w7n:~$ sudo docker ps -a | grep nettest
3fb45954f92b        gcr.io/google_containers/nettest:1.1                      "-service=nettest -p   About an hour ago                                                      k8s_webserver.cdc9249f_
nettest-0_nettest-7539_618da41b-d80a-11e4-8a8d-42010af08eed_b43e3b95                                                                                                      
d74f85d6f37e        gcr.io/google_containers/pause:1.0                        "/pause"               About an hour ago   Exited (2) About an hour ago                       k8s_POD.98c00084_nettes
t-0_nettest-7539_618da41b-d80a-11e4-8a8d-42010af08eed_fa4e3f25

So yeah, the network container has been dead some time. I see the same thing consistently when retrying the test.

dchen1107 · 2015-04-01T02:27:21Z

Ok, I realized that POD container might be killed by kubelet when kubelet found all user containers are not running, and restart policy is never.

Can you please post the output of 'cat docker.log | grep 2dc9729cfd3f788fbd9234eccbb28cd620c34c9e41e4f5063158ab84e73e341f' here?

j3ffml · 2015-04-01T02:51:20Z

Ah, sorry, I finished working with that e2e cluster and tore it down. I'm spinning up a new one and will see if I can repro the above symptoms.

dchen1107 · 2015-04-01T02:55:27Z

Don't worry, I will try to reproduce it later. Thanks!

satnam6502 · 2015-04-01T03:20:14Z

I will also independently have a look at this, depending on how soon I can
neutralize the kids.

On Tue, Mar 31, 2015 at 7:55 PM, Dawn Chen notifications@github.com wrote:

Don't worry, I will try to reproduce it later. Thanks!

—
Reply to this email directly or view it on GitHub
#5629 (comment)
.

dchen1107 · 2015-04-01T07:11:28Z

Ok, running e2e takes forever, sigh!

It is very easy to reproduce the issue, just require a long time. docker log shows that container cannot run because a bad image:

Cannot start container 336ddc1f6a73209d416c1b7b8a789e7139d8f72ea940a0413dee823810bab3fc: exec: "-service=nettest": executable file not found in $PATH
time="2015-04-01T06:39:59Z" level="info" msg="-job start(336ddc1f6a73209d416c1b7b8a789e7139d8f72ea940a0413dee823810bab3fc) = ERR (1)"

I believe this is a regression introduced by #6100 which changes the way how kubernetes are going to use ENTRYPOINT. The fix should be straighforward. I can provide it tomorrow morning.

dchen1107 · 2015-04-01T07:13:53Z

cc/ @bgrant0607 :-)

bgrant0607 · 2015-04-01T15:07:01Z

FYI @pmorie

bgrant0607 · 2015-04-01T15:10:33Z

If #6100 is the culprit, it's a bug. The image's entrypoint is supposed to be used if entrypoint isn't provided.

pmorie · 2015-04-01T15:12:19Z

@bgrant0607 @dchen1107 @satnam6502 Taking a look to see if it's #6100

zmerlynn · 2015-04-01T15:24:20Z

#6100 definitely lines up with the beginning of the Networking test failures, which actually started after I reopened this bug, amusingly. This bug has taken on a different meaning, though. I reopened the bug from a totally issue that we're still seeing, which is a hang off contacting the proxy endpoint. I'm forking and re-lowering the priority because we haven't seen another red dot in a little bit.

zmerlynn · 2015-04-01T15:34:16Z

Forked to #6294. This bug should stay as the random, very serious issue where the nettest attempts to make a client call and hangs arbitrarily, which is a violation of all reasonable principals dealing with timeouts.

pmorie · 2015-04-01T15:49:48Z

@bgrant0607 the code was functioning correctly; in this case the e2e test
needed to be changed to set the args, not the command.

On Wed, Apr 1, 2015 at 11:34 AM, Zach Loafman notifications@github.com
wrote:

Forked to #6294
#6294. This bug
should stay as the random, very serious issue where the nettest attempts to
make a client call and hangs arbitrarily, which is a violation of all
reasonable principals dealing with timeouts.

—
Reply to this email directly or view it on GitHub
#5629 (comment)
.

dchen1107 · 2015-04-01T16:03:15Z

@bgrant0607 It is not a bug introduced by #6100 since we know it is a breakage change. The fix should be straightforward.

dchen1107 · 2015-04-01T16:05:21Z

Just found @pmorie already provided a fix for #6294 which is dup of this one. Shouldn't we close this?

zmerlynn · 2015-04-01T16:05:55Z

@dchen1107: Unless you're fixing the hang, no. I forked that off to #6294.

dchen1107 · 2015-04-01T16:07:00Z

What do you mean hang? I thought that is the only issue reporting here. Now we have a new issue or you just simple point to the that performance issue we suspected watcher?

zmerlynn · 2015-04-01T16:09:55Z

The hang that this issue was opened for. It most recently occurred in #5629 (comment) and caused Jenkins to timeout, losing the build. IIRC, there was another recurrence on gke-ci at a similar time.

It just happened to be there were two networking related failures yesterday, by coincidence. #6294 just happened to land right after we had a recurrence of this hang, so people glommed onto this bug. Which is why I forked it.

jayunit100 · 2015-04-01T17:21:43Z

yup im seeing the same thing, ran many many times

lavalamp · 2015-04-01T17:23:33Z

@pmorie's fix looks good to me, but it's not related to the OP.

The problem in the OP is caused by apiserver crashing (only way I'm aware of for an nginx instance to give bad gateway errors), can we get apiserver logs?

zmerlynn · 2015-04-01T17:26:28Z

@lavalamp: Why is the client not timing out? That seems like a bug, too.

lavalamp · 2015-04-01T17:51:21Z

@zmerlynn it looks like nginx is timing out waiting for apiserver. From our client's perspective, the connection is just fine but apiserver is slow sending data. So the fix is probably to make nginx not wait for apiserver.

Also we should figure out what happened to apiserver, and fix that, too, of course.

lavalamp · 2015-04-01T18:28:40Z

Looking at the logs further down in the thread...

It's also possible that apiserver's proxy is incorrectly waiting for something that's not answering; if so that'd be a totally different issue from apiserver crashing and nginx bad gatewaying, even though the symptom would be similar.

a-robinson · 2015-04-17T21:39:17Z

Should this issue be closed? It doesn't look like the networking test has had any problems the last couple weeks.

jayunit100 · 2015-04-17T21:42:44Z

yup close this guy . this isnt an issue anymore,

it runs generally in 15 seconds or less (on a single node).
Its usually less than a minute on clusters from my experience.... last time we ran was 40 seconds on a 20 node cluster

zmerlynn added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/test-infra labels Mar 19, 2015

zmerlynn removed the area/test-infra label Mar 19, 2015

zmerlynn closed this as completed Mar 19, 2015

zmerlynn reopened this Mar 31, 2015

zmerlynn added the team/cluster label Mar 31, 2015

j3ffml mentioned this issue Apr 1, 2015

Pull e2e images from gcr.io #6268

Merged

zmerlynn added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Apr 1, 2015

zmerlynn mentioned this issue Apr 1, 2015

Networking should function for pods broken on head #6294

Closed

dchen1107 self-assigned this Apr 1, 2015

zmerlynn unassigned dchen1107 Apr 1, 2015

dchen1107 mentioned this issue Apr 1, 2015

Run etcd 2.0.5 in a pod #6221

Merged

a-robinson closed this as completed Apr 17, 2015

Networking test timing out on Jenkins #5629

Networking test timing out on Jenkins #5629

Comments

zmerlynn commented Mar 19, 2015

zmerlynn commented Mar 19, 2015

zmerlynn commented Mar 19, 2015

zmerlynn commented Mar 19, 2015

satnam6502 commented Mar 19, 2015

zmerlynn commented Mar 19, 2015

satnam6502 commented Mar 19, 2015

bprashanth commented Mar 19, 2015

zmerlynn commented Mar 19, 2015

satnam6502 commented Mar 19, 2015

zmerlynn commented Mar 19, 2015

zmerlynn commented Mar 31, 2015

j3ffml commented Apr 1, 2015

satnam6502 commented Apr 1, 2015

dchen1107 commented Apr 1, 2015

j3ffml commented Apr 1, 2015

dchen1107 commented Apr 1, 2015

j3ffml commented Apr 1, 2015

dchen1107 commented Apr 1, 2015

satnam6502 commented Apr 1, 2015

dchen1107 commented Apr 1, 2015

dchen1107 commented Apr 1, 2015

bgrant0607 commented Apr 1, 2015

bgrant0607 commented Apr 1, 2015

pmorie commented Apr 1, 2015

zmerlynn commented Apr 1, 2015

zmerlynn commented Apr 1, 2015

pmorie commented Apr 1, 2015

dchen1107 commented Apr 1, 2015

dchen1107 commented Apr 1, 2015

zmerlynn commented Apr 1, 2015

dchen1107 commented Apr 1, 2015

zmerlynn commented Apr 1, 2015

jayunit100 commented Apr 1, 2015

lavalamp commented Apr 1, 2015

zmerlynn commented Apr 1, 2015

lavalamp commented Apr 1, 2015

lavalamp commented Apr 1, 2015

a-robinson commented Apr 17, 2015

jayunit100 commented Apr 17, 2015