-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Networking test timing out on Jenkins #5629
Comments
I think we need to revert the PRs that went in |
(The later commits, not the namespace one.) |
Wait, no. It's been timing out since 11am. |
Yes, I am noticing it too and I think there might be deadlock somewhere which occurs just as the test tries to leave the running state to pass. |
I really wanted to blame you, but the first failure is at git |
I can't see what's in that which would cause the networking test to break? On Wed, Mar 18, 2015 at 5:37 PM, Zach Loafman notifications@github.com
|
fwiw i had this failing an ~hour ago but not on rebase |
My guess is that it's #5555. |
The tests seem to work for me:
|
This test is now passing with gusto. I'm confused. |
This is the same thing @satnam6502 stuck in the logging for, but it's snuck back in, now with more logging:
We've had a couple of aborted Jenkins builds recently due to this. |
From running the test locally, the failure I see is that the webserver container cannot run.
This seems to be a catchall failure. @dchen1107 may have some idea what this error indicates. |
If the pod's state here is "termination" then why does it show up as S On Tue, Mar 31, 2015 at 5:28 PM, Jeff Lowdermilk notifications@github.com
|
@jlowdermilk Which kubernetes version are you running here? It looks like an old version since the currentstate still have container POD information. The reason of pod state is terminated and container doesn't run because its network container POD is dead. I don't know why? could you please run: cat docker.log | grep "20fb8035f4b0b3deee7cd9bae3e1a8c4e892298c9b72a989c8e4f81a96f48a7f" to see anything abnormal there? I can try to reproduce the issue later when I have VPN. |
Don't see anything abnormal docker.log, but also not sure what would be abnormal. Here is output. When trying to repro I saw the pod stuck in pending until timeout reached. While it was pending this was the output of
So yeah, the network container has been dead some time. I see the same thing consistently when retrying the test. |
Ok, I realized that POD container might be killed by kubelet when kubelet found all user containers are not running, and restart policy is never. Can you please post the output of 'cat docker.log | grep 2dc9729cfd3f788fbd9234eccbb28cd620c34c9e41e4f5063158ab84e73e341f' here? |
Ah, sorry, I finished working with that e2e cluster and tore it down. I'm spinning up a new one and will see if I can repro the above symptoms. |
Don't worry, I will try to reproduce it later. Thanks! |
I will also independently have a look at this, depending on how soon I can On Tue, Mar 31, 2015 at 7:55 PM, Dawn Chen notifications@github.com wrote:
|
Ok, running e2e takes forever, sigh! It is very easy to reproduce the issue, just require a long time. docker log shows that container cannot run because a bad image:
I believe this is a regression introduced by #6100 which changes the way how kubernetes are going to use ENTRYPOINT. The fix should be straighforward. I can provide it tomorrow morning. |
cc/ @bgrant0607 :-) |
FYI @pmorie |
If #6100 is the culprit, it's a bug. The image's entrypoint is supposed to be used if entrypoint isn't provided. |
@bgrant0607 @dchen1107 @satnam6502 Taking a look to see if it's #6100 |
#6100 definitely lines up with the beginning of the Networking test failures, which actually started after I reopened this bug, amusingly. This bug has taken on a different meaning, though. I reopened the bug from a totally issue that we're still seeing, which is a hang off contacting the proxy endpoint. I'm forking and re-lowering the priority because we haven't seen another red dot in a little bit. |
Forked to #6294. This bug should stay as the random, very serious issue where the nettest attempts to make a client call and hangs arbitrarily, which is a violation of all reasonable principals dealing with timeouts. |
@bgrant0607 the code was functioning correctly; in this case the e2e test On Wed, Apr 1, 2015 at 11:34 AM, Zach Loafman notifications@github.com
|
@bgrant0607 It is not a bug introduced by #6100 since we know it is a breakage change. The fix should be straightforward. |
@dchen1107: Unless you're fixing the hang, no. I forked that off to #6294. |
What do you mean hang? I thought that is the only issue reporting here. Now we have a new issue or you just simple point to the that performance issue we suspected watcher? |
The hang that this issue was opened for. It most recently occurred in #5629 (comment) and caused Jenkins to timeout, losing the build. IIRC, there was another recurrence on It just happened to be there were two networking related failures yesterday, by coincidence. #6294 just happened to land right after we had a recurrence of this hang, so people glommed onto this bug. Which is why I forked it. |
yup im seeing the same thing, ran many many times |
@pmorie's fix looks good to me, but it's not related to the OP. The problem in the OP is caused by apiserver crashing (only way I'm aware of for an nginx instance to give bad gateway errors), can we get apiserver logs? |
@lavalamp: Why is the client not timing out? That seems like a bug, too. |
@zmerlynn it looks like nginx is timing out waiting for apiserver. From our client's perspective, the connection is just fine but apiserver is slow sending data. So the fix is probably to make nginx not wait for apiserver. Also we should figure out what happened to apiserver, and fix that, too, of course. |
Looking at the logs further down in the thread... It's also possible that apiserver's proxy is incorrectly waiting for something that's not answering; if so that'd be a totally different issue from apiserver crashing and nginx bad gatewaying, even though the symptom would be similar. |
Should this issue be closed? It doesn't look like the networking test has had any problems the last couple weeks. |
yup close this guy . this isnt an issue anymore,
|
cc @ixdy @satnam6502
The text was updated successfully, but these errors were encountered: