-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod status not relayed in a timely fashion #6139
Comments
The last fast e2e run we have starts at 76b5b79. The first slow run we have starts at 8a7a127, so the culprits are: 243de3a Fix kubectl log for single-container pods |
We had one run at 1.5h, one run at 1h, one run at 0.5h. The consistency after this chunk of commits has been ... special. |
I'm having trouble reproducing this on my own, but I was able to piece out I think what's going on from the event stream. If I had to take a guess, the pod status from Pending to Running is very flakey / racy in some cases (possibly if the pod is transient, like a lot of our e2e pods are). This is just a working theory, though, based on some log munging: I'm working off of go/k8s-test/job/kubernetes-e2e-gce/3930/consoleText. I just curl'd it down to use grep. If you grep for for But then a similar thing happens in |
FWIW, I have no repro and e2e runs are actually green now. :/ But we've been running the exact same code for the last several hours, so I'm looking for any theory that would explain it, either. |
@zmerlynn Do you think reverting "etcd in a pod" helped the run become green ? |
@ArtfulCoder: Yes, clusters weren't turning up before that. |
cc @dchen1107 |
Logging for frequency: This is still happening, it wasn't just some fluke. In GCE e2e build 3937, in the |
Is the list of PRs you listed earlier still considered to be the set that might be causing the problem? If we want to help, should we just 'git checkout' at each of those commits and try running e2e? |
Most likely. The problem is, it may be an initial cluster state of some sort. I'm not sure how, but it seems like some of the e2e clusters are born sticky and some aren't. So I haven't yet been able to do it by single test runs: i.e. I tried to just run /healthz liveness and it showed very little variance around the ~20s mark, yet it was one of the ones that spiked on 3930. |
I am sync to HEAD, and run e2e for now. |
e2e test passed: Ran 28 of 30 Specs in 1156.955 seconds I didn't see any podStatus relay issue. Any more specific issues here? While running e2e test, I also run 'kubectl get pods' on another terminal, and see message like this for many tests:
But the output of kubectl get pods shows the pod is running, later already deleted. Where did you observe such podStatus delay issue? From e2e tests output? Is it possible the issues in test itself? |
It's very possible the tests are doing something consistently wrong, but
|
Hm. The event stream is missing a fair amount when I poke at a lot of these builds, making me wonder if it's something as banal as slow pulls. There's something else really fishy going on, too: the |
Yeah, I'm having trouble convincing myself this is anything other than a product bug. Here's build breadcrumbs from e2e-gce 3963. The
But ... am I misreading something about this event stream?
That looks an awful lot like it went from assigned to started in about 1s? |
Actually, I have a theory based on what was going on with Jenkins at the time, too. We noticed both some odd memory pressure on Jenkins and this weird issue. It's possible the test code is actually getting sliced here for quite a while. Maybe the pod isn't running because it's already dead? We don't do a great job checking for that. Self-assigning and watching the timing now that Jenkins has more memory. |
Please correct issue title to reflect the real issue here. I don't think there are any issues related to PodStatus. It might have issues at event mechanism, or code on watching events stream, or simply test framework. Thanks! |
Apologies if I'm wrong here, but don't the tests explicitly wait for pods to be "Running" as told by a |
@dchen1107: The test isn't doing anything with the event stream. You can see an example loop here: https://github.com/GoogleCloudPlatform/kubernetes/blob/master/test/e2e/util.go#L94 It literally is waiting for the pod's observable status to change. My comments about the event stream are because I was trying to debug it. |
This is continuing even after Jenkins changes, on current code. I don't think it's a test issue. |
( go/k8s-test/job/kubernetes-e2e-gke-ci/2872/consoleText is a recent build ) |
@dchen1107: It's very possible there's something wrong here with the test. But for a moment, treat me as a customer that has a set of tests, and is seeing extremely variable performance on our product, and all you're seeing is a set of logs. Now, what are you going to instrument kubelet and/or apiserver with to help you diagnose the problem? Because that's the situation we're in right now. I tried to repro this issue with the following script and didn't get anywhere. Amazingly so, because this script doesn't even wait for add-ons to finish running (because it's running ginkgo-e2e straight), so the timing was actually pretty variable and had some pull-time variance. There's something a little more systemic to the full e2e run, possibly non-hermetic between multiple e2es interacting, or containers getting purged as we churn through tests. I don't know. I'm about to try a repro using the seed from one of the failing runs instead. However, there's some issue here. Let's not assume it's the tests for now.
|
Ok, I were confused by your earlier event analysis. Ok, I am looking into it. |
Re-assign it to myself. I think it is a race and performance related issue. But need more data to validate my hyperthesis. Updated it soon. |
By running e2e tests a couple of times today, I think I have enough information to validate my hyperthesis:
Need to investigate why kubelet and scheduler doesn't pick up the work at the first place. cc/ @wojtek-t who is working on perf related issue. |
cc/ @satnam6502 @davidopp |
By scanning through misc/perf-1.0 issues, this one looks like a dup of #6059 |
I'm perplexed about why this is an issue for e2e tests on jenkins. #6059 uses a higher number of nodes (50) and pods per node (30). The regular e2e uses two nodes and creates a few pods in each test (pods.go). I can't really reproduce this issue on my cluster running pods.go at all... |
@yujuhong I don't know why jenkin failure more often. When I reproduced the issue by running e2e a couple of times, I never see an e2e failure actually. But from e2e test output, I observed the message on waiting a pod to be run for a couple of minutes (above comment: #6139 (comment)) If the waiting gets a little bit longer than 5mins, the test will fail. That is why I started to aggregate information from various sources for debugging. The only reason I could come up with more failures on Jenkin is that Jenkin machines might be slower machines. |
The Jenkins machine is actually a 16way / 60gig machine. It's probably On Mon, Mar 30, 2015 at 3:41 PM, Dawn Chen notifications@github.com wrote:
|
@zmerlynn, hmm...I thought jenkins uses g1-small GCE instances (for both master and minions), same as my e2e cluster. Is that not true? |
It does, yes. I was referring to the machine the test sare getting executed On Mon, Mar 30, 2015 at 3:57 PM, Yu-Ju Hong notifications@github.com
|
Ah... I see. Agreed that the test executing machine is probably irrelevant at this point. In fact, I've run |
Yeah, I've actually tried to isolate this using --ginkgo.focus as well, and failed. (See #6139 (comment), which is basically brute force of that.) The main differences with Jenkins:
Otherwise, it looks like a pretty normal e2e run. |
I'm pretty sure this one is a duplicate of #6059 (as @dchen1107 suggested above). Please take a look at #6059 (comment) as the first step towards explanation of this issue. Anyway, I suggest closing this issue as a duplicate of #6059 and moving the whole discussion there. |
Closing as dupe of #6059. |
Jenkins e2es are taking nearly an hour and a half to complete now.
Will file with a full list of git hashes that could be causes in just a little bit.
The text was updated successfully, but these errors were encountered: