Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod status not relayed in a timely fashion #6139

Closed
zmerlynn opened this issue Mar 28, 2015 · 38 comments
Closed

Pod status not relayed in a timely fashion #6139

zmerlynn opened this issue Mar 28, 2015 · 38 comments
Assignees
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@zmerlynn
Copy link
Member

Jenkins e2es are taking nearly an hour and a half to complete now.

Will file with a full list of git hashes that could be causes in just a little bit.

@zmerlynn zmerlynn added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 28, 2015
@zmerlynn
Copy link
Member Author

The last fast e2e run we have starts at 76b5b79. The first slow run we have starts at 8a7a127, so the culprits are:

243de3a Fix kubectl log for single-container pods
68ccb97 Remove the --machines SaltStack configuration on GCE
7750f35 Running etcd 2.0.5 in a pod on master
3bc9336 Merge pull request #6083 from jlowdermilk/fix-log
525bbfd Merge pull request #6103 from zmerlynn/remove_gce_node_names
3ad0e22 Kubelet: allow docker to examine terminated container
f8d8b09 Reduce some tab spacing in kubectl
296fafa Merge pull request #6129 from yujuhong/container_status
179fe87 Merge pull request #4442 from ArtfulCoder/docker_etcd
b9499b8 Merge pull request #6126 from ghodss/reduce-some-kubectl-spacing
848134f Revert "Running etcd 2.0.5 in a pod on master"
8a7a127 Merge pull request #6137 from GoogleCloudPlatform/revert-4442-docker_etcd

@zmerlynn zmerlynn added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Mar 28, 2015
@zmerlynn
Copy link
Member Author

We had one run at 1.5h, one run at 1h, one run at 0.5h. The consistency after this chunk of commits has been ... special.

@zmerlynn
Copy link
Member Author

I'm having trouble reproducing this on my own, but I was able to piece out I think what's going on from the event stream. If I had to take a guess, the pod status from Pending to Running is very flakey / racy in some cases (possibly if the pod is transient, like a lot of our e2e pods are). This is just a working theory, though, based on some log munging:

I'm working off of go/k8s-test/job/kubernetes-e2e-gce/3930/consoleText. I just curl'd it down to use grep.

If you grep for for liveness-http, you'll see one case of this. The e2e client code code is waiting for upwards of 4m for the pod to transition from Pending to Not. Towards the bottom is the cluster event stream for the e2e run (with some elisions) :
08:52:38: assigned
08:53:25: pause pulled/created, a second later it's started
08:53:26: liveness-http started

But then a similar thing happens in Pods should contain environment variables for services. We wait nearly 4m for the first pod to be running, but if you look in the event stream for server-envvars-8fd9e975-d567-11e4-924b-42010af067aa, you'll see:
09:32:19: assigned
09:33:23: pause pulled/created/started
09:33:24: server-envvars-8fd9e975-d567-11e4-924b-42010af067aa started

@zmerlynn zmerlynn added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed area/test-infra labels Mar 28, 2015
@zmerlynn zmerlynn changed the title e2e times shot up to nearly an hour and a half Pod status not relayed in a timely fashion Mar 28, 2015
@zmerlynn
Copy link
Member Author

FWIW, I have no repro and e2e runs are actually green now. :/ But we've been running the exact same code for the last several hours, so I'm looking for any theory that would explain it, either.

@ArtfulCoder
Copy link
Contributor

@zmerlynn Do you think reverting "etcd in a pod" helped the run become green ?

@zmerlynn
Copy link
Member Author

@ArtfulCoder: Yes, clusters weren't turning up before that.

@zmerlynn
Copy link
Member Author

cc @dchen1107

@zmerlynn
Copy link
Member Author

Logging for frequency: This is still happening, it wasn't just some fluke. In GCE e2e build 3937, in the Pods should contain environment variables for services test, it took forever for pod client-envvars-843dc797-d58b-11e4-b0bf-42010af067aa to go from Pending to Running. In the previous build it was matter of seconds.

@davidopp
Copy link
Member

Is the list of PRs you listed earlier still considered to be the set that might be causing the problem? If we want to help, should we just 'git checkout' at each of those commits and try running e2e?

@zmerlynn
Copy link
Member Author

Most likely. The problem is, it may be an initial cluster state of some sort. I'm not sure how, but it seems like some of the e2e clusters are born sticky and some aren't. So I haven't yet been able to do it by single test runs: i.e. I tried to just run /healthz liveness and it showed very little variance around the ~20s mark, yet it was one of the ones that spiked on 3930.

@dchen1107
Copy link
Member

I am sync to HEAD, and run e2e for now.

@dchen1107
Copy link
Member

e2e test passed:

Ran 28 of 30 Specs in 1156.955 seconds
SUCCESS! -- 28 Passed | 0 Failed | 0 Pending | 2 Skipped I0328 15:39:48.198616 23424 driver.go:94] All tests pass
watchEvents output: |
Project: golden-system-455
Zone: us-central1-b
current-context: "golden-system-455_e2e-test-dawnchen"

I didn't see any podStatus relay issue. Any more specific issues here?

While running e2e test, I also run 'kubectl get pods' on another terminal, and see message like this for many tests:

STEP: waiting up to 5m0s for pod pod-update-d7661541-d59a-11e4-a8fa-24be050deb0e status to be running
INFO: Waiting for pod pod-update-d7661541-d59a-11e4-a8fa-24be050deb0e in namespace default status to be "running" (found "Pending") (41.460927ms)

But the output of kubectl get pods shows the pod is running, later already deleted. Where did you observe such podStatus delay issue? From e2e tests output? Is it possible the issues in test itself?

@zmerlynn
Copy link
Member Author

It's very possible the tests are doing something consistently wrong, but
they've been doing it this way for weeks. Read back to my earlier comments
for why I think there was an issue, though: the test clients are reporting
one thing (say 4m to observe a pod go to from pending to running), but it
seems like from the event stream, there's no delay here: the pod is
assigned, the pause pod is created/started, the pod is created/started all
within seconds. So the event stream off the e2e runs is saying something
different. But I might be missing something.
On Mar 28, 2015 3:51 PM, "Dawn Chen" notifications@github.com wrote:

e2e test passed:

Ran 28 of 30 Specs in 1156.955 seconds
SUCCESS! -- 28 Passed | 0 Failed | 0 Pending | 2 Skipped I0328
15:39:48.198616 23424 driver.go:94] All tests pass
watchEvents output: |
Project: golden-system-455
Zone: us-central1-b
current-context: "golden-system-455_e2e-test-dawnchen"

I didn't see any podStatus relay issue. Any more specific issues here?

While running e2e test, I also run 'kubectl get pods' on another terminal,
and see message like this for many tests:

STEP: waiting up to 5m0s for pod pod-update-d7661541-d59a-11e4-a8fa-24be050deb0e status to be running
INFO: Waiting for pod pod-update-d7661541-d59a-11e4-a8fa-24be050deb0e in namespace default status to be "running" (found "Pending") (41.460927ms)

But the output of kubectl get pods shows the pod is running, later already
deleted. Where did you observe such podStatus delay issue? From e2e tests
output? Is it possible the issues in test itself?


Reply to this email directly or view it on GitHub
#6139 (comment)
.

@zmerlynn
Copy link
Member Author

Jenkins runs from last night showed considerable, very abnormal, variance:
screen shot 2015-03-29 at 9 16 41 am

@zmerlynn
Copy link
Member Author

Hm. The event stream is missing a fair amount when I poke at a lot of these builds, making me wonder if it's something as banal as slow pulls. There's something else really fishy going on, too: the gke-ci project has relatively solid timing (it's still a little more variable than it should be), running the same code. But it makes me wonder if there's some resource in the e2e network on Jenkins that's affecting docker pulls, or something of that nature.

@zmerlynn
Copy link
Member Author

Yeah, I'm having trouble convincing myself this is anything other than a product bug. Here's build breadcrumbs from e2e-gce 3963. The Pods should be restarted with a /healthz http liveness probe test says:

STEP: Creating pod liveness-http in namespace e2e-test-4ca0d444-d623-11e4-9c84-42010af067aa
STEP: waiting up to 5m0s for pod liveness-http status to be !pending
INFO: Waiting for pod liveness-http in namespace e2e-test-4ca0d444-d623-11e4-9c84-42010af067aa status to be "!pending" (found "Pending") (4.04483ms)
INFO: Waiting for pod liveness-http in namespace e2e-test-4ca0d444-d623-11e4-9c84-42010af067aa status to be "!pending" (found "Pending") (5.009523466s)
INFO: Waiting for pod liveness-http in namespace e2e-test-4ca0d444-d623-11e4-9c84-42010af067aa status to be "!pending" (found "Pending") (10.016968102s)
INFO: Waiting for pod liveness-http in namespace e2e-test-4ca0d444-d623-11e4-9c84-42010af067aa status to be "!pending" (found "Pending") (15.022880787s)
[... this continues for a while ...]
INFO: Waiting for pod liveness-http in namespace e2e-test-4ca0d444-d623-11e4-9c84-42010af067aa status to be "!pending" (found "Pending") (1m30.134534607s)
INFO: Saw pod liveness-http in namespace e2e-test-4ca0d444-d623-11e4-9c84-42010af067aa out of pending state (found "Running")

But ... am I misreading something about this event stream?

  Sun, 29 Mar 2015 07:08:18 -0700   Sun, 29 Mar 2015 07:08:18 -0700   1         liveness-http   Pod                 scheduled   {scheduler }   Successfully assigned liveness-http to e2e-test-jenkins-minion-6ir1.c.kubernetes-jenkins.internal
  Sun, 29 Mar 2015 07:08:18 -0700   Sun, 29 Mar 2015 07:08:18 -0700   1         liveness-http   Pod       implicitly required container POD   pulled    {kubelet e2e-test-jenkins-minion-6ir1.c.kubernetes-jenkins.internal}   Successfully pulled image "kubernetes/pause:latest"
  Sun, 29 Mar 2015 07:08:18 -0700   Sun, 29 Mar 2015 07:08:18 -0700   1         liveness-http   Pod       implicitly required container POD   created   {kubelet e2e-test-jenkins-minion-6ir1.c.kubernetes-jenkins.internal}   Created with docker id 443ef818d96d4335f20dcc00e0264eebcf8cd1a2e61c0c377d5a3f9ed1e8a669
  Sun, 29 Mar 2015 07:08:18 -0700   Sun, 29 Mar 2015 07:08:18 -0700   1         liveness-http   Pod       implicitly required container POD   started   {kubelet e2e-test-jenkins-minion-6ir1.c.kubernetes-jenkins.internal}   Started with docker id 443ef818d96d4335f20dcc00e0264eebcf8cd1a2e61c0c377d5a3f9ed1e8a669
  Sun, 29 Mar 2015 07:08:19 -0700   Sun, 29 Mar 2015 07:08:19 -0700   1         liveness-http   Pod       spec.containers{liveness}   pulled    {kubelet e2e-test-jenkins-minion-6ir1.c.kubernetes-jenkins.internal}   Successfully pulled image "kubernetes/liveness"
  Sun, 29 Mar 2015 07:08:19 -0700   Sun, 29 Mar 2015 07:08:19 -0700   1         liveness-http   Pod       spec.containers{liveness}   created   {kubelet e2e-test-jenkins-minion-6ir1.c.kubernetes-jenkins.internal}   Created with docker id 1069007cf616c3b2b3a7221844763cdcd49851ed48c3237ff12309a2ec5812a4
  Sun, 29 Mar 2015 07:08:19 -0700   Sun, 29 Mar 2015 07:08:19 -0700   1         liveness-http   Pod       spec.containers{liveness}   started   {kubelet e2e-test-jenkins-minion-6ir1.c.kubernetes-jenkins.internal}   Started with docker id 1069007cf616c3b2b3a7221844763cdcd49851ed48c3237ff12309a2ec5812a4

That looks an awful lot like it went from assigned to started in about 1s?

@zmerlynn
Copy link
Member Author

Actually, I have a theory based on what was going on with Jenkins at the time, too. We noticed both some odd memory pressure on Jenkins and this weird issue. It's possible the test code is actually getting sliced here for quite a while. Maybe the pod isn't running because it's already dead? We don't do a great job checking for that. Self-assigning and watching the timing now that Jenkins has more memory.

@zmerlynn zmerlynn self-assigned this Mar 30, 2015
@zmerlynn zmerlynn added priority/backlog Higher priority than priority/awaiting-more-evidence. area/test-infra and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Mar 30, 2015
@dchen1107
Copy link
Member

Please correct issue title to reflect the real issue here. I don't think there are any issues related to PodStatus. It might have issues at event mechanism, or code on watching events stream, or simply test framework. Thanks!

@mbforbes
Copy link
Contributor

Apologies if I'm wrong here, but don't the tests explicitly wait for pods to be "Running" as told by a get pods call? It seems like this rules out pods being killed (why would they then suddenly report "Running"? The tests pass.). Also, the tests don't read the event stream, right?

@zmerlynn
Copy link
Member Author

@dchen1107: The test isn't doing anything with the event stream. You can see an example loop here: https://github.com/GoogleCloudPlatform/kubernetes/blob/master/test/e2e/util.go#L94 It literally is waiting for the pod's observable status to change.

My comments about the event stream are because I was trying to debug it.

@zmerlynn zmerlynn added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. area/test-infra labels Mar 30, 2015
@zmerlynn
Copy link
Member Author

This is continuing even after Jenkins changes, on current code. I don't think it's a test issue.

@zmerlynn zmerlynn added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 30, 2015
@zmerlynn
Copy link
Member Author

( go/k8s-test/job/kubernetes-e2e-gke-ci/2872/consoleText is a recent build )

@zmerlynn
Copy link
Member Author

@dchen1107: It's very possible there's something wrong here with the test. But for a moment, treat me as a customer that has a set of tests, and is seeing extremely variable performance on our product, and all you're seeing is a set of logs. Now, what are you going to instrument kubelet and/or apiserver with to help you diagnose the problem? Because that's the situation we're in right now.

I tried to repro this issue with the following script and didn't get anywhere. Amazingly so, because this script doesn't even wait for add-ons to finish running (because it's running ginkgo-e2e straight), so the timing was actually pretty variable and had some pull-time variance. There's something a little more systemic to the full e2e run, possibly non-hermetic between multiple e2es interacting, or containers getting purged as we churn through tests. I don't know. I'm about to try a repro using the seed from one of the failing runs instead.

However, there's some issue here. Let's not assume it's the tests for now.

set -o errexit

go run hack/e2e.go -down -v

while true; do
  until go run hack/e2e.go -up -v; do
    echo
    echo CLUSTER CREATION FAILED
    echo
    go run hack/e2e.go -down -v
  done

  hack/ginkgo-e2e.sh --ginkgo.focus=".*healthz.*" |& tee /tmp/e2e.log
  time=$(cat /tmp/e2e.log | grep "Ran 1 of" | cut -f 7 -d\ | cut -f1 -d.)

  if [[ "${time}" -gt 120 ]]; then
    echo "SO SLOW"
    exit 1
  fi

  go run hack/e2e.go -down -v
done

@dchen1107
Copy link
Member

Ok, I were confused by your earlier event analysis. Ok, I am looking into it.

@dchen1107 dchen1107 assigned dchen1107 and unassigned zmerlynn Mar 30, 2015
@dchen1107
Copy link
Member

Re-assign it to myself. I think it is a race and performance related issue. But need more data to validate my hyperthesis. Updated it soon.

@dchen1107 dchen1107 added this to the v1.0 milestone Mar 30, 2015
@dchen1107
Copy link
Member

By running e2e tests a couple of times today, I think I have enough information to validate my hyperthesis:

  1. in one e2e run, I saw the output like the following:
[It] should contain environment variables for services
  /go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/pods.go:396
STEP: waiting up to 5m0s for pod server-envvars-b6787a58-d720-11e4-b933-24be050deb0e status to be running
INFO: Waiting for pod server-envvars-b6787a58-d720-11e4-b933-24be050deb0e in namespace default status to be "running" (found "Pending") (40.426318ms)
...
INFO: Waiting for pod server-envvars-b6787a58-d720-11e4-b933-24be050deb0e in namespace default status to be "running" (found "Pending") (3m26.80330632s)
INFO: Waiting for pod server-envvars-b6787a58-d720-11e4-b933-24be050deb0e in namespace default status to be "running" (found "Pending") (3m31.859145221s)
INFO: Waiting for pod server-envvars-b6787a58-d720-11e4-b933-24be050deb0e in namespace default status to be "running" (found "Pending") (3m36.901863952s)
  1. kubectl get pods show pod server-envvars-b6787a58-d720-11e4-b933-24be050deb0e at Pending state initially, without any events. This means scheduler doesn't pick up the work to schedule.
  2. After a while, I saw first event related to such pod printed out, but kubelet doesn't pick up the work.
  3. At stage 3), I examined kubelet.log, and found heapster started querying such pod information 3 minutes before Kubelet first started to work:
# cat kubelet.log  | grep server-envvars
I0330 21:07:04.774355    4684 server.go:654] GET /stats/default/server-envvars-b6787a58-d720-11e4-b933-24be050deb0e/b68eca83-d720-11e4-9041-42010af01d6f/srv: (43.397µs) 404 [[Go 1.1 package http] 10.245.2.4:59452]
...
I0330 21:10:40.083923    4684 kubelet.go:1099] No Infra Container for "server-envvars-b6787a58-d720-11e4-b933-24be050deb0e_default" found. All containers will be restarted.
I0330 21:10:40.115632    4684 kubelet.go:1269] Creating pod infra container for "server-envvars-b6787a58-d720-11e4-b933-24be050deb0e_default"
I0330 21:10:40.135926    4684 event.go:138] Event(api.ObjectReference{Kind:"Pod", Namespace:"default", Name:"server-envvars-b6787a58-d720-11e4-b933-24be050deb0e", UID:"b68eca83-d720-11e4-9041-42010af01d6f", APIVersion:"v1beta1", ResourceVersion:"2884", FieldPath:"implicitly required container POD"}): reason: 'pulled' Successfully pulled image "kubernetes/pause:latest"
...
I0330 21:11:01.230655    4684 event.go:138] Event(api.ObjectReference{Kind:"Pod", Namespace:"default", Name:"server-envvars-b6787a58-d720-11e4-b933-24be050deb0e", UID:"b68eca83-d720-11e4-9041-42010af01d6f", APIVersion:"v1beta1", ResourceVersion:"2884", FieldPath:"implicitly required container POD"}): reason: 'killing' Killing 456d23b1ab44488b7028b58ba3027156dec79e1e34123b6850e7cc8c3ae7190e
  1. At the end of stage 4), I dumped kubectl get events and proved that first event from kubelet is more 3 mins after the event from scheduler. There is no event issued by kube-apiserver today.
$ cluster/kubectl.sh get events | grep server-envvars
current-context: "golden-system-455_e2e-test-dawnchen"
Running: cluster/../cluster/gce/../../cluster/../_output/dockerized/bin/linux/amd64/kubectl get events
Mon, 30 Mar 2015 14:07:02 -0700   Mon, 30 Mar 2015 14:07:02 -0700   1         server-envvars-b6787a58-d720-11e4-b933-24be050deb0e              Pod                                                                                   scheduled          {scheduler }                                                           Successfully assigned server-envvars-b6787a58-d720-11e4-b933-24be050deb0e to e2e-test-dawnchen-minion-x38x.c.golden-system-455.internal
Mon, 30 Mar 2015 14:10:40 -0700   Mon, 30 Mar 2015 14:10:40 -0700   1         server-envvars-b6787a58-d720-11e4-b933-24be050deb0e              Pod       implicitly required container POD                                           started            {kubelet e2e-test-dawnchen-minion-x38x.c.golden-system-455.internal}   Started with docker id 456d23b1ab44488b7028b58ba3027156dec79e1e34123b6850e7cc8c3ae7190e
Mon, 30 Mar 2015 14:10:40 -0700   Mon, 30 Mar 2015 14:10:40 -0700   1         server-envvars-b6787a58-d720-11e4-b933-24be050deb0e              Pod       implicitly required container POD                                           created            {kubelet e2e-test-dawnchen-minion-x38x.c.golden-system-455.internal}   Created with docker id 456d23b1ab44488b7028b58ba3027156dec79e1e34123b6850e7cc8c3ae7190e
Mon, 30 Mar 2015 14:10:40 -0700   Mon, 30 Mar 2015 14:10:40 -0700   1         server-envvars-b6787a58-d720-11e4-b933-24be050deb0e              Pod       implicitly required container POD                                           pulled             {kubelet e2e-test-dawnchen-minion-x38x.c.golden-system-455.internal}   Successfully pulled image "kubernetes/pause:latest"
Mon, 30 Mar 2015 14:10:44 -0700   Mon, 30 Mar 2015 14:10:44 -0700   1         server-envvars-b6787a58-d720-11e4-b933-24be050deb0e              Pod       spec.containers{srv}                                                        created            {kubelet e2e-test-dawnchen-minion-x38x.c.golden-system-455.internal}   Created with docker id 8c4fbc802f1d0d03870fa2ba96d49a0ad43261e903dc3384d6cff1117a553738
Mon, 30 Mar 2015 14:10:44 -0700   Mon, 30 Mar 2015 14:10:44 -0700   1         server-envvars-b6787a58-d720-11e4-b933-24be050deb0e              Pod       spec.containers{srv}                                                        started            {kubelet e2e-test-dawnchen-minion-x38x.c.golden-system-455.internal}   Started with docker id 8c4fbc802f1d0d03870fa2ba96d49a0ad43261e903dc3384d6cff1117a553738
Mon, 30 Mar 2015 14:10:44 -0700   Mon, 30 Mar 2015 14:10:44 -0700   1         server-envvars-b6787a58-d720-11e4-b933-24be050deb0e              Pod       spec.containers{srv}                                                        pulled             {kubelet e2e-test-dawnchen-minion-x38x.c.golden-system-455.internal}   Successfully pulled image "kubernetes/serve_hostname"
Mon, 30 Mar 2015 14:11:00 -0700   Mon, 30 Mar 2015 14:11:00 -0700   1         server-envvars-b6787a58-d720-11e4-b933-24be050deb0e              Pod       spec.containers{srv}                                                        killing            {kubelet e2e-test-dawnchen-minion-x38x.c.golden-system-455.internal}   Killing 8c4fbc802f1d0d03870fa2ba96d49a0ad43261e903dc3384d6cff1117a553738
Mon, 30 Mar 2015 14:11:01 -0700   Mon, 30 Mar 2015 14:11:01 -0700   1         server-envvars-b6787a58-d720-11e4-b933-24be050deb0e              Pod       implicitly required container POD                                           killing            {kubelet e2e-test-dawnchen-minion-x38x.c.golden-system-455.internal}   Killing 456d23b1ab44488b7028b58ba3027156dec79e1e34123b6850e7cc8c3ae7190e

Need to investigate why kubelet and scheduler doesn't pick up the work at the first place.

cc/ @wojtek-t who is working on perf related issue.

@dchen1107
Copy link
Member

cc/ @satnam6502 @davidopp

@dchen1107
Copy link
Member

By scanning through misc/perf-1.0 issues, this one looks like a dup of #6059

@yujuhong
Copy link
Contributor

I'm perplexed about why this is an issue for e2e tests on jenkins. #6059 uses a higher number of nodes (50) and pods per node (30). The regular e2e uses two nodes and creates a few pods in each test (pods.go). I can't really reproduce this issue on my cluster running pods.go at all...

@dchen1107
Copy link
Member

@yujuhong I don't know why jenkin failure more often. When I reproduced the issue by running e2e a couple of times, I never see an e2e failure actually. But from e2e test output, I observed the message on waiting a pod to be run for a couple of minutes (above comment: #6139 (comment))

If the waiting gets a little bit longer than 5mins, the test will fail. That is why I started to aggregate information from various sources for debugging.

The only reason I could come up with more failures on Jenkin is that Jenkin machines might be slower machines.

@zmerlynn
Copy link
Member Author

The Jenkins machine is actually a 16way / 60gig machine. It's probably
beefier than your desktop.

On Mon, Mar 30, 2015 at 3:41 PM, Dawn Chen notifications@github.com wrote:

@yujuhong https://github.com/yujuhong I don't know why jenkin failure
more often. When I reproduced the issue by running e2e a couple of times, I
never see an e2e failure actually. But from e2e test output, I observed the
message on waiting a pod to be run for a couple of minutes (above comment: #6139
(comment)
#6139 (comment))

If the waiting gets a little bit longer than 5mins, the test will fail.
That is why I started to aggregate information from various sources for
debugging.

The only reason I could come up with more failures on Jenkin is that
Jenkin machines might be slower machines.


Reply to this email directly or view it on GitHub
#6139 (comment)
.

@yujuhong
Copy link
Contributor

@zmerlynn, hmm...I thought jenkins uses g1-small GCE instances (for both master and minions), same as my e2e cluster. Is that not true?

@zmerlynn
Copy link
Member Author

It does, yes. I was referring to the machine the test sare getting executed
from, which I supposed has been eliminated at this point.

On Mon, Mar 30, 2015 at 3:57 PM, Yu-Ju Hong notifications@github.com
wrote:

@zmerlynn https://github.com/zmerlynn, hmm...I thought jenkins uses
g1-small GCE instances (for both master and minions), same as my e2e
cluster. Is that not true?


Reply to this email directly or view it on GitHub
#6139 (comment)
.

@yujuhong
Copy link
Contributor

Ah... I see. Agreed that the test executing machine is probably irrelevant at this point.

In fact, I've run --ginkgo.focus=Pods* more than 6 times today and not once did I observe any delay in getting the "running" pod status. It's interesting that we'd run into a performance issue on such a low load setting (2 nodes, less than 10 pods in total in the cluster). It might help to know if we run tests on jenkins any differently.

@zmerlynn
Copy link
Member Author

Yeah, I've actually tried to isolate this using --ginkgo.focus as well, and failed. (See #6139 (comment), which is basically brute force of that.)

The main differences with Jenkins:

  • It executes so many more times (of course, if you have your e2es in a tight loop, this argument is moot)
  • Its project is sometimes cluttered. Except that this morning, I upgraded it, and at that point, it was very quiet, so I had a chance to see if anything was really that cluttered. Turns out, we weren't leaking that many GCE resources.
  • It's running concurrent with other e2e runs, but they're on separate networks. (gce, gke, gke-test, gke-ci). This hasn't caused issues in a while.

Otherwise, it looks like a pretty normal e2e run.

@wojtek-t
Copy link
Member

I'm pretty sure this one is a duplicate of #6059 (as @dchen1107 suggested above).

Please take a look at #6059 (comment) as the first step towards explanation of this issue.
If my hypothesis is correct we can face this problem if there are a lot of writes to etcd which doesn't modify pods after opening "watch channel". I think this can happen in e2e tests, but it's not very probable. That's why I think it's not easy to reproduce.

Anyway, I suggest closing this issue as a duplicate of #6059 and moving the whole discussion there.

@zmerlynn
Copy link
Member Author

Closing as dupe of #6059.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

7 participants