Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump ectd client and fix initializing IP/Port allocators when etcd is not reachable #9862

Merged
merged 2 commits into from
Jun 19, 2015

Conversation

fgrzadkowski
Copy link
Contributor

This PR reintroduces newer version of go-etcd client library (reverted in #9447). It also adds additional debugging to service allocator code to help debug potential failures that I was not able to reproduce locally.

@wojtek-t @smarterclayton
@saad-ali (currentl oncall)

NOTE 1: If you see other places where logging might be useful please add a comment.
NOTE 2: Please do not merge this PR. I will do this tomorrow morning so that I can closely watch jenkins and revert it once I gather more data for debugging.

@k8s-bot
Copy link

k8s-bot commented Jun 16, 2015

GCE e2e build/test failed for commit cee8c22ea2d794c42a3e369756c1a3d97a0b43c0.

@wojtek-t
Copy link
Member

As discussed offline - it seems that the Jenkins failure is exactly the reproduction of the problem.

@mnemotiv
Copy link

I don't know where it should be put and whether it's related, but it would be very helpful if while using --etcd-config (go-etcd) flag it would output any errors to kube-apiserver, because for now - if it fails - no logs are being shown.

@fgrzadkowski
Copy link
Contributor Author

@k8s-bot test this please

@k8s-bot
Copy link

k8s-bot commented Jun 17, 2015

GCE e2e build/test failed for commit cee8c22ea2d794c42a3e369756c1a3d97a0b43c0.

@fgrzadkowski
Copy link
Contributor Author

@k8s-bot test this please

@k8s-bot
Copy link

k8s-bot commented Jun 17, 2015

GCE e2e build/test failed for commit cee8c22ea2d794c42a3e369756c1a3d97a0b43c0.

@fgrzadkowski
Copy link
Contributor Author

@k8s-bot test this please

@k8s-bot
Copy link

k8s-bot commented Jun 17, 2015

GCE e2e build/test failed for commit b677cef5030af461fd26544748817e352b38adf9.

@k8s-bot
Copy link

k8s-bot commented Jun 17, 2015

GCE e2e build/test failed for commit 985cb1cb5450f3c4c0ccb619fd8924a4bd8193a2.

@k8s-bot
Copy link

k8s-bot commented Jun 18, 2015

GCE e2e build/test failed for commit cbe7207bf59d860161b3ee611e9a679c42bf47a4.

@fgrzadkowski
Copy link
Contributor Author

@k8s-bot test this please

@k8s-bot
Copy link

k8s-bot commented Jun 18, 2015

GCE e2e build/test failed for commit cbe7207bf59d860161b3ee611e9a679c42bf47a4.

@k8s-bot
Copy link

k8s-bot commented Jun 18, 2015

GCE e2e build/test failed for commit 333c40b2b15e95333de5f74054eea4f4e2528365.

@k8s-bot
Copy link

k8s-bot commented Jun 18, 2015

GCE e2e build/test failed for commit c2bf8f5abe8f5c51701becc47c75b3dc018504d3.

…evert-9390-bump_ectd_client"

This reverts commit b68e08f, reversing
changes made to b244974.
@k8s-bot
Copy link

k8s-bot commented Jun 19, 2015

GCE e2e build/test passed for commit b7032521728b4c2d28ad36d546a74812899cc923.

@fgrzadkowski
Copy link
Contributor Author

@k8s-bot test this please

@k8s-bot
Copy link

k8s-bot commented Jun 19, 2015

GCE e2e build/test passed for commit b7032521728b4c2d28ad36d546a74812899cc923.

@fgrzadkowski fgrzadkowski changed the title [Do not merge] Bump ectd client and add more logging to allocator code Bump ectd client and fix initializing IP/Port allocators when etcd is not reachable Jun 19, 2015
@fgrzadkowski
Copy link
Contributor Author

This is ready for review.

Assigning to @smarterclayton since one of the commits fixes a bug in allocator code.

@k8s-bot
Copy link

k8s-bot commented Jun 19, 2015

GCE e2e build/test passed for commit d06461f119904efb4be7aae5573a7bba043d41cf.

@fgrzadkowski
Copy link
Contributor Author

@davidopp
Results from density test with 30 pods per node in 100 node cluster with n1-standard-4 master:

INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.99 Latency:1.605515s}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.9 Latency:1.270142s}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.5 Latency:529.251ms}
INFO: Top latency metric: {Verb:POST Resource:bindings Quantile:0.99 Latency:444.361ms}
INFO: Top latency metric: {Verb:PUT Resource:pods Quantile:0.99 Latency:426.816ms}

This is way better (~25%) than the usual run for density test.

@k8s-bot
Copy link

k8s-bot commented Jun 19, 2015

GCE e2e build/test passed for commit 6ecba7b364d385f4d6f5e71de79cb75aecc53fed.

@fgrzadkowski
Copy link
Contributor Author

Shippable complains about hack/run-gendocs.sh but I run it locally and it was a no-op :/

@smarterclayton I can't find any of your comments (even those I've already addressed). Is there anything else I should fix with this PR? It'd be great to have it merged today.

@smarterclayton
Copy link
Contributor

Had one comment that got last.

@smarterclayton
Copy link
Contributor

LGTM but I'm not sure what the shippable failure is.

@smarterclayton smarterclayton added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 19, 2015
@k8s-bot
Copy link

k8s-bot commented Jun 19, 2015

GCE e2e build/test passed for commit 4c0c7dd.

@satnam6502
Copy link
Contributor

Issue? Risk level? Milestone? Does it need to be merged anyway?

@fgrzadkowski
Copy link
Contributor Author

@satnam6502 This fixes #9444 and will improve performance by reducing cpu usage on master machine. Based on discussion with @davidopp we want to merge this change before v1.0.

I'd say that risk is medium-high due to multiple changes in etcd client library since our last update.

I've run e2e tests multiple times (using jenkins and locally) and it looks good.

satnam6502 added a commit that referenced this pull request Jun 19, 2015
Bump ectd client and fix initializing IP/Port allocators when etcd is not reachable
@satnam6502 satnam6502 merged commit d96a9a1 into kubernetes:master Jun 19, 2015
@davidopp
Copy link
Member

Thanks for merging. I think this merits being included in 1.0 since it substantially improves performance.

@ghost
Copy link

ghost commented Jun 22, 2015

See #10146. There is quite possibly a bug in this PR - it appears to be causing recovery after an etcd failure not to work reliably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants