-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nethealth container to prepull manifest #27019
Add nethealth container to prepull manifest #27019
Conversation
7661cbd
to
5b01ed7
Compare
5b01ed7
to
8b8e822
Compare
Adds network health check to debug test flakes about slow image pulls from GCR. |
@girishkalele I think I didn't make the suggestion clear at #26626. What I really suggested is introducing another container paralleling with image-puller container to run your nethealth in prepull pod since that Pod is one of the first pod running at test nodes. Your current implementation only pulling the image, but not run the test. |
I'm not sure why we'd want to prepull if we're going to add this as a container in the prepull pod anyway. It's just dead weight no? |
8b8e822
to
73663d1
Compare
139bb50
to
abefe7b
Compare
I misunderstood your comment about adding to the manifest - still new here - I have now added a new container named nethealth-check and removed it from the pre-pull image list. $ kubectl logs --namespace=kube-system e2e-image-puller2 -c nethealth-check
2016/06/10 17:48:18 HTTP HEAD reports content length: 67108864 - running GET
2016/06/10 17:48:19 DOWNLOAD: 67108864 bytes 1103 ms Bandwidth ~ 59416 KiB/sec
2016/06/10 17:48:19 Hash Matches expected value |
This is what prints it: kubernetes/test/e2e/framework/framework.go Line 245 in cf234ab
kubernetes/test/e2e/framework/util.go Line 660 in 66149c5
Where's the code for kube-nethealth? It would be nice if it gathered background stats while the image puller is running vs just fire and forget a single curl request. |
cpu: 100m | ||
limits: | ||
cpu: 100m | ||
image: gcr.io/google_containers/kube-nethealth-amd64:1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to exit or the pod will continue wasting resources on the node and tests like the resource tracking test will fail waiting for it to stop. Ideally it should exit when the image puller exits. This is why I'd suggested breaking out the for loop in sh -c above into an independent script, and doing your stats collection in between each image pull.
I'm ok with quick and dirty for now, if your container is just going to do some bounded, small number of queries within eg: 4m, which is roughly how long the image puller will take.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is just the seed for a more full featured nethealth monitoring container.
Nethealth Code is in the node-problem-detector repo here. https://github.com/kubernetes/node-problem-detector/blob/master/pkg/util/nethealth/nethealth.go |
@@ -128,4 +129,5 @@ func RegisterFlags() { | |||
flag.BoolVar(&TestContext.GatherMetricsAfterTest, "gather-metrics-at-teardown", false, "If set to true framwork will gather metrics from all components after each test.") | |||
flag.StringVar(&TestContext.OutputPrintType, "output-print-type", "hr", "Comma separated list: 'hr' for human readable summaries 'json' for JSON ones.") | |||
flag.BoolVar(&TestContext.DumpLogsOnFailure, "dump-logs-on-failure", true, "If set to true test will dump data about the namespace in which test was running.") | |||
flag.BoolVar(&TestContext.DumpNethealthLogs, "dump-nethealth-logs", true, "If set to true test will dump data about the network health checks from all nodes.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need a new flag for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the flag (default is true) to follow the same model as DumpLogsOnFailure - looks like there is a use-case for not printing logs sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we reuse the same flag dump-logs-on-failure? flag overload is not yet an issue with this binary, but it might be one day like it is in the kubelet/apiserver (100+ flags, old flags are preserved because codepaths have diverged). Node team people will happily tell you what a big problem flag fragmentation is in the internal kubelet equivalent :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will do.
92d1b0b
to
49068d0
Compare
@@ -693,7 +696,18 @@ func LogPodsWithLabels(c *client.Client, ns string, match map[string]string) { | |||
} | |||
Logf("Running kubectl logs on pods with labels %v in %v", match, ns) | |||
for _, pod := range podList.Items { | |||
kubectlLogPod(c, pod) | |||
kubectlLogPod(c, pod, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why not just pipe through containerSubstr into this func? I only see 2 uses of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand your comment - iterate through containers in the pod and call the function with every container ?
49068d0
to
d3ac6a9
Compare
lgtm, assuming @dchen1107 is ok with this applying label |
@girishkalele |
Hmm the last failure looks like exactly the sort of things your pr is meant to help with:
Did it dump out enough information? |
It did dump out information for that node - networking at startup was good.
|
Image puller logs from that node were full of errors, the image gcr.io/google_containers/porter failed to prepull.
|
Yeah that's #27451 |
@bprashanth I lost my lgtm label because of the flake...I added it again. |
d3ac6a9
to
ee7ca66
Compare
reapplying lgtm, it was just a rebase right? |
Yes, thanks. |
GCE e2e build/test passed for commit ee7ca66. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test passed for commit ee7ca66. |
Automatic merge from submit-queue |
No description provided.