Add nethealth container to prepull manifest #27019

girishkalele · 2016-06-08T01:05:23Z

No description provided.

girishkalele · 2016-06-09T17:18:38Z

Adds network health check to debug test flakes about slow image pulls from GCR.
#26626

dchen1107 · 2016-06-10T16:59:03Z

@girishkalele I think I didn't make the suggestion clear at #26626. What I really suggested is introducing another container paralleling with image-puller container to run your nethealth in prepull pod since that Pod is one of the first pod running at test nodes.

Your current implementation only pulling the image, but not run the test.

bprashanth · 2016-06-10T17:03:37Z

I'm not sure why we'd want to prepull if we're going to add this as a container in the prepull pod anyway. It's just dead weight no?

girishkalele · 2016-06-10T17:56:10Z

@dchen1107

I misunderstood your comment about adding to the manifest - still new here - I have now added a new container named nethealth-check and removed it from the pre-pull image list.
Will the output of this container also show up in the node logs ?
I need to specify the container name with a -c to view its logs (since the logs command defaults to the image puller container).

$ kubectl logs --namespace=kube-system e2e-image-puller2 -c nethealth-check
2016/06/10 17:48:18 HTTP HEAD reports content length: 67108864 - running GET
2016/06/10 17:48:19 DOWNLOAD: 67108864 bytes 1103 ms Bandwidth ~ 59416 KiB/sec
2016/06/10 17:48:19 Hash Matches expected value

bprashanth · 2016-06-10T18:09:00Z

This is what prints it:

kubernetes/test/e2e/framework/framework.go

Line 245 in cf234ab

LogPodsWithLabels(f.Client, api.NamespaceSystem, ImagePullerLabels)

, it should iterate through all containers (

kubernetes/test/e2e/framework/util.go

Line 660 in 66149c5

for _, container := range pod.Spec.Containers {

). You can trigger a test failure and confirm that it prints the output of your container.

Where's the code for kube-nethealth? It would be nice if it gathered background stats while the image puller is running vs just fire and forget a single curl request.

bprashanth · 2016-06-10T18:13:52Z

cluster/saltbase/salt/e2e-image-puller/e2e-image-puller.manifest

+        cpu: 100m
+      limits:
+        cpu: 100m
+    image:  gcr.io/google_containers/kube-nethealth-amd64:1.0


this needs to exit or the pod will continue wasting resources on the node and tests like the resource tracking test will fail waiting for it to stop. Ideally it should exit when the image puller exits. This is why I'd suggested breaking out the for loop in sh -c above into an independent script, and doing your stats collection in between each image pull.

I'm ok with quick and dirty for now, if your container is just going to do some bounded, small number of queries within eg: 4m, which is roughly how long the image puller will take.

Yes, this is just the seed for a more full featured nethealth monitoring container.

girishkalele · 2016-06-10T18:22:21Z

Nethealth Code is in the node-problem-detector repo here.

https://github.com/kubernetes/node-problem-detector/blob/master/pkg/util/nethealth/nethealth.go

bprashanth · 2016-06-14T20:43:30Z

test/e2e/framework/test_context.go

@@ -128,4 +129,5 @@ func RegisterFlags() {
 	flag.BoolVar(&TestContext.GatherMetricsAfterTest, "gather-metrics-at-teardown", false, "If set to true framwork will gather metrics from all components after each test.")
 	flag.StringVar(&TestContext.OutputPrintType, "output-print-type", "hr", "Comma separated list: 'hr' for human readable summaries 'json' for JSON ones.")
 	flag.BoolVar(&TestContext.DumpLogsOnFailure, "dump-logs-on-failure", true, "If set to true test will dump data about the namespace in which test was running.")
+	flag.BoolVar(&TestContext.DumpNethealthLogs, "dump-nethealth-logs", true, "If set to true test will dump data about the network health checks from all nodes.")


do we need a new flag for this?

I added the flag (default is true) to follow the same model as DumpLogsOnFailure - looks like there is a use-case for not printing logs sometimes.

nit: can we reuse the same flag dump-logs-on-failure? flag overload is not yet an issue with this binary, but it might be one day like it is in the kubelet/apiserver (100+ flags, old flags are preserved because codepaths have diverged). Node team people will happily tell you what a big problem flag fragmentation is in the internal kubelet equivalent :)

Ok, will do.

bprashanth · 2016-06-14T21:24:04Z

test/e2e/framework/util.go

@@ -693,7 +696,18 @@ func LogPodsWithLabels(c *client.Client, ns string, match map[string]string) {
 	}
 	Logf("Running kubectl logs on pods with labels %v in %v", match, ns)
 	for _, pod := range podList.Items {
-		kubectlLogPod(c, pod)
+		kubectlLogPod(c, pod, "")


nit: why not just pipe through containerSubstr into this func? I only see 2 uses of it.

I didn't understand your comment - iterate through containers in the pod and call the function with every container ?

bprashanth · 2016-06-15T01:22:46Z

lgtm, assuming @dchen1107 is ok with this applying label

k8s-github-robot · 2016-06-15T16:53:19Z

@girishkalele
You must link to the test flake issue which caused you to request this manual re-test.
Re-test requests should be in the form of: k8s-bot test this issue: #<number>
Here is the list of open test flakes.

girishkalele · 2016-06-15T17:52:15Z

@k8s-bot test this issue: #27335

bprashanth · 2016-06-15T17:53:24Z

Hmm the last failure looks like exactly the sort of things your pr is meant to help with:

Jun 15 10:34:16.379: INFO: At {2016-06-15 10:29:45 -0700 PDT} - event for proxy-service-kir7r-fedi4: {kubelet e2e-gce-agent-pr-8-0-minion-group-57ju} Failed: Failed to pull image "gcr.io/google_containers/porter:cd5cb5791ebaa8641955f0e8c2a9bed669b1eaab": Error pulling image (cd5cb5791ebaa8641955f0e8c2a9bed669b1eaab) from gcr.io/google_containers/porter, Get https://storage.googleapis.com/artifacts.google-containers.appspot.com/containers/images/0bac9e8a78faf81cff38ba54baaf63c3818dce51c1ea24d3a516ab8c1ca70a72/ancestry?access_token=ya29.CkECA5qsDgPRsetAhNEghwM6zpkv1fOnJiMaKhk00WMPQMeS1TQT-gpkvCqkZ6P1wrYKRQxFFnPCGSMpBYa1GAssyw: dial tcp 74.125.129.128:443: i/o timeout

Did it dump out enough information?

girishkalele · 2016-06-15T17:56:53Z

It did dump out information for that node - networking at startup was good.

STEP: Logs of kube-system/e2e-image-puller-e2e-gce-agent-pr-8-0-minion-group-57ju:nethealth-check on node e2e-gce-agent-pr-8-0-minion-group-57ju
Jun 15 10:27:48.079: INFO: nethealth : STARTLOG
2016/06/15 17:23:06 HTTP HEAD reports content length: 67108864 - running GET
2016/06/15 17:23:07 DOWNLOAD: 67108864 bytes 761 ms Bandwidth ~ 86118 KiB/sec
2016/06/15 17:23:07 Hash Matches expected value

girishkalele · 2016-06-15T17:59:25Z

Image puller logs from that node were full of errors, the image gcr.io/google_containers/porter failed to prepull.

STEP: Logs of kube-system/e2e-image-puller-e2e-gce-agent-pr-8-0-minion-group-57ju:image-puller on node e2e-gce-agent-pr-8-0-minion-group-57ju
Jun 15 10:34:17.494: INFO: image-puller : STARTLOG
17:22:20 pulling gcr.io/google_containers/busybox
17:22:50 pulling gcr.io/google_containers/busybox:1.24
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/busybox
17:22:50 pulling gcr.io/google_containers/dnsutils:e2e
17:22:58 pulling gcr.io/google_containers/eptest:0.1
17:22:59 pulling gcr.io/google_containers/fakegitserver:0.1
17:22:59 pulling gcr.io/google_containers/hostexec:1.2
17:23:05 pulling gcr.io/google_containers/iperf:e2e
17:23:13 pulling gcr.io/google_containers/jessie-dnsutils:e2e
17:23:21 pulling gcr.io/google_containers/liveness:e2e
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/liveness
17:23:22 pulling gcr.io/google_containers/mounttest:0.2
17:23:22 pulling gcr.io/google_containers/mounttest:0.5
17:23:22 pulling gcr.io/google_containers/mounttest:0.6
17:23:28 pulling gcr.io/google_containers/mounttest-user:0.3
17:23:33 pulling gcr.io/google_containers/netexec:1.4
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/netexec
17:23:33 pulling gcr.io/google_containers/netexec:1.5
17:23:54 pulling gcr.io/google_containers/nettest:1.7
17:23:55 pulling gcr.io/google_containers/nettest:1.8
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/nettest
17:23:55 pulling gcr.io/google_containers/nginx:1.7.9
17:24:02 pulling gcr.io/google_containers/nginx-slim:0.5
17:24:08 pulling gcr.io/google_containers/n-way-http:1.0
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/n-way-http
17:24:08 pulling gcr.io/google_containers/pause:2.0
17:24:09 pulling gcr.io/google_containers/pause-amd64:3.0
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/pause-amd64
17:24:09 pulling gcr.io/google_containers/porter:cd5cb5791ebaa8641955f0e8c2a9bed669b1eaab
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/porter
17:24:09 pulling gcr.io/google_containers/portforwardtester:1.0
17:24:09 pulling gcr.io/google_containers/redis:e2e
17:24:29 pulling gcr.io/google_containers/resource_consumer:beta2
17:24:41 pulling gcr.io/google_containers/serve_hostname:v1.4
17:24:42 pulling gcr.io/google_containers/servicelb:0.1
17:25:34 pulling gcr.io/google_containers/test-webserver:e2e
17:25:35 pulling gcr.io/google_containers/ubuntu:14.04
17:25:35 pulling gcr.io/google_containers/update-demo:kitten
17:26:10 pulling gcr.io/google_containers/update-demo:nautilus
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/update-demo
17:26:10 pulling gcr.io/google_containers/volume-ceph:0.1
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/volume-ceph
17:26:10 pulling gcr.io/google_containers/volume-gluster:0.2
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/volume-gluster
17:26:10 pulling gcr.io/google_containers/volume-iscsi:0.1
17:26:23 pulling gcr.io/google_containers/volume-nfs:0.6
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/volume-nfs
17:26:23 pulling gcr.io/google_containers/volume-rbd:0.1
17:26:54 pulling gcr.io/google_samples/gb-redisslave:v1
Error pulling image (0.1) from gcr.io/google_containers/volume-rbd, Server error: Status 0 while fetching image layer (efda3e1d78611a63aa0c9a5cc8b70b9ad4a5732c6cac43ee30601ad448350fde)
Error pulling image (v1) from gcr.io/google_samples/gb-redisslave, Server error: Status 0 while fetching image layer (9b5f6b31f2b8dbe4ba89f91aa29ed4accf896a9df00a22fe6bdf1933aabefa45)

bprashanth · 2016-06-15T18:01:23Z

Yeah that's #27451

girishkalele · 2016-06-15T19:48:04Z

@bprashanth I lost my lgtm label because of the flake...I added it again.

bprashanth · 2016-06-16T05:13:19Z

@k8s-bot test this github issue: #26682

bprashanth · 2016-06-16T22:49:55Z

reapplying lgtm, it was just a rebase right?

girishkalele · 2016-06-16T22:53:32Z

Yes, thanks.

k8s-bot · 2016-06-16T23:25:20Z

GCE e2e build/test passed for commit ee7ca66.

k8s-github-robot · 2016-06-17T07:57:10Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-bot · 2016-06-17T09:15:40Z

GCE e2e build/test passed for commit ee7ca66.

k8s-github-robot · 2016-06-17T09:15:59Z

Automatic merge from submit-queue

googlebot added the cla: yes label Jun 8, 2016

k8s-github-robot assigned eparis Jun 8, 2016

k8s-github-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. release-note-label-needed labels Jun 8, 2016

girishkalele force-pushed the nethealth2saltbase branch from 7661cbd to 5b01ed7 Compare June 8, 2016 03:57

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 9, 2016

girishkalele force-pushed the nethealth2saltbase branch from 5b01ed7 to 8b8e822 Compare June 9, 2016 17:02

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 9, 2016

girishkalele added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Jun 9, 2016

girishkalele added this to the v1.3 milestone Jun 9, 2016

eparis added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 9, 2016

dchen1107 self-assigned this Jun 10, 2016

dchen1107 removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 10, 2016

dchen1107 added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jun 10, 2016

girishkalele force-pushed the nethealth2saltbase branch from 8b8e822 to 73663d1 Compare June 10, 2016 17:39

k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 10, 2016

girishkalele force-pushed the nethealth2saltbase branch 2 times, most recently from 139bb50 to abefe7b Compare June 10, 2016 17:48

k8s-github-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 10, 2016

bprashanth reviewed Jun 10, 2016
View reviewed changes

eparis removed their assignment Jun 10, 2016

bprashanth reviewed Jun 14, 2016
View reviewed changes

girishkalele force-pushed the nethealth2saltbase branch from 92d1b0b to 49068d0 Compare June 14, 2016 21:13

bprashanth reviewed Jun 14, 2016
View reviewed changes

girishkalele force-pushed the nethealth2saltbase branch from 49068d0 to d3ac6a9 Compare June 14, 2016 23:35

bprashanth added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2016

k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2016

girishkalele added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2016

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2016

Add nethealth prepull container output to e2e run logs

ee7ca66

girishkalele force-pushed the nethealth2saltbase branch from d3ac6a9 to ee7ca66 Compare June 16, 2016 22:14

k8s-github-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 16, 2016

bprashanth added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 16, 2016

k8s-github-robot merged commit 6a18fbe into kubernetes:master Jun 17, 2016

gmarek mentioned this pull request Jun 22, 2016

KubeDNS is crashlooping after restarting kubelet on which it runs. #27873

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nethealth container to prepull manifest #27019

Add nethealth container to prepull manifest #27019

girishkalele commented Jun 8, 2016

girishkalele commented Jun 9, 2016 •

edited

Loading

dchen1107 commented Jun 10, 2016

bprashanth commented Jun 10, 2016

girishkalele commented Jun 10, 2016

bprashanth commented Jun 10, 2016

bprashanth Jun 10, 2016

girishkalele Jun 10, 2016

girishkalele commented Jun 10, 2016

bprashanth Jun 14, 2016

girishkalele Jun 14, 2016

bprashanth Jun 14, 2016

girishkalele Jun 14, 2016

bprashanth Jun 14, 2016

girishkalele Jun 14, 2016

bprashanth commented Jun 15, 2016

k8s-github-robot commented Jun 15, 2016

girishkalele commented Jun 15, 2016

bprashanth commented Jun 15, 2016

girishkalele commented Jun 15, 2016

girishkalele commented Jun 15, 2016

bprashanth commented Jun 15, 2016

girishkalele commented Jun 15, 2016

bprashanth commented Jun 16, 2016

bprashanth commented Jun 16, 2016

girishkalele commented Jun 16, 2016

k8s-bot commented Jun 16, 2016

k8s-github-robot commented Jun 17, 2016

k8s-bot commented Jun 17, 2016

k8s-github-robot commented Jun 17, 2016

Add nethealth container to prepull manifest #27019

Add nethealth container to prepull manifest #27019

Conversation

girishkalele commented Jun 8, 2016

girishkalele commented Jun 9, 2016 • edited Loading

dchen1107 commented Jun 10, 2016

bprashanth commented Jun 10, 2016

girishkalele commented Jun 10, 2016

bprashanth commented Jun 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

girishkalele commented Jun 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bprashanth commented Jun 15, 2016

k8s-github-robot commented Jun 15, 2016

girishkalele commented Jun 15, 2016

bprashanth commented Jun 15, 2016

girishkalele commented Jun 15, 2016

girishkalele commented Jun 15, 2016

bprashanth commented Jun 15, 2016

girishkalele commented Jun 15, 2016

bprashanth commented Jun 16, 2016

bprashanth commented Jun 16, 2016

girishkalele commented Jun 16, 2016

k8s-bot commented Jun 16, 2016

k8s-github-robot commented Jun 17, 2016

k8s-bot commented Jun 17, 2016

k8s-github-robot commented Jun 17, 2016

girishkalele commented Jun 9, 2016 •

edited

Loading