Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nethealth container to prepull manifest #27019

Merged
merged 1 commit into from
Jun 17, 2016

Conversation

girishkalele
Copy link

No description provided.

@k8s-github-robot k8s-github-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. release-note-label-needed labels Jun 8, 2016
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 9, 2016
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 9, 2016
@girishkalele girishkalele added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Jun 9, 2016
@girishkalele girishkalele added this to the v1.3 milestone Jun 9, 2016
@girishkalele
Copy link
Author

girishkalele commented Jun 9, 2016

Adds network health check to debug test flakes about slow image pulls from GCR.
#26626

@eparis eparis added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 9, 2016
@dchen1107 dchen1107 self-assigned this Jun 10, 2016
@dchen1107 dchen1107 removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 10, 2016
@dchen1107
Copy link
Member

@girishkalele I think I didn't make the suggestion clear at #26626. What I really suggested is introducing another container paralleling with image-puller container to run your nethealth in prepull pod since that Pod is one of the first pod running at test nodes.

Your current implementation only pulling the image, but not run the test.

@dchen1107 dchen1107 added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jun 10, 2016
@bprashanth
Copy link
Contributor

I'm not sure why we'd want to prepull if we're going to add this as a container in the prepull pod anyway. It's just dead weight no?

@k8s-github-robot k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 10, 2016
@girishkalele girishkalele force-pushed the nethealth2saltbase branch 2 times, most recently from 139bb50 to abefe7b Compare June 10, 2016 17:48
@k8s-github-robot k8s-github-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 10, 2016
@girishkalele
Copy link
Author

@dchen1107

I misunderstood your comment about adding to the manifest - still new here - I have now added a new container named nethealth-check and removed it from the pre-pull image list.
Will the output of this container also show up in the node logs ?
I need to specify the container name with a -c to view its logs (since the logs command defaults to the image puller container).

$ kubectl logs --namespace=kube-system e2e-image-puller2 -c nethealth-check
2016/06/10 17:48:18 HTTP HEAD reports content length: 67108864 - running GET
2016/06/10 17:48:19 DOWNLOAD: 67108864 bytes 1103 ms Bandwidth ~ 59416 KiB/sec
2016/06/10 17:48:19 Hash Matches expected value

@bprashanth
Copy link
Contributor

This is what prints it:

LogPodsWithLabels(f.Client, api.NamespaceSystem, ImagePullerLabels)
, it should iterate through all containers (
for _, container := range pod.Spec.Containers {
). You can trigger a test failure and confirm that it prints the output of your container.

Where's the code for kube-nethealth? It would be nice if it gathered background stats while the image puller is running vs just fire and forget a single curl request.

cpu: 100m
limits:
cpu: 100m
image: gcr.io/google_containers/kube-nethealth-amd64:1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to exit or the pod will continue wasting resources on the node and tests like the resource tracking test will fail waiting for it to stop. Ideally it should exit when the image puller exits. This is why I'd suggested breaking out the for loop in sh -c above into an independent script, and doing your stats collection in between each image pull.

I'm ok with quick and dirty for now, if your container is just going to do some bounded, small number of queries within eg: 4m, which is roughly how long the image puller will take.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is just the seed for a more full featured nethealth monitoring container.

@girishkalele
Copy link
Author

Nethealth Code is in the node-problem-detector repo here.

https://github.com/kubernetes/node-problem-detector/blob/master/pkg/util/nethealth/nethealth.go

@eparis eparis removed their assignment Jun 10, 2016
@@ -128,4 +129,5 @@ func RegisterFlags() {
flag.BoolVar(&TestContext.GatherMetricsAfterTest, "gather-metrics-at-teardown", false, "If set to true framwork will gather metrics from all components after each test.")
flag.StringVar(&TestContext.OutputPrintType, "output-print-type", "hr", "Comma separated list: 'hr' for human readable summaries 'json' for JSON ones.")
flag.BoolVar(&TestContext.DumpLogsOnFailure, "dump-logs-on-failure", true, "If set to true test will dump data about the namespace in which test was running.")
flag.BoolVar(&TestContext.DumpNethealthLogs, "dump-nethealth-logs", true, "If set to true test will dump data about the network health checks from all nodes.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a new flag for this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the flag (default is true) to follow the same model as DumpLogsOnFailure - looks like there is a use-case for not printing logs sometimes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we reuse the same flag dump-logs-on-failure? flag overload is not yet an issue with this binary, but it might be one day like it is in the kubelet/apiserver (100+ flags, old flags are preserved because codepaths have diverged). Node team people will happily tell you what a big problem flag fragmentation is in the internal kubelet equivalent :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do.

@@ -693,7 +696,18 @@ func LogPodsWithLabels(c *client.Client, ns string, match map[string]string) {
}
Logf("Running kubectl logs on pods with labels %v in %v", match, ns)
for _, pod := range podList.Items {
kubectlLogPod(c, pod)
kubectlLogPod(c, pod, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why not just pipe through containerSubstr into this func? I only see 2 uses of it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand your comment - iterate through containers in the pod and call the function with every container ?

@bprashanth
Copy link
Contributor

lgtm, assuming @dchen1107 is ok with this applying label

@bprashanth bprashanth added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2016
@k8s-github-robot
Copy link

@girishkalele
You must link to the test flake issue which caused you to request this manual re-test.
Re-test requests should be in the form of: k8s-bot test this issue: #<number>
Here is the list of open test flakes.

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2016
@girishkalele
Copy link
Author

@k8s-bot test this issue: #27335

@bprashanth
Copy link
Contributor

Hmm the last failure looks like exactly the sort of things your pr is meant to help with:

Jun 15 10:34:16.379: INFO: At {2016-06-15 10:29:45 -0700 PDT} - event for proxy-service-kir7r-fedi4: {kubelet e2e-gce-agent-pr-8-0-minion-group-57ju} Failed: Failed to pull image "gcr.io/google_containers/porter:cd5cb5791ebaa8641955f0e8c2a9bed669b1eaab": Error pulling image (cd5cb5791ebaa8641955f0e8c2a9bed669b1eaab) from gcr.io/google_containers/porter, Get https://storage.googleapis.com/artifacts.google-containers.appspot.com/containers/images/0bac9e8a78faf81cff38ba54baaf63c3818dce51c1ea24d3a516ab8c1ca70a72/ancestry?access_token=ya29.CkECA5qsDgPRsetAhNEghwM6zpkv1fOnJiMaKhk00WMPQMeS1TQT-gpkvCqkZ6P1wrYKRQxFFnPCGSMpBYa1GAssyw: dial tcp 74.125.129.128:443: i/o timeout

Did it dump out enough information?

@girishkalele
Copy link
Author

It did dump out information for that node - networking at startup was good.

STEP: Logs of kube-system/e2e-image-puller-e2e-gce-agent-pr-8-0-minion-group-57ju:nethealth-check on node e2e-gce-agent-pr-8-0-minion-group-57ju
Jun 15 10:27:48.079: INFO: nethealth : STARTLOG
2016/06/15 17:23:06 HTTP HEAD reports content length: 67108864 - running GET
2016/06/15 17:23:07 DOWNLOAD: 67108864 bytes 761 ms Bandwidth ~ 86118 KiB/sec
2016/06/15 17:23:07 Hash Matches expected value

@girishkalele
Copy link
Author

Image puller logs from that node were full of errors, the image gcr.io/google_containers/porter failed to prepull.

STEP: Logs of kube-system/e2e-image-puller-e2e-gce-agent-pr-8-0-minion-group-57ju:image-puller on node e2e-gce-agent-pr-8-0-minion-group-57ju
Jun 15 10:34:17.494: INFO: image-puller : STARTLOG
17:22:20 pulling gcr.io/google_containers/busybox
17:22:50 pulling gcr.io/google_containers/busybox:1.24
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/busybox
17:22:50 pulling gcr.io/google_containers/dnsutils:e2e
17:22:58 pulling gcr.io/google_containers/eptest:0.1
17:22:59 pulling gcr.io/google_containers/fakegitserver:0.1
17:22:59 pulling gcr.io/google_containers/hostexec:1.2
17:23:05 pulling gcr.io/google_containers/iperf:e2e
17:23:13 pulling gcr.io/google_containers/jessie-dnsutils:e2e
17:23:21 pulling gcr.io/google_containers/liveness:e2e
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/liveness
17:23:22 pulling gcr.io/google_containers/mounttest:0.2
17:23:22 pulling gcr.io/google_containers/mounttest:0.5
17:23:22 pulling gcr.io/google_containers/mounttest:0.6
17:23:28 pulling gcr.io/google_containers/mounttest-user:0.3
17:23:33 pulling gcr.io/google_containers/netexec:1.4
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/netexec
17:23:33 pulling gcr.io/google_containers/netexec:1.5
17:23:54 pulling gcr.io/google_containers/nettest:1.7
17:23:55 pulling gcr.io/google_containers/nettest:1.8
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/nettest
17:23:55 pulling gcr.io/google_containers/nginx:1.7.9
17:24:02 pulling gcr.io/google_containers/nginx-slim:0.5
17:24:08 pulling gcr.io/google_containers/n-way-http:1.0
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/n-way-http
17:24:08 pulling gcr.io/google_containers/pause:2.0
17:24:09 pulling gcr.io/google_containers/pause-amd64:3.0
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/pause-amd64
17:24:09 pulling gcr.io/google_containers/porter:cd5cb5791ebaa8641955f0e8c2a9bed669b1eaab
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/porter
17:24:09 pulling gcr.io/google_containers/portforwardtester:1.0
17:24:09 pulling gcr.io/google_containers/redis:e2e
17:24:29 pulling gcr.io/google_containers/resource_consumer:beta2
17:24:41 pulling gcr.io/google_containers/serve_hostname:v1.4
17:24:42 pulling gcr.io/google_containers/servicelb:0.1
17:25:34 pulling gcr.io/google_containers/test-webserver:e2e
17:25:35 pulling gcr.io/google_containers/ubuntu:14.04
17:25:35 pulling gcr.io/google_containers/update-demo:kitten
17:26:10 pulling gcr.io/google_containers/update-demo:nautilus
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/update-demo
17:26:10 pulling gcr.io/google_containers/volume-ceph:0.1
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/volume-ceph
17:26:10 pulling gcr.io/google_containers/volume-gluster:0.2
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/volume-gluster
17:26:10 pulling gcr.io/google_containers/volume-iscsi:0.1
17:26:23 pulling gcr.io/google_containers/volume-nfs:0.6
Error response from daemon: name invalid: Requested repository does not match bearer token resource: google_containers/volume-nfs
17:26:23 pulling gcr.io/google_containers/volume-rbd:0.1
17:26:54 pulling gcr.io/google_samples/gb-redisslave:v1
Error pulling image (0.1) from gcr.io/google_containers/volume-rbd, Server error: Status 0 while fetching image layer (efda3e1d78611a63aa0c9a5cc8b70b9ad4a5732c6cac43ee30601ad448350fde)
Error pulling image (v1) from gcr.io/google_samples/gb-redisslave, Server error: Status 0 while fetching image layer (9b5f6b31f2b8dbe4ba89f91aa29ed4accf896a9df00a22fe6bdf1933aabefa45)

@bprashanth
Copy link
Contributor

Yeah that's #27451

@girishkalele girishkalele added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2016
@girishkalele
Copy link
Author

@bprashanth I lost my lgtm label because of the flake...I added it again.

@bprashanth
Copy link
Contributor

@k8s-bot test this github issue: #26682

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2016
@k8s-github-robot k8s-github-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 16, 2016
@bprashanth
Copy link
Contributor

reapplying lgtm, it was just a rebase right?

@bprashanth bprashanth added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 16, 2016
@girishkalele
Copy link
Author

Yes, thanks.

@k8s-bot
Copy link

k8s-bot commented Jun 16, 2016

GCE e2e build/test passed for commit ee7ca66.

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented Jun 17, 2016

GCE e2e build/test passed for commit ee7ca66.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants