[job failure] gci-gke #55189

spiffxp · 2017-11-07T01:06:54Z

/priority critical-urgent
/priority failing-test
/area platform/gke
@kubernetes/sig-gcp-test-failures

This job has been failing since 2017-11-02. It's on the sig-release-master-blocking dashboard,
and prevents us from cutting [v1.9.0-alpha.3] (kubernetes/sig-release#27). Is there work ongoing to bring this job back to green?

https://k8s-testgrid.appspot.com/sig-release-master-blocking#gci-gke

last good: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gke/17906
first bad: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gke/17907
latest bad as of filing: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gke/18057/
suspect changelog: 55e216f...3a15fdb

The text was updated successfully, but these errors were encountered:

spiffxp · 2017-11-07T01:12:16Z

/kind bug

spiffxp · 2017-11-07T01:13:12Z

/status approved-for-milestone
because this is a release-blocking job

enisoc · 2017-11-07T17:41:51Z

It seems the master is getting stuck and we time out waiting for it.

W1107 00:08:34.093] 2017/11/07 00:08:34 util.go:155: Running: gcloud container clusters create --quiet --project=gke-up-g1-3-c1-5-up-clu-n --zone=us-central1-f --machine-type=n1-standard-2 --image-type=gci --num-nodes=3 --network=e2e-18057 --cluster-version=1.9.0-alpha.2.266+fdeeed100132cf e2e-18057
W1107 00:08:36.262] Creating cluster e2e-18057...
W1107 00:30:02.919] .....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................done.
W1107 00:30:03.726] ERROR: (gcloud.container.clusters.create) Operation [<Operation
W1107 00:30:03.726]  endTime: u'2017-11-07T00:30:00.716555911Z'
W1107 00:30:03.726]  name: u'operation-1510013316239-37703ad9'
W1107 00:30:03.726]  operationType: OperationTypeValueValuesEnum(CREATE_CLUSTER, 1)
W1107 00:30:03.727]  selfLink: u'https://test-container.sandbox.googleapis.com/v1/projects/216093379844/zones/us-central1-f/operations/operation-1510013316239-37703ad9'
W1107 00:30:03.727]  startTime: u'2017-11-07T00:08:36.239134796Z'
W1107 00:30:03.727]  status: StatusValueValuesEnum(DONE, 3)
W1107 00:30:03.727]  statusMessage: u'Timed out waiting for cluster initialization. Cluster API may not be available.'
W1107 00:30:03.727]  targetLink: u'https://test-container.sandbox.googleapis.com/v1/projects/216093379844/zones/us-central1-f/clusters/e2e-18057'
W1107 00:30:03.727]  zone: u'us-central1-f'>] finished with error: Timed out waiting for cluster initialization. Cluster API may not be available.

We need someone with access to the GKE master logs to diagnose further. I believe @yliaog is looking into it.

yliaog · 2017-11-07T17:50:22Z

Yes, I think the culprit is commit 3a15fdb (3a15fdbe7).
PR is #54643

Looking at the diffs, the type of ManifestURLHeader is changed from string to map[string][]string

jpbetz · 2017-11-07T18:03:44Z

Is this a release blocker? We hit this while validating 1.8.3 (#55244). Should we proceed with the release and ignore this error or should we hold off?

enisoc · 2017-11-07T18:36:32Z

@jpbetz It looks like @yliaog believes the issue here is something that went into master recently, so #55244 is likely a different problem. Sorry if I led you astray by guessing at a connection.

jpbetz · 2017-11-07T18:45:27Z

@enisoc No worries. Following other leads now.

yliaog · 2017-11-08T05:01:06Z

/cc yliaog

yliaog · 2017-11-08T05:02:38Z

i managed to ssh into one failed cluster master, and got the error msg:
/home/kubernetes/bin/configure.sh: line 244: LOAD_IMAGE_COMMAND: unbound variable

A quick github search revealed #54964 is the culprit. It introduced LOAD_IMAGE_COMMAND. Although it added the default in cluster/gce/config-default.sh (LOAD_IMAGE_COMMAND=${KUBE_LOAD_IMAGE_COMMAND:-docker load -i}), however, the default is not properly loaded.

Random-Liu · 2017-11-08T18:22:49Z

Github become pretty slow on my side.

I sent out a PR #55331.

As said by @yliaog, GKE is not using cluster/config-default.sh and cluster/config-test.sh in open source, but we only added the default value in those files.

In #55331, we apply a default in cluster/gce/gci/configure.sh, which should fix the issue.

yliaog · 2017-11-08T19:46:17Z

lgtm

spiffxp · 2017-11-09T01:15:56Z

/reopen
I'd like to hold this open until I see the results on testgrid

yliaog · 2017-11-11T07:01:46Z

Found following errors in e2e test:
W1111 06:06:35.956] zone: u'us-central1-f'>] finished with error: All cluster resources were brought up, but the cluster API is reporting that only 0 nodes out of 3 have registered. Cluster may be unhealthy.

Found the RBAC DENY in API server log
I1111 05:53:29.316972 5 rbac.go:116] RBAC DENY: user "system:node-problem-detector" groups ["system:authenticated"] cannot "patch" resource "nodes/status" named "gke-xxxxx-4th-default-pool-dcab13b3-191f" cluster-wide

I1111 05:53:35.393883 5 rbac.go:116] RBAC DENY: user "system:kube-scheduler" groups ["system:authenticated"] cannot "list" resource "nodes" cluster-wide

looks like caused by the rbac change in #53144

yliaog · 2017-11-11T07:04:32Z

/cc @mindprince @mikedanese please help take a look

yliaog · 2017-11-11T07:07:04Z

/cc @mikedanese

yliaog · 2017-11-12T04:52:53Z

Found the following error logs from one node in a failed cluster.

Nov 12 03:32:05 gke-oneoff-e2e-default-pool-d85a29f5-1s1k kubelet[23455]: error: failed to run Kubelet: cannot crea
te certificate signing request: certificatesigningrequests.certificates.k8s.io is forbidden: User "kubelet" cannot
create certificatesigningrequests.certificates.k8s.io at the cluster scope: Unknown user "kubelet"

spiffxp · 2017-11-15T15:20:12Z

/close
https://k8s-testgrid.appspot.com/sig-release-master-blocking#gci-gke 4 green runs in a row, thanks for the help

krzyzacy · 2017-11-17T18:11:22Z

/reopen

https://k8s-testgrid.appspot.com/google-gke#gci-gke
something is busted midnight - from commit range d20b156...b223955 nothing is really suspicious

k8s-ci-robot · 2017-11-17T18:11:22Z

@krzyzacy: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen

https://k8s-testgrid.appspot.com/google-gke#gci-gke
something is busted midnight - from commit range d20b156...b223955 nothing is really suspicious

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

abgworrall · 2017-11-17T18:15:21Z

There is a theory that it is related to #55950

enisoc · 2017-11-17T19:34:41Z

It seems related to the attempt to disable docker live restore.

From systemctl status kube-master-configuration:

Enable docker registry mirror at: https://mirror.gcr.io
Extend the docker.service configuration to remove the network checkpiont
Extend the docker.service configuration to set a higher pids limit
Docker command line is updated. Restart docker to pick it up
Job for docker.service failed because the control process exited with error code.
See "systemctl status docker.service" and "journalctl -xe" for details.
kube-master-configuration.service: Main process exited, code=exited, status=1/FAILURE
Failed to start Configure kubernetes master.

From systemctl status docker:

unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration file: live-restore: (from flag: false, from file: false)

enisoc · 2017-11-17T19:45:36Z

I suspect #55639.

rohitagarwal003 · 2017-11-17T19:50:12Z

/cc @yguo0905 @yujuhong

yguo0905 · 2017-11-17T20:00:38Z

We are aware of this issue and fixing it.

k8s-github-robot · 2017-11-17T20:25:25Z

[MILESTONENOTIFIER] Milestone Issue Current

@spiffxp @yguo0905 @yujuhong

Issue Labels

sig/gcp: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

yujuhong · 2017-11-18T02:10:54Z

Fixed.

spiffxp added this to the v1.9 milestone Nov 7, 2017

spiffxp mentioned this issue Nov 7, 2017

[job failure] gke-device-plugin-gpu #55190

Closed

k8s-github-robot added the milestone/incomplete-labels label Nov 7, 2017

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 7, 2017

k8s-github-robot added milestone/needs-approval and removed milestone/incomplete-labels labels Nov 7, 2017

k8s-ci-robot added the status/approved-for-milestone label Nov 7, 2017

k8s-github-robot removed the milestone/needs-approval label Nov 7, 2017

This was referenced Nov 7, 2017

[job failure] gci-gke-slow #55192

Closed

[job failure] gci-gke-serial #55193

Closed

[job failure] gci-gke-ingress #55195

Closed

enisoc mentioned this issue Nov 7, 2017

release-1.8 GKE test failure "Services should work after restarting apiserver" #55244

Closed

Random-Liu mentioned this issue Nov 8, 2017

Fix GKE failure, set default in configure.sh. #55331

Merged

k8s-github-robot closed this as completed in #55331 Nov 9, 2017

mikedanese self-assigned this Nov 14, 2017

mikedanese mentioned this issue Nov 14, 2017

GKE misc fixes #55624

Merged

spiffxp mentioned this issue Nov 14, 2017

remove gci-gke jobs from sig-release-master-blocking kubernetes/test-infra#5508

Closed

krzyzacy mentioned this issue Nov 14, 2017

Tests failed due to project quota issues kubernetes/test-infra#5509

Closed

mikedanese unassigned mtaufen Nov 15, 2017

k8s-github-robot closed this as completed in #55624 Nov 15, 2017

enisoc reopened this Nov 15, 2017

k8s-github-robot closed this as completed in 3e757c7 Nov 15, 2017

enisoc reopened this Nov 15, 2017

k8s-ci-robot closed this as completed Nov 15, 2017

enisoc reopened this Nov 17, 2017

yujuhong assigned yujuhong and yguo0905 and unassigned mikedanese Nov 17, 2017

yujuhong closed this as completed Nov 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[job failure] gci-gke #55189

[job failure] gci-gke #55189

spiffxp commented Nov 7, 2017

spiffxp commented Nov 7, 2017

spiffxp commented Nov 7, 2017

enisoc commented Nov 7, 2017

yliaog commented Nov 7, 2017

jpbetz commented Nov 7, 2017

enisoc commented Nov 7, 2017

jpbetz commented Nov 7, 2017

yliaog commented Nov 8, 2017

yliaog commented Nov 8, 2017

Random-Liu commented Nov 8, 2017

yliaog commented Nov 8, 2017

spiffxp commented Nov 9, 2017

yliaog commented Nov 11, 2017

yliaog commented Nov 11, 2017

yliaog commented Nov 11, 2017

yliaog commented Nov 12, 2017

spiffxp commented Nov 15, 2017

krzyzacy commented Nov 17, 2017

k8s-ci-robot commented Nov 17, 2017

abgworrall commented Nov 17, 2017

enisoc commented Nov 17, 2017

enisoc commented Nov 17, 2017

rohitagarwal003 commented Nov 17, 2017

yguo0905 commented Nov 17, 2017

k8s-github-robot commented Nov 17, 2017

yujuhong commented Nov 18, 2017

[job failure] gci-gke #55189

[job failure] gci-gke #55189

Comments

spiffxp commented Nov 7, 2017

spiffxp commented Nov 7, 2017

spiffxp commented Nov 7, 2017

enisoc commented Nov 7, 2017

yliaog commented Nov 7, 2017

jpbetz commented Nov 7, 2017

enisoc commented Nov 7, 2017

jpbetz commented Nov 7, 2017

yliaog commented Nov 8, 2017

yliaog commented Nov 8, 2017

Random-Liu commented Nov 8, 2017

yliaog commented Nov 8, 2017

spiffxp commented Nov 9, 2017

yliaog commented Nov 11, 2017

yliaog commented Nov 11, 2017

yliaog commented Nov 11, 2017

yliaog commented Nov 12, 2017

spiffxp commented Nov 15, 2017

krzyzacy commented Nov 17, 2017

k8s-ci-robot commented Nov 17, 2017

abgworrall commented Nov 17, 2017

enisoc commented Nov 17, 2017

enisoc commented Nov 17, 2017

rohitagarwal003 commented Nov 17, 2017

yguo0905 commented Nov 17, 2017

k8s-github-robot commented Nov 17, 2017

yujuhong commented Nov 18, 2017