Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[job failure] gci-gke #55189

Closed
spiffxp opened this issue Nov 7, 2017 · 41 comments
Closed

[job failure] gci-gke #55189

spiffxp opened this issue Nov 7, 2017 · 41 comments
Assignees
Labels
area/provider/gcp Issues or PRs related to gcp provider kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Milestone

Comments

@spiffxp
Copy link
Member

spiffxp commented Nov 7, 2017

/priority critical-urgent
/priority failing-test
/area platform/gke
@kubernetes/sig-gcp-test-failures

This job has been failing since 2017-11-02. It's on the sig-release-master-blocking dashboard,
and prevents us from cutting [v1.9.0-alpha.3] (kubernetes/sig-release#27). Is there work ongoing to bring this job back to green?

https://k8s-testgrid.appspot.com/sig-release-master-blocking#gci-gke

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/gcp kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. area/provider/gcp Issues or PRs related to gcp provider labels Nov 7, 2017
@spiffxp spiffxp added this to the v1.9 milestone Nov 7, 2017
@spiffxp
Copy link
Member Author

spiffxp commented Nov 7, 2017

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 7, 2017
@spiffxp
Copy link
Member Author

spiffxp commented Nov 7, 2017

/status approved-for-milestone
because this is a release-blocking job

@enisoc
Copy link
Member

enisoc commented Nov 7, 2017

It seems the master is getting stuck and we time out waiting for it.

W1107 00:08:34.093] 2017/11/07 00:08:34 util.go:155: Running: gcloud container clusters create --quiet --project=gke-up-g1-3-c1-5-up-clu-n --zone=us-central1-f --machine-type=n1-standard-2 --image-type=gci --num-nodes=3 --network=e2e-18057 --cluster-version=1.9.0-alpha.2.266+fdeeed100132cf e2e-18057
W1107 00:08:36.262] Creating cluster e2e-18057...
W1107 00:30:02.919] .....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................done.
W1107 00:30:03.726] ERROR: (gcloud.container.clusters.create) Operation [<Operation
W1107 00:30:03.726]  endTime: u'2017-11-07T00:30:00.716555911Z'
W1107 00:30:03.726]  name: u'operation-1510013316239-37703ad9'
W1107 00:30:03.726]  operationType: OperationTypeValueValuesEnum(CREATE_CLUSTER, 1)
W1107 00:30:03.727]  selfLink: u'https://test-container.sandbox.googleapis.com/v1/projects/216093379844/zones/us-central1-f/operations/operation-1510013316239-37703ad9'
W1107 00:30:03.727]  startTime: u'2017-11-07T00:08:36.239134796Z'
W1107 00:30:03.727]  status: StatusValueValuesEnum(DONE, 3)
W1107 00:30:03.727]  statusMessage: u'Timed out waiting for cluster initialization. Cluster API may not be available.'
W1107 00:30:03.727]  targetLink: u'https://test-container.sandbox.googleapis.com/v1/projects/216093379844/zones/us-central1-f/clusters/e2e-18057'
W1107 00:30:03.727]  zone: u'us-central1-f'>] finished with error: Timed out waiting for cluster initialization. Cluster API may not be available.

We need someone with access to the GKE master logs to diagnose further. I believe @yliaog is looking into it.

@yliaog
Copy link
Contributor

yliaog commented Nov 7, 2017

Yes, I think the culprit is commit 3a15fdb (3a15fdbe7).
PR is #54643

Looking at the diffs, the type of ManifestURLHeader is changed from string to map[string][]string

@jpbetz
Copy link
Contributor

jpbetz commented Nov 7, 2017

Is this a release blocker? We hit this while validating 1.8.3 (#55244). Should we proceed with the release and ignore this error or should we hold off?

@enisoc
Copy link
Member

enisoc commented Nov 7, 2017

@jpbetz It looks like @yliaog believes the issue here is something that went into master recently, so #55244 is likely a different problem. Sorry if I led you astray by guessing at a connection.

@jpbetz
Copy link
Contributor

jpbetz commented Nov 7, 2017

@enisoc No worries. Following other leads now.

@yliaog
Copy link
Contributor

yliaog commented Nov 8, 2017

/cc yliaog

@yliaog
Copy link
Contributor

yliaog commented Nov 8, 2017

i managed to ssh into one failed cluster master, and got the error msg:
/home/kubernetes/bin/configure.sh: line 244: LOAD_IMAGE_COMMAND: unbound variable

A quick github search revealed #54964 is the culprit. It introduced LOAD_IMAGE_COMMAND. Although it added the default in cluster/gce/config-default.sh (LOAD_IMAGE_COMMAND=${KUBE_LOAD_IMAGE_COMMAND:-docker load -i}), however, the default is not properly loaded.

@Random-Liu
Copy link
Member

Github become pretty slow on my side.

I sent out a PR #55331.

As said by @yliaog, GKE is not using cluster/config-default.sh and cluster/config-test.sh in open source, but we only added the default value in those files.

In #55331, we apply a default in cluster/gce/gci/configure.sh, which should fix the issue.

@yliaog
Copy link
Contributor

yliaog commented Nov 8, 2017

lgtm

@spiffxp
Copy link
Member Author

spiffxp commented Nov 9, 2017

/reopen
I'd like to hold this open until I see the results on testgrid

@yliaog
Copy link
Contributor

yliaog commented Nov 11, 2017

Found following errors in e2e test:
W1111 06:06:35.956] zone: u'us-central1-f'>] finished with error: All cluster resources were brought up, but the cluster API is reporting that only 0 nodes out of 3 have registered. Cluster may be unhealthy.

Found the RBAC DENY in API server log
I1111 05:53:29.316972 5 rbac.go:116] RBAC DENY: user "system:node-problem-detector" groups ["system:authenticated"] cannot "patch" resource "nodes/status" named "gke-xxxxx-4th-default-pool-dcab13b3-191f" cluster-wide

I1111 05:53:35.393883 5 rbac.go:116] RBAC DENY: user "system:kube-scheduler" groups ["system:authenticated"] cannot "list" resource "nodes" cluster-wide

looks like caused by the rbac change in #53144

@yliaog
Copy link
Contributor

yliaog commented Nov 11, 2017

/cc @mindprince @mikedanese please help take a look

@yliaog
Copy link
Contributor

yliaog commented Nov 11, 2017

/cc @mikedanese

@yliaog
Copy link
Contributor

yliaog commented Nov 12, 2017

Found the following error logs from one node in a failed cluster.

Nov 12 03:32:05 gke-oneoff-e2e-default-pool-d85a29f5-1s1k kubelet[23455]: error: failed to run Kubelet: cannot crea
te certificate signing request: certificatesigningrequests.certificates.k8s.io is forbidden: User "kubelet" cannot
create certificatesigningrequests.certificates.k8s.io at the cluster scope: Unknown user "kubelet"

@spiffxp
Copy link
Member Author

spiffxp commented Nov 15, 2017

/close
https://k8s-testgrid.appspot.com/sig-release-master-blocking#gci-gke 4 green runs in a row, thanks for the help

@krzyzacy
Copy link
Member

/reopen

https://k8s-testgrid.appspot.com/google-gke#gci-gke
something is busted midnight - from commit range d20b156...b223955 nothing is really suspicious

@k8s-ci-robot
Copy link
Contributor

@krzyzacy: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen

https://k8s-testgrid.appspot.com/google-gke#gci-gke
something is busted midnight - from commit range d20b156...b223955 nothing is really suspicious

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abgworrall
Copy link
Contributor

There is a theory that it is related to #55950

@enisoc enisoc reopened this Nov 17, 2017
@enisoc
Copy link
Member

enisoc commented Nov 17, 2017

It seems related to the attempt to disable docker live restore.

From systemctl status kube-master-configuration:

Enable docker registry mirror at: https://mirror.gcr.io
Extend the docker.service configuration to remove the network checkpiont
Extend the docker.service configuration to set a higher pids limit
Docker command line is updated. Restart docker to pick it up
Job for docker.service failed because the control process exited with error code.
See "systemctl status docker.service" and "journalctl -xe" for details.
kube-master-configuration.service: Main process exited, code=exited, status=1/FAILURE
Failed to start Configure kubernetes master.

From systemctl status docker:

unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration file: live-restore: (from flag: false, from file: false)

@enisoc
Copy link
Member

enisoc commented Nov 17, 2017

I suspect #55639.

@rohitagarwal003
Copy link
Member

/cc @yguo0905 @yujuhong

@yguo0905
Copy link
Contributor

We are aware of this issue and fixing it.

@yujuhong yujuhong assigned yujuhong and yguo0905 and unassigned mikedanese Nov 17, 2017
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Issue Current

@spiffxp @yguo0905 @yujuhong

Issue Labels
  • sig/gcp: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@yujuhong
Copy link
Contributor

Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/gcp Issues or PRs related to gcp provider kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.