-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[job failure] gci-gke #55189
Comments
/kind bug |
/status approved-for-milestone |
It seems the master is getting stuck and we time out waiting for it.
We need someone with access to the GKE master logs to diagnose further. I believe @yliaog is looking into it. |
Is this a release blocker? We hit this while validating 1.8.3 (#55244). Should we proceed with the release and ignore this error or should we hold off? |
@enisoc No worries. Following other leads now. |
/cc yliaog |
i managed to ssh into one failed cluster master, and got the error msg: A quick github search revealed #54964 is the culprit. It introduced LOAD_IMAGE_COMMAND. Although it added the default in cluster/gce/config-default.sh (LOAD_IMAGE_COMMAND=${KUBE_LOAD_IMAGE_COMMAND:-docker load -i}), however, the default is not properly loaded. |
Github become pretty slow on my side. I sent out a PR #55331. As said by @yliaog, GKE is not using In #55331, we apply a default in |
lgtm |
/reopen |
Found following errors in e2e test: Found the RBAC DENY in API server log I1111 05:53:35.393883 5 rbac.go:116] RBAC DENY: user "system:kube-scheduler" groups ["system:authenticated"] cannot "list" resource "nodes" cluster-wide looks like caused by the rbac change in #53144 |
/cc @mindprince @mikedanese please help take a look |
/cc @mikedanese |
Found the following error logs from one node in a failed cluster. Nov 12 03:32:05 gke-oneoff-e2e-default-pool-d85a29f5-1s1k kubelet[23455]: error: failed to run Kubelet: cannot crea |
/close |
/reopen https://k8s-testgrid.appspot.com/google-gke#gci-gke |
@krzyzacy: you can't re-open an issue/PR unless you authored it or you are assigned to it. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There is a theory that it is related to #55950 |
It seems related to the attempt to disable docker live restore. From systemctl status kube-master-configuration:
From systemctl status docker:
|
I suspect #55639. |
We are aware of this issue and fixing it. |
[MILESTONENOTIFIER] Milestone Issue Current Issue Labels
|
Fixed. |
/priority critical-urgent
/priority failing-test
/area platform/gke
@kubernetes/sig-gcp-test-failures
This job has been failing since 2017-11-02. It's on the sig-release-master-blocking dashboard,
and prevents us from cutting [v1.9.0-alpha.3] (kubernetes/sig-release#27). Is there work ongoing to bring this job back to green?
https://k8s-testgrid.appspot.com/sig-release-master-blocking#gci-gke
The text was updated successfully, but these errors were encountered: