Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[job failure] gce-master-1.8-downgrade-cluster #56244

Closed
spiffxp opened this issue Nov 22, 2017 · 19 comments
Closed

[job failure] gce-master-1.8-downgrade-cluster #56244

spiffxp opened this issue Nov 22, 2017 · 19 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Milestone

Comments

@spiffxp
Copy link
Member

spiffxp commented Nov 22, 2017

/priority critical-urgent
/priority failing-test
/kind bug
/status approved-for-milestone
@kubernetes/sig-cluster-lifecycle-test-failures

This job has been failing since at least 2017-11-08. It's on the sig-release-master-upgrade dashboard,
and prevents us from cutting [v1.9.0-beta.1] (kubernetes/sig-release#34). Is there work ongoing to bring this job back to green?

https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-master-1.8-downgrade-cluster

@spiffxp spiffxp added this to the v1.9 milestone Nov 22, 2017
@k8s-ci-robot k8s-ci-robot added status/approved-for-milestone sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/bug Categorizes issue or PR as related to a bug. labels Nov 22, 2017
@jberkus
Copy link

jberkus commented Nov 28, 2017

Can we have a status update on this issue from the SIG? This issue has become critical for 1.9 release. Thanks!

@janetkuo
Copy link
Member

The test timed out waiting for the node to be recreated after node drain.

W1128 18:07:13.833] 2017/11/28 18:07:13 util.go:155: Running: kubetest --test --test_args=--ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade --v=true --check-version-skew=false
[...skipped...]
I1128 18:17:26.419] node "bootstrap-e2e-minion-group-7xbf" drained
I1128 18:17:26.422] == Recreating instance bootstrap-e2e-minion-group-7xbf. ==
I1128 18:17:27.591] == Waiting for instance bootstrap-e2e-minion-group-7xbf to be recreated. ==
I1128 18:18:24.610] ..................................== FAILED to describe bootstrap-e2e-minion-group-7xbf ==
I1128 18:18:24.611] ERROR: (gcloud.compute.instances.describe) Could not fetch resource:
I1128 18:18:24.611]  - The resource 'projects/e2e-gce-gci-ci-slow-1-5/zones/us-central1-f/instances/bootstrap-e2e-minion-group-7xbf' was not found
I1128 18:18:24.611]   (Will retry.)
I1128 18:18:26.223] Instance bootstrap-e2e-minion-group-7xbf recreated.
I1128 18:18:26.223] == Waiting for new node to be added to k8s.  ==
[!! waited for a long time!!]
I1129 08:58:30.063] ..........................You should now be able to use ssh/scp with your instances.
I1129 08:58:30.110] For example, try running:
I1129 08:58:30.110] 
I1129 08:58:30.111]   $ ssh bootstrap-e2e-master.us-central1-f.e2e-gce-gci-ci-slow-1-5
I1129 08:58:30.112] 
W1129 08:58:30.214] 2017/11/29 08:58:20 util.go:196: Interrupt after 15h0m0s timeout during kubetest --test --test_args=--ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade --v=true --check-version-skew=false. Will terminate in another 15m
W1129 08:58:30.214] 2017/11/29 08:58:21 util.go:177: Killing ./hack/ginkgo-e2e.sh --ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade(-7393) after receiving signal
W1129 08:58:30.214] 2017/11/29 08:58:21 util.go:157: Step './hack/ginkgo-e2e.sh --ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade' finished in 14h51m2.443658039s
W1129 08:58:30.217] 2017/11/29 08:58:21 main.go:312: Something went wrong: encountered 1 errors: [error during ./hack/ginkgo-e2e.sh --ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade: signal: killed]

@spiffxp
Copy link
Member Author

spiffxp commented Dec 1, 2017

Now tracking against v1.9.0-beta.2 (kubernetes/sig-release#39)

@janetkuo janetkuo assigned yguo0905 and unassigned abgworrall Dec 1, 2017
@janetkuo
Copy link
Member

janetkuo commented Dec 1, 2017

@yguo0905 is going to take a look

@jberkus
Copy link

jberkus commented Dec 4, 2017

@yguo0905 status update?

@yguo0905
Copy link
Contributor

yguo0905 commented Dec 5, 2017

For the failed run https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster/168?log#log

Node bootstrap-e2e-minion-group-9sch cannot register to master because

Dec 05 03:41:56 bootstrap-e2e-minion-group-9sch configure.sh[1015]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dec 05 03:41:56 bootstrap-e2e-minion-group-9sch configure.sh[1015]:                                  Dload  Upload   Total   Spent    Left  Speed
Dec 05 03:41:57 bootstrap-e2e-minion-group-9sch configure.sh[1015]: [158B blob data]
Dec 05 03:41:57 bootstrap-e2e-minion-group-9sch configure.sh[1015]: == Downloaded https://storage.googleapis.com/kubernetes-release/network-plugins/cni-0799f5732f2a11b329d9e3d51b9c8f2e3759f2ff.tar.gz (SHA1 = 1d9788b0f5420e1a219aad2cb8681823fc515e7c) ==
Dec 05 03:41:57 bootstrap-e2e-minion-group-9sch configure.sh[1015]: /home/kubernetes/bin/configure.sh: line 203: KUBE_MANIFESTS_TAR_URL: unbound variable
Dec 05 03:41:57 bootstrap-e2e-minion-group-9sch configure.sh[1015]: /home/kubernetes/bin/configure.sh: line 204: manifests_tar_urls[0]: unbound variable
Dec 05 03:41:57 bootstrap-e2e-minion-group-9sch systemd[1]: kube-node-installation.service: Main process exited, code=exited, status=1/FAILURE
Dec 05 03:41:57 bootstrap-e2e-minion-group-9sch systemd[1]: Failed to start Download and install k8s binaries and configurations.
Dec 05 03:41:57 bootstrap-e2e-minion-group-9sch systemd[1]: kube-node-installation.service: Unit entered failed state.
Dec 05 03:41:57 bootstrap-e2e-minion-group-9sch systemd[1]: kube-node-installation.service: Failed with result 'exit-code'.

The node is of the newly created template bootstrap-e2e-minion-template-v1-8-5-beta-0-60-dcbe09a08ac68d, which does not contain KUBE_MANIFESTS_TAR_URL. The instance template was created from

# TODO(zmerlynn): Get configure-vm script from ${version}. (Must plumb this
# through all create-node-instance-template implementations).
local template_name=$(get-template-name-from-version ${SANITIZED_VERSION})
create-node-instance-template "${template_name}"
# The following is echo'd so that callers can get the template name.
echo "Instance template name: ${template_name}"
echo "== Finished preparing node upgrade (to ${KUBE_VERSION}). ==" >&2

This doesn't seem to be a node issue (in scope of sig-node).

@zmerlynn, do you happen to know whether some change caused this issue?

Is this test critical for 1.9 release?

@spiffxp
Copy link
Member Author

spiffxp commented Dec 6, 2017

@yguo0905 historically we have treated failing jobs / tests in the https://k8s-testgrid.appspot.com/sig-release-master-upgrade dashboard as release-blockers; this is how I'm treating them as well as the CI Signal Lead for this release https://github.com/kubernetes/sig-release/blob/master/release-process-documentation/release-team-guides/ci-signal-playbook.md#code-freeze

@spiffxp
Copy link
Member Author

spiffxp commented Dec 11, 2017

Now tracking against v1.9.0 (kubernetes/sig-release#40)

All automated downgrade jobs are failing, this could really use some attention

@yguo0905
Copy link
Contributor

Could someone from sig-cluster-lifecycle take a look at the issue on #56244 (comment)? Is KUBE_MANIFESTS_TAR_URL expected to be set in the new node pool?

@luxas
Copy link
Member

luxas commented Dec 11, 2017

@spiffxp FWIW; I'm running some basic downgrade tests manually using kubeadm to some coverage generally, but it really doesn't test everything, only ~what's in Conformance tests, which is a low bar, but anyway...

@enisoc
Copy link
Member

enisoc commented Dec 11, 2017

@krousey found a problem during manual downgrade testing that is likely to also be impacting the downgrade e2es. The following CP allowed him to complete a manual downgrade:

#57056

@krousey
Copy link
Contributor

krousey commented Dec 12, 2017

@enisoc Well... I got through a node downgrade, which is where it was hanging. Now master downgrade doesn't work. I think it's because we changed etcd versions and etcd is refusing to downgrade.

@krousey
Copy link
Contributor

krousey commented Dec 12, 2017

Just ran a test. If we deploy the 1.9 cluster with ETCD_VERSION=3.0.17 (the etcd version of 1.8) then master downgrade succeeds.

@xiangpengzhao
Copy link
Contributor

xref: #57013

@enisoc
Copy link
Member

enisoc commented Dec 13, 2017

The downgrade test is now running, but some of the tests are failing:

https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-master-1.8-downgrade-cluster

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster/181

Cluster downgrade [sig-apps] daemonset-upgrade

Dec 13 10:27:42.738: expected DaemonSet pod to be running on all nodes, it was not

k8s.io/kubernetes/test/e2e/upgrades/apps.(*DaemonSetUpgradeTest).validateRunningDaemonSet(0x5c865f0, 0xc420d20b40)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/upgrades/apps/daemonsets.go:109 +0x1fb
k8s.io/kubernetes/test/e2e/upgrades/apps.(*DaemonSetUpgradeTest).Test(0x5c865f0, 0xc420d20b40, 0xc42129a0c0, 0x2)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/upgrades/apps/daemonsets.go:96 +0xc4
k8s.io/kubernetes/test/e2e/lifecycle.(*chaosMonkeyAdapter).Test(0xc420a60480, 0xc420a3e4a0)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/lifecycle/cluster_upgrade.go:321 +0x1ed
k8s.io/kubernetes/test/e2e/lifecycle.(*chaosMonkeyAdapter).Test-fm(0xc420a3e4a0)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/lifecycle/cluster_upgrade.go:390 +0x34
k8s.io/kubernetes/test/e2e/chaosmonkey.(*chaosmonkey).Do.func1(0xc420a3e4a0, 0xc4201ff6b0)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:89 +0x76
created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*chaosmonkey).Do
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:90 +0x1c7

[sig-cluster-lifecycle] Downgrade [Feature:Downgrade] cluster downgrade should maintain a functioning cluster [Feature:ClusterDowngrade]

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/lifecycle/cluster_upgrade.go:174
Dec 13 10:27:42.738: expected DaemonSet pod to be running on all nodes, it was not
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/upgrades/apps/daemonsets.go:109

[k8s.io] [sig-node] Kubelet [Serial] [Slow] [k8s.io] [sig-node] regular resource usage tracking resource tracking for 100 pods per node

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/node/kubelet_perf.go:271
Dec 13 13:51:38.512: CPU usage exceeding limits:
 node bootstrap-e2e-minion-group-b3wl:
 container "runtime": expected 50th% usage < 0.100; got 0.102
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/node/kubelet_perf.go:189

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Issue Current

@spiffxp @yguo0905

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
  • sig/cluster-lifecycle: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@enisoc
Copy link
Member

enisoc commented Dec 13, 2017

The DaemonSet issues appear to be a flaky test. The condition it wants is actually true (there is one Pod on each Node), but the test seems to have the wrong idea of which Nodes exist.

I1213 10:27:42.738] Dec 13 10:27:42.738: INFO: Pod name: ds1-dsghx	 Node Name: bootstrap-e2e-minion-group-w795
I1213 10:27:42.739] Dec 13 10:27:42.738: INFO: Pod name: ds1-g4vct	 Node Name: bootstrap-e2e-minion-group-b3wl
I1213 10:27:42.739] Dec 13 10:27:42.738: INFO: Pod name: ds1-m6m9t	 Node Name: bootstrap-e2e-master
I1213 10:27:42.739] Dec 13 10:27:42.738: INFO: Pod name: ds1-slfln	 Node Name: bootstrap-e2e-minion-group-b02r
I1213 10:27:42.740] Dec 13 10:27:42.738: INFO: nodesToPodCount: map[bootstrap-e2e-minion-group-b02r:1 bootstrap-e2e-minion-group-w795:1 bootstrap-e2e-minion-group-b3wl:1 bootstrap-e2e-master:1]
I1213 10:27:42.741] Dec 13 10:27:42.738: INFO: expected DaemonSet pod to be running on all nodes, it was not

@enisoc
Copy link
Member

enisoc commented Dec 13, 2017

I'm also not terribly concerned about the other test showing 0.002 more CPU usage than desired:

expected 50th% usage < 0.100; got 0.102

@enisoc
Copy link
Member

enisoc commented Dec 14, 2017

It seems those were indeed flakes. The latest run is fully green:

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster/182

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

No branches or pull requests