[job failure] gce-master-1.8-downgrade-cluster #56244

spiffxp · 2017-11-22T19:32:06Z

/priority critical-urgent
/priority failing-test
/kind bug
/status approved-for-milestone
@kubernetes/sig-cluster-lifecycle-test-failures

This job has been failing since at least 2017-11-08. It's on the sig-release-master-upgrade dashboard,
and prevents us from cutting [v1.9.0-beta.1] (kubernetes/sig-release#34). Is there work ongoing to bring this job back to green?

https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-master-1.8-downgrade-cluster

none of the upgrade jobs are passing https://k8s-testgrid.appspot.com/sig-release-master-upgrade
latest failure: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster/143

jberkus · 2017-11-28T17:35:37Z

Can we have a status update on this issue from the SIG? This issue has become critical for 1.9 release. Thanks!

janetkuo · 2017-11-30T00:15:03Z

The test timed out waiting for the node to be recreated after node drain.

W1128 18:07:13.833] 2017/11/28 18:07:13 util.go:155: Running: kubetest --test --test_args=--ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade --v=true --check-version-skew=false
[...skipped...]
I1128 18:17:26.419] node "bootstrap-e2e-minion-group-7xbf" drained
I1128 18:17:26.422] == Recreating instance bootstrap-e2e-minion-group-7xbf. ==
I1128 18:17:27.591] == Waiting for instance bootstrap-e2e-minion-group-7xbf to be recreated. ==
I1128 18:18:24.610] ..................................== FAILED to describe bootstrap-e2e-minion-group-7xbf ==
I1128 18:18:24.611] ERROR: (gcloud.compute.instances.describe) Could not fetch resource:
I1128 18:18:24.611]  - The resource 'projects/e2e-gce-gci-ci-slow-1-5/zones/us-central1-f/instances/bootstrap-e2e-minion-group-7xbf' was not found
I1128 18:18:24.611]   (Will retry.)
I1128 18:18:26.223] Instance bootstrap-e2e-minion-group-7xbf recreated.
I1128 18:18:26.223] == Waiting for new node to be added to k8s.  ==
[!! waited for a long time!!]
I1129 08:58:30.063] ..........................You should now be able to use ssh/scp with your instances.
I1129 08:58:30.110] For example, try running:
I1129 08:58:30.110] 
I1129 08:58:30.111]   $ ssh bootstrap-e2e-master.us-central1-f.e2e-gce-gci-ci-slow-1-5
I1129 08:58:30.112] 
W1129 08:58:30.214] 2017/11/29 08:58:20 util.go:196: Interrupt after 15h0m0s timeout during kubetest --test --test_args=--ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade --v=true --check-version-skew=false. Will terminate in another 15m
W1129 08:58:30.214] 2017/11/29 08:58:21 util.go:177: Killing ./hack/ginkgo-e2e.sh --ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade(-7393) after receiving signal
W1129 08:58:30.214] 2017/11/29 08:58:21 util.go:157: Step './hack/ginkgo-e2e.sh --ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade' finished in 14h51m2.443658039s
W1129 08:58:30.217] 2017/11/29 08:58:21 main.go:312: Something went wrong: encountered 1 errors: [error during ./hack/ginkgo-e2e.sh --ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade: signal: killed]

spiffxp · 2017-12-01T17:51:49Z

Now tracking against v1.9.0-beta.2 (kubernetes/sig-release#39)

janetkuo · 2017-12-01T19:09:17Z

@yguo0905 is going to take a look

jberkus · 2017-12-04T19:32:26Z

@yguo0905 status update?

yguo0905 · 2017-12-05T04:38:19Z

For the failed run https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster/168?log#log

Node bootstrap-e2e-minion-group-9sch cannot register to master because

Dec 05 03:41:56 bootstrap-e2e-minion-group-9sch configure.sh[1015]:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dec 05 03:41:56 bootstrap-e2e-minion-group-9sch configure.sh[1015]:                                  Dload  Upload   Total   Spent    Left  Speed
Dec 05 03:41:57 bootstrap-e2e-minion-group-9sch configure.sh[1015]: [158B blob data]
Dec 05 03:41:57 bootstrap-e2e-minion-group-9sch configure.sh[1015]: == Downloaded https://storage.googleapis.com/kubernetes-release/network-plugins/cni-0799f5732f2a11b329d9e3d51b9c8f2e3759f2ff.tar.gz (SHA1 = 1d9788b0f5420e1a219aad2cb8681823fc515e7c) ==
Dec 05 03:41:57 bootstrap-e2e-minion-group-9sch configure.sh[1015]: /home/kubernetes/bin/configure.sh: line 203: KUBE_MANIFESTS_TAR_URL: unbound variable
Dec 05 03:41:57 bootstrap-e2e-minion-group-9sch configure.sh[1015]: /home/kubernetes/bin/configure.sh: line 204: manifests_tar_urls[0]: unbound variable
Dec 05 03:41:57 bootstrap-e2e-minion-group-9sch systemd[1]: kube-node-installation.service: Main process exited, code=exited, status=1/FAILURE
Dec 05 03:41:57 bootstrap-e2e-minion-group-9sch systemd[1]: Failed to start Download and install k8s binaries and configurations.
Dec 05 03:41:57 bootstrap-e2e-minion-group-9sch systemd[1]: kube-node-installation.service: Unit entered failed state.
Dec 05 03:41:57 bootstrap-e2e-minion-group-9sch systemd[1]: kube-node-installation.service: Failed with result 'exit-code'.

The node is of the newly created template bootstrap-e2e-minion-template-v1-8-5-beta-0-60-dcbe09a08ac68d, which does not contain KUBE_MANIFESTS_TAR_URL. The instance template was created from

kubernetes/cluster/gce/upgrade.sh

Lines 270 to 276 in 2175199

    
           # TODO(zmerlynn): Get configure-vm script from ${version}. (Must plumb this 
        
           #                 through all create-node-instance-template implementations). 
        
           local template_name=$(get-template-name-from-version ${SANITIZED_VERSION}) 
        
           create-node-instance-template "${template_name}" 
        
           # The following is echo'd so that callers can get the template name. 
        
           echo "Instance template name: ${template_name}" 
        
           echo "== Finished preparing node upgrade (to ${KUBE_VERSION}). ==" >&2

This doesn't seem to be a node issue (in scope of sig-node).

@zmerlynn, do you happen to know whether some change caused this issue?

Is this test critical for 1.9 release?

spiffxp · 2017-12-06T03:10:07Z

@yguo0905 historically we have treated failing jobs / tests in the https://k8s-testgrid.appspot.com/sig-release-master-upgrade dashboard as release-blockers; this is how I'm treating them as well as the CI Signal Lead for this release https://github.com/kubernetes/sig-release/blob/master/release-process-documentation/release-team-guides/ci-signal-playbook.md#code-freeze

spiffxp · 2017-12-11T17:39:14Z

Now tracking against v1.9.0 (kubernetes/sig-release#40)

All automated downgrade jobs are failing, this could really use some attention

yguo0905 · 2017-12-11T17:53:08Z

Could someone from sig-cluster-lifecycle take a look at the issue on #56244 (comment)? Is KUBE_MANIFESTS_TAR_URL expected to be set in the new node pool?

luxas · 2017-12-11T18:53:44Z

@spiffxp FWIW; I'm running some basic downgrade tests manually using kubeadm to some coverage generally, but it really doesn't test everything, only ~what's in Conformance tests, which is a low bar, but anyway...

enisoc · 2017-12-11T23:20:02Z

@krousey found a problem during manual downgrade testing that is likely to also be impacting the downgrade e2es. The following CP allowed him to complete a manual downgrade:

#57056

krousey · 2017-12-12T17:27:11Z

@enisoc Well... I got through a node downgrade, which is where it was hanging. Now master downgrade doesn't work. I think it's because we changed etcd versions and etcd is refusing to downgrade.

krousey · 2017-12-12T17:52:23Z

Just ran a test. If we deploy the 1.9 cluster with ETCD_VERSION=3.0.17 (the etcd version of 1.8) then master downgrade succeeds.

xiangpengzhao · 2017-12-12T18:19:48Z

xref: #57013

enisoc · 2017-12-13T22:58:45Z

The downgrade test is now running, but some of the tests are failing:

https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-master-1.8-downgrade-cluster

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster/181

Cluster downgrade [sig-apps] daemonset-upgrade

Dec 13 10:27:42.738: expected DaemonSet pod to be running on all nodes, it was not

k8s.io/kubernetes/test/e2e/upgrades/apps.(*DaemonSetUpgradeTest).validateRunningDaemonSet(0x5c865f0, 0xc420d20b40)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/upgrades/apps/daemonsets.go:109 +0x1fb
k8s.io/kubernetes/test/e2e/upgrades/apps.(*DaemonSetUpgradeTest).Test(0x5c865f0, 0xc420d20b40, 0xc42129a0c0, 0x2)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/upgrades/apps/daemonsets.go:96 +0xc4
k8s.io/kubernetes/test/e2e/lifecycle.(*chaosMonkeyAdapter).Test(0xc420a60480, 0xc420a3e4a0)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/lifecycle/cluster_upgrade.go:321 +0x1ed
k8s.io/kubernetes/test/e2e/lifecycle.(*chaosMonkeyAdapter).Test-fm(0xc420a3e4a0)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/lifecycle/cluster_upgrade.go:390 +0x34
k8s.io/kubernetes/test/e2e/chaosmonkey.(*chaosmonkey).Do.func1(0xc420a3e4a0, 0xc4201ff6b0)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:89 +0x76
created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*chaosmonkey).Do
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:90 +0x1c7

[sig-cluster-lifecycle] Downgrade [Feature:Downgrade] cluster downgrade should maintain a functioning cluster [Feature:ClusterDowngrade]

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/lifecycle/cluster_upgrade.go:174
Dec 13 10:27:42.738: expected DaemonSet pod to be running on all nodes, it was not
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/upgrades/apps/daemonsets.go:109

[k8s.io] [sig-node] Kubelet [Serial] [Slow] [k8s.io] [sig-node] regular resource usage tracking resource tracking for 100 pods per node

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/node/kubelet_perf.go:271
Dec 13 13:51:38.512: CPU usage exceeding limits:
 node bootstrap-e2e-minion-group-b3wl:
 container "runtime": expected 50th% usage < 0.100; got 0.102
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/node/kubelet_perf.go:189

k8s-github-robot · 2017-12-13T23:00:35Z

[MILESTONENOTIFIER] Milestone Issue Current

@spiffxp @yguo0905

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Issue Labels

sig/cluster-lifecycle: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

enisoc · 2017-12-13T23:37:39Z

The DaemonSet issues appear to be a flaky test. The condition it wants is actually true (there is one Pod on each Node), but the test seems to have the wrong idea of which Nodes exist.

I1213 10:27:42.738] Dec 13 10:27:42.738: INFO: Pod name: ds1-dsghx	 Node Name: bootstrap-e2e-minion-group-w795
I1213 10:27:42.739] Dec 13 10:27:42.738: INFO: Pod name: ds1-g4vct	 Node Name: bootstrap-e2e-minion-group-b3wl
I1213 10:27:42.739] Dec 13 10:27:42.738: INFO: Pod name: ds1-m6m9t	 Node Name: bootstrap-e2e-master
I1213 10:27:42.739] Dec 13 10:27:42.738: INFO: Pod name: ds1-slfln	 Node Name: bootstrap-e2e-minion-group-b02r
I1213 10:27:42.740] Dec 13 10:27:42.738: INFO: nodesToPodCount: map[bootstrap-e2e-minion-group-b02r:1 bootstrap-e2e-minion-group-w795:1 bootstrap-e2e-minion-group-b3wl:1 bootstrap-e2e-master:1]
I1213 10:27:42.741] Dec 13 10:27:42.738: INFO: expected DaemonSet pod to be running on all nodes, it was not

enisoc · 2017-12-13T23:40:44Z

I'm also not terribly concerned about the other test showing 0.002 more CPU usage than desired:

expected 50th% usage < 0.100; got 0.102

enisoc · 2017-12-14T01:09:00Z

It seems those were indeed flakes. The latest run is fully green:

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster/182

spiffxp added this to the v1.9 milestone Nov 22, 2017

k8s-github-robot added the milestone/needs-attention label Nov 22, 2017

janetkuo assigned abgworrall Nov 29, 2017

janetkuo assigned yguo0905 and unassigned abgworrall Dec 1, 2017

jberkus mentioned this issue Dec 4, 2017

[1.9] Issue Burndown kubernetes/sig-release#38

Closed

spiffxp mentioned this issue Dec 11, 2017

[job failure] gce-master-1.8-downgrade-cluster-parallel #56879

Closed

enisoc mentioned this issue Dec 11, 2017

Automated cherry pick of #52929 #57056

Merged

This was referenced Dec 12, 2017

Pin ETCD versions to avoid ETCD downgrade issues kubernetes/test-infra#5909

Merged

Need to use the test version of env vars to pin etcd kubernetes/test-infra#5920

Merged

enisoc added the status/in-progress label Dec 13, 2017

k8s-github-robot removed the milestone/needs-attention label Dec 13, 2017

enisoc closed this as completed Dec 14, 2017

lukaszgryglicki mentioned this issue Mar 1, 2018

Pervasive lag issue with label/milestone changes in issues and PRs cncf/devstats.archive#78

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[job failure] gce-master-1.8-downgrade-cluster #56244

[job failure] gce-master-1.8-downgrade-cluster #56244

spiffxp commented Nov 22, 2017

jberkus commented Nov 28, 2017

janetkuo commented Nov 30, 2017

spiffxp commented Dec 1, 2017

janetkuo commented Dec 1, 2017

jberkus commented Dec 4, 2017

yguo0905 commented Dec 5, 2017 •

edited

Loading

spiffxp commented Dec 6, 2017 •

edited

Loading

spiffxp commented Dec 11, 2017

yguo0905 commented Dec 11, 2017

luxas commented Dec 11, 2017 •

edited

Loading

enisoc commented Dec 11, 2017

krousey commented Dec 12, 2017

krousey commented Dec 12, 2017

xiangpengzhao commented Dec 12, 2017

enisoc commented Dec 13, 2017

k8s-github-robot commented Dec 13, 2017

enisoc commented Dec 13, 2017

enisoc commented Dec 13, 2017

enisoc commented Dec 14, 2017

[job failure] gce-master-1.8-downgrade-cluster #56244

[job failure] gce-master-1.8-downgrade-cluster #56244

Comments

spiffxp commented Nov 22, 2017

jberkus commented Nov 28, 2017

janetkuo commented Nov 30, 2017

spiffxp commented Dec 1, 2017

janetkuo commented Dec 1, 2017

jberkus commented Dec 4, 2017

yguo0905 commented Dec 5, 2017 • edited Loading

spiffxp commented Dec 6, 2017 • edited Loading

spiffxp commented Dec 11, 2017

yguo0905 commented Dec 11, 2017

luxas commented Dec 11, 2017 • edited Loading

enisoc commented Dec 11, 2017

krousey commented Dec 12, 2017

krousey commented Dec 12, 2017

xiangpengzhao commented Dec 12, 2017

enisoc commented Dec 13, 2017

Cluster downgrade [sig-apps] daemonset-upgrade

[sig-cluster-lifecycle] Downgrade [Feature:Downgrade] cluster downgrade should maintain a functioning cluster [Feature:ClusterDowngrade]

[k8s.io] [sig-node] Kubelet [Serial] [Slow] [k8s.io] [sig-node] regular resource usage tracking resource tracking for 100 pods per node

k8s-github-robot commented Dec 13, 2017

enisoc commented Dec 13, 2017

enisoc commented Dec 13, 2017

enisoc commented Dec 14, 2017

yguo0905 commented Dec 5, 2017 •

edited

Loading

spiffxp commented Dec 6, 2017 •

edited

Loading

luxas commented Dec 11, 2017 •

edited

Loading