CSI Upgrade Tests Failing for hostPath and gcePD #71228

davidz627 · 2018-11-19T19:31:41Z

Failing job
https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-cluster

Failing Tests

[sig-storage] [Serial] CSI Volumes CSI plugin test using CSI driver: gcePD [Serial] should provision storage
[sig-storage] [Serial] CSI Volumes CSI plugin test using CSI driver: hostPath [Serial] should provision storage
Error
https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster/2629

This is now 1.13 release blocking.

CSI v1.0.0 is only compatible with k8s >1.13
CSI v0.x is only compatible with k8s <1.12

Its probable that we will need to detect what version the k8s cluster is on in the test and deploy the "correct" version of the CSI Drivers for each version

/kind failing-test
/assign
/cc @AishSundar @jberkus @saad-ali @msau42
/sig storage

AishSundar · 2018-11-19T19:50:00Z

The specific upgrade job these tests are failing on, upgrades the master and nodes to 1.13 and run the old version's tests i.e., in this case runs 1.11 tests against the 1.13 cluster

If a test job name ends with upgrade-cluster, it means we first upgrade the cluster (i.e. master and nodes) and then run the old test suite with new kubectl.

More info on upgrade tests at https://github.com/kubernetes/community/blob/master/contributors/devel/e2e-tests.md#version-skewed-and-upgrade-testing

msau42 · 2018-11-19T19:51:54Z

Do we also have skew tests that only upgrade master, but keeps the nodes on the older version? If so, we need to detect that case too and use the old 0.3 drivers in that scenario.

AishSundar · 2018-11-19T20:00:19Z

Yes @msau42, thats https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-master

From speaking to @davidz627 we need to update tests in the following jobs

liggitt · 2018-11-20T18:05:26Z

am I understanding correctly that there is no overlap in supported CSI plugin APIs between 1.12 and 1.13?

AishSundar · 2018-11-20T18:28:21Z

quoting @saad-ali from #65246 (comment)

Kubernetes v1.13 speaks CSI v1.0 and no longer understands CSI v0.x. Any CSI drivers deployed on previous versions of Kubernetes speaking CSI v0.x must removed, updated to be CSI 1.0 compatible, and redeployed on the on Kubernetes v1.13.

If only the master is upgraded to 1.13 (and nodes are 1.12 or earlier) older drivers will continue to work. When nodes are upgraded to 1.13, then older drivers will stop working.

So yes looks like users will see this exact same issue as this CI signal if the entire cluster is upgraded to 1.13. But I will let @saad-ali talk to this more. Also should we communicate this more broadly than within Release notes. This test failure did manage to raise a few eyebrows during the burndown, so I am wondering if this compatibility expectation is clearly understood/ communicated.

saad-ali · 2018-11-20T21:25:12Z

CSI 1.0 is not backwards compatible with CSI 0.x. Kubernetes v1.13 adds support for CSI 1.0 while dropping support for CSI 0.x. Which means it is not backwards compatible with CSI 0.x drivers.

A vendor may choose to make a version of a CSI driver that supports both versions.

If they do not, the disruptive upgrade path for Kubernetes cluster admins is to remove the old CSI drivers from the cluster before upgrading. Upgrade master and node to k8s v1.13, and wait until master and all nodes are upgraded to 1.13, and then deploy the new CSI 1.0 compatible version of the driver.

This approach will be disruptive for the duration of the cluster upgrade as it would mean that during the upgrade no volumes of that type can be provisioned/attached/mounted/etc. After the cluster/driver upgrade things should just start working again with the existing PV/PVCs/objects (unless the driver name changed, or the driver had some other change between 0.3 and 1.0 that prevents it from operating on the old PV/PVC/Storageclass objects).

A possible non-disruptive upgrade path is being investigated here: #71282

Regardless, in the CSI community it has been well communicated (and in my opinion well understood) that CSI 0.x will have breaking changes from release to release. The intention has been to treat CSI 0.x as a chance for developing and testing drivers before we arriving at a stable API. And as far as I know that is the case with the CSI drivers that I am aware of (https://kubernetes-csi.github.io/docs/Drivers.html) -- none of them are currently production ready.

The intended path forward is to document the incompatibility as a known (and expected) issue for 1.13. And document possible upgrade strategies. The existing test have been modified to support disruptive upgrade (PR #71241). We're looking in to also adding a new upgrade test that exercises and validates a possible non-disruptive upgrade strategy (issue #71282).

Moving forward, we intend for all all CSI 1.x release to be backwards compatible to 1.0 and support it as such in Kubernetes.

CC @thockin @liggitt @AishSundar

liggitt · 2018-11-20T21:46:52Z

the disruptive upgrade path for Kubernetes cluster admins is to remove the old CSI drivers from the cluster before upgrading. Upgrade master and node to k8s v1.13, and wait until master and all nodes are upgraded to 1.13, and then deploy the new CSI 1.0 compatible version of the driver.

Just to explore how disruptive that would be, aren't nodes required to be drained as part of upgrade? if so, when new replicas of those workloads attempt to get scheduled to other nodes, any that use CSI volumes will not be able to run because those volumes could not be attached/mounted, correct?

saad-ali · 2018-11-20T21:55:35Z

Just to explore how disruptive that would be, aren't nodes required to be drained as part of upgrade? if so, when new replicas of those workloads attempt to get scheduled to other nodes, any that use CSI volumes will not be able to run because those volumes could not be attached/mounted, correct?

Correct, it is suggested that pods be drained from nodes before a node is upgraded. And if a CSI driver is uninstalled before upgrade, then the pods depending on volumes using that driver will remain in pending state (won't attach/mount) until the upgrade is complete and a compatible version of the driver is installed.

msau42 · 2018-11-20T22:42:33Z

This might be tricky for pods that are managed by higher level controllers that will automatically restart the pods

saad-ali · 2018-11-20T23:38:06Z

@AishSundar @liggitt I spoke with @thockin. He is not happy with the backwards comparability break. He wants us to look in to what it will take to support both 0.x and 1.0. I will look in to that. And update this thread on feasibility. Until this is resolved consider this a release blocking issue.

/priority critical-urgent
/milestone v1.13

AishSundar · 2018-11-20T23:46:42Z

@saad-ali thanks for the update, will watch this thread.

saad-ali · 2018-11-21T19:26:20Z

We discussed this at the release meeting today. Plan of record is that we will try to get PR #71314 approved/merged by EOD. Once it is merged, we'll monitor for regressions over the long weekend. On Monday we will assess the impact and next steps. As a backup option we will continue to investigate a non-disruptive manual intervention upgrade option (issue #71282).

AishSundar · 2018-11-22T04:32:07Z

/reopen

Reopening until we get green CI. @jberkus ^^

k8s-ci-robot · 2018-11-22T04:32:08Z

@AishSundar: Reopened this issue.

In response to this:

/reopen

Reopening until we get green CI. @jberkus ^^

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jberkus · 2018-11-23T18:13:57Z

We're getting a lot of flakes across the longer-running tests today, so at this point we don't know if things are passing or not.

liggitt · 2018-11-23T18:17:36Z

the CSI tests in question turned green after #71314 was merged:

https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-cluster&include-filter-by-regex=.*CSI.*&width=20

This particular issue can be closed

jberkus · 2018-11-23T18:37:26Z

@liggitt @saad-ali We're still getting various CSI failures in new-master-upgrade-parallel:

https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-cluster-new-parallel&include-filter-by-regex=.*CSI.*&width=20

Are those unrelated to this issue? They seem to be all subpath failures.

liggitt · 2018-11-23T19:00:24Z

Are those unrelated to this issue?

Flakes are unrelated to this issue. This issue was fixing a hard fail.

jberkus · 2018-11-23T19:54:19Z

OK, closing this and opening a new issue for the flakes.

/close

k8s-ci-robot · 2018-11-23T19:54:20Z

@jberkus: Closing this issue.

In response to this:

OK, closing this and opening a new issue for the flakes.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot assigned davidz627 Nov 19, 2018

k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Nov 19, 2018

davidz627 changed the title ~~CSI Upgrade Tests Failing~~ CSI Upgrade Tests Failing for hostPath and gcePD Nov 19, 2018

davidz627 mentioned this issue Nov 19, 2018

Second Pod Using Same PVC Fails WaitForAttach Flakily #65246

Closed

AishSundar mentioned this issue Nov 19, 2018

[failing test] CSI Volumes CSI plugin test using CSI driver #71216

Closed

This was referenced Nov 20, 2018

Amend test to work with with version skew of 1.13+ by using CSI V1 compatible drivers and sidecars #71241

Closed

Amend hostpath CSI Driver tests to run with version skew of 1.13+ by … #71244

Closed

saad-ali mentioned this issue Nov 20, 2018

Add a test for "graceful" CSI upgrade between 1.12 to 1.13 #71282

Closed

saad-ali added this to the v1.13 milestone Nov 20, 2018

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 20, 2018

saad-ali mentioned this issue Nov 21, 2018

Reintroduce CSI 0.3.x support in CSI Volume Plugin #71314

Merged

k8s-ci-robot closed this as completed in #71314 Nov 22, 2018

k8s-ci-robot reopened this Nov 22, 2018

k8s-ci-robot closed this as completed Nov 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI Upgrade Tests Failing for hostPath and gcePD #71228

CSI Upgrade Tests Failing for hostPath and gcePD #71228

davidz627 commented Nov 19, 2018

AishSundar commented Nov 19, 2018

msau42 commented Nov 19, 2018

AishSundar commented Nov 19, 2018

liggitt commented Nov 20, 2018

AishSundar commented Nov 20, 2018

saad-ali commented Nov 20, 2018 •

edited

Loading

liggitt commented Nov 20, 2018

saad-ali commented Nov 20, 2018

msau42 commented Nov 20, 2018

saad-ali commented Nov 20, 2018

AishSundar commented Nov 20, 2018

saad-ali commented Nov 21, 2018

AishSundar commented Nov 22, 2018

k8s-ci-robot commented Nov 22, 2018

jberkus commented Nov 23, 2018

liggitt commented Nov 23, 2018

jberkus commented Nov 23, 2018

liggitt commented Nov 23, 2018

jberkus commented Nov 23, 2018

k8s-ci-robot commented Nov 23, 2018

CSI Upgrade Tests Failing for hostPath and gcePD #71228

CSI Upgrade Tests Failing for hostPath and gcePD #71228

Comments

davidz627 commented Nov 19, 2018

AishSundar commented Nov 19, 2018

msau42 commented Nov 19, 2018

AishSundar commented Nov 19, 2018

liggitt commented Nov 20, 2018

AishSundar commented Nov 20, 2018

saad-ali commented Nov 20, 2018 • edited Loading

liggitt commented Nov 20, 2018

saad-ali commented Nov 20, 2018

msau42 commented Nov 20, 2018

saad-ali commented Nov 20, 2018

AishSundar commented Nov 20, 2018

saad-ali commented Nov 21, 2018

AishSundar commented Nov 22, 2018

k8s-ci-robot commented Nov 22, 2018

jberkus commented Nov 23, 2018

liggitt commented Nov 23, 2018

jberkus commented Nov 23, 2018

liggitt commented Nov 23, 2018

jberkus commented Nov 23, 2018

k8s-ci-robot commented Nov 23, 2018

saad-ali commented Nov 20, 2018 •

edited

Loading