Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI Upgrade Tests Failing for hostPath and gcePD #71228

Closed
davidz627 opened this issue Nov 19, 2018 · 20 comments · Fixed by #71314
Closed

CSI Upgrade Tests Failing for hostPath and gcePD #71228

davidz627 opened this issue Nov 19, 2018 · 20 comments · Fixed by #71314
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Milestone

Comments

@davidz627
Copy link
Contributor

Quoted from: #65246 (comment)

Failing job
https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-cluster

Failing Tests

[sig-storage] [Serial] CSI Volumes CSI plugin test using CSI driver: gcePD [Serial] should provision storage
[sig-storage] [Serial] CSI Volumes CSI plugin test using CSI driver: hostPath [Serial] should provision storage
Error
https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster/2629

This is now 1.13 release blocking.

CSI v1.0.0 is only compatible with k8s >1.13
CSI v0.x is only compatible with k8s <1.12

Its probable that we will need to detect what version the k8s cluster is on in the test and deploy the "correct" version of the CSI Drivers for each version

/kind failing-test
/assign
/cc @AishSundar @jberkus @saad-ali @msau42
/sig storage

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Nov 19, 2018
@davidz627 davidz627 changed the title CSI Upgrade Tests Failing CSI Upgrade Tests Failing for hostPath and gcePD Nov 19, 2018
@AishSundar
Copy link
Contributor

The specific upgrade job these tests are failing on, upgrades the master and nodes to 1.13 and run the old version's tests i.e., in this case runs 1.11 tests against the 1.13 cluster

If a test job name ends with upgrade-cluster, it means we first upgrade the cluster (i.e. master and nodes) and then run the old test suite with new kubectl.

More info on upgrade tests at https://github.com/kubernetes/community/blob/master/contributors/devel/e2e-tests.md#version-skewed-and-upgrade-testing

@msau42
Copy link
Member

msau42 commented Nov 19, 2018

Do we also have skew tests that only upgrade master, but keeps the nodes on the older version? If so, we need to detect that case too and use the old 0.3 drivers in that scenario.

@liggitt
Copy link
Member

liggitt commented Nov 20, 2018

am I understanding correctly that there is no overlap in supported CSI plugin APIs between 1.12 and 1.13?

@AishSundar
Copy link
Contributor

quoting @saad-ali from #65246 (comment)

Kubernetes v1.13 speaks CSI v1.0 and no longer understands CSI v0.x. Any CSI drivers deployed on previous versions of Kubernetes speaking CSI v0.x must removed, updated to be CSI 1.0 compatible, and redeployed on the on Kubernetes v1.13.

If only the master is upgraded to 1.13 (and nodes are 1.12 or earlier) older drivers will continue to work. When nodes are upgraded to 1.13, then older drivers will stop working.

So yes looks like users will see this exact same issue as this CI signal if the entire cluster is upgraded to 1.13. But I will let @saad-ali talk to this more. Also should we communicate this more broadly than within Release notes. This test failure did manage to raise a few eyebrows during the burndown, so I am wondering if this compatibility expectation is clearly understood/ communicated.

@saad-ali
Copy link
Member

saad-ali commented Nov 20, 2018

CSI 1.0 is not backwards compatible with CSI 0.x. Kubernetes v1.13 adds support for CSI 1.0 while dropping support for CSI 0.x. Which means it is not backwards compatible with CSI 0.x drivers.

A vendor may choose to make a version of a CSI driver that supports both versions.

If they do not, the disruptive upgrade path for Kubernetes cluster admins is to remove the old CSI drivers from the cluster before upgrading. Upgrade master and node to k8s v1.13, and wait until master and all nodes are upgraded to 1.13, and then deploy the new CSI 1.0 compatible version of the driver.

This approach will be disruptive for the duration of the cluster upgrade as it would mean that during the upgrade no volumes of that type can be provisioned/attached/mounted/etc. After the cluster/driver upgrade things should just start working again with the existing PV/PVCs/objects (unless the driver name changed, or the driver had some other change between 0.3 and 1.0 that prevents it from operating on the old PV/PVC/Storageclass objects).

A possible non-disruptive upgrade path is being investigated here: #71282

Regardless, in the CSI community it has been well communicated (and in my opinion well understood) that CSI 0.x will have breaking changes from release to release. The intention has been to treat CSI 0.x as a chance for developing and testing drivers before we arriving at a stable API. And as far as I know that is the case with the CSI drivers that I am aware of (https://kubernetes-csi.github.io/docs/Drivers.html) -- none of them are currently production ready.

The intended path forward is to document the incompatibility as a known (and expected) issue for 1.13. And document possible upgrade strategies. The existing test have been modified to support disruptive upgrade (PR #71241). We're looking in to also adding a new upgrade test that exercises and validates a possible non-disruptive upgrade strategy (issue #71282).

Moving forward, we intend for all all CSI 1.x release to be backwards compatible to 1.0 and support it as such in Kubernetes.

CC @thockin @liggitt @AishSundar

@liggitt
Copy link
Member

liggitt commented Nov 20, 2018

the disruptive upgrade path for Kubernetes cluster admins is to remove the old CSI drivers from the cluster before upgrading. Upgrade master and node to k8s v1.13, and wait until master and all nodes are upgraded to 1.13, and then deploy the new CSI 1.0 compatible version of the driver.

Just to explore how disruptive that would be, aren't nodes required to be drained as part of upgrade? if so, when new replicas of those workloads attempt to get scheduled to other nodes, any that use CSI volumes will not be able to run because those volumes could not be attached/mounted, correct?

@saad-ali
Copy link
Member

Just to explore how disruptive that would be, aren't nodes required to be drained as part of upgrade? if so, when new replicas of those workloads attempt to get scheduled to other nodes, any that use CSI volumes will not be able to run because those volumes could not be attached/mounted, correct?

Correct, it is suggested that pods be drained from nodes before a node is upgraded. And if a CSI driver is uninstalled before upgrade, then the pods depending on volumes using that driver will remain in pending state (won't attach/mount) until the upgrade is complete and a compatible version of the driver is installed.

@msau42
Copy link
Member

msau42 commented Nov 20, 2018

This might be tricky for pods that are managed by higher level controllers that will automatically restart the pods

@saad-ali saad-ali added this to the v1.13 milestone Nov 20, 2018
@saad-ali
Copy link
Member

@AishSundar @liggitt I spoke with @thockin. He is not happy with the backwards comparability break. He wants us to look in to what it will take to support both 0.x and 1.0. I will look in to that. And update this thread on feasibility. Until this is resolved consider this a release blocking issue.

/priority critical-urgent
/milestone v1.13

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 20, 2018
@AishSundar
Copy link
Contributor

@saad-ali thanks for the update, will watch this thread.

@saad-ali
Copy link
Member

We discussed this at the release meeting today. Plan of record is that we will try to get PR #71314 approved/merged by EOD. Once it is merged, we'll monitor for regressions over the long weekend. On Monday we will assess the impact and next steps. As a backup option we will continue to investigate a non-disruptive manual intervention upgrade option (issue #71282).

@AishSundar
Copy link
Contributor

/reopen

Reopening until we get green CI. @jberkus ^^

@k8s-ci-robot
Copy link
Contributor

@AishSundar: Reopened this issue.

In response to this:

/reopen

Reopening until we get green CI. @jberkus ^^

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Nov 22, 2018
@jberkus
Copy link

jberkus commented Nov 23, 2018

We're getting a lot of flakes across the longer-running tests today, so at this point we don't know if things are passing or not.

@liggitt
Copy link
Member

liggitt commented Nov 23, 2018

the CSI tests in question turned green after #71314 was merged:

https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-cluster&include-filter-by-regex=.*CSI.*&width=20

This particular issue can be closed

@jberkus
Copy link

jberkus commented Nov 23, 2018

@liggitt @saad-ali We're still getting various CSI failures in new-master-upgrade-parallel:

https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-cluster-new-parallel&include-filter-by-regex=.*CSI.*&width=20

Are those unrelated to this issue? They seem to be all subpath failures.

@liggitt
Copy link
Member

liggitt commented Nov 23, 2018

Are those unrelated to this issue?

Flakes are unrelated to this issue. This issue was fixing a hard fail.

@jberkus
Copy link

jberkus commented Nov 23, 2018

OK, closing this and opening a new issue for the flakes.

/close

@k8s-ci-robot
Copy link
Contributor

@jberkus: Closing this issue.

In response to this:

OK, closing this and opening a new issue for the flakes.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants