Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(vmclone): delete snapshot and restore after pvc bound #10888

Merged
merged 2 commits into from
Jan 26, 2024

Conversation

fossedihelm
Copy link
Contributor

@fossedihelm fossedihelm commented Dec 15, 2023

The clone process involves vmsnapshot and vmrestore.
Currently, these are both deleted as soon as the restore is finished. If the source vm has a pvc the vmrestore process creates another pvc, pointing to the volumesnapshot previously created by the vmsnapshot.
With Immediate binding mode the pvc(s) created are immediately bound.
When WaitForFirstConsumer binding mode is used the pvc will stay in "Pending" state until a pod is created(the virt-launcher pod).
In the latter case, when the cloned vm is started, it is stuck in "WaitingForVolumeBinding" becuse the related PVC binding is failing, due to the snapshot source has being deleted. In other words, we cannot delete the snapshot resources until all the pvc are Bound.

This patch aim to delay the snapshot and the restore deletion until we are sure that all the pvc are bound.

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #10832
jira-ticket: https://issues.redhat.com/browse/CNV-32667
Special notes for your reviewer:

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note:

[Bugfix] Clone VM with WaitForFirstConsumer binding mode PVC now works.

@kubevirt-bot kubevirt-bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Dec 15, 2023
Copy link
Contributor

@akalenyu akalenyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just a bit concerned about testing

Should we start talking about a periodic with ceph-wffc so we don't miss this kind of stuff in the future? ofc this would be effort outside of this PR and owned by sig-storage

pkg/virt-controller/watch/clone/clone_test.go Show resolved Hide resolved
@fossedihelm
Copy link
Contributor Author

Looks good to me, just a bit concerned about testing

Should we start talking about a periodic with ceph-wffc so we don't miss this kind of stuff in the future? ofc this would be effort outside of this PR and owned by sig-storage

Thanks @akalenyu for the review!
I still don't have a strong opinion regard this. I think that probably, we should start thinking about it. In particular if there are many places where WFFC has a different behavior regarding the Immediate binding mode.

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Dec 19, 2023
@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 3, 2024
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 3, 2024
@fossedihelm
Copy link
Contributor Author

/retest

1 similar comment
@acardace
Copy link
Member

acardace commented Jan 8, 2024

/retest

expectEvent(TargetVMCreated)
})

It("if not all the PVCs are bound, nothing should happen", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fossedihelm Hi! just want to make sure, if you run this test w/o the changes in the product code, do you get the issue reproduction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct :)

Copy link
Contributor

@enp0s3 enp0s3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Thanks @fossedihelm !

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enp0s3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 10, 2024
Copy link
Member

@xpivarc xpivarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a functional test?

pkg/virt-controller/watch/clone/clone.go Show resolved Hide resolved
The clone process involves `vmsnapshot` and `vmrestore`.
Currently, these are both deleted as soon as the restore is finished.
If the source vm has a pvc the `vmrestore` process creates another pvc, pointing
to the volumesnapshot previously created by the `vmsnapshot`.
With `Immediate` binding mode the pvc(s) created are immediately bound.
When `WaitForFirstConsumer` binding mode is used the pvc will stay in "Pending" state
until a pod is created(the virt-launcher pod).
In the latter case, when the cloned vm is started, it is stuck in "WaitingForVolumeBinding"
becuse the related PVC binding is failing, due to the snapshot source has being deleted.
In other words, we cannot delete the snapshot resources until all the pvc are Bound.

This patch aim to delay the snapshot and the restore deletion until we are sure that all
the pvc are bound.

Signed-off-by: fossedihelm <ffossemo@redhat.com>
If a pvc associated to a vm clone is using a storage class
with wffc binding mode, the vmrestore and the vm snapshot
resources should be deleted only after the target pvc are
bound. This commit cover this case with an e2e test.

Signed-off-by: fossedihelm <ffossemo@redhat.com>
@fossedihelm
Copy link
Contributor Author

@xpivarc I added the functional test and rebased. PTAL Thanks!

@kbidarkar
Copy link
Contributor

/retest

@@ -526,6 +528,10 @@ var _ = Describe("[Serial]VirtualMachineClone Tests", Serial, func() {
vm, err := virtClient.VirtualMachine(vm.Namespace).Create(context.Background(), vm)
Expect(err).ToNot(HaveOccurred())

if !running && libstorage.IsStorageClassBindingModeWaitForFirstConsumer(storageClass) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akalenyu @awels
Should we just extend the HaveSucceded or what is the expected state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should not extend the HaveSucceeded, "hiding" the wffc binding mode behavior.
The createVMWithStorageClass function should address the different behavior between the storage classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the more verbose variant here:
844e69b#diff-847cd06c1bf6ef597ea96894f8b32ecb4d074ad1bc785b39871e11fb6ab71b3dR538

(if statement can be dropped)

Comment on lines +790 to +797
Consistently(func(g Gomega) {
vmSnapshot, err := virtClient.VirtualMachineSnapshot(vmClone.Namespace).Get(context.Background(), *vmSnapshotName, v1.GetOptions{})
g.Expect(err).ShouldNot(HaveOccurred())
g.Expect(vmSnapshot).ShouldNot(BeNil())
vmRestore, err := virtClient.VirtualMachineRestore(vmClone.Namespace).Get(context.Background(), *vmRestoreName, v1.GetOptions{})
g.Expect(err).ShouldNot(HaveOccurred())
g.Expect(vmRestore).ShouldNot(BeNil())
}, 30*time.Second).Should(Succeed(), "vmsnapshot and vmrestore should not be deleted until the pvc is bound")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if we need this. Isn't it enough to check a phase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I want to be sure the vmrestore and the vmsnapshot are not deleted until the pvc is bound (bu run the vm).
We can check the phase, but I should do it Consistenly IMO. WDYT? Thanks

Comment on lines +802 to +807
Eventually(func(g Gomega) {
_, err := virtClient.VirtualMachineSnapshot(vmClone.Namespace).Get(context.Background(), *vmSnapshotName, v1.GetOptions{})
g.Expect(errors.IsNotFound(err)).Should(BeTrue())
_, err = virtClient.VirtualMachineRestore(vmClone.Namespace).Get(context.Background(), *vmRestoreName, v1.GetOptions{})
g.Expect(errors.IsNotFound(err)).Should(BeTrue())
}, 1*time.Minute).Should(Succeed(), "vmsnapshot and vmrestore should be deleted once the pvc is bound")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be replaced by using BeGone but you do provide the description so I think this is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have several follow-up PRs, I will adjust it later if you agree. Thanks

@xpivarc
Copy link
Member

xpivarc commented Jan 24, 2024

/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 24, 2024
@kubevirt-commenter-bot
Copy link

Required labels detected, running phase 2 presubmits:
/test pull-kubevirt-e2e-kind-1.27-sriov
/test pull-kubevirt-e2e-k8s-1.27-ipv6-sig-network
/test pull-kubevirt-e2e-k8s-1.27-sig-network
/test pull-kubevirt-e2e-k8s-1.27-sig-storage
/test pull-kubevirt-e2e-k8s-1.27-sig-compute
/test pull-kubevirt-e2e-k8s-1.27-sig-operator
/test pull-kubevirt-e2e-k8s-1.28-sig-network
/test pull-kubevirt-e2e-k8s-1.28-sig-storage
/test pull-kubevirt-e2e-k8s-1.28-sig-compute
/test pull-kubevirt-e2e-k8s-1.28-sig-operator

@kbidarkar
Copy link
Contributor

/retest-required

1 similar comment
@acardace
Copy link
Member

/retest-required

@fossedihelm
Copy link
Contributor Author

/cherrypick release-1.1

@kubevirt-bot
Copy link
Contributor

@fossedihelm: once the present PR merges, I will cherry-pick it on top of release-1.1 in a new PR and assign it to you.

In response to this:

/cherrypick release-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fossedihelm
Copy link
Contributor Author

/retest-required

2 similar comments
@fossedihelm
Copy link
Contributor Author

/retest-required

@fossedihelm
Copy link
Contributor Author

/retest-required

@acardace
Copy link
Member

/retest

@brianmcarey
Copy link
Member

/override pull-kubevirt-e2e-k8s-1.29-sig-monitoring

This lane was triggered by prow statusreconciler but is not required for this PR

@kubevirt-bot
Copy link
Contributor

@brianmcarey: Overrode contexts on behalf of brianmcarey: pull-kubevirt-e2e-k8s-1.29-sig-monitoring

In response to this:

/override pull-kubevirt-e2e-k8s-1.29-sig-monitoring

This lane was triggered by prow statusreconciler but is not required for this PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fossedihelm
Copy link
Contributor Author

/retest-required

@kubevirt-bot kubevirt-bot merged commit 888cbd3 into kubevirt:main Jan 26, 2024
39 of 40 checks passed
@kubevirt-bot
Copy link
Contributor

@fossedihelm: new pull request created: #11085

In response to this:

/cherrypick release-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed to Clone VM When DataVolume's Type is WaitForFirstConsumer
9 participants