-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test to detach a pd whose node was deleted #36009
Conversation
Jenkins unit/integration failed for commit 42191d723cc89bcca6cd79527a1fac8707d68149. Full PR test history. The magic incantation to run this job again is |
Jenkins GCI GCE e2e failed for commit 42191d723cc89bcca6cd79527a1fac8707d68149. Full PR test history. The magic incantation to run this job again is |
Jenkins GKE smoke e2e failed for commit 42191d723cc89bcca6cd79527a1fac8707d68149. Full PR test history. The magic incantation to run this job again is |
Jenkins GCE etcd3 e2e failed for commit 42191d723cc89bcca6cd79527a1fac8707d68149. Full PR test history. The magic incantation to run this job again is |
Jenkins verification failed for commit 571d3cfb15831eb00cd21d7e2b778f5dfde27a94. Full PR test history. The magic incantation to run this job again is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to reuse existing code instead of writing your own resize routine.
Also run the test back to back for a couple hours and to test how stable it is. You can use this shell script (just replace with your test name)--it will repeatedly run the same test until failure.
|
||
By("deleting host0") | ||
|
||
output, err = exec.Command("gcloud", "compute", "instances", "delete", string(host0Name), "--project="+framework.TestContext.CloudConfig.ProjectID, "--zone="+framework.TestContext.CloudConfig.Zone).CombinedOutput() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of writing your own cluster resizing routine, reuse the existing code in e2e/resize_nodes.go
:
By(fmt.Sprintf("decreasing cluster size to %d", initialGroupSize-1))
err = ResizeGroup(group, initialGroupSize-1)
Expect(err).NotTo(HaveOccurred())
err = WaitForGroupSize(group, initialGroupSize-1)
Expect(err).NotTo(HaveOccurred())
err = framework.WaitForClusterSize(c, int(initialGroupSize-1), 10*time.Minute)
Expect(err).NotTo(HaveOccurred())
Maybe stick it in a common method ResizeCluster(newSize int)
.
Also immediately prior to resizing make sure to call defer...
that reverts the cluster to the original size (even if there is a failure). Again you can probably reuse the code from AfterEach
in ResizeCluster(newSize int)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ResizeCluster(newsize int) picks a random node to shut down which may or may not help me with my test.
What I wanted to do was to shut down a specific node on which the volume was mounted on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. There is the potential for things not going quite as planned here since the managed-instance-group recreates the missing node. But I don't see a good way to have the MIG kill only the node that we want instead of a random one.
Pet set recreate should recreate evicted petset,hurf,1 | ||
PetSet Basic PetSet functionality should handle healthy pet restarts during scale,kevin-wangzefeng,1 | ||
PetSet Basic PetSet functionality should provide basic identity,girishkalele,1 | ||
PetSet Deploy clustered applications should creating a working mysql cluster,piosz,1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's up with these other tests being added in the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#36663 is fixing this looks like this file was out of date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just run this in a loop for a couple hours and see how flaky it is.
|
||
By("deleting host0") | ||
|
||
output, err = exec.Command("gcloud", "compute", "instances", "delete", string(host0Name), "--project="+framework.TestContext.CloudConfig.ProjectID, "--zone="+framework.TestContext.CloudConfig.Zone).CombinedOutput() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. There is the potential for things not going quite as planned here since the managed-instance-group recreates the missing node. But I don't see a good way to have the MIG kill only the node that we want instead of a random one.
Ran this 512 times grep -i "1 passed" gce-pd-out.out | wc -l |
f1f905f
to
6827be6
Compare
Initially the cluster size wasn't able to come back to the original size. I have increased the timeout in ResizeNodes() so that there is enough time for the all nodes to come back. Wasn't able to get to this earlier as there were other higher priority tasks to deal with. |
1eaaa58
to
2545243
Compare
Jenkins Kubemark GCE e2e failed for commit 1eaaa5897f33c5e30ea67f2a6860c51867d8de6c. Full PR test history. The magic incantation to run this job again is |
Jenkins GCI GKE smoke e2e failed for commit 1eaaa5897f33c5e30ea67f2a6860c51867d8de6c. Full PR test history. The magic incantation to run this job again is |
Jenkins GCE Node e2e failed for commit 1eaaa5897f33c5e30ea67f2a6860c51867d8de6c. Full PR test history. The magic incantation to run this job again is |
Jenkins kops AWS e2e failed for commit 1eaaa5897f33c5e30ea67f2a6860c51867d8de6c. Full PR test history. The magic incantation to run this job again is |
/lgtm |
Jenkins GCE e2e failed for commit f67b495. Full PR test history. The magic incantation to run this job again is |
@k8s-bot cvm gce e2e test this |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue |
What this PR does / why we need it:
A test for the following issue :
If a node with a GCE PD attached is deleted (before the volume is detached), subsequent attempts by the attach/detach controller to detach it should not fail.
Bonus :Added additional code to ensure that the pd can still be attached to a different node.
Edit : Removed it as it was making the test much slower.
#29358
This change is