-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve test reliability #1960
Improve test reliability #1960
Conversation
I ran the full test suite locally 8 times to test this. Tests all passed each of the 8 times, with runtime ranging from 60 to 100s. |
167e0f4
to
83c18dc
Compare
So, somehow the recorded tests take 10m to pass in CI... That's still faster than they were but feels slow compared to what I'm seeing locally... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if the GVK changes are safe to have only when running tests; they might make tests run that should fail.
@@ -137,7 +137,7 @@ func (r *AzureDeploymentReconciler) Reconcile(ctx context.Context) (ctrl.Result, | |||
} | |||
|
|||
func (r *AzureDeploymentReconciler) CreateOrUpdate(ctx context.Context) (ctrl.Result, error) { | |||
r.logObj("reconciling resource") | |||
r.logObj("reconciling resource", r.obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😃
Codecov Report
@@ Coverage Diff @@
## main #1960 +/- ##
==========================================
- Coverage 56.98% 56.96% -0.02%
==========================================
Files 404 405 +1
Lines 80870 81028 +158
==========================================
+ Hits 46082 46157 +75
- Misses 28911 28989 +78
- Partials 5877 5882 +5
Continue to review full report at Codecov.
|
6c8fa01
to
8b940a0
Compare
@theunrepentantgeek, you said:
I'm not 100% sure I follow. AFAIK, the GVK "additions" in the test client are just to make the direct to APIServer (no cache) client behave the same as the other client already does. So it makes tests pass that should pass (without it, they fail with the direct client but pass with the cached client due to the behavior difference). Can you expand upon what scenarios it might make something pass that wasn't before (with the cached client?) |
* Use a KubeClient in EnvTest that always hits APIServer to avoid cache latency and inconsistency issues that can cause races and intermittent test failures. See kubernetes-sigs/controller-runtime#1464 and kubernetes-sigs/controller-runtime#343 for details. * Write Status after Spec. This ensures that tests waiting for a status update cannot possibly see it so fast that they go on to perform a write that conflicts with the Spec write of the controller. * Improve log messages to be clearer (aids in test debugging).
Prior to performing any deletion actions (either monitoring or deletion), verify that the finalizer is still set. If it's not set, immediately allow the k8s resource to be deleted. Don't take any action on the Azure resource in this instance.
8b940a0
to
add957b
Compare
I had a concern that code dependent on having the GVK would pass in test but fail in production because GVK was only reliably present during testing. If the client used during production runs will also reliably provide the GVK then my concerns are moot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, but I'm worried that the finalizer change might make the tests more stable at the risk of the occasional lost resource.
|
||
changed, err := r.AddInitialResourceState(armResource.GetID()) | ||
err = r.AddInitialResourceState(armResource.GetID()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point the resource hasn't been updated back to etcd with the finalizer, right? It seems like there's a chance we'll issue the creation in ARM but the k8s save fails for some reason so we lose track of the ARM resource. Doing another loop seems like it would be safer, but is that the thing we're trying to avoid for test reliability? I don't think it is but I might be misunderstanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, you're right. This is probably a longstanding bug we've had. I'll fix this but do it in a different PR as it's pretty unrelated to tests.
add957b
to
92d81dc
Compare
immediately allow the k8s resource to be deleted. Don't take any action on the Azure resource in this instance.
How does this PR make you feel:
If applicable: