Improve test reliability #1960

matthchr · 2021-11-05T22:03:12Z

Use a KubeClient in EnvTest that always hits APIServer to avoid cache latency and inconsistency issues that can cause races and intermittent test failures. See Controller reconciler GET is reading stale data kubernetes-sigs/controller-runtime#1464 and Cannot Get after Create using Client with Custom Resource kubernetes-sigs/controller-runtime#343 for details.
Write Status after Spec. This ensures that tests waiting for a status update cannot possibly see it so fast that they go on to perform a write that conflicts with the Spec write of the controller.
Improve log messages to be clearer (aids in test debugging).
Prior to performing any deletion actions (either monitoring or deletion), verify that the finalizer is still set. If it's not set,
immediately allow the k8s resource to be deleted. Don't take any action on the Azure resource in this instance.

How does this PR make you feel:

If applicable:

this PR contains documentation
this PR contains tests

matthchr · 2021-11-05T22:04:31Z

I ran the full test suite locally 8 times to test this. Tests all passed each of the 8 times, with runtime ranging from 60 to 100s.

matthchr · 2021-11-05T23:45:08Z

So, somehow the recorded tests take 10m to pass in CI... That's still faster than they were but feels slow compared to what I'm seeing locally...
@Porges any idea what spec machines we get in actions?

theunrepentantgeek

I'm not sure if the GVK changes are safe to have only when running tests; they might make tests run that should fail.

theunrepentantgeek · 2021-11-07T00:43:57Z

v2/internal/reconcilers/azure_generic_arm_reconciler.go

@@ -137,7 +137,7 @@ func (r *AzureDeploymentReconciler) Reconcile(ctx context.Context) (ctrl.Result,
 }

 func (r *AzureDeploymentReconciler) CreateOrUpdate(ctx context.Context) (ctrl.Result, error) {
-	r.logObj("reconciling resource")
+	r.logObj("reconciling resource", r.obj)


v2/internal/reconcilers/azure_generic_arm_reconciler.go

v2/internal/testcommon/direct_client.go

codecov-commenter · 2021-11-07T01:02:16Z

Codecov Report

Merging #1960 (6c8fa01) into main (05179ef) will decrease coverage by 0.01%.
The diff coverage is 63.57%.

@@            Coverage Diff             @@
##             main    #1960      +/-   ##
==========================================
- Coverage   56.98%   56.96%   -0.02%     
==========================================
  Files         404      405       +1     
  Lines       80870    81028     +158     
==========================================
+ Hits        46082    46157      +75     
- Misses      28911    28989      +78     
- Partials     5877     5882       +5

Impacted Files	Coverage Δ
...a1api20210515storage/database_account_types_gen.go	`10.34% <ø> (ø)`
...alpha1api20211101/namespaces_eventhub_types_gen.go	`53.54% <0.00%> (-0.35%)`	⬇️
...a1api20201101/virtual_network_gateway_types_gen.go	`44.48% <0.00%> (-0.04%)`	⬇️
...k/v1alpha1api20201101/virtual_network_types_gen.go	`52.20% <0.00%> (-0.41%)`	⬇️
...ork/v1alpha1api20201101/load_balancer_types_gen.go	`55.98% <35.00%> (-0.25%)`	⬇️
v2/internal/testcommon/direct_client.go	`47.05% <47.05%> (ø)`
.../v1alpha1api20210515/database_account_types_gen.go	`56.41% <47.61%> (ø)`
v2/tools/generator/internal/astbuilder/builder.go	`86.78% <75.00%> (-0.66%)`	⬇️
...api20201201/virtual_machine_scale_set_types_gen.go	`55.23% <79.16%> (+0.09%)`	⬆️
...gen/pipeline/convert_allof_and_oneof_to_objects.go	`72.31% <79.41%> (+0.66%)`	⬆️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 268b9a2...6c8fa01. Read the comment docs.

matthchr · 2021-11-08T18:12:56Z

@theunrepentantgeek, you said:

I'm not sure if the GVK changes are safe to have only when running tests; they might make tests run that should fail.

I'm not 100% sure I follow. AFAIK, the GVK "additions" in the test client are just to make the direct to APIServer (no cache) client behave the same as the other client already does. So it makes tests pass that should pass (without it, they fail with the direct client but pass with the cached client due to the behavior difference).

Can you expand upon what scenarios it might make something pass that wasn't before (with the cached client?)

* Use a KubeClient in EnvTest that always hits APIServer to avoid cache latency and inconsistency issues that can cause races and intermittent test failures. See kubernetes-sigs/controller-runtime#1464 and kubernetes-sigs/controller-runtime#343 for details. * Write Status after Spec. This ensures that tests waiting for a status update cannot possibly see it so fast that they go on to perform a write that conflicts with the Spec write of the controller. * Improve log messages to be clearer (aids in test debugging).

Prior to performing any deletion actions (either monitoring or deletion), verify that the finalizer is still set. If it's not set, immediately allow the k8s resource to be deleted. Don't take any action on the Azure resource in this instance.

theunrepentantgeek · 2021-11-08T19:18:37Z

Can you expand upon what scenarios it might make something pass that wasn't before (with the cached client?)

I had a concern that code dependent on having the GVK would pass in test but fail in production because GVK was only reliably present during testing. If the client used during production runs will also reliably provide the GVK then my concerns are moot.

babbageclunk

This looks good, but I'm worried that the finalizer change might make the tests more stable at the risk of the occasional lost resource.

v2/internal/controllers/generic_controller.go

babbageclunk · 2021-11-08T20:48:14Z

v2/internal/reconcilers/azure_generic_arm_reconciler.go


-	changed, err := r.AddInitialResourceState(armResource.GetID())
+	err = r.AddInitialResourceState(armResource.GetID())


At this point the resource hasn't been updated back to etcd with the finalizer, right? It seems like there's a chance we'll issue the creation in ARM but the k8s save fails for some reason so we lose track of the ARM resource. Doing another loop seems like it would be safer, but is that the thing we're trying to avoid for test reliability? I don't think it is but I might be misunderstanding.

Nope, you're right. This is probably a longstanding bug we've had. I'll fix this but do it in a different PR as it's pretty unrelated to tests.

matthchr added this to the codegen-alpha-4 milestone Nov 5, 2021

matthchr requested review from babbageclunk, davefellows, Porges and theunrepentantgeek as code owners November 5, 2021 22:03

matthchr force-pushed the feature/more-test-improvements branch 2 times, most recently from 167e0f4 to 83c18dc Compare November 5, 2021 22:30

theunrepentantgeek reviewed Nov 7, 2021

View reviewed changes

matthchr force-pushed the feature/more-test-improvements branch from 6c8fa01 to 8b940a0 Compare November 8, 2021 17:39

matthchr added 2 commits November 8, 2021 10:29

Verify that finalizer is set before delete

e938945

Prior to performing any deletion actions (either monitoring or deletion), verify that the finalizer is still set. If it's not set, immediately allow the k8s resource to be deleted. Don't take any action on the Azure resource in this instance.

matthchr force-pushed the feature/more-test-improvements branch from 8b940a0 to add957b Compare November 8, 2021 18:29

theunrepentantgeek approved these changes Nov 8, 2021

View reviewed changes

babbageclunk approved these changes Nov 8, 2021

View reviewed changes

PR feedback

92d81dc

matthchr force-pushed the feature/more-test-improvements branch from add957b to 92d81dc Compare November 8, 2021 22:37

matthchr merged commit 617de6b into Azure:main Nov 8, 2021

matthchr deleted the feature/more-test-improvements branch November 8, 2021 23:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve test reliability #1960

Improve test reliability #1960

matthchr commented Nov 5, 2021 •

edited

Loading

matthchr commented Nov 5, 2021

matthchr commented Nov 5, 2021

theunrepentantgeek left a comment

theunrepentantgeek Nov 7, 2021

codecov-commenter commented Nov 7, 2021

matthchr commented Nov 8, 2021

theunrepentantgeek commented Nov 8, 2021

babbageclunk left a comment

babbageclunk Nov 8, 2021

matthchr Nov 8, 2021


		changed, err := r.AddInitialResourceState(armResource.GetID())
		err = r.AddInitialResourceState(armResource.GetID())

Improve test reliability #1960

Improve test reliability #1960

Conversation

matthchr commented Nov 5, 2021 • edited Loading

matthchr commented Nov 5, 2021

matthchr commented Nov 5, 2021

theunrepentantgeek left a comment

Choose a reason for hiding this comment

theunrepentantgeek Nov 7, 2021

Choose a reason for hiding this comment

codecov-commenter commented Nov 7, 2021

Codecov Report

matthchr commented Nov 8, 2021

theunrepentantgeek commented Nov 8, 2021

babbageclunk left a comment

Choose a reason for hiding this comment

babbageclunk Nov 8, 2021

Choose a reason for hiding this comment

matthchr Nov 8, 2021

Choose a reason for hiding this comment

matthchr commented Nov 5, 2021 •

edited

Loading