Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve test reliability #1960

Merged
merged 3 commits into from
Nov 8, 2021

Conversation

matthchr
Copy link
Member

@matthchr matthchr commented Nov 5, 2021

How does this PR make you feel:
gif

If applicable:

  • this PR contains documentation
  • this PR contains tests

@matthchr
Copy link
Member Author

matthchr commented Nov 5, 2021

I ran the full test suite locally 8 times to test this. Tests all passed each of the 8 times, with runtime ranging from 60 to 100s.

@matthchr matthchr force-pushed the feature/more-test-improvements branch 2 times, most recently from 167e0f4 to 83c18dc Compare November 5, 2021 22:30
@matthchr
Copy link
Member Author

matthchr commented Nov 5, 2021

So, somehow the recorded tests take 10m to pass in CI... That's still faster than they were but feels slow compared to what I'm seeing locally...
@Porges any idea what spec machines we get in actions?

Copy link
Member

@theunrepentantgeek theunrepentantgeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the GVK changes are safe to have only when running tests; they might make tests run that should fail.

@@ -137,7 +137,7 @@ func (r *AzureDeploymentReconciler) Reconcile(ctx context.Context) (ctrl.Result,
}

func (r *AzureDeploymentReconciler) CreateOrUpdate(ctx context.Context) (ctrl.Result, error) {
r.logObj("reconciling resource")
r.logObj("reconciling resource", r.obj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😃

v2/internal/reconcilers/azure_generic_arm_reconciler.go Outdated Show resolved Hide resolved
v2/internal/testcommon/direct_client.go Show resolved Hide resolved
@codecov-commenter
Copy link

Codecov Report

Merging #1960 (6c8fa01) into main (05179ef) will decrease coverage by 0.01%.
The diff coverage is 63.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1960      +/-   ##
==========================================
- Coverage   56.98%   56.96%   -0.02%     
==========================================
  Files         404      405       +1     
  Lines       80870    81028     +158     
==========================================
+ Hits        46082    46157      +75     
- Misses      28911    28989      +78     
- Partials     5877     5882       +5     
Impacted Files Coverage Δ
...a1api20210515storage/database_account_types_gen.go 10.34% <ø> (ø)
...alpha1api20211101/namespaces_eventhub_types_gen.go 53.54% <0.00%> (-0.35%) ⬇️
...a1api20201101/virtual_network_gateway_types_gen.go 44.48% <0.00%> (-0.04%) ⬇️
...k/v1alpha1api20201101/virtual_network_types_gen.go 52.20% <0.00%> (-0.41%) ⬇️
...ork/v1alpha1api20201101/load_balancer_types_gen.go 55.98% <35.00%> (-0.25%) ⬇️
v2/internal/testcommon/direct_client.go 47.05% <47.05%> (ø)
.../v1alpha1api20210515/database_account_types_gen.go 56.41% <47.61%> (ø)
v2/tools/generator/internal/astbuilder/builder.go 86.78% <75.00%> (-0.66%) ⬇️
...api20201201/virtual_machine_scale_set_types_gen.go 55.23% <79.16%> (+0.09%) ⬆️
...gen/pipeline/convert_allof_and_oneof_to_objects.go 72.31% <79.41%> (+0.66%) ⬆️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 268b9a2...6c8fa01. Read the comment docs.

@matthchr matthchr force-pushed the feature/more-test-improvements branch from 6c8fa01 to 8b940a0 Compare November 8, 2021 17:39
@matthchr
Copy link
Member Author

matthchr commented Nov 8, 2021

@theunrepentantgeek, you said:

I'm not sure if the GVK changes are safe to have only when running tests; they might make tests run that should fail.

I'm not 100% sure I follow. AFAIK, the GVK "additions" in the test client are just to make the direct to APIServer (no cache) client behave the same as the other client already does. So it makes tests pass that should pass (without it, they fail with the direct client but pass with the cached client due to the behavior difference).

Can you expand upon what scenarios it might make something pass that wasn't before (with the cached client?)

 * Use a KubeClient in EnvTest that always hits APIServer to avoid
   cache latency and inconsistency issues that can cause races and
   intermittent test failures. See
   kubernetes-sigs/controller-runtime#1464
   and
   kubernetes-sigs/controller-runtime#343
   for details.
 * Write Status after Spec. This ensures that tests waiting for a
   status update cannot possibly see it so fast that they go on
   to perform a write that conflicts with the Spec write of the
   controller.
 * Improve log messages to be clearer (aids in test debugging).
Prior to performing any deletion actions (either monitoring or
deletion), verify that the finalizer is still set. If it's not set,
immediately allow the k8s resource to be deleted. Don't take any action
on the Azure resource in this instance.
@matthchr matthchr force-pushed the feature/more-test-improvements branch from 8b940a0 to add957b Compare November 8, 2021 18:29
@theunrepentantgeek
Copy link
Member

Can you expand upon what scenarios it might make something pass that wasn't before (with the cached client?)

I had a concern that code dependent on having the GVK would pass in test but fail in production because GVK was only reliably present during testing. If the client used during production runs will also reliably provide the GVK then my concerns are moot.

Copy link
Member

@babbageclunk babbageclunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but I'm worried that the finalizer change might make the tests more stable at the risk of the occasional lost resource.

v2/internal/controllers/generic_controller.go Outdated Show resolved Hide resolved

changed, err := r.AddInitialResourceState(armResource.GetID())
err = r.AddInitialResourceState(armResource.GetID())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point the resource hasn't been updated back to etcd with the finalizer, right? It seems like there's a chance we'll issue the creation in ARM but the k8s save fails for some reason so we lose track of the ARM resource. Doing another loop seems like it would be safer, but is that the thing we're trying to avoid for test reliability? I don't think it is but I might be misunderstanding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, you're right. This is probably a longstanding bug we've had. I'll fix this but do it in a different PR as it's pretty unrelated to tests.

@matthchr matthchr force-pushed the feature/more-test-improvements branch from add957b to 92d81dc Compare November 8, 2021 22:37
@matthchr matthchr merged commit 617de6b into Azure:main Nov 8, 2021
@matthchr matthchr deleted the feature/more-test-improvements branch November 8, 2021 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants