clusterresolver: switch a couple of tests to e2e style #6394

easwars · 2023-06-23T15:43:10Z

This change helps with an upcoming change where I plan to switch the EDS watch to use the generic xDS client API.

#resource-agnostic-xdsclient-api

RELEASE NOTES: none

zasweq

Some comments.

xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go

zasweq · 2023-06-23T18:24:22Z

xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go

+
+// testChildPolicyBalancerBuilder wraps the priority LB policy (which is the
+// child policy for the cluster resolver LB policy), and makes certain events
+// available to the test (subConn state changes and config pushes).


Nit: instead of config pushes (which I assume to come from UpdateClientConnState) can you instead say ParseConfig(). I know it should come 1:1 (ParseConfig than UpdateClientConnState) but I had to fix an issue in cds where we were bypassing this API here: https://github.com/grpc/grpc-go/pull/6361/files#diff-20fc9eddb97bc2324ff2b3ca5bd2cada8c7f4ad23776d547ad08e09087699bb0R376.

Thanks for bringing this up. I had initially thought about overriding UpdateClientConnState because that seems to be the more appropriate thing to do, but decided on getting the config from ParseConfig, purely because of laziness. Now, I switched the code to rely on UpdateClientConnState instead. This seems more correct to me.

xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go

zasweq · 2023-06-23T18:30:58Z

xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go

+			case <-ctx.Done():
+				doneCh <- fmt.Errorf("test timeout expired when waiting for child policy to see a READY subConn")
+				return
+			case state := <-testChildPolicy.scStateCh:


Is this not racy? I.e. after spawning goroutine for this check, we're not guaranteed the processory actually hits this select, and the other operations from lines 421 - 428 can execute without this select running in a separate goroutine, and the line 362 hits, and we lose the Update?

Great find. Switched to using a buffer.Unbounded because I want two properties here:

no state updates should be lost (as you pointed out)

the UpdateSubConnState method should not block when writing on the channel because the test goroutine isn't reading anymore. This can be lead to test failures as well with the leakchecker.

I didn't switch the other channel to a buffer.Unbounded because we don't have that issue there (at least with the current usage).

Right, I was actually going to suggest this pattern. In order to not lose state, we need an arbitrary length buffer, and keep reading until READY received. So this change lgtm.

zasweq · 2023-06-23T18:33:59Z

xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go

+	select {
+	case jsonCfg := <-testChildPolicy.cfgCh:
+		gotCfg := &priority.LBConfig{}
+		if err := json.Unmarshal(jsonCfg, gotCfg); err != nil {
+			t.Fatalf("Failed to unmarshal LB config received by child policy: %v", err)
+		}
+		if diff := cmp.Diff(wantCfg, gotCfg); diff != "" {
+			t.Fatalf("Child policy received unexpected diff in config (-want +got):\n%s", diff)
+		}
+	case <-ctx.Done():
+		t.Fatalf("Timeout when waiting for child policy to receive its configuration")
+	}
+}


Unlike the test above, this check/recv is not in a goroutine. Is there a reason for the divergence? (is it this is balancer.Balancer API vs. balancer.ClientConn in test above) Also, similar question here, I don't get how this isn't racy. Are we guaranteed this select is running by the time we send on cfgCh on line 348? Or can that send come before this is running losing this check and making test flaky?

I don't need a goroutine here because I'm not performing an RPC here. Technically, we don't need to perform an RPC in the previous test as well (to get the subChannel to READY), but I thought it was more appropriate to do the RPC there, because we also want to ensure that a READY subchannel also means successful RPCs.

This is not racy here because we expect only one config update to be pushed to the child policy, and the channel used by the wrapped balancer has a buffer of 1 (I thought I had a buffer of 1, but didn't have it, so I added it now). This means that the writer will not be blocked even if the reading goroutine has not started running.

I meant racy with the send being lost do the default in the case of the channel not having buffer of 1. Now that is has buffer of 1 and we expect 1 config update, this check sgtm. Either it sends before or after it gets to this select, and either case works :).

easwars

Thanks for the detailed review.

easwars · 2023-06-23T18:58:40Z

xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go

+
+// testChildPolicyBalancerBuilder wraps the priority LB policy (which is the
+// child policy for the cluster resolver LB policy), and makes certain events
+// available to the test (subConn state changes and config pushes).


Thanks for bringing this up. I had initially thought about overriding UpdateClientConnState because that seems to be the more appropriate thing to do, but decided on getting the config from ParseConfig, purely because of laziness. Now, I switched the code to rely on UpdateClientConnState instead. This seems more correct to me.

xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go

easwars · 2023-06-23T19:15:50Z

xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go

+			case <-ctx.Done():
+				doneCh <- fmt.Errorf("test timeout expired when waiting for child policy to see a READY subConn")
+				return
+			case state := <-testChildPolicy.scStateCh:


Great find. Switched to using a buffer.Unbounded because I want two properties here:

no state updates should be lost (as you pointed out)

the UpdateSubConnState method should not block when writing on the channel because the test goroutine isn't reading anymore. This can be lead to test failures as well with the leakchecker.

I didn't switch the other channel to a buffer.Unbounded because we don't have that issue there (at least with the current usage).

xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go

easwars · 2023-06-23T19:20:02Z

xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go

+	select {
+	case jsonCfg := <-testChildPolicy.cfgCh:
+		gotCfg := &priority.LBConfig{}
+		if err := json.Unmarshal(jsonCfg, gotCfg); err != nil {
+			t.Fatalf("Failed to unmarshal LB config received by child policy: %v", err)
+		}
+		if diff := cmp.Diff(wantCfg, gotCfg); diff != "" {
+			t.Fatalf("Child policy received unexpected diff in config (-want +got):\n%s", diff)
+		}
+	case <-ctx.Done():
+		t.Fatalf("Timeout when waiting for child policy to receive its configuration")
+	}
+}


I don't need a goroutine here because I'm not performing an RPC here. Technically, we don't need to perform an RPC in the previous test as well (to get the subChannel to READY), but I thought it was more appropriate to do the RPC there, because we also want to ensure that a READY subchannel also means successful RPCs.

This is not racy here because we expect only one config update to be pushed to the child policy, and the channel used by the wrapped balancer has a buffer of 1 (I thought I had a buffer of 1, but didn't have it, so I added it now). This means that the writer will not be blocked even if the reading goroutine has not started running.

easwars

Thanks for the detailed review.

easwars · 2023-06-23T20:06:01Z

I saw a weird race reported here as part of accessing the balancer registry. It's not entirely related to this test and can actually happen in any test. Maybe we should rethink our rationale for not having a mutex to protect the map in all of our registries (resolver/balancer etc).

zasweq · 2023-06-23T20:16:58Z

That race is interesting, in my opinion. Thinking about it, the global map is protected because the registry is called on init() time, which has a deterministic happens before relationship wrt init() called when you build a binary, thus protecting the map that way. It's weird though that there's races in tests unless tests are running concurrently, because tests run after init() I think.

zasweq · 2023-06-23T20:19:33Z

xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go

+	// to the child policy. We *only* wait for READY state since we expect to
+	// get to that state before the below RPC succeeds.
+	doneCh := make(chan error, 1)
+	go func() {


Curious, now that it's a buffer.Unbounded and the write doesn't block, you could technically do this after line 434 in main test goroutine right? The sc updates get pushed to buffer in a non blocking way then you read after?

Good point. Done.

zasweq

LGTM.

easwars · 2023-06-23T20:37:00Z

That race is interesting, in my opinion. Thinking about it, the global map is protected because the registry is called on init() time, which has a deterministic happens before relationship wrt init() called when you build a binary, thus protecting the map that way. It's weird though that there's races in tests unless tests are running concurrently, because tests run after init() I think.

There are races in tests because we unregister and register things to the registry from tests (not necessarily from the init function in tests), and other non-test code read from the registry (again in non-init functions). I don't see any strong reason for why we shouldn't protect the map in the registry with a mutex. I've filed #6408 for us to discuss this in more detail (and also to ensure that we don't lose track of it).

zasweq · 2023-06-23T20:44:01Z

Ok, I see why there would be a race. Doesn't the go test guarantee one at a time not in parallel, so within a single test, if you have a write and read async. Issue sgtm since it's not on rpc critical path so no harm with protecting mutex and adding some instructions in init() time and tests.

easwars · 2023-06-23T20:51:11Z

Ok, I see why there would be a race. Doesn't the go test guarantee one at a time not in parallel, so within a single test, if you have a write and read async. Issue sgtm since it's not on rpc critical path so no harm with protecting mutex and adding some instructions in init() time and tests.

In a test we might do the following:

unregister foo from the registry
register a dummy foo for the duration of the test
defer a register of the original foo

In the intervening timeframe (when the test is running), the code might also read from the registry. For example:

grpc-go/xds/internal/balancer/priority/balancer.go

Line 119 in 0b3a81e

bb := balancer.Get(newSubConfig.Config.Name)

Hence the race, even if Go guarantees that it will only run one test at a time.

clusterresolver: switch a couple of tests to e2e style

5408499

easwars requested a review from zasweq June 23, 2023 15:43

easwars assigned zasweq Jun 23, 2023

easwars added the Type: Testing label Jun 23, 2023

easwars added this to the 1.57 Release milestone Jun 23, 2023

zasweq requested changes Jun 23, 2023

View reviewed changes

zasweq assigned easwars and unassigned zasweq Jun 23, 2023

review comments #1

faca94c

easwars commented Jun 23, 2023

View reviewed changes

easwars assigned zasweq and unassigned easwars Jun 23, 2023

zasweq reviewed Jun 23, 2023

View reviewed changes

zasweq approved these changes Jun 23, 2023

View reviewed changes

nix the goroutine

096b953

easwars merged commit 0673105 into grpc:master Jun 23, 2023
1 check passed

easwars deleted the clusterresolver_test_cleanup_part_4 branch June 23, 2023 20:51

github-actions bot locked as resolved and limited conversation to collaborators Jun 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clusterresolver: switch a couple of tests to e2e style #6394

clusterresolver: switch a couple of tests to e2e style #6394

easwars commented Jun 23, 2023 •

edited

Loading

zasweq left a comment

zasweq Jun 23, 2023

easwars Jun 23, 2023

zasweq Jun 23, 2023

easwars Jun 23, 2023

zasweq Jun 23, 2023

zasweq Jun 23, 2023

easwars Jun 23, 2023

zasweq Jun 23, 2023

easwars left a comment

easwars Jun 23, 2023

easwars Jun 23, 2023

easwars Jun 23, 2023

easwars left a comment

easwars commented Jun 23, 2023

zasweq commented Jun 23, 2023

zasweq Jun 23, 2023

easwars Jun 23, 2023

zasweq left a comment

easwars commented Jun 23, 2023

zasweq commented Jun 23, 2023

easwars commented Jun 23, 2023

clusterresolver: switch a couple of tests to e2e style #6394

clusterresolver: switch a couple of tests to e2e style #6394

Conversation

easwars commented Jun 23, 2023 • edited Loading

zasweq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

easwars left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

easwars left a comment

Choose a reason for hiding this comment

easwars commented Jun 23, 2023

zasweq commented Jun 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zasweq left a comment

Choose a reason for hiding this comment

easwars commented Jun 23, 2023

zasweq commented Jun 23, 2023

easwars commented Jun 23, 2023

easwars commented Jun 23, 2023 •

edited

Loading