-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clusterresolver: switch a couple of tests to e2e style #6394
clusterresolver: switch a couple of tests to e2e style #6394
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments.
xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go
Outdated
Show resolved
Hide resolved
|
||
// testChildPolicyBalancerBuilder wraps the priority LB policy (which is the | ||
// child policy for the cluster resolver LB policy), and makes certain events | ||
// available to the test (subConn state changes and config pushes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: instead of config pushes (which I assume to come from UpdateClientConnState) can you instead say ParseConfig(). I know it should come 1:1 (ParseConfig than UpdateClientConnState) but I had to fix an issue in cds where we were bypassing this API here: https://github.com/grpc/grpc-go/pull/6361/files#diff-20fc9eddb97bc2324ff2b3ca5bd2cada8c7f4ad23776d547ad08e09087699bb0R376.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up. I had initially thought about overriding UpdateClientConnState
because that seems to be the more appropriate thing to do, but decided on getting the config from ParseConfig
, purely because of laziness. Now, I switched the code to rely on UpdateClientConnState
instead. This seems more correct to me.
xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go
Outdated
Show resolved
Hide resolved
xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go
Outdated
Show resolved
Hide resolved
xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go
Outdated
Show resolved
Hide resolved
case <-ctx.Done(): | ||
doneCh <- fmt.Errorf("test timeout expired when waiting for child policy to see a READY subConn") | ||
return | ||
case state := <-testChildPolicy.scStateCh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not racy? I.e. after spawning goroutine for this check, we're not guaranteed the processory actually hits this select, and the other operations from lines 421 - 428 can execute without this select running in a separate goroutine, and the line 362 hits, and we lose the Update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find. Switched to using a buffer.Unbounded
because I want two properties here:
- no state updates should be lost (as you pointed out)
- the
UpdateSubConnState
method should not block when writing on the channel because the test goroutine isn't reading anymore. This can be lead to test failures as well with theleakchecker
.
I didn't switch the other channel to a buffer.Unbounded
because we don't have that issue there (at least with the current usage).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I was actually going to suggest this pattern. In order to not lose state, we need an arbitrary length buffer, and keep reading until READY received. So this change lgtm.
select { | ||
case jsonCfg := <-testChildPolicy.cfgCh: | ||
gotCfg := &priority.LBConfig{} | ||
if err := json.Unmarshal(jsonCfg, gotCfg); err != nil { | ||
t.Fatalf("Failed to unmarshal LB config received by child policy: %v", err) | ||
} | ||
if diff := cmp.Diff(wantCfg, gotCfg); diff != "" { | ||
t.Fatalf("Child policy received unexpected diff in config (-want +got):\n%s", diff) | ||
} | ||
case <-ctx.Done(): | ||
t.Fatalf("Timeout when waiting for child policy to receive its configuration") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike the test above, this check/recv is not in a goroutine. Is there a reason for the divergence? (is it this is balancer.Balancer API vs. balancer.ClientConn in test above) Also, similar question here, I don't get how this isn't racy. Are we guaranteed this select is running by the time we send on cfgCh on line 348? Or can that send come before this is running losing this check and making test flaky?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't need a goroutine here because I'm not performing an RPC here. Technically, we don't need to perform an RPC in the previous test as well (to get the subChannel to READY), but I thought it was more appropriate to do the RPC there, because we also want to ensure that a READY subchannel also means successful RPCs.
This is not racy here because we expect only one config update to be pushed to the child policy, and the channel used by the wrapped balancer has a buffer of 1 (I thought I had a buffer of 1, but didn't have it, so I added it now). This means that the writer will not be blocked even if the reading goroutine has not started running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant racy with the send being lost do the default in the case of the channel not having buffer of 1. Now that is has buffer of 1 and we expect 1 config update, this check sgtm. Either it sends before or after it gets to this select, and either case works :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed review.
|
||
// testChildPolicyBalancerBuilder wraps the priority LB policy (which is the | ||
// child policy for the cluster resolver LB policy), and makes certain events | ||
// available to the test (subConn state changes and config pushes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up. I had initially thought about overriding UpdateClientConnState
because that seems to be the more appropriate thing to do, but decided on getting the config from ParseConfig
, purely because of laziness. Now, I switched the code to rely on UpdateClientConnState
instead. This seems more correct to me.
xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go
Outdated
Show resolved
Hide resolved
xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go
Outdated
Show resolved
Hide resolved
xds/internal/balancer/clusterresolver/e2e_test/balancer_test.go
Outdated
Show resolved
Hide resolved
case <-ctx.Done(): | ||
doneCh <- fmt.Errorf("test timeout expired when waiting for child policy to see a READY subConn") | ||
return | ||
case state := <-testChildPolicy.scStateCh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find. Switched to using a buffer.Unbounded
because I want two properties here:
- no state updates should be lost (as you pointed out)
- the
UpdateSubConnState
method should not block when writing on the channel because the test goroutine isn't reading anymore. This can be lead to test failures as well with theleakchecker
.
I didn't switch the other channel to a buffer.Unbounded
because we don't have that issue there (at least with the current usage).
select { | ||
case jsonCfg := <-testChildPolicy.cfgCh: | ||
gotCfg := &priority.LBConfig{} | ||
if err := json.Unmarshal(jsonCfg, gotCfg); err != nil { | ||
t.Fatalf("Failed to unmarshal LB config received by child policy: %v", err) | ||
} | ||
if diff := cmp.Diff(wantCfg, gotCfg); diff != "" { | ||
t.Fatalf("Child policy received unexpected diff in config (-want +got):\n%s", diff) | ||
} | ||
case <-ctx.Done(): | ||
t.Fatalf("Timeout when waiting for child policy to receive its configuration") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't need a goroutine here because I'm not performing an RPC here. Technically, we don't need to perform an RPC in the previous test as well (to get the subChannel to READY), but I thought it was more appropriate to do the RPC there, because we also want to ensure that a READY subchannel also means successful RPCs.
This is not racy here because we expect only one config update to be pushed to the child policy, and the channel used by the wrapped balancer has a buffer of 1 (I thought I had a buffer of 1, but didn't have it, so I added it now). This means that the writer will not be blocked even if the reading goroutine has not started running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed review.
I saw a weird race reported here as part of accessing the balancer registry. It's not entirely related to this test and can actually happen in any test. Maybe we should rethink our rationale for not having a mutex to protect the map in all of our registries (resolver/balancer etc). |
That race is interesting, in my opinion. Thinking about it, the global map is protected because the registry is called on init() time, which has a deterministic happens before relationship wrt init() called when you build a binary, thus protecting the map that way. It's weird though that there's races in tests unless tests are running concurrently, because tests run after init() I think. |
// to the child policy. We *only* wait for READY state since we expect to | ||
// get to that state before the below RPC succeeds. | ||
doneCh := make(chan error, 1) | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, now that it's a buffer.Unbounded and the write doesn't block, you could technically do this after line 434 in main test goroutine right? The sc updates get pushed to buffer in a non blocking way then you read after?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There are races in tests because we unregister and register things to the registry from tests (not necessarily from the |
Ok, I see why there would be a race. Doesn't the go test guarantee one at a time not in parallel, so within a single test, if you have a write and read async. Issue sgtm since it's not on rpc critical path so no harm with protecting mutex and adding some instructions in init() time and tests. |
In a test we might do the following:
In the intervening timeframe (when the test is running), the code might also read from the registry. For example:
Hence the race, even if Go guarantees that it will only run one test at a time. |
This change helps with an upcoming change where I plan to switch the EDS watch to use the generic xDS client API.
#resource-agnostic-xdsclient-api
RELEASE NOTES: none