Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: implement ADS stream flow control mechanism #7458

Merged
merged 5 commits into from
Aug 12, 2024

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Jul 30, 2024

This PR adds support for ADS stream flow control. See grpc/grpc#34099 for complete context.

At a high level, we want watch callbacks from the xDS client to include a function to be invoked by the watcher when it has completely processed the update. And the ADS stream will block the next read until the previous update has been completely consumed by all watchers.

Summary of changes:

  • xDS client:
    • watch API:
      • modify the signature of the callbacks to also include a DoneNotifier
    • transport:
      • Define a new type ADSFlowControl to implement the logic of flow control. This has methods to increment the number of pending watchers, decrement the number of pending watchers and wait till the number of pending watchers is 0.
      • The ADSFlowControl instance for the stream is passed to the authority along with the update.
    • authority:
      • When processing an update, it builds a list of watch callbacks to be invoked (and increments the pending count during the process)
      • At the end, it invokes the callbacks in a serializer and makes sure that the OnDone callback will be invoked eventually.
  • Change watchers to invoke the DoneNotifier when they are done with an update.
    • Some watchers process the update asynchronously by pushing them onto a serializer. Here, calls to Schedule have been changed with calls to ScheduleOr. This includes cds and xds_resolver
    • Some watchers push the update onto a channel which is read by another component and processed. This involved a little bit more work. See changes in clusterresolver.
    • Some watchers process the updates inline. This is the simplest where we can simply defer the call to OnDone. See xds_server.
  • New tests for the newly introduced functionality.
  • Fix a bunch of existing tests.
  • Avoid printing bootstrap config multiple times in tests.

#ads-stream-flow-control

Fixes #7382

RELEASE NOTES:

  • TBD

@easwars easwars added the Type: Feature New features or improvements in behavior label Jul 30, 2024
@easwars easwars added this to the 1.66 Release milestone Jul 30, 2024
Copy link

codecov bot commented Jul 30, 2024

Codecov Report

Attention: Patch coverage is 91.30435% with 14 lines in your changes missing coverage. Please review.

Project coverage is 81.61%. Comparing base (3eb0145) to head (585042b).
Report is 15 commits behind head on master.

Files Patch % Lines
xds/internal/xdsclient/transport/transport.go 89.74% 2 Missing and 2 partials ⚠️
.../balancer/clusterresolver/resource_resolver_eds.go 70.00% 3 Missing ⚠️
xds/internal/resolver/watch_service.go 85.00% 3 Missing ⚠️
xds/internal/testutils/resource_watcher.go 66.66% 2 Missing ⚠️
.../balancer/clusterresolver/resource_resolver_dns.go 75.00% 1 Missing ⚠️
xds/internal/xdsclient/clientimpl_watchers.go 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7458      +/-   ##
==========================================
+ Coverage   81.57%   81.61%   +0.04%     
==========================================
  Files         354      357       +3     
  Lines       27076    27267     +191     
==========================================
+ Hits        22088    22255     +167     
- Misses       3798     3804       +6     
- Partials     1190     1208      +18     
Files Coverage Δ
internal/testutils/xds_bootsrap.go 69.23% <ø> (-2.20%) ⬇️
internal/xds/bootstrap/bootstrap.go 72.20% <ø> (+4.52%) ⬆️
...s/internal/balancer/cdsbalancer/cluster_watcher.go 100.00% <100.00%> (ø)
...ternal/balancer/clusterresolver/clusterresolver.go 72.12% <100.00%> (+1.63%) ⬆️
...rnal/balancer/clusterresolver/resource_resolver.go 93.63% <100.00%> (+0.30%) ⬆️
xds/internal/server/listener_wrapper.go 78.61% <100.00%> (+0.41%) ⬆️
xds/internal/server/rds_handler.go 77.94% <100.00%> (+1.01%) ⬆️
xds/internal/xdsclient/authority.go 86.40% <100.00%> (+0.31%) ⬆️
xds/internal/xdsclient/client_refcounted.go 86.11% <100.00%> (ø)
...nal/xdsclient/xdsresource/cluster_resource_type.go 80.00% <100.00%> (ø)
... and 10 more

... and 28 files with indirect coverage changes

@easwars easwars requested a review from zasweq July 30, 2024 20:20
Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor implementation comments; will look at new tests next pass.

xds/internal/balancer/clusterresolver/resource_resolver.go Outdated Show resolved Hide resolved
internal/testutils/xds_bootsrap.go Show resolved Hide resolved
internal/xds/bootstrap/bootstrap.go Show resolved Hide resolved
xds/internal/balancer/clusterresolver/resource_resolver.go Outdated Show resolved Hide resolved
xds/internal/xdsclient/authority.go Show resolved Hide resolved
xds/internal/xdsclient/authority.go Show resolved Hide resolved
xds/internal/xdsclient/client_refcounted.go Show resolved Hide resolved
xds/internal/xdsclient/transport/transport.go Outdated Show resolved Hide resolved
xds/internal/xdsclient/transport/transport.go Outdated Show resolved Hide resolved
@zasweq zasweq assigned easwars and unassigned zasweq Aug 7, 2024
@easwars easwars assigned zasweq and unassigned easwars Aug 7, 2024
Comment on lines +230 to +239
select {
case <-watcher1.updateCh:
case <-ctx.Done():
t.Fatalf("Timed out waiting for update to reach watcher 1")
}
select {
case <-watcher2.updateCh:
case <-ctx.Done():
t.Fatalf("Timed out waiting for update to reach watcher 2")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these receives flaky? The sends have a default and there's no guarantee this testing goroutine gets to these lines before the OnUpdate callbacks send on the channel right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I guess since it's verifying something has updated it (but this would pass if numerous updates since next updates would get dropped). Do we care about this/is it important?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these receives flaky? The sends have a default and there's no guarantee this testing goroutine gets to these lines before the OnUpdate callbacks send on the channel right?

No, they aren't flaky because the channels have a buffer of size 1. So, even if testing goroutine is not reading on the channels when the update fires, the writes on the channels will go through.

Oh I guess since it's verifying something has updated it (but this would pass if numerous updates since next updates would get dropped). Do we care about this/is it important?

Our concern is not the number of times an update/error callback is invoked. Instead, it is the about when the next read happens. And if there is a bug in the xDS client which invokes the callbacks multiple times for a single update, that should be caught by other tests.

Here, we ensure that multiple reads on the ADS stream is not happening because we do have a check for that, a few lines below.

Hope this is satisfactory for you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright sounds good.

// management server and therefore the watcher's OnResourceDoesNotExist()
// callback is expected to be invoked. Verifies that no further reads are
// attempted until the callback is completely handled by the watcher.
func (s) TestADSFlowControl_ResourceDoesNotExist(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems similar to the test above albeit a few different operations. Can we refactor the two into a t-test? There seems to be a lot of shared code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is about 30 lines of common code at the start of these two tests, but writing a table driven test to cater to these two scenarios would make the logic inside the table complicated. As you can see for the error case, we simply configure a resource on the management server which is expected to NACKed by the xDS client. But for the resource-not-found case, we initially need to have a resource which is expected to be ACKed by the xDS client and then we remove it from the management server triggering the resource-not-found error.

My preference is to keep tests simpler to read and maintain, over tests that are shorter and share more code, but end up being complicated.

Let me know your thoughts. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright reading the test bodies I guess the operations are different enough. I guess refactoring into setup would be hard across tests since variable listeners and routes.

@zasweq
Copy link
Contributor

zasweq commented Aug 8, 2024

Reviewed tests and responded/resolved the implementation comments.

@zasweq zasweq assigned easwars and unassigned zasweq Aug 8, 2024
@easwars easwars assigned zasweq and unassigned easwars Aug 8, 2024
Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zasweq zasweq removed their assignment Aug 9, 2024
@easwars easwars merged commit ced812e into grpc:master Aug 12, 2024
9 of 11 checks passed
@easwars easwars deleted the ads_flow_control branch August 12, 2024 14:33
infovivek2020 pushed a commit to infovivek2020/grpc-go that referenced this pull request Aug 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xds: implement ADS stream flow control mechanism
2 participants