-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resolver: refactor common code for polling-based resolvers #28979
Conversation
Both Python tests on the macOS might be flakes. Local tests suggest a <1% flake rate, while 2 commits end up with failed macOS DNSResolverTest. This is quite strange. The Python Channelz issue on Window might be real: https://screenshot.googleplex.com/ASvCkLJJ9CgYYan. In the past half year, that test only failed on 1 single day (might be broken shortly). |
Here is the Windows channel tracing response in one of the failed runs. A client channel Channelz node lingers after the test is finished:
|
// Output fields from ares request. | ||
std::unique_ptr<ServerAddressList> addresses_; | ||
std::unique_ptr<ServerAddressList> balancer_addresses_; | ||
char* service_config_json_ = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: free service_config_json_ in the dtor rather than in OnResolved
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
std::unique_ptr<ResultHandler> result_handler_; | ||
/// pollset_set to drive the name resolution process | ||
grpc_pollset_set* interested_parties_; | ||
~AresClientChannelDNSResolver() override; | ||
|
||
/// whether to request the service config | ||
bool request_service_config_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: while we're here, all of these private fields can be const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
void RequestReresolutionLocked() override; | ||
private: | ||
class AresRequestWrapper : public InternallyRefCounted<AresRequestWrapper> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of structural comments (I won't block the PR on this, just want to consider):
The thread safety around the combined request initialization and starting in StartRequest
is somewhat subtle to reason about, since it relies on the PollingResolver
to synchronize its access to the Orphanable
returned from StartRequest
.
What do you think about changing the OrphanablePtr<Orphanable> StartRequest
API to OrphanablePtr<PollingResolver::Request> CreateRequest
, where PollingResolver::Request
looks like:
class Request : InternallyRefCounted<Request> {
public:
Start();
}
For c-ares, PollingResolver::Request
would be implemented by AresRequestWrapper
pretty much as is.
For the native resolver, we could introduce NativeRequestWrapper
which would hold a RefCountedPtr<DNSResolver::Request> request_;
, as well as RefCountedPtr<PollingResolver>
(getting rid of the manual ref in the native resolver).
We could then go a step further and make PollingResolver
a final class, and merge all of the remaining AresClientChannelDNSResolver
and NativeClientChannelDNSResolver
methods into their corresponding PollingResolver::Request
implementations. With PollingResolver
e.g. taking a factory function in it's ctor for creating PollingResolver::Request
.
Getting rid of the AresClientChannelDNSResolver
and NativeClientChannelDNSResolver
sub-class types might improve readability, since they are pretty thin now anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thread safety around the combined request initialization and starting in
StartRequest
is somewhat subtle to reason about, since it relies on thePollingResolver
to synchronize its access to theOrphanable
returned fromStartRequest
.
That's true, but PollingResolver
is explicitly designed to make that guarantee. In any polling resolver, the request is going to complete asynchronously, and when it does, we're going to have to hop back into the WorkSerializer
to process the result. PollingResolver
handles that so that subclasses don't need to worry about it, so this access will always be safe.
Note that this is not intended to be a public API, so I'm fine with having it depend on the semantics of WorkSerializer
.
What do you think about changing the
OrphanablePtr<Orphanable> StartRequest
API toOrphanablePtr<PollingResolver::Request> CreateRequest
, wherePollingResolver::Request
looks like:class Request : InternallyRefCounted<Request> { public: Start(); }
I thought about that -- in fact, I had actually implemented something like that in an earlier draft of this PR -- but I decided against it, because I didn't want to force implementations to jump through hoops to avoid diamond dependency problems.
The native DNS resolver is a prime example of that: the iomgr DNS resolution API already returns a DNSResolver::Request
object, which itself inherits from InternallyRefCounted<>
. The native DNS resolver can't return an object that inherits from both DNSResolver::Request
and from PollingResolver::Request
without hitting a diamond dependency problem, so it would be forced to create a new wrapper object that inherits from PollingResolver::Request
and contains the DNSResolver::Request
object that does all the real work. That seems unnecessarily complex and inefficient, especially given that PollingResolver
doesn't actually care whether the request object is ref-counted or not; all it really cares about is that it's orphanable, so that it can cancel it.
For c-ares,
PollingResolver::Request
would be implemented byAresRequestWrapper
pretty much as is.
In the long run, I don't think we really want to have AresRequestWrapper
in the first place. If #25108 hadn't been reverted (and I would still like to see it rolled forward at some point), the c-ares resolver would have been able to just drop in its existing AresRequest
object here. But if we went with the PollingResolver::Request
approach, the c-ares resolver would have had to jump through exactly the same hoops as the native DNS resolver to avoid the diamond dependency problem.
For the native resolver, we could introduce
NativeRequestWrapper
which would hold aRefCountedPtr<DNSResolver::Request> request_;
, as well asRefCountedPtr<PollingResolver>
(getting rid of the manual ref in the native resolver).
I agree that having a way to avoid dealing with this ref manually would be better. But once we can use C++14, we can do that anyway by making the completion callback a lambda that captures the RefCountedPtr<>
using move semantics. (Even with C++11, we could actually use the lambda-capture approach instead of dealing with the refs manually, but that would result in an extra unnecessay ref/unref pair due to the lack of lambda move-capture semantics.)
We could then go a step further and make
PollingResolver
a final class, and merge all of the remainingAresClientChannelDNSResolver
andNativeClientChannelDNSResolver
methods into their correspondingPollingResolver::Request
implementations. WithPollingResolver
e.g. taking a factory function in it's ctor for creatingPollingResolver::Request
.Getting rid of the
AresClientChannelDNSResolver
andNativeClientChannelDNSResolver
sub-class types might improve readability, since they are pretty thin now anyways.
I think that would force the interface for implementations to be a little too narrow, which would make it hard to expand it in the future if needed.
Also, it's worth noting that once AJ moves the c-ares code inside of the iomgr API, we will have only one client channel DNS resolver implementation anyway, so any small amount of duplication here will go away at that point anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks for context
this, grpc_error_std_string(error).c_str(), request_.get()); | ||
} | ||
have_next_resolution_timer_ = false; | ||
if (error == GRPC_ERROR_NONE && request_ == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be a impossible for request_
to not be nullptr at this point?
Also, I think we should check shutdown_
here, and not start a new resolution if shutdown has started. Otherwise , I believe we could start a request that might never be explicitly cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to above changes, let's log shutdown_ state in the above trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed this to check shutdown_
instead, and to log it.
} | ||
request_.reset(); | ||
if (!shutdown_) { | ||
if (result.service_config.ok() && result.addresses.ok()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be result.service_config.ok() || result.addresses.ok()
?
Otherwise, this looks like a behavioral change from the c-ares resolver current implementation, which resets backoff in the case that we resolved only addresses, without a service config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related: I think result.service_config
will typically be at the default unknown
value, since the c-ares resolver doesn't request service config by default and so won't fill it in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The service_config
field defaults to a null value (i.e., OK status):
grpc/src/core/lib/resolver/resolver.h
Line 57 in fe91338
absl::StatusOr<RefCountedPtr<ServiceConfig>> service_config = nullptr; |
This means that result.service_config.ok()
will be false only if the resolver explicitly returns an error for it. I think if we get an explicit error for either service config or addresses, we should consider it a failure for the purposes of backoff.
I don't think this will actually change the behavior of the c-ares resolver, because it does not explicitly return an error for the service config if it is not fetching the service config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it, I misread that that it defaults to a null value
This moves all of the code to handle backoff and cooldown timers into a common base class called
PollingResolver
. Both the native and c-ares DNS resolver are changed to inherit fromPollingResolver
, so that they now share all of that common code.