Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resolver: refactor common code for polling-based resolvers #28979

Merged
merged 13 commits into from
Mar 10, 2022

Conversation

markdroth
Copy link
Member

This moves all of the code to handle backoff and cooldown timers into a common base class called PollingResolver. Both the native and c-ares DNS resolver are changed to inherit from PollingResolver, so that they now share all of that common code.

@markdroth markdroth added the release notes: no Indicates if PR should not be in release notes label Feb 26, 2022
@markdroth markdroth requested a review from apolcyn February 26, 2022 08:40
@lidizheng
Copy link
Contributor

Both Python tests on the macOS might be flakes. Local tests suggest a <1% flake rate, while 2 commits end up with failed macOS DNSResolverTest. This is quite strange.

The Python Channelz issue on Window might be real: https://screenshot.googleplex.com/ASvCkLJJ9CgYYan. In the past half year, that test only failed on 1 single day (might be broken shortly).

@lidizheng
Copy link
Contributor

Here is the Windows channel tracing response in one of the failed runs. A client channel Channelz node lingers after the test is finished:

channel {
  ref {
    channel_id: 3
  }
  data {
    state {
      state: SHUTDOWN
    }
    target: "localhost:50701"
    trace {
      num_events_logged: 7
      creation_timestamp {
        seconds: 1646177312
        nanos: 543574228
      }
      events {
        description: "Channel created"
        severity: CT_INFO
        timestamp {
          seconds: 1646177312
          nanos: 543571901
        }
      }
      events {
        description: "Channel state change to CONNECTING"
        severity: CT_INFO
        timestamp {
          seconds: 1646177312
          nanos: 544829032
        }
      }
      events {
        description: "Created new LB policy \"pick_first\""
        severity: CT_INFO
        timestamp {
          seconds: 1646177312
          nanos: 548041343
        }
      }
      events {
        description: "Resolution event: Address list became non-empty, Service config changed"
        severity: CT_INFO
        timestamp {
          seconds: 1646177312
          nanos: 549691319
        }
      }
      events {
        description: "Channel state change to CONNECTING"
        severity: CT_INFO
        timestamp {
          seconds: 1646177312
          nanos: 549681079
        }
      }
      events {
        description: "Channel state change to READY"
        severity: CT_INFO
        timestamp {
          seconds: 1646177312
          nanos: 550193945
        }
      }
      events {
        description: "Channel state change to SHUTDOWN"
        severity: CT_INFO
        timestamp {
          seconds: 1646177312
          nanos: 565222991
        }
      }
    }
    calls_started: 1
    calls_failed: 1
    last_call_started_timestamp {
      seconds: 3292358225
      nanos: 69867566
    }
  }
}
channel {
  ref {
    channel_id: 13
  }
  data {
    target: "localhost:50708"
    trace {
      num_events_logged: 1
      creation_timestamp {
        seconds: 1646177312
        nanos: 584316601
      }
      events {
        description: "Channel created"
        severity: CT_INFO
        timestamp {
          seconds: 1646177312
          nanos: 584315205
        }
      }
    }
  }
}
end: true

// Output fields from ares request.
std::unique_ptr<ServerAddressList> addresses_;
std::unique_ptr<ServerAddressList> balancer_addresses_;
char* service_config_json_ = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: free service_config_json_ in the dtor rather than in OnResolved ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

std::unique_ptr<ResultHandler> result_handler_;
/// pollset_set to drive the name resolution process
grpc_pollset_set* interested_parties_;
~AresClientChannelDNSResolver() override;

/// whether to request the service config
bool request_service_config_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: while we're here, all of these private fields can be const

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


void RequestReresolutionLocked() override;
private:
class AresRequestWrapper : public InternallyRefCounted<AresRequestWrapper> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of structural comments (I won't block the PR on this, just want to consider):

The thread safety around the combined request initialization and starting in StartRequest is somewhat subtle to reason about, since it relies on the PollingResolver to synchronize its access to the Orphanable returned from StartRequest.

What do you think about changing the OrphanablePtr<Orphanable> StartRequest API to OrphanablePtr<PollingResolver::Request> CreateRequest, where PollingResolver::Request looks like:

class Request : InternallyRefCounted<Request> {
  public:
    Start();
}

For c-ares, PollingResolver::Request would be implemented by AresRequestWrapper pretty much as is.

For the native resolver, we could introduce NativeRequestWrapper which would hold a RefCountedPtr<DNSResolver::Request> request_;, as well as RefCountedPtr<PollingResolver> (getting rid of the manual ref in the native resolver).

We could then go a step further and make PollingResolver a final class, and merge all of the remaining AresClientChannelDNSResolver and NativeClientChannelDNSResolver methods into their corresponding PollingResolver::Request implementations. With PollingResolver e.g. taking a factory function in it's ctor for creating PollingResolver::Request.

Getting rid of the AresClientChannelDNSResolver and NativeClientChannelDNSResolver sub-class types might improve readability, since they are pretty thin now anyways.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thread safety around the combined request initialization and starting in StartRequest is somewhat subtle to reason about, since it relies on the PollingResolver to synchronize its access to the Orphanable returned from StartRequest.

That's true, but PollingResolver is explicitly designed to make that guarantee. In any polling resolver, the request is going to complete asynchronously, and when it does, we're going to have to hop back into the WorkSerializer to process the result. PollingResolver handles that so that subclasses don't need to worry about it, so this access will always be safe.

Note that this is not intended to be a public API, so I'm fine with having it depend on the semantics of WorkSerializer.

What do you think about changing the OrphanablePtr<Orphanable> StartRequest API to OrphanablePtr<PollingResolver::Request> CreateRequest, where PollingResolver::Request looks like:

class Request : InternallyRefCounted<Request> {
  public:
    Start();
}

I thought about that -- in fact, I had actually implemented something like that in an earlier draft of this PR -- but I decided against it, because I didn't want to force implementations to jump through hoops to avoid diamond dependency problems.

The native DNS resolver is a prime example of that: the iomgr DNS resolution API already returns a DNSResolver::Request object, which itself inherits from InternallyRefCounted<>. The native DNS resolver can't return an object that inherits from both DNSResolver::Request and from PollingResolver::Request without hitting a diamond dependency problem, so it would be forced to create a new wrapper object that inherits from PollingResolver::Request and contains the DNSResolver::Request object that does all the real work. That seems unnecessarily complex and inefficient, especially given that PollingResolver doesn't actually care whether the request object is ref-counted or not; all it really cares about is that it's orphanable, so that it can cancel it.

For c-ares, PollingResolver::Request would be implemented by AresRequestWrapper pretty much as is.

In the long run, I don't think we really want to have AresRequestWrapper in the first place. If #25108 hadn't been reverted (and I would still like to see it rolled forward at some point), the c-ares resolver would have been able to just drop in its existing AresRequest object here. But if we went with the PollingResolver::Request approach, the c-ares resolver would have had to jump through exactly the same hoops as the native DNS resolver to avoid the diamond dependency problem.

For the native resolver, we could introduce NativeRequestWrapper which would hold a RefCountedPtr<DNSResolver::Request> request_;, as well as RefCountedPtr<PollingResolver> (getting rid of the manual ref in the native resolver).

I agree that having a way to avoid dealing with this ref manually would be better. But once we can use C++14, we can do that anyway by making the completion callback a lambda that captures the RefCountedPtr<> using move semantics. (Even with C++11, we could actually use the lambda-capture approach instead of dealing with the refs manually, but that would result in an extra unnecessay ref/unref pair due to the lack of lambda move-capture semantics.)

We could then go a step further and make PollingResolver a final class, and merge all of the remaining AresClientChannelDNSResolver and NativeClientChannelDNSResolver methods into their corresponding PollingResolver::Request implementations. With PollingResolver e.g. taking a factory function in it's ctor for creating PollingResolver::Request.

Getting rid of the AresClientChannelDNSResolver and NativeClientChannelDNSResolver sub-class types might improve readability, since they are pretty thin now anyways.

I think that would force the interface for implementations to be a little too narrow, which would make it hard to expand it in the future if needed.

Also, it's worth noting that once AJ moves the c-ares code inside of the iomgr API, we will have only one client channel DNS resolver implementation anyway, so any small amount of duplication here will go away at that point anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks for context

this, grpc_error_std_string(error).c_str(), request_.get());
}
have_next_resolution_timer_ = false;
if (error == GRPC_ERROR_NONE && request_ == nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be a impossible for request_ to not be nullptr at this point?

Also, I think we should check shutdown_ here, and not start a new resolution if shutdown has started. Otherwise , I believe we could start a request that might never be explicitly cancelled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to above changes, let's log shutdown_ state in the above trace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed this to check shutdown_ instead, and to log it.

}
request_.reset();
if (!shutdown_) {
if (result.service_config.ok() && result.addresses.ok()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be result.service_config.ok() || result.addresses.ok() ?

Otherwise, this looks like a behavioral change from the c-ares resolver current implementation, which resets backoff in the case that we resolved only addresses, without a service config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related: I think result.service_config will typically be at the default unknown value, since the c-ares resolver doesn't request service config by default and so won't fill it in

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The service_config field defaults to a null value (i.e., OK status):

absl::StatusOr<RefCountedPtr<ServiceConfig>> service_config = nullptr;

This means that result.service_config.ok() will be false only if the resolver explicitly returns an error for it. I think if we get an explicit error for either service config or addresses, we should consider it a failure for the purposes of backoff.

I don't think this will actually change the behavior of the c-ares resolver, because it does not explicitly return an error for the service config if it is not fetching the service config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it, I misread that that it defaults to a null value

@markdroth markdroth merged commit 707aad7 into grpc:master Mar 10, 2022
@markdroth markdroth deleted the resolver_polling branch March 10, 2022 18:45
@copybara-service copybara-service bot added the imported Specifies if the PR has been imported to the internal repository label Mar 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bloat/low imported Specifies if the PR has been imported to the internal repository lang/core perf-change/medium release notes: no Indicates if PR should not be in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants