-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Fix core worker client pool leak #41535
Conversation
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@@ -781,6 +781,8 @@ RAY_CONFIG(int64_t, grpc_client_keepalive_time_ms, 300000) | |||
/// grpc keepalive timeout for client. | |||
RAY_CONFIG(int64_t, grpc_client_keepalive_timeout_ms, 120000) | |||
|
|||
RAY_CONFIG(int64_t, grpc_client_idle_timeout_ms, 1800000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the grpc default value: 30 minutes.
/// Also see https://grpc.github.io/grpc/core/md_doc_connectivity-semantics-and-api.html | ||
/// for channel connectivity state machine. | ||
bool IsChannelIdleAfterRPCs() const { | ||
return (channel_->GetState(false) == GRPC_CHANNEL_IDLE) && call_method_invoked_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetState is not blocking right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking.
auto id = WorkerID::FromBinary(addr_proto.worker_id()); | ||
auto it = client_map_.find(id); | ||
if (it != client_map_.end()) { | ||
return it->second; | ||
entry = *it->second; | ||
client_list_.erase(it->second); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this actually pretty expensive (O(N)) if there are lots of connections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we just make RemoveIdleClients called every 30 seconds or something instead
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::list is doubly linked list so it's constant time.
Hello, am I understanding correctly that this fix is not yet merged into any release? Thanks |
@m-harmonic Yes, it will be part of Ray 2.10 release. |
Why are these changes needed?
Currently core worker client pool doesn't remove clients in most cases (there are one or two places where
Disconnect()
might be called) and this caused memory leak. This PR adds a GC inside core worker client pool to remove IDLE clients (i.e.g clients that don't have active connections).Related issue number
Closes #41260
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.