-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse worker to scheduler WS connection for scheduling #8918
Comments
This isn't true. We're opening a single connection between each worker and the scheduler. This connection is initiated from the worker. This connection is used in bidirectional communication to deal with all administrative traffic (e.g. assign tasks, notify about results). If there is a firewall blocking this, the worker cannot even register and would fail during startup. If you are running into any issue you have to be more specific. There are a couple of specialized APIs where this is not true but ordinary scheduling will not open any connections to the worker that are initiated by the scheduler. |
Yes, indeed. I wasn't precise in describing the actual problem. I see that a worker initiates a ws connection with the scheduler and it is used for administrative purposes. However, as I see there's another connection that is initiated by the scheduler and that is the one that fails. The system starts up without any errors, but I see errors once the first job is scheduled: Be aware that provided IP addresses are local IP addresses. Scheduler is public while the worker and the client are behind a firewall: Scheduler started:
Worker started
Scheduler received a connection from the worker
Client submits a task and sees the following message in a while
The scheduler sees the following errors:
So in this particular case, the worker expects scheduler to connect to ws://172.28.0.12:46663, and it will never work if port 46663 remains closed. Do I get it right? |
@fjetter I tried to provide as more detailed logs as possible. Long story short, the scheduler tries to connect to 172.28.0.12 which is the private IP of the worker. The worker expects connections on Thank you |
It looks to me like the connection is being lost during worker registration, and then the scheduler is trying to resume the connection and failing. So when you say "the system starts up without any error" I don't think that's the case.
Dask is designed in a way where the scheduler and workers can all open network connections between each other at any time. If you're running in an environment where a firewall is blocking such a connection then Dask will not behave as expected. Is it feasible to move the scheduler to the same location as your workers? |
I learned that right now for every task the scheduler creates a new connection that sometimes may not be feasible when the worker is behind the firewall. Since WS communication between client -> scheduler and worker -> scheduler already works, it can be used for scheduling tasks. Although, I understand that there may be some worker to worker communication, it's important to make the system work even though a worker is not directly accessible from the scheduler.
The text was updated successfully, but these errors were encountered: