-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does TfJob controller need to do master election? #263
Comments
I think it is better to follow the design of Kubernetes controller instead of etcd operator, since we could keep the controller stateless. @wackxu did lots of refactoring work and we are close to it. In that architecture, we do not need keep a in-memory state machine and the map here https://github.com/tensorflow/k8s/blob/master/pkg/controller/controller.go#L53. And we do not need the master election, of course. |
Both tensorflow/k8s and caicloud TFJob controller keep the sample controller style. And, TFJob controller should be stateless and get rid of in-memory state machine. So, we can deprecate the master election. |
FYI @jlewi https://github.com/caicloud/kubeflow-controller
We open sourced a temporary re-implementation of our internal tool, ml-executor, in caicloud. And it is designed to be stateless although it is simple now. We hope the project could accelerate the development of tensorflow/k8s :-) And we will use tensorflow/k8s eventually to replace this controller. |
If we want to make TFJob controller stateless like |
Master election is a little heavy for me, I think we could use deployments to deploy the controller and rely on k8s to help us handle the controller failure. BTW /cc @ScorpioCPH @jimexist |
If it is |
There is one thing different compared to web services: All controller instances watch the apiserver and all of them will execute event call back handler by default. Although we could keep the data consistency with the support of ResourceVersion, I think it is a little weird. |
@gaocegege So I think it is not In consideration of data consistency, I prefer master election. |
Close it since we come to an agreement. |
The TfJob controller does master election. This code was originally copied from the etcd controller.
I think the etcd operator was using master election to handle operator update.
What do others think? Is master election the right thing to do?
/cc @zjj2wry @wackxu @gaocegege @DjangoPeng
The text was updated successfully, but these errors were encountered: