Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does TfJob controller need to do master election? #263

Closed
jlewi opened this issue Jan 3, 2018 · 9 comments
Closed

Does TfJob controller need to do master election? #263

jlewi opened this issue Jan 3, 2018 · 9 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jan 3, 2018

The TfJob controller does master election. This code was originally copied from the etcd controller.

I think the etcd operator was using master election to handle operator update.

What do others think? Is master election the right thing to do?

/cc @zjj2wry @wackxu @gaocegege @DjangoPeng

@gaocegege
Copy link
Member

gaocegege commented Jan 3, 2018

I think it is better to follow the design of Kubernetes controller instead of etcd operator, since we could keep the controller stateless. @wackxu did lots of refactoring work and we are close to it.

In that architecture, we do not need keep a in-memory state machine and the map here https://github.com/tensorflow/k8s/blob/master/pkg/controller/controller.go#L53. And we do not need the master election, of course.

@DjangoPeng
Copy link
Member

Both tensorflow/k8s and caicloud TFJob controller keep the sample controller style. And, TFJob controller should be stateless and get rid of in-memory state machine.

So, we can deprecate the master election.

@gaocegege
Copy link
Member

FYI @jlewi https://github.com/caicloud/kubeflow-controller

Recently Google open sources a new project Kubeflow, which is dedicated to making using ML stacks on Kubernetes easy, fast and extensible.

We have a similar project in CaiCloud and it serves for several years. In that project we took different approaches but similar design decisions to Kubeflow. Then we re-implement our internal tool and open source it to accelerate the development of the community powered controller tensorflow/k8s.

We open sourced a temporary re-implementation of our internal tool, ml-executor, in caicloud. And it is designed to be stateless although it is simple now. We hope the project could accelerate the development of tensorflow/k8s :-) And we will use tensorflow/k8s eventually to replace this controller.

@wackxu
Copy link
Contributor

wackxu commented Jan 3, 2018

If we want to make TFJob controller stateless like Kubernetes controller , I think we need do master election for high availability so that we can running more than one tensorflow controller and when one of them down, it can switch smoothly to one of the cold back.

@gaocegege
Copy link
Member

Master election is a little heavy for me, I think we could use deployments to deploy the controller and rely on k8s to help us handle the controller failure.

BTW /cc @ScorpioCPH @jimexist

@ScorpioCPH
Copy link
Member

If it is stateless, I think we can just scale-up it by load balancer like web service, what's the benefit of built-in master election?

@gaocegege
Copy link
Member

There is one thing different compared to web services: All controller instances watch the apiserver and all of them will execute event call back handler by default. Although we could keep the data consistency with the support of ResourceVersion, I think it is a little weird.

@ScorpioCPH
Copy link
Member

@gaocegege So I think it is not stateless as you described.

In consideration of data consistency, I prefer master election.

@gaocegege
Copy link
Member

Close it since we come to an agreement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants