Skip to content

TFJob with 1 replicas can't use gang-scheduling #922

Closed
@zionwu

Description

When I enabled gang-scheduling, I expect all the TF jobs to be scheduled by "kube-batch", so that all the jobs will have the same scheduling policy.

However, If I submit a TFJob with 1 replicas and specified the schedulerName to "kube-batch", The job stays pending. The cause is that TF-operator is not creating PDB if the replicas is less than 2 for the job:

func (jc *JobController) SyncPdb(job metav1.Object, minAvailableReplicas int32) (*v1beta1.PodDisruptionBudget, error) {
	labelJobName := jc.Controller.GetJobNameLabelKey()
	// Non-distributed training is not required gang scheduling
	if minAvailableReplicas < 2 {
		return nil, nil
	}
       .....

Can we remove this check to make the scheduling policy for all jobs consistent?

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions