Skip to content

MPIJob worker still running when NotEnoughResources with enable-gang-scheduling==true? #1617

Closed
@goodpp

Description

请问下,MPIJOB刚性调度在资源不足的情况下,显示Running状态,是Bug吗?

== MPIJOB
NAME AGE STATE
hvd-tf1-mnist 16m Running

=== POD
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
hvd-tf1-mnist-launcher 0/1 Pending 0 14m
hvd-tf1-mnist-worker-0 1/1 Running 0 14m 10.42.1.172 openpai-212
hvd-tf1-mnist-worker-1 0/1 Pending 0 14m

==== PodGroup Status
status:
conditions:
- lastTransitionTime: "2022-06-17T10:21:33Z"
message: '2/1 tasks in gang unschedulable: pod group is not ready, 1 Running,
3 minAvailable'
reason: NotEnoughResources
status: "True"
transitionID: cd024380-e518-43f0-9c44-3664ebb10429
type: Unschedulable
phase: Unknown
running: 1

==== MPIJOB Status
status:
conditions:
- lastTransitionTime: "2022-06-17T10:06:10Z"
lastUpdateTime: "2022-06-17T10:06:10Z"
message: MPIJob aios/hvd-tf1-mnist is created.
reason: MPIJobCreated
status: "True"
type: Created
- lastTransitionTime: "2022-06-17T10:06:11Z"
lastUpdateTime: "2022-06-17T10:06:11Z"
message: MPIJob hvd-tf1-mnist is running.
reason: JobRunning
status: "True"
type: Running
replicaStatuses:
Launcher: {}
Worker:
active: 1

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions