Skip to content

unable to fetch TFJob when I use client.go run tfjob #1612

Closed
@goodpp

Description

为了表述准确,用中文来描述这个问题。
当我使用client.go 运行tfjob,tfjob运行成功后,会自动把相关tfjob,pod 数据都清理了。此时在training-operator的日志中看到提示 TFJob.kubeflow.org "tjob-tf1-ps-demo-10-1-0-0" not found,请问下怎么能查到什么原因导致出现unable to fetch TFJob
xxx?

PS:使用kubectl运行同样的tfjob的时候是能正常结束的,使用Client.go运行tfjob时才会出现该问题。

k8s version: 1.20
client.go version: 1.21

相关日志logs:
time="2022-06-13T06:51:01Z" level=info msg="Reconciling for job tjob-tf1-ps-demo-10-1-0-0"
time="2022-06-13T06:51:01Z" level=info msg="Ignoring inactive pod aios/tjob-tf1-ps-demo-10-1-0-0-worker-0 in state Succeeded, deletion time "
time="2022-06-13T06:51:01Z" level=info msg="Pod: aios.tjob-tf1-ps-demo-10-1-0-0-worker-0 exited with code 0" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
time="2022-06-13T06:51:01Z" level=info msg="TFJob=aios/tjob-tf1-ps-demo-10-1-0-0, ReplicaType=PS expected=1, running=1, failed=0" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
time="2022-06-13T06:51:01Z" level=info msg="TFJob=aios/tjob-tf1-ps-demo-10-1-0-0, ReplicaType=Worker expected=1, running=1, failed=0" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
2022-06-13T06:51:01.777Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"aios","name":"tjob-tf1-ps-demo-10-1-0-0","uid":"97074ddb-b857-4e6e-8d2b-765ed4f006de","apiVersion":"kubeflow.org/v1","resourceVersion":"79919"}, "reason": "ExitedWithCode", "message": "Pod: aios.tjob-tf1-ps-demo-10-1-0-0-worker-0 exited with code 0"}
2022-06-13T06:51:01.778Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"aios","name":"tjob-tf1-ps-demo-10-1-0-0","uid":"97074ddb-b857-4e6e-8d2b-765ed4f006de","apiVersion":"kubeflow.org/v1","resourceVersion":"79919"}, "reason": "TFJobSucceeded", "message": "TFJob aios/tjob-tf1-ps-demo-10-1-0-0 successfully completed."}
time="2022-06-13T06:51:01Z" level=info msg="Finished updating TFJobs Status "tjob-tf1-ps-demo-10-1-0-0" (10.619824ms)" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
time="2022-06-13T06:51:01Z" level=info msg="Reconciling for job tjob-tf1-ps-demo-10-1-0-0"
time="2022-06-13T06:51:01Z" level=info msg="Controller tjob-tf1-ps-demo-10-1-0-0 deleting pod aios/tjob-tf1-ps-demo-10-1-0-0-worker-1" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
2022-06-13T06:51:01.800Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"aios","name":"tjob-tf1-ps-demo-10-1-0-0","uid":"97074ddb-b857-4e6e-8d2b-765ed4f006de","apiVersion":"kubeflow.org/v1","resourceVersion":"80228"}, "reason": "SuccessfulDeletePod", "message": "Deleted pod: tjob-tf1-ps-demo-10-1-0-0-worker-1"}
time="2022-06-13T06:51:01Z" level=info msg="Controller tjob-tf1-ps-demo-10-1-0-0 deleting service aios/tjob-tf1-ps-demo-10-1-0-0-worker-1"
2022-06-13T06:51:01.809Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"aios","name":"tjob-tf1-ps-demo-10-1-0-0","uid":"97074ddb-b857-4e6e-8d2b-765ed4f006de","apiVersion":"kubeflow.org/v1","resourceVersion":"80228"}, "reason": "SuccessfulDeleteService", "message": "Deleted service: tjob-tf1-ps-demo-10-1-0-0-worker-1"}
time="2022-06-13T06:51:01Z" level=info msg="Controller tjob-tf1-ps-demo-10-1-0-0 deleting pod aios/tjob-tf1-ps-demo-10-1-0-0-ps-0" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
2022-06-13T06:51:01.817Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"aios","name":"tjob-tf1-ps-demo-10-1-0-0","uid":"97074ddb-b857-4e6e-8d2b-765ed4f006de","apiVersion":"kubeflow.org/v1","resourceVersion":"80228"}, "reason": "SuccessfulDeletePod", "message": "Deleted pod: tjob-tf1-ps-demo-10-1-0-0-ps-0"}
time="2022-06-13T06:51:01Z" level=info msg="Controller tjob-tf1-ps-demo-10-1-0-0 deleting service aios/tjob-tf1-ps-demo-10-1-0-0-ps-0"
2022-06-13T06:51:01.825Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"aios","name":"tjob-tf1-ps-demo-10-1-0-0","uid":"97074ddb-b857-4e6e-8d2b-765ed4f006de","apiVersion":"kubeflow.org/v1","resourceVersion":"80228"}, "reason": "SuccessfulDeleteService", "message": "Deleted service: tjob-tf1-ps-demo-10-1-0-0-ps-0"}
time="2022-06-13T06:51:01Z" level=info msg="Finished updating TFJobs Status "tjob-tf1-ps-demo-10-1-0-0" (7.262213ms)" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
time="2022-06-13T06:51:01Z" level=info msg="Reconciling for job tjob-tf1-ps-demo-10-1-0-0"
time="2022-06-13T06:51:01Z" level=info msg="pod aios/tjob-tf1-ps-demo-10-1-0-0-ps-0 is terminating, skip deleting" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
time="2022-06-13T06:51:01Z" level=info msg="Finished updating TFJobs Status "tjob-tf1-ps-demo-10-1-0-0" (3.70475ms)" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
time="2022-06-13T06:51:01Z" level=warning msg="Reconcile Tensorflow Job error Operation cannot be fulfilled on tfjobs.kubeflow.org "tjob-tf1-ps-demo-10-1-0-0": the object has been modified; please apply your changes to the latest version and try again"
2022-06-13T06:51:01.839Z ERROR controller-runtime.manager.controller.tfjob-controller Reconciler error {"name": "tjob-tf1-ps-demo-10-1-0-0", "namespace": "aios", "error": "Operation cannot be fulfilled on tfjobs.kubeflow.org "tjob-tf1-ps-demo-10-1-0-0": the object has been modified; please apply your changes to the latest version and try again"}
time="2022-06-13T06:51:01Z" level=info msg="Reconciling for job tjob-tf1-ps-demo-10-1-0-0"
time="2022-06-13T06:51:01Z" level=info msg="pod aios/tjob-tf1-ps-demo-10-1-0-0-ps-0 is terminating, skip deleting" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
time="2022-06-13T06:51:01Z" level=info msg="Reconciling for job tjob-tf1-ps-demo-10-1-0-0"
time="2022-06-13T06:51:01Z" level=info msg="pod aios/tjob-tf1-ps-demo-10-1-0-0-ps-0 is terminating, skip deleting" job=aios.tjob-tf1-ps-demo-10-1-0-0 uid=97074ddb-b857-4e6e-8d2b-765ed4f006de
2022-06-13T06:51:10.096Z INFO TFJob.kubeflow.org "tjob-tf1-ps-demo-10-1-0-0" not found {"tfjob": "aios/tjob-tf1-ps-demo-10-1-0-0", "unable to fetch TFJob": "aios/tjob-tf1-ps-demo-10-1-0-0"}
2022-06-13T06:51:10.102Z INFO TFJob.kubeflow.org "tjob-tf1-ps-demo-10-1-0-0" not found {"tfjob": "aios/tjob-tf1-ps-demo-10-1-0-0", "unable to fetch TFJob": "aios/tjob-tf1-ps-demo-10-1-0-0"}
2022-06-13T06:51:10.105Z INFO TFJob.kubeflow.org "tjob-tf1-ps-demo-10-1-0-0" not found {"tfjob": "aios/tjob-tf1-ps-demo-10-1-0-0", "unable to fetch TFJob": "aios/tjob-tf1-ps-demo-10-1-0-0"}
2022-06-13T06:51:40.214Z INFO TFJob.kubeflow.org "tjob-tf1-ps-demo-10-1-0-0" not found {"tfjob": "aios/tjob-tf1-ps-demo-10-1-0-0", "unable to fetch TFJob": "aios/tjob-tf1-ps-demo-10-1-0-0"}

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions