-
Notifications
You must be signed in to change notification settings - Fork 438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature][RayJob] Support light-weight job submission #1893
[Feature][RayJob] Support light-weight job submission #1893
Conversation
WithEntrypoint("python /home/ray/jobs/counter.py"). | ||
WithRuntimeEnvYAML(` | ||
env_vars: | ||
counter_name: test_counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: If the indentation is set to 4 spaces, the YAML string will fail to unmarshal correctly. Maybe we should verify whether runtimeEnvYAML is a valid YAML or not.
@@ -197,10 +198,10 @@ type RayJobInfo struct { | |||
} | |||
|
|||
// RayJobRequest is the request body to submit. | |||
// Reference to https://docs.ray.io/en/latest/cluster/jobs-package-ref.html#jobsubmissionclient. | |||
// Reference to https://docs.ray.io/en/latest/cluster/running-applications/job-submission/rest.html#ray-job-rest-api-spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the OpenAPI https://docs.ray.io/en/latest/cluster/running-applications/job-submission/api.html#/paths/~1api~1jobs/post for more details. Some types seem to be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OpenAPI doc is currently manually written and might itself have bugs; it might be better to use the Python as the source of truth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And if we notice something is off, we can update the OpenAPI doc)
@@ -222,6 +223,7 @@ type RayJobLogsResponse struct { | |||
} | |||
|
|||
// Note that RayJobInfo and error can't be nil at the same time. | |||
// Please make sure if the Ray job with JobId can't be found. Return a BadRequest error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat unusual, but the function NewNotFound requires schema.GroupResource as a parameter.
func NewNotFound(qualifiedResource schema.GroupResource, name string) *StatusError
@@ -222,6 +223,7 @@ type RayJobLogsResponse struct { | |||
} | |||
|
|||
// Note that RayJobInfo and error can't be nil at the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: KubeRay uses many different methods to create errors
. We should try to unify them.
- "k8s.io/apimachinery/pkg/api/errors"
- "github.com/pkg/errors"
fmt.Errorf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API looks reasonable and the code looks good to me! In the accompanying Ray doc PR, we can add a brief explanation of the pros and cons of setting this parameter.
@@ -197,6 +202,15 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request) | |||
rayDashboardClient.InitClient(rayJobInstance.Status.DashboardURL) | |||
jobInfo, err := rayDashboardClient.GetJobInfo(ctx, rayJobInstance.Status.JobId) | |||
if err != nil { | |||
// If the Ray job was not found, GetJobInfo returns a BadRequest error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I would expect it to return a 404 according to the doc (https://docs.ray.io/en/latest/cluster/running-applications/job-submission/api.html#/paths/~1api~1jobs~1%7Bsubmission_id%7D/get), but the doc is not automatically generated and it might be inaccurate. Do you think it's worth changing the behavior in Ray to make it return 404? I don't think any other users have reported it, so maybe it's not important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Ray RESTful API will return 404, but the function GetJobInfo
returns BadRequest here. The reason that I decide not to use NewNotFound
at this moment is that it requires to pass schema.GroupResource
into the function (example). We should figure out a better way for "errors". See #1893 (comment) for more details.
@@ -197,10 +198,10 @@ type RayJobInfo struct { | |||
} | |||
|
|||
// RayJobRequest is the request body to submit. | |||
// Reference to https://docs.ray.io/en/latest/cluster/jobs-package-ref.html#jobsubmissionclient. | |||
// Reference to https://docs.ray.io/en/latest/cluster/running-applications/job-submission/rest.html#ray-job-rest-api-spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OpenAPI doc is currently manually written and might itself have bugs; it might be better to use the Python as the source of truth.
// If LightWeightSubmissionMode is true, KubeRay operator sends a request to the RayCluster to create a | ||
// Ray job. Otherwise, the operator creates a submitter Kubernetes Job to submit the Ray job. | ||
// +kubebuilder:default:=false | ||
LightWeightSubmissionMode bool `json:"lightWeightSubmissionMode,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's cleaner to enumerate the submission type instead of a bool.
type RayJobSpec struct {
...
...
SubmissionMode JobSubmissionMode
}
type JobSubmissionMode string
const (
// Submit job via Kubernetes Job
BatchJob JobSubmissionMode = "batchv1.Job"
// Submit job via HTTP request
HTTP JobSubmissionMode = "HTTP"
...
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it depends on whether we anticipate more modes of job submission going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Updated.
type JobSubmissionMode string | ||
|
||
const ( | ||
K8sJobMode JobSubmissionMode = "K8sJobMode" // Submit job via Kubernetes Job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is the "Mode" part neeeed? I think it's implied from the field name. Maybe just K8sJob
and HTTP
is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to add "Mode". Without it, users might find the documentation confusing, as they would have to guess whether "HTTP" refers to the submission mode or the protocol. In addition, "K8sJob" and "K8s Job" can also be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"HTTP" refers to the submission mode or the protocol
Users will rarely see "HTTP" by itself though. They will usually see submissionMode: HTTP
and submissionMode: batchv1.Job
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They will usually see submissionMode: HTTP and submissionMode: batchv1.Job.
This is reasonable in the YAML files and in the KubeRay codebase, but it may not be accurate for the Ray documentation.
Why are these changes needed?
Some users are not willing to allocate additional compute resources for the submitter Kubernetes Job. Hence, this PR implements the light-weight job submission which uses a HTTP request to submit the Ray job.
Related issue number
Closes #1558
Checks