-
Notifications
You must be signed in to change notification settings - Fork 56
Ray in VDK
Data engineers and analytics engineers are essential to any organization that heavily relies on data. They're responsible for designing, creating, and maintaining the data architecture of a company. This typically includes creating and managing databases (tables and datasets), data pipelines, and ETL (extract, transform, load) processes. They also work with data scientists to ensure that they have the necessary data to perform their analyses.
For instance, an analytics engineer may need to run a Python job to ingest data from DB, rest API or other service, SQL Job to perform some data manipulation, then a Spark job to process large amounts of data in parallel, and finally a Ray job for distributed computing. Each of these jobs would typically require a different platform or tool for management, creating a fragmented and complex workflow. These complexities and fragmentation lead to inefficiencies and increased overhead, as engineers need to switch between different platforms and keep track of their jobs and code versions across multiple systems
By integrating the capability to run Ray jobs in VDK, it can function as a "single pane of glass" for managing and observing various types of jobs. A unified job management and version control system can significantly improve the efficiency of data engineers and scientists. With this proposal, VDK will provide a platform where users can manage diverse job types, keep track of their deployed code version, and have single unified UI for all type of jobs
This proposal suggests extending the VDK to enable it to handle Ray jobs. Here's how it works:
See original diagram at here
- Deployment & Versioning
Users deploy the jobs using vdk deploy. This command initiates the job deployment process, including versioning, thereby releasing the users' code.
vdk deploy --name your-ray-job
or to revert
vdk deploy --name your-ray-job --job-version <prev-version>
- Source Code Tracking
All source codes related to different types of jobs will be tracked and maintained in the VDK Source Repository. This repository serves as a single, centralized location for all types of job source codes. It acts as a read-only catalog enabling root cause analysis, reproducibility and reuse.
- Automatic Ray client initilization and configuration
Depending on the cluster where the data job run VDK can automatically initialize and shutdown remote ray connection with optimal settings and configuration:
# Sample implementation of vdk-ray plugin that automatically initilizes and shutdowns at the end of a job
import ray
from vdk.api.plugin.hook_markers import hookimpl
class RunRayJob:
@hookimpl
def run_job(self, context: ExecutionContext):
ray.init() # Initialize Ray
yield # this yields back the the user job execution
ray.shutdown() # Shutdown Ray
- VDK Operations UI Integration
The VDK Operations UI is updated to provide monitoring for all types of jobs, including Ray jobs. It would be able to distinguish between types of Jobs. While the operations UI provides a general overview and monitoring capabilities, it can redirect users to more specialized UIs like Ray Dashboard UI for more detailed insights.
SDK - Develop Data Jobs
SDK Key Concepts
Control Service - Deploy Data Jobs
Control Service Key Concepts
- Scheduling a Data Job for automatic execution
- Deployment
- Execution
- Production
- Properties and Secrets
Operations UI
Community
Contacts