-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spot] Event callback for adding notification #2106
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this feature @MaoZiming! It would be very useful for the users. Left several comments.
Additional comments:
- We may want to call the event callback when the spot job starts recovering as well, according to the users feedback.
- We may want to think of whether we should run the callback asynchronously, instead of in a blocking way. Otherwise, our control logic may be significantly delayed.
@Michaelvll Incorporated all the comments. The basic skeleton for slack notification should be there. Prob need better UI |
…o spot-event-callback
…o spot-event-callback
…o spot-event-callback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix @MaoZiming! The slack messages were sent like a magic. It works well for STARTING
/RUNNING
/RECOVERING
/RECOVERED
/SUCCEEDED
, but it seems like not working well with the CANCELLING
/CANCELLED
.
sky/spot/spot_state.py
Outdated
@@ -419,9 +470,13 @@ def set_cancelling(job_id: int): | |||
(SpotStatus.CANCELLING.value, time.time(), job_id)) | |||
if rows.rowcount > 0: | |||
logger.info('Cancelling the job...') | |||
if callback_func is not None: | |||
callback_func(-1, 'CANCELLING', '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weirdly, when I am trying this PR, the STARTING
, RUNNING
, RECOVERING
, RECOVERED
works normally, while this CANCELLING
/CANCELLED
state was not sent to my slack. Do you have an idea why this happens?
Fyi, I did the following:
sky spot launch test.yaml
- manually kill the cluster on the console
- the states were correctly sent
sky spot cancel -a
: the cancelling/cancelled states were not correctly sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For set_cancelling
outside of SpotController
, I didn't pass in event_callback_func
.
https://github.com/skypilot-org/skypilot/blob/spot-event-callback/sky/spot/controller.py#L456
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to pass the event_callback_func to those as well? Otherwise, sky spot cancel -a
will not send the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to think of a clean way to do this since we need to call event_callback_func
outside of SpotController
and multiprocessing.Process(target=_run_controller...)
.
Do you know if is there an easy way to get task
inside the child process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, as we did for the _cleanup
function, we can retrieve the dag using the day_yaml
as we did in _cleanup
function
skypilot/sky/spot/controller.py
Line 430 in 438f815
_cleanup(job_id, dag_yaml=dag_yaml) |
However, we are not able to get some of the environment variable. We can probably print some log in the event callback saying which variables are not accessible, and we can wait for the user feedback to see if they want to have some of the variables in the CANCEL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Michaelvll Thanks for all the detailed feedback! The suggested changes should all be incorporated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this exciting PR @MaoZiming! The code looks good to me. Several small nits.
sky/spot/controller.py
Outdated
@@ -19,6 +19,7 @@ | |||
from sky.spot import recovery_strategy | |||
from sky.spot import spot_state | |||
from sky.spot import spot_utils | |||
from sky.spot.spot_utils import event_callback_func |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: import module instead of functions. https://google.github.io/styleguide/pyguide.html#224-decision
sky/spot/controller.py
Outdated
first_task_id = 0 | ||
first_task = dag.tasks[first_task_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
first_task_id = 0 | |
first_task = dag.tasks[first_task_id] | |
first_task = dag.tasks[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, do you think we can use spot_state.get_latest_task_id_status
to get the current task_id, so that the message can include the information of the currently cancelled task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I think that's better. Thanks!
sky/spot/spot_utils.py
Outdated
cluster_name = generate_spot_cluster_name(task.name, | ||
job_id) if task.name else None | ||
logger.info(f'=== START: event callback for {state!r} ===') | ||
log_path = os.path.join(constants.SKY_LOGS_DIRECTORY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we place it in a folder instead of the root directory of the logs? Probably, under ~/sky_logs/spot_event
?
Thank you so much @Michaelvll! I have incorporated all the suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix @MaoZiming! I just realized that the current way to pass the callback function can be a bit confusing. Do you think we can do something like the comments below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for awesome work @MaoZiming! LGTM. Please feel free to merge it.
Very nice @MaoZiming! |
Issue #2000: A user would like to have a customizable event callback for the spot job, so that they can implement some notification when the spot job status changes. Example:
Allows the controller to execute a customizable bash script when the spot status changes.
Example:
Register a Slack message web hook from here.
The user will receive a Slack notification when the spot status changes.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh