Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UX] Add in-progress spot jobs to sky status #1691

Merged
merged 47 commits into from
Feb 26, 2023
Merged

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Feb 10, 2023

Fixes #1665.

In this PR, we also show in-progress spot jobs in the sky status output.

We also changed the implementation of fetch_aws.py/fetch_azure.py to get rid of ray, because:

  1. ray has larger overhead
  2. ray will add sigterm handler when being imported (here), which will cause the following error when we use multiprocessing.

TODO:

  • we show sky status after sky launch, and the current change will add overheads to a normal sky launch. We may need to add an additional option to sky status to avoid querying the spot jobs.

Overhead comparison:

  1. python -c 'import time; from sky.clouds.service_catalog.data_fetchers import fetch_aws; start=time.time(); fetch_aws.fetch_availability_zone_mappings(); end = time.time(); print(end-start)' (used in aws_catalog.py)
    new: 2.44 seconds, master: 6.50 seconds
  2. time python -m sky.clouds.service_catalog.data_fetchers.fetch_aws --no-az-mappings --check-all-regions-enabled-for-account
    new: 43s, master: 59s
  3. time python -m sky.clouds.service_catalog.data_fetchers.fetch_azure
    new: 1m9s, master: 1m28s

Tested (run the relevant ones):

  • Any manual or new tests for this PR (please specify below)
    • sky status when the controller is stopped. (6.4s)
    • sky status -r when the controller is stopped. (37s and spot job query does not introduce more cost)
      image
    • sky status when the controller is INIT.
      image
    • sky status when the controller is UP.
      image
      image
      image
    • python -m sky.clouds.service_catalog.data_fetchers.fetch_aws --no-az-mappings --check-all-regions-enabled-for-account
    • python -m sky.clouds.service_catalog.data_fetchers.fetch_azure

@Michaelvll Michaelvll marked this pull request as ready for review February 12, 2023 00:46
@concretevitamin
Copy link
Member

Thanks @Michaelvll for this UX improvement! I will try to get to this today or tomorrow. Quick question: many users will have tens of jobs - mix of in progress and finished - in the spot queue. What would the output looks like (will it put normal clusters out of the current terminal page)?

@Michaelvll
Copy link
Collaborator Author

Thanks @Michaelvll for this UX improvement! I will try to get to this today or tomorrow. Quick question: many users will have tens of jobs - mix of in progress and finished - in the spot queue. What would the output looks like (will it put normal clusters out of the current terminal page)?

Good question! It currently shows all the in-progress spot jobs without the finished ones. I can make it only show the first 5 in-progress spot jobs, and show all the spot jobs when -a is specified. Wdyt?

@concretevitamin
Copy link
Member

concretevitamin commented Feb 13, 2023 via email

@Michaelvll
Copy link
Collaborator Author

Another idea is to not show any jobs, but show the total number of in-progress jobs + a hint asking users to check "spot queue". Wdyt?

I think it might be better to show the actual spot jobs for the user's convenience. We can still show the spot queue hint in it. : )

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Feb 15, 2023

According to the feedback, I reordered hints for autostopped clusters to the end. Also, I removed the managed spot queue printing when the spot controller is STOPPED or DOWN. PTAL @concretevitamin
image

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll, let's go with this design thanks to user feedback. Left some UX comments.


» sst
Clusters
NAME                          LAUNCHED    RESOURCES                            STATUS  AUTOSTOP  COMMAND
sky-spot-controller-8a3968f2  16 hrs ago  1x GCP(n2-standard-4, disk_size=50)  UP      10m       sky spot queue -r

Managed Spot Jobs

* 1 cluster has auto{stop,down} scheduled. Refresh statuses with: sky status --refresh

or with a refresh

» sst -r
Clusters
NAME                          LAUNCHED    RESOURCES                            STATUS   AUTOSTOP  COMMAND
sky-spot-controller-8a3968f2  16 hrs ago  1x GCP(n2-standard-4, disk_size=50)  STOPPED  10m       sky spot queue -r

Managed Spot Jobs

* 1 cluster has auto{stop,down} scheduled. Refresh statuses with: sky status --refresh

Showing the header alone "Managed Spot Jobs" seems surprising (leaving it hanging). Shall we print something like "No in-progress jobs found." immediately under the header?

sky/cli.py Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried it out, works well! Some comments, mostly UX nits.

sky/cli.py Show resolved Hide resolved
sky/clouds/service_catalog/aws_catalog.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/data_fetchers/fetch_aws.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/data_fetchers/fetch_azure.py Outdated Show resolved Hide resolved
sky/utils/cli_utils/status_utils.py Outdated Show resolved Hide resolved
sky/cli.py Show resolved Hide resolved
sky/cli.py Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Feb 25, 2023

Blocked by #1727

@Michaelvll Michaelvll added duplicate This issue or pull request already exists and removed duplicate This issue or pull request already exists labels Feb 25, 2023
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll for this. Huge long-awaited UX improvement. LGTM.

Running a spot launch with a stopped controller, there seems to be a blank line:

 » sst
Clusters
NAME                          LAUNCHED        RESOURCES                            STATUS  AUTOSTOP  COMMAND
sky-spot-controller-8a3968f2  a few secs ago  1x GCP(n2-standard-4, disk_size=50)  INIT    -         sky spot launch -dy --cloud...

Managed spot jobs
Controller's latest status is detected to be INIT; jobs will not be shown until it becomes UP.

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/cli.py Show resolved Hide resolved
Comment on lines +1564 to +1565
f'* {job_info}To see all jobs: {colorama.Style.BRIGHT}'
f'sky spot queue{colorama.Style.RESET_ALL}')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, not feeling strongly.

sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
Michaelvll and others added 8 commits February 25, 2023 16:25
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -1958,11 +1958,10 @@ def _refresh_cluster_record(

Args:
cluster_name: The name of the cluster.
force_refresh: refresh the cluster status as long as the cluster exists.
If False, the cluster status will only be refreshed if the cluster:
force_refresh: if True, refresh the cluster status even if it may be skipped. Otherwise (the default),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format (line too long)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good catch! Weirdly, our format script and the GitHub CI failed to catch it. Fixed. Thanks!

# controller is INIT/UP. This optimization avoids unnecessary costly
# refresh when the controller is already stopped. This optimization is
# based on the assumption that the user will not start the controller
# manually from the cloud console.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems safe to me after reading all usage in core and cli. Even if users manually start the controller in the console, this indicates it was previously autostopped, which means job have finished, so those usage seem safe.

@Michaelvll Michaelvll merged commit 1ac582a into master Feb 26, 2023
@Michaelvll Michaelvll deleted the spot-queue-in-status branch February 26, 2023 05:54
sumanthgenz pushed a commit to sumanthgenz/skypilot that referenced this pull request Mar 15, 2023
* Show spot queue in sky status

* format

* Handle missed error

* Do not force refresh for spot controller

* use multiprocesssing

* fix deprecation error

* fix

* fix

* Fix spot queue problem

* Add --no-show-spot-queue

* format

* Only show first 5 in-progress jobs

* fix

* reorder

* Do not show the spot queue if the controller is stopped

* format

* fix

* Rearange

* partially address comments

* address comments

* fix

* longer wait for spot job

* fix

* Fix message

* format

* fix message

* revert changes for catalog fetcher

* revert

* partially address

* Use the new fetchers

* address comments

* force_refresh

* Adopt threading instead

* no need to wait

* address comments

* Update sky/backends/backend_utils.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* Update sky/cli.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* Update sky/cli.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* rename option

* fix argument

* format

* add comment

---------

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[UX] Show in-progress jobs in the sky status
2 participants