Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relaunch all failed jobs at once, for a given step #18442

Open
vladvisan opened this issue Jun 25, 2024 · 5 comments
Open

Relaunch all failed jobs at once, for a given step #18442

vladvisan opened this issue Jun 25, 2024 · 5 comments

Comments

@vladvisan
Copy link

image
Screenshot taken from @ahmedhamidawan's GCC presentation since my instance doesn’t have this feature yet.

Related to

Description

  • If multiple jobs fail for a same step (collection), it would be nice to be able to relaunch all the failed jobs for that step, at the same time / with one button.
  • If not, if there are for example 50 failed jobs, it can be very tedious.
  • I’m not sure how often such big (>=50) collections are, but it could happen, potentially of even larger size (potentially hundreds or thousands of datasets).

Adjacent ideas

  • Handle the cases where there is a mix of successful jobs, failed jobs, running jobs, waiting-to-be-run jobs, ..
  • One talk at GCC mentioned a « multi-select datasets » option when launching a tool, maybe the logic or page could be re-used/pre-populated ?
  • Maybe allow a multiple-choice checkbox of which jobs to re-execute, by default all selected, with a button to turn them off
  • Maybe even a regex to select the jobs to be re-executed, maybe re-using the collection filter operation
  • Should also (as usual) include the « Resume dependencies from this job ? » Additional Option
  • In the screenshot's workflow, one could just relaunch the previous step as it would in turn relaunch all the failed jobs of the last step. But this solution wouldn't work for a typical workflow (and it is inefficient even when it does work).
  • Let the user change the step's tool's version before executing - but keep the rest of the invocation ?
    • Might be necessary to resolve the underlying error. Not always, sometimes an external resource was unavailable and relaunching the exact same tool/dataset combo works the second time around
    • Need to be careful about reproducibility. Might need to duplicate/fork a history, or create a new invocation and combine with Skip job execution if equivalent job exists #4690 ?

Labels

  • feature-request, area/UI-UX, and maybe area/workflows and area/backend
@mvdbeek
Copy link
Member

mvdbeek commented Jun 25, 2024

I would say the most common thing to do is to re-run a single job, this is the default behavior now, and I think that should remain that way.

If not, if there are for example 50 failed jobs, it can be very tedious.

you can select the input collection today and all jobs will re-run. There should probably be a way to switch between those two modes more easily, so you don't need to find the input collection. The information on whether or not the job was part of a mapped over collection is available to the frontend.

Handle the cases where there is a mix of successful jobs, failed jobs, running jobs, waiting-to-be-run jobs, ..

You can rerun the whole collection and enable the job cache, that would the equivalent action

One talk at GCC mentioned a « multi-select datasets » option when launching a tool, maybe the logic or page could be re-used/pre-populated ?

this is an entirely different thing that will result in a different output structure that is flattened by one level

The rest sounds good and we should do it IMO, thanks for writing up the issue.

@mvdbeek mvdbeek changed the title [Feature Request] Relaunch all failed jobs at once, for a given step Relaunch all failed jobs at once, for a given step Jun 25, 2024
@vladvisan
Copy link
Author

Thanks for the feedback.

I would say the most common thing to do is to re-run a single job, this is the default behavior now, and I think that should remain that way.
Good point.

you can select the input collection today and all jobs will re-run

  • I must have missed something, I tried to do this, but I was not able to see a rerun/"recycle" button for the collection, only for the individual datasets?
  • image
  • I also tried to manually modify the rerun URL of a dataset https://usegalaxy.org/tool_runner/rerun?id=xxxx , and replace the id with the collection's id, but I got a "You are not allowed to access this dataset" page

You can rerun the whole collection and enable the job cache, that would the equivalent action
Good point, thanks, I haven't enabled it on my instance yet, I want to test this out soon.

this is an entirely different thing that will result in a different output structure that is flattened by one level
I understand.

@vladvisan
Copy link
Author

vladvisan commented Jun 26, 2024

Also a separate comment:

"Resume dependencies from this job" even for re-runs of succesful jobs?

  • I had assumed this was the case, but I just tested, and this option only appears for failed jobs (whose associated step has downstream steps)
  • At least for some scientists where I work, the option to re-try parts of the workflow from a given step is useful, with slightly different parameters from that step forwards (but with the same datasets/results from before)
  • Although this could seemingly also be achieved (assuming job cache is activated) by re-running the whole workflow, and just changing the parameters of that step

@mvdbeek
Copy link
Member

mvdbeek commented Jun 26, 2024

  • individual datasets

yes, that's right, if you click on rerun there you can replace the single input with the higher level input (i.e. the collection input). I agree that this should probably a more direct option in the user interface, but I wanted to point out that you can do this.

@vladvisan
Copy link
Author

vladvisan commented Jun 27, 2024

UI option
Ah, I see, I was able to select the collection as you indicated, in the re-run screen:

  • image
  • image
  • (t being the name of the collection)

Basic results

  • All the collection's datasets are regenerated (one job launched per dataset), which is nice
  • However, ideally only the failed ones would induce new jobs launched
  • I tested this on a collection with a mix of failed/successful dataset jobs, and they were are all regenerated/relaunched

Advanced results (resume dependencies)
When I select the “Resume dependencies from this job?” option, the execution refuses to launch, with the following error screen/message (I crossed out the irrelevant information):
image

I tested this (on Galaxy version 23.2.2.dev0):

  • first test: with all the jobs associated to datasets of the collection, being in the failed state
  • other test: with some of the jobs associated to the datasets of the collection, being in the failed state, and others being in the success state

Both cases led to the above error message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants