Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WDL generation. #6504

Merged
merged 2 commits into from
Jun 30, 2020
Merged

WDL generation. #6504

merged 2 commits into from
Jun 30, 2020

Conversation

cmnbroad
Copy link
Collaborator

@cmnbroad cmnbroad commented Mar 16, 2020

This requires a Barclay upgrade (that is not yet published) since most of the basic WDL generation code is in Barclay. WDL is only generated for tools that are annotated with @RuntimeProperties, and that have input/output files arguments that are annotated with@WorkflowResource.

For each WDL generated, an accompanying JSON input file is generated that contains all the tool's arguments, with the optional arguments initialized to the tool's default values, and the required args initialized to a string that describes the required type. A temporary commit that contains a sample WDL/JSON generated by the WDL gen task in included to make it easier to see the WDL that results.

The only commits in this branch that are directly WDL-gen related are the WDL Gen commit itself, and the sample output commit. The other commits are either related to GATKPathSpecifier migration (not required for WDL gen) or Barclay upgrade migration (required for WDL gen).

@cmnbroad
Copy link
Collaborator Author

cmnbroad commented Mar 17, 2020

@bshifaw This is the PR we discussed in slack. Since we won't actually check the generated files into the repo, I added a temporary commit with some example WDL/JSON files so you can see what currently gets generated, and am looking for comments on the WDL structure, etc. Thanks.

@droazen
Copy link
Contributor

droazen commented Apr 8, 2020

@bshifaw Just a reminder that this branch is awaiting your review. To clarify, we don't need you to review the code, we just need you to review the generated WDLs/JSONs themselves, looking for problems / missing elements / etc., and suggesting any tweaks you think are necessary. Thanks!

@cmnbroad
Copy link
Collaborator Author

cmnbroad commented Apr 8, 2020

@bshifaw Minor tweak to David's comment: you won't be able to run the WDL gen directly from this branch because it requires a barclay upgrade, but I did check a couple of output WDLs in as a temporary commit so you can look at those. If you'd like, I can generate WDL for all the tools and check those in, but I didn't do that because the tools will require some annotation in addition to this branch in order to get WDLs with proper output sections out.

@bshifaw
Copy link

bshifaw commented Apr 8, 2020

Sorry, just seeing this now.
I'm mainly looking at the sample output and it looks great!! I can't find anything wrong with it but here are some suggestions for improvement

  • If its too late now maybe for later it would be nice to add a runtime block for each task so that it would executable in a cloud environment.
  • Why have inputs declared at a workflow level input block (which requires them to be repeated in the call block, and again at the task input block) when they can be declared simply at the task level?
  • If it's an easy fix it would be nice to see the full parameter name instead of abbreviation for input and output (e.g. "I", "O")

@cmnbroad
Copy link
Collaborator Author

Thanks for the review @bshifaw.

If its too late now maybe for later it would be nice to add a runtime block for each task so that it would executable in a cloud environment.

There is limited support for runtime blocks now (you can see one here) but it requires annotating the tool with the proper values. It currently only supports the memory attribute, but we can certainly add support for other attributes if it makes sense for them to have static values. Perhaps we should add docker ? Others ?

Why have inputs declared at a workflow level input block (which requires them to be repeated in the call block, and again at the task input block) when they can be declared simply at the task level?

I chose to make each WDL have both a workflow and a task, and generated an input JSON that contains the default values for each optional arg so that only the required args have to be provided (they're initialized with the type of value required similar to the skeleton that womtool generates). But maybe theres an alternative structure thats better. Can you point me to a WDL that uses the structure you're proposing? I'm not sure I understand how it would be used.

If it's an easy fix it would be nice to see the full parameter name instead of abbreviation for input and output (e.g. "I", "O")

I used the full names wherever possible, but many of the tools have args with names that are WDL reserved works like input and output, so I had to use the short name in those cases. An alternative would be to mangle/annotate them instead, i.e., turn input into something like input_ or some such thing.

@bshifaw
Copy link

bshifaw commented Apr 17, 2020

  1. Runtime Block. Generally diskpace, memory, and docker image would be the main cloud runtime attributes to have for a task. Maybe give them default values, so that users can change them in the json if they would like.
  2. Variable declaration placement. Not absolutely necessary but thought it might help reduce the number of lines in the wdl. Here is an example where the task declares a variable that is not in the call or workflow block: here. Here is a wdl doc about it (link) though its old. Hard to find a direct statement about it in the wdl 1.0 spec but this section hints at it link.
  1. Input naming. Ahh. that makes sense. I often see input_file being used in wdls so that could work.

@cmnbroad cmnbroad force-pushed the cn_wdl_gen_barclay branch from e36a243 to c46d362 Compare April 28, 2020 22:34
Copy link
Collaborator

@jonn-smith jonn-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks pretty great. I still need to look at the Barclay branch.

I had a couple of comments / questions. I primarily commented on the PrintReads wdl with a few other comments interspersed.

@cmnbroad and @bshifaw - In terms of the WDL workflow / task inputs - I think the way it's written up now is good. It would save space to only have task level inputs, but this gives more of the sense that each tool can stand on its own as an entity. It's also auto-generated so it's no more work to do it this way.

.travis.yml Outdated
- CROMWELL_JAR=$HOME/cromwell-36.jar
- WOMTOOL_JAR=$HOME/womtool-36.jar
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to update to a newer cromwell / womtool version? Looks like v47 is used in the gatkWDLGenValidation build task.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I made a separate PR to upgrade to v51, which is merged now, and this is rebased on that.

build.gradle Outdated
def wdlFiles = fileTree(dir: wdlGenFolder)
wdlFiles = wdlFiles.filter { f -> !f.getAbsolutePath().endsWith(".html") && !f.getAbsolutePath().endsWith(".json")}
return wdlFiles.any() { wdlFile ->
//TODO: fix this before submitting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to remove.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,297 @@
version 1.0

# Run PrintReads (GATK Version 4.1.7.0-7-gcd21b9c-SNAPSHOT)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Run PrintReads (GATK Version 4.1.7.0-7-gcd21b9c-SNAPSHOT)
# Run PrintReads (WDL Autogenerated from: GATK Version 4.1.7.0-7-gcd21b9c-SNAPSHOT)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, good idea.


input {
#Docker to use
String dockerImage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about versioning and how it applies to the docker image and GATK executables here.

I think we should have a default value here for the gatk executable so this can run out of the box on a GATK docker image.

Should we have a default for a docker image here? Or should we make a check in the command block below to make sure the GATK is the right version? I'm a little concerned about version mismatches.

Copy link
Collaborator Author

@cmnbroad cmnbroad Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now defaults to broadinstitue/gatk:version.

>>>

runtime {
docker: dockerImage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to give some default parameters for memory and disk?

I typically start with some defaults. Here's something like what I usually start with:

    input {
        Int? mem_gb
        Int? preemptible_attempts
        Int? disk_space_gb
        Int? cpu
        Int? boot_disk_size_gb
    }

    Boolean use_ssd = false

    Float input_files_size_gb = size(reads_file, "GiB") + size(segments_fasta, "GiB") + size(boundaries_file, "GiB")

    # You may have to change the following two parameter values depending on the task requirements
    Int default_ram_mb = 2048
    Int default_disk_space_gb = ceil((input_files_size_gb * 2) + 1024)

    Int default_boot_disk_size_gb = 15

    # Mem is in units of GB but our command and memory runtime values are in MB
    Int machine_mem = if defined(mem_gb) then mem_gb * 1024 else default_ram_mb


     runtime {
         docker: docker_image
         memory: machine_mem + " MB"
         disks: "local-disk " + select_first([disk_space_gb, default_disk_space_gb]) + if use_ssd then " SSD" else " HDD"
         bootDiskSizeGb: select_first([boot_disk_size_gb, default_boot_disk_size_gb])
         preemptible: select_first([preemptible_attempts, 1])
         cpu: select_first([cpu, 1])
     }

Copy link
Collaborator Author

@cmnbroad cmnbroad Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defaults are set in the JSON input files (based on either the default or override values taken from the RuntimeProperties annotations).

~{true='--tmp_dir ' false='' defined(tmp_dir)}~{sep=' --tmp_dir ' tmp_dir} \
~{true='--use_jdk_deflater ' false='' defined(use_jdk_deflater)}~{sep=' --use_jdk_deflater ' use_jdk_deflater} \
~{true='--use_jdk_inflater ' false='' defined(use_jdk_inflater)}~{sep=' --use_jdk_inflater ' use_jdk_inflater} \
~{true='--verbosity ' false='' defined(verbosity)}~{sep=' --verbosity ' verbosity} \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need another newline here because of the trailing backslash?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, done.

@@ -83,17 +85,20 @@
programGroup = ReadDataManipulationProgramGroup.class
)
@DocumentedFeature
@RuntimeProperties(memoryRequirements = "1GB")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking for something like this. This is great!

@@ -248,29 +247,29 @@ void addToFeatureSources(final int featureQueryLookahead, final FeatureInput<? e
}

/**
* Given a Field known to be of type FeatureInput (or a Collection thereof), retrieves the type
* Given a ArgumentDefinition for an argument known to be of type FeatureInput (or a Collection thereof), retrieves the type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Given a ArgumentDefinition for an argument known to be of type FeatureInput (or a Collection thereof), retrieves the type
* Given an ArgumentDefinition for an argument known to be of type FeatureInput (or a Collection thereof), retrieves the type

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit is from another branch that was required for the new Barclay version, but thats been merged into master now and this is rebased on that, so this commit is gone.

@@ -58,4 +64,27 @@ protected void addCustomBindings(final DocWorkUnit currentWorkUnit) {
}
}

// @Override
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed since it's commented out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, done.

<#if group.supercat == supercat>
<li><a data-toggle="collapse" data-parent="#sidenav" href="#${group.id}">${group.name}</a>
<div id="${group.id}"
<?php echo ($group == '${group.name}')? 'class="accordion-body collapse in"'.chr(62) : 'class="accordion-body collapse"'.chr(62);?>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hoo boy. Are we really still using PHP? Not really a question for this code review… but I'm surprised.

@eitanbanks
Copy link
Contributor

Cool stuff!
Some small comments after using this:

  1. Why do you call the task “(toolname)Task” and not “(toolname)”? You might want to ask around for the convention here.
  2. It's weird that --help shows up as one of the command-line options.
  3. Is there a way to disable printing of non-tool-specific arguments? You may even want to consider an option that disables printing out of any arguments that aren't required (since that's what I happened to have wanted to do in my case).

@eitanbanks
Copy link
Contributor

Bug: it should be --output in the command block, not --output_arg

@cmnbroad cmnbroad force-pushed the cn_wdl_gen_barclay branch 2 times, most recently from de42cb2 to 42e2493 Compare June 16, 2020 17:28
@cmnbroad cmnbroad force-pushed the cn_wdl_gen_barclay branch 7 times, most recently from 45b362c to 70e1bbd Compare June 23, 2020 20:50
@cmnbroad cmnbroad marked this pull request as ready for review June 23, 2020 21:02
@cmnbroad
Copy link
Collaborator Author

@jonn-smith Back to you. It took a fair amount of machinery to get a cromwell execution test running on travis with dummy inputs, etc., but its finally working. For the moment, the only tool that is annotated for WDLGen is PrintReads, but as soon as this goes in I'll start annotating others in a separate PR.

@cmnbroad cmnbroad changed the title DRAFT: WDL generation. WDL generation. Jun 23, 2020
@ldgauthier
Copy link
Contributor

Are we morally opposed to tracking the generated WDLs with git? That would be the easiest way to get up-to-date single-tool WDLs into Terra, i.e. add each to Dockstore tracking the GATK repo and import into Terra from Dockstore.

@droazen
Copy link
Contributor

droazen commented Jun 24, 2020

@ldgauthier Yes, we plan to put the generated WDLs in Dockstore (Comms team has also requested this), and update them with each release.

Copy link
Collaborator

@jonn-smith jonn-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One or two very minor questions / suggestions.

Feel free to merge after looking at / addressing them.

/**
* Adapter shim for use within GATK to run tools in command line validation mode. Note that this
* class does not have it's own CommandLineProgramProperties annotation, and isn't intended to be
* run directly from the command line.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding in some documentation / naming the class to indicate that it really only validates the command line arguments for a given tool.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

/**
* Adds a super-category so that we can custom-order the categories in the doc index
*
* @param docWorkUnit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadoc needs updating here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

/**
* Add the named argument {@code argDed}to the property map if applicable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Add the named argument {@code argDed}to the property map if applicable.
* Add the named argument {@code argDed} to the property map if applicable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@cmnbroad cmnbroad force-pushed the cn_wdl_gen_barclay branch from ce2be87 to 6550649 Compare June 30, 2020 17:21
@gatk-bot
Copy link

Travis reported job failures from build 30848
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk11 30848.14 logs

@cmnbroad cmnbroad merged commit 896e54e into master Jun 30, 2020
@cmnbroad cmnbroad deleted the cn_wdl_gen_barclay branch June 30, 2020 21:42
jonn-smith pushed a commit that referenced this pull request Jul 14, 2020
mwalker174 pushed a commit that referenced this pull request Nov 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants