-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-GVS bits required for GVS [VS-971] #8362
Conversation
@@ -451,6 +451,19 @@ public static StorageAPIAvroReaderAndBigQueryStatistics executeQueryWithStorageA | |||
} | |||
|
|||
public static boolean doRowsExistFor(String projectID, String datasetName, String tableName, String columnName, String value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't unique to this PR, but should we be worried about injection attacks here? It seems like we should be sanitizing the inputs and probably using whatever bigquery query builder exists instead of directly munging strings from the user into our queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably not clear from the way GitHub displays these diffs, but this PR is adding a second version of doRowsExistFor
that makes its last parameter a Long
instead of the String
. So the changes here should be even safer than the baseline. 🙂
The code in this file is often formatting not just column values into String, but also project / dataset / table. All of these are variable in GVS (even table names, thanks to "superpartitioning" to work around BQ partitioning constraints).
At least in the GVS context I'm not particularly concerned about injection here for a few reasons:
- The project and dataset must have been validated to even create the tables in the first place.
- Table and column names are not specified from user input.
- Users would be running GVS via their pet service accounts and wouldn't have the ability to read or write anything that isn't theirs.
- These
doRowsExistFor
are only called to validate sample ids (Longs) and digests. So even if the VCF header had some "little Bobby Tables" in it we would be querying for the digest of that.
Happy to add some docs / warnings to clarify for potential future users for whom these assumptions may not apply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mcovarr I have a few comments. I didn't look at the wdl at all. It should really have a test to show the new functionality also.
if (!g.hasLikelihoods()) { | ||
if (!g.isHomRef() && !roundContributionFromEachGenotype) { | ||
throw new IllegalStateException("Genotype likelihoods cannot be determined from GQ for genotype: " + g + | ||
".\nGenotypes with no PLs should have integer counts using roundContributionFromEachGenotype = true."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message seems confusingly worded to me. Maybe something like "Non-hom-ref genotypes with no PLs are only allowed if roundContributionFromEachGenotype = true.
@@ -62,9 +62,19 @@ public static GenotypeCounts computeDiploidGenotypeCounts(final VariantContext v | |||
continue; | |||
} | |||
|
|||
if (!g.hasLikelihoods() && g.isHomRef()) { | |||
if (!g.hasLikelihoods()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the control flow kind of confusing here now. I might rearrange it into a tree with round vs not round branches instead of pulling out that error case at the top.
Something like this:
if (!g.hasLikelihoods()) {
if (roundContributionFromEachGenotype) {
if (g.isHomRef()) {
genotypeWithTwoRefsCount += 1;
} else if (g.isHet()) {
genotypesWithOneRefCount += 1;
} else {
genotypesWithNoRefsCount += 1;
}
} else {
if (!g.isHomRef()) {
throw new IllegalStateException("Genotype likelihoods cannot be determined from GQ for genotype: " + g +
".\nGenotypes with no PLs should have integer counts using roundContributionFromEachGenotype = true.");
}
if (gq == 0) {
genotypeWithTwoRefsCount += 1.0/3;
genotypesWithOneRefCount += 1.0/3;
genotypesWithNoRefsCount += 1.0/3;
} else {
genotypeWithTwoRefsCount += QualityUtils.qualToProb(gq);
genotypesWithOneRefCount += 1 - QualityUtils.qualToProb(gq);
//assume last likelihood is negligible
}
}
continue;
}
@@ -208,7 +208,7 @@ their names (or descriptions) depend on some threshold. Those filters are not i | |||
public static final String VQSR_FAILURE_PREFIX = "low_VQSLOD_"; | |||
public static final String VQSR_FAILURE_SNP = VQSR_FAILURE_PREFIX + SNP; | |||
public static final String VQSR_FAILURE_INDEL = VQSR_FAILURE_PREFIX + INDEL; | |||
public static final String VQS_SENS_FAILURE_PREFIX = "low_VQS_SENS_"; | |||
public static final String VQS_SENS_FAILURE_PREFIX = "high_VQS_SENS_"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this was confusing enough that it was gotten wrong initially maybe add a comment explaining what this means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's analogous to the low_VQSLOD_ prefix - "Prefix for a site (SNP/INDEL) that failed calibration sensitivity cutoff". In this case, the site would be a failure if the sensitivity is greater than the threshold.
df66177
to
a68c196
Compare
The "WDL test" CI failures are not related to the changes in this PR, please see this sanity check PR which is also currently aflame. |
e14616c
to
585d0c8
Compare
Some updates per the description:
|
18202d7
to
23b750d
Compare
* | ||
* Created by jonn on 4/17/19. | ||
*/ | ||
public final class BigQueryUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of deleting and re-adding these, it might be better to simply move them to their destined package in this PR. That way, we could preserve the git history of changes to these utils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was done so I'm going to dismiss this review.
91de082
to
977a2ef
Compare
977a2ef
to
dc44285
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
A collection of changes in non-GVS packages required to build a working version of GVS against master:
JointVcfFiltering.wdl
.VQS_SENS_FAILURE_PREFIX
VCF header value updated for correctness.gvs
package to make clear these are currently considered to be GVS specific.ExcessHet calculation fixes for the case of no PLs.Removed, no longer required with Annotation changes inExtractTool
.