Closed
Description
I'm using
gatk --java-options "-Xmx10g" CombineGVCFs
on clair3 gvcf output. I encountered this error.
htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 97073593: unparsable vcf record with allele TGTGB
The offending input line is
chr3 16902879 . TGTGB T,<NON_REF> 12.03 PASS F GT:GQ:DP:AD:AF:PL 0/1:12:38:26,11,0:0.2895:15,0,57,990,990,990
If this output is actually compatible with a version of GATK, could you please tell what version? The error was generated using v4.1.8.1.
Thanks.
Activity
zhengzhenxian commentedon Nov 25, 2022
Hi,
We test the GVCF compatibility using GATK version 4.2.6.1. And for the error, it seems that the reference consists of IUPAC base B, not sure your GATK version could parse it.
Coppini commentedon Jan 2, 2023
@aquaskyline @zhengzhenxian
Our group has been facing the same issue with the VCF files we obtain with Clair3, which cause problems when we run QC pipelines on them.
Apparently, while GATK tools seem to handle IUPAC codes in the reference, Clair3 does not.
Our current approach is to check the output VCF from Clair3 and replace any IUPAC degenerate codes in it by
N
, sinceN
is accepted by VCF standards (the other IUPAC degenerate codes are not).This has allowed us to proceed and use the generated VCF in other tools.
I noticed there was an old issue related to this on Clair as well (HKU-BAL/Clair#33).
Would it perhaps be possible to handle this automatically inside Clair3, converting these IUPAC codes to N before generating the output VCF?
Current code we've been using:
Which is basically a hacky
awk
code that replaces any degenerate IUPAC codes in the 4th or 5th fields of the VCF (the two columns containing bases, one for reference and one for alt) by N.I believe if Clair3 took care of that by itself (or at least had an option that enabled this, like a
--convert_iupac_to_n
), this would simplify things a lot for further analysis and to make the output VCF (and gVCF) compatible with other tools.Thanks!
aquaskyline commentedon Jan 3, 2023
Understood the problem and am working on a fix.
dennishendriksen commentedon Feb 20, 2023
Running into the same issue with HTSJDK v3.0.4. The VCF 4.3 specification states:
Personally I would prefer conversion to
N
as mentioned by @Coppini since that matches the reference sequence value in the cases that I encountered.Looking forward to a fix, thanks in advance!
aquaskyline commentedon Mar 6, 2023
in v1.0.0, IUPAC bases are output as
N