Skip to content

gatk compatibility #153

Closed
Closed
@shinlin77

Description

I'm using

gatk --java-options "-Xmx10g" CombineGVCFs

on clair3 gvcf output. I encountered this error.

htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 97073593: unparsable vcf record with allele TGTGB

The offending input line is

chr3 16902879 . TGTGB T,<NON_REF> 12.03 PASS F GT:GQ:DP:AD:AF:PL 0/1:12:38:26,11,0:0.2895:15,0,57,990,990,990

If this output is actually compatible with a version of GATK, could you please tell what version? The error was generated using v4.1.8.1.

Thanks.

Activity

zhengzhenxian

zhengzhenxian commented on Nov 25, 2022

@zhengzhenxian
Collaborator

Hi,
We test the GVCF compatibility using GATK version 4.2.6.1. And for the error, it seems that the reference consists of IUPAC base B, not sure your GATK version could parse it.

Coppini

Coppini commented on Jan 2, 2023

@Coppini

@aquaskyline @zhengzhenxian

Our group has been facing the same issue with the VCF files we obtain with Clair3, which cause problems when we run QC pipelines on them.

Apparently, while GATK tools seem to handle IUPAC codes in the reference, Clair3 does not.

Our current approach is to check the output VCF from Clair3 and replace any IUPAC degenerate codes in it by N, since N is accepted by VCF standards (the other IUPAC degenerate codes are not).

This has allowed us to proceed and use the generated VCF in other tools.

I noticed there was an old issue related to this on Clair as well (HKU-BAL/Clair#33).

Would it perhaps be possible to handle this automatically inside Clair3, converting these IUPAC codes to N before generating the output VCF?

Current code we've been using:

awk -F'\t' -v OFS='\t' '/^[^#]/{sub(/[RYSWKMBDHV]/, "N", $4) sub(/[RYSWKMBDHV]/, "N", $5) 1'

Which is basically a hacky awk code that replaces any degenerate IUPAC codes in the 4th or 5th fields of the VCF (the two columns containing bases, one for reference and one for alt) by N.

I believe if Clair3 took care of that by itself (or at least had an option that enabled this, like a --convert_iupac_to_n), this would simplify things a lot for further analysis and to make the output VCF (and gVCF) compatible with other tools.

Thanks!

aquaskyline

aquaskyline commented on Jan 3, 2023

@aquaskyline
Member

Understood the problem and am working on a fix.

dennishendriksen

dennishendriksen commented on Feb 20, 2023

@dennishendriksen

Running into the same issue with HTSJDK v3.0.4. The VCF 4.3 specification states:

If the reference sequence contains IUPAC ambiguity codes not allowed by this specification (such as R = A/G),
the ambiguous reference base must be reduced to a concrete base by using the one that is first alphabetically
(thus R as a reference base is converted to A in VCF.)

Personally I would prefer conversion to N as mentioned by @Coppini since that matches the reference sequence value in the cases that I encountered.

Looking forward to a fix, thanks in advance!

aquaskyline

aquaskyline commented on Mar 6, 2023

@aquaskyline
Member

in v1.0.0, IUPAC bases are output as N

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      gatk compatibility · Issue #153 · HKU-BAL/Clair3