Question

Problems with overlapping variants in a vcf file

0

Entering edit mode

3.8 years ago

Buxus ▴ 10

Greetings,

I'm trying to get a sequence for each sample in a multi-sample vcf by combining a reference sequence with the variants from the vcf.

The problem is that there are a few variants that overlap with indels. Some are correctly (I believe) denoted by the * symbol as per vcf 4.3 specification, other are not. There are lines where the ALT allele is * but does not seem to overlap with any other variant. Below is an example of such an entry. The variant at position 14586 is called as * when there are no overlaps with either the previous or the subsequent variant in this sample.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  CAR10
MT  14559   .   TAAA    TA,T,TAA    4.35685e+06 .   AC=1,0,0
MT  14586   .   AATATATATATATATATATATAT AATATATAT,AATATATATAT,*,AATATATATATATAT,AATATATATATATATAT,AATAT 881243  .   AC=0,0,1,0,0,0
MT  15129   .   C   A   2.69538e+06 .   AC=1

Is this a problem with the vcf file or am I misunderstanding the format?

I'm using bcftools to do the following:

susbet the multisample vcf to get single sample vcf (bcftools view)
normalise (bcftools norm)
combine reference fasta and the vcf (bcftools consensus)

bcftools consensus skips the variants that overlap with the previous variant. I believe this is done by comparing the position of a variant with the end position of the previous variant.

In the case above it does not skip the variant at position 14586 but instead includes a "*" symbol into the output fasta file. Should it just use the reference instead?

Any help would be greatly appreciated.

bcftools variant vcf • 1.8k views

ADD COMMENT • link updated 11 weeks ago by Ram 44k • written 3.8 years ago by Buxus ▴ 10

0

Entering edit mode

did you figure this out? Please check out my question from phased, trio vcf if you have any insights:

Removing / Excluding / Collapsing Overlapping Indels

ADD REPLY • link 11 weeks ago by jon.klonowski ▴ 210