Greetings,
I'm trying to get a sequence for each sample in a multi-sample vcf by combining a reference sequence with the variants from the vcf.
The problem is that there are a few variants that overlap with indels. Some are correctly (I believe) denoted by the *
symbol as per vcf 4.3 specification, other are not. There are lines where the ALT allele is *
but does not seem to overlap with any other variant. Below is an example of such an entry. The variant at position 14586 is called as *
when there are no overlaps with either the previous or the subsequent variant in this sample.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CAR10
MT 14559 . TAAA TA,T,TAA 4.35685e+06 . AC=1,0,0
MT 14586 . AATATATATATATATATATATAT AATATATAT,AATATATATAT,*,AATATATATATATAT,AATATATATATATATAT,AATAT 881243 . AC=0,0,1,0,0,0
MT 15129 . C A 2.69538e+06 . AC=1
Is this a problem with the vcf file or am I misunderstanding the format?
I'm using bcftools to do the following:
- susbet the multisample vcf to get single sample vcf (bcftools view)
- normalise (bcftools norm)
- combine reference fasta and the vcf (bcftools consensus)
bcftools consensus skips the variants that overlap with the previous variant. I believe this is done by comparing the position of a variant with the end position of the previous variant.
In the case above it does not skip the variant at position 14586 but instead includes a "*" symbol into the output fasta file. Should it just use the reference instead?
Any help would be greatly appreciated.
did you figure this out? Please check out my question from phased, trio vcf if you have any insights:
Removing / Excluding / Collapsing Overlapping Indels