Skip to content

Patched 0 SNPs -> an old .tbi index file seems to be culprit. #24

Open
@mikelove

Description

Similar to #19, but in this case, the same chromosome names are used, so that shouldn't be the issue. I'm using the genome that was used to create the VCF files, as provided by Sanger.

Using g2gtools 0.2.7 installed via conda install -c kbchoi g2gtools=0.2.7=py36_0.

If I download these VCF and FASTA for CAST:

ftp://ftp-mouse.sanger.ac.uk/REL-1505-SNPs_Indels/strain_specific_vcfs/CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
ftp://ftp-mouse.sanger.ac.uk/REL-1505-SNPs_Indels/strain_specific_vcfs/CAST_EiJ.mgp.v5.indels.dbSNP142.normed.vcf.gz
ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa

And then start the first two steps to create a transcriptome:

g2gtools vcf2vci -p 12 -i snps/CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz \
  -i snps/CAST_EiJ.mgp.v5.indels.dbSNP142.normed.vcf.gz \
  -o snps/CAST_EiJ.vci -s CAST_EiJ -f fasta/GRCm38_68.fa --pass --quality

g2gtools patch -p 12 -i fasta/GRCm38_68.fa -c snps/CAST_EiJ.vci.gz \
  -o fasta/CAST_EiJ.patched.fa 2> fasta/CAST_EiJ.patched.log

The log output says:

==> fasta/CAST_EiJ.patched.log <==
[g2gtools] Patched 0 SNPs total
[g2gtools] Patch complete: 00:00:14.33

And the output FASTA is equivalent to reference.

If I exclude the indels file, I get:

==> fasta/CAST_EiJ.patched.log <==
[g2gtools] Patched 552,805 SNPs total
[g2gtools] Patch complete: 00:00:18.53

I've tried re-ordering the SNP and indels file, and I've tried with and without the pass or quality filters. I've also tried merging the two VCF into one file (removing the header of the second one).

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions