I have a vcf file from which I want to remove SNPs that fall within certain genomic windows (20kb window size). I continue to face a problem with this seemingly easy task. I have checked my code for mistakes several times, but I still don't know what is going on.
Here's what I do:
1. From my original vcf file, I extracted all SNP positions like this:
bcftools query -f '%CHROM\t%POS\n' vcfFile.vcf > vcfFile.SNP.Positions
The first 10 SNPs on chrI (chromosome I) are:
chrI 810
chrI 1619
chrI 1686
chrI 2393
chrI 2674
chrI 2696
chrI 4642
chrI 4655
chrI 4813
chrI 5117
2. I then wanted to remove SNPs that fall within certain genomic windows (bins of 20kb). For this, I created a bed-file ("remove.bed").
This file looks like this (here I show the first ten 20kb-windows):
chrom chromStart chromEnd
chrI 10000011020000
chrI 1140001 1160000
chrI 1160001 1180000
chrI 3440001 3460000
chrI 6460001 6480000
chrI 6860001 6880000
chrI 6880001 6900000
chrI 6900001 6920000
chrI 6920001 6940000
chrI 6940001 6960000
Notably, this bed-file is ordered by chromosome and position in increasing order.
3. I then use vcftools to remove all SNPs from vcfFile.vcf like this:
vcftools \
--vcf vcfFile.vcf \
--exclude-bed remove.bed \
--recode \
--recode-INFO-all \
--out vcfFile_removed
This creates a new vcf file called: "vcfFile_removed.recode.vcf". As expected, this vcf file now has less SNPs than the initial vcf file.
4. However, I then looked at all SNP positions in vcfFile_removed.recode.vcf, by again using bcftools:
bcftools query -f '%CHROM\t%POS\n' vcfFile_removed.recode.vcf > vcfFile_removed.SNP.Positions
The first 10 SNPs on chrI (chromosome I) in this file are:
chrI 25980088
chrI 25980144
chrI 25980172
chrI 25980365
chrI 25980393
chrI 25980422
chrI 25980539
chrI 25980647
chrI 25980652
chrI 25980673
To make sure this has nothing to do with how the vcf file is sorted, I also grepped for "chrI 810" and "chrI 1619" in "vcfFile_removed.SNP.Positions", but these SNP positions were no longer found.
I don't understand why these SNPs were removed because they do not fall in any of the specified bins within the bed file.
Why is the first line of your BED malformed? Also, have you tried sorting your final VCF?