Hello everyone!
I'm trying to solve the following problem:
I have vcf A, annotated with VEP (with simple information such as transcript information and codon)
CHR POS REF ALT FILTER INFO
1 1 A T . CSQ=ENST102048|1/10
And I have a second vcf B, which is the querying VCF
CHR POS REF ALT FILTER INFO
1 2 C G . CSQ=ENST102048|1/10
10 1 C G . CSQ=ENST999999|1/10
The filtered resulting vcf (C) should be (because the alteration is at the same 1/10 codon of transcript ENST102048 in the example)
CHR POS REF ALT FILTER INFO
1 1 A T . CSQ=ENST102048|1/10
As you can infer, it is the same gene, same codon, but different base (pos ref and alt)
How could I filter the VCF A to keep only sites where the transcript ID and codon is the same as in VCF B, even if pos ref and alt are different?
I've never see anyone filtering like this before, so I'm still wonder if it is possible with bcftools or any vcf tools at all.
Those entries are tab delimited, not VCF. You can use simple join functions with these tab delimited files.
This is just a minimal example, but I have the full vcfs annotated like that in a similar fashion. I tought a minimal example would be better to explain what i'm trying to acomplish
That's a really weird question. Why would we have the same transcript id in two different chromosomes?
I think you can achieve what you are describing with bcftools.
Get your sites in to a tsv file. We add 1 to last column to filter for later.
Annotate your other VCF with the newly created table.
First we create a custom header.
Second we use annotate to add the "1"'s we created above in the fifth column. Then, we just filter with bcftools view.
You are right! There could not be different chromossomes for the same transcript! So maybe leave out the chrom right? I know it is weird but it makes sense to look for same codon mutations. I know how to work with the tabular vcf-like file in R, but R is not very memory efficient for very large vcfs, thats why I was asking about a bcftools way of doing that, if possible. I gave that example above which is a vcf without the header (info field is annotated by VEP) I also slight modified my question to better include what I`m trying to do I will try your recomendations, thanks!
Why are you not matching for
transcript id + codon number
?Thats exactly what I'm trying to do! But in my case, the annotation for transcript ID + codon number is in the VEP annotated CSQ field (INFO column)
I thought
1/10
in your examle is the exon rank not codon rank. You can take them out of CSQ field using the split-vep plugin.I agree! But I'm not sure how to use codons + transcript ID as filters. I think bcftools does not accept a list of terms/txt file to filter by (VCF B would be that list, but I could use a combination of transcript ID + codon for a unique identifier). I think I will generate this unique identifier for both files, then turn the query into a bash vector then filter the original VCF using this. I'm not sure how to create this unique identifier in bcftools though.
You can use
bcftools query
to get your identifier and you can usejoin
to merge them.How would I use join to filter, with this unique identifier in mind?
Here is an example you can play around and change according to your task.
https://colab.research.google.com/drive/1JTyy9MNbTPmxOT6FrSV3UmNUWNGmrXSO?usp=sharing