Question

Filtering VCF based on codon-transcript

0

Entering edit mode

8 months ago

avelarbio46 ▴ 30

Hello everyone!

I'm trying to solve the following problem:

I have vcf A, annotated with VEP (with simple information such as transcript information and codon)

CHR     POS    REF    ALT   FILTER      INFO
1       1      A       T     .            CSQ=ENST102048|1/10

And I have a second vcf B, which is the querying VCF

CHR     POS    REF    ALT   FILTER      INFO
1       2      C       G     .          CSQ=ENST102048|1/10
10      1      C       G     .            CSQ=ENST999999|1/10

The filtered resulting vcf (C) should be (because the alteration is at the same 1/10 codon of transcript ENST102048 in the example)

CHR     POS    REF    ALT   FILTER      INFO
1       1      A       T     .            CSQ=ENST102048|1/10

As you can infer, it is the same gene, same codon, but different base (pos ref and alt)

How could I filter the VCF A to keep only sites where the transcript ID and codon is the same as in VCF B, even if pos ref and alt are different?

I've never see anyone filtering like this before, so I'm still wonder if it is possible with bcftools or any vcf tools at all.

vcf bcftools • 1.3k views

ADD COMMENT • link updated 7 months ago by barslmn ★ 2.3k • written 8 months ago by avelarbio46 ▴ 30

0

Entering edit mode

Those entries are tab delimited, not VCF. You can use simple join functions with these tab delimited files.

ADD REPLY • link 8 months ago by Ram 44k

0

Entering edit mode

This is just a minimal example, but I have the full vcfs annotated like that in a similar fashion. I tought a minimal example would be better to explain what i'm trying to acomplish

ADD REPLY • link 7 months ago by avelarbio46 ▴ 30

0

Entering edit mode

How could I filter the VCF A to keep only sites where the transcript ID and codon is the same as in VCF B, even if chr pos ref and alt are different?

That's a really weird question. Why would we have the same transcript id in two different chromosomes?

I think you can achieve what you are describing with bcftools.

Get your sites in to a tsv file. We add 1 to last column to filter for later.

bcftools query -f "%CHROM\t%POS\t%REF\t%ALT\t1\n" VCF_B > VCF_B.sites

Annotate your other VCF with the newly created table.

First we create a custom header.

Second we use annotate to add the "1"'s we created above in the fifth column. Then, we just filter with bcftools view.

TMPHDR=$(mktemp)
echo '##INFO=<ID=MYFILTER, Number=1,Type=Integer,Description="my custom filter">' >"$TMPHDR"

bcftools annotate -a  VCF_B.sites -h "$TMPHDR" -c "CHROM,POS,REF,ALT,MYFILTER" VCF_A |
     bcftoolsview -i 'INFO/MYFILTER=1' -Oz -o VCF_C

ADD REPLY • link updated 7 months ago by Ram 44k • written 8 months ago by barslmn ★ 2.3k

0

Entering edit mode

You are right! There could not be different chromossomes for the same transcript! So maybe leave out the chrom right? I know it is weird but it makes sense to look for same codon mutations. I know how to work with the tabular vcf-like file in R, but R is not very memory efficient for very large vcfs, thats why I was asking about a bcftools way of doing that, if possible. I gave that example above which is a vcf without the header (info field is annotated by VEP) I also slight modified my question to better include what I`m trying to do I will try your recomendations, thanks!

ADD REPLY • link 7 months ago by avelarbio46 ▴ 30

0

Entering edit mode

I know it is weird but it makes sense to look for same codon mutations.

Why are you not matching for transcript id + codon number?

ADD REPLY • link 7 months ago by barslmn ★ 2.3k

0

Entering edit mode

Thats exactly what I'm trying to do! But in my case, the annotation for transcript ID + codon number is in the VEP annotated CSQ field (INFO column)

ADD REPLY • link 7 months ago by avelarbio46 ▴ 30

0

Entering edit mode

I thought 1/10 in your examle is the exon rank not codon rank. You can take them out of CSQ field using the split-vep plugin.

        bcftools +split-vep -c Feature,Codons <myfile.vcf>

ADD REPLY • link 7 months ago by barslmn ★ 2.3k

0

Entering edit mode

I agree! But I'm not sure how to use codons + transcript ID as filters. I think bcftools does not accept a list of terms/txt file to filter by (VCF B would be that list, but I could use a combination of transcript ID + codon for a unique identifier). I think I will generate this unique identifier for both files, then turn the query into a bash vector then filter the original VCF using this. I'm not sure how to create this unique identifier in bcftools though.

ADD REPLY • link 7 months ago by avelarbio46 ▴ 30

0

Entering edit mode

You can use bcftools query to get your identifier and you can use join to merge them.

ADD REPLY • link 7 months ago by barslmn ★ 2.3k

0

Entering edit mode

How would I use join to filter, with this unique identifier in mind?

ADD REPLY • link 7 months ago by avelarbio46 ▴ 30

0

Entering edit mode

Here is an example you can play around and change according to your task.

https://colab.research.google.com/drive/1JTyy9MNbTPmxOT6FrSV3UmNUWNGmrXSO?usp=sharing

ADD REPLY • link 7 months ago by barslmn ★ 2.3k