Proper reanotation of genepanel with snpEff and dbNSFP
1
2
Entering edit mode
3 months ago
Lukas ▴ 130

Sure, I can help you rewrite your text. Here’s a revised version:

Hi everyone,

I’m struggling with reannotation and could really use some help. Please forgive me if this question seems basic—it's not my intention to waste anyone's time.

I've encountered an issue with an annotated VCF file from a targeted gene panel provided by a colleague. The VCF mentions the reference hg38RD.fa for variant calling, uses snpEff with GRCh38.p13, and applies dbNSFP annotations with SnpSift. After investigating, I found that the reference is more closely aligned with GRCh38.

I decided to reannotate the variants to see if the selected variants have changed over the past four years. Here’s what I did:


bcftools norm -m-both -o output.vcf input.vcf # separate the multi variatins into separate lines

bcftools norm -m-both -f reference.fa -o output.vcf -Ov input.vcf # used with the reference

Certainly! Here's a more polite and refined version of your query:

I encountered an issue with discrepancies in counts within the dbNSFP parameters. Specifically, there seems to be a mismatch between the counts of ALT and values in various fields of the dbNSFP INFO section. It appears that the provided dbNSFP file may not be properly aligned with the reference, leading to these discrepancies.

As a result, I decided to remove the entire dbNSFP annotation and handle multivariations using bcftools norm -m-both without re-aligning with the reference genome.

Could you please advise on the best course of action? If the annotation was incorrect, should I use genomic coordinates to re-annotate it with the latest genome build without concern, or should I realign it with the original reference sequence and re-annotate from the beginning? Additionally, is normalization still recommended in this scenario?

If anybody have any resources for how to annotate targeted gene panel, I would be elated if you mention them.

Thank you for your time

Essential header:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
#fileDate=20210212
##source=freeBayes v1.3.2-44-gfce9620
##reference=/opt/bwa.kit/hs38DH.fa

##commandline="/opt/freebayes/bin/freebayes --fasta-reference /opt/bwa.kit/hs38DH.fa --bam-list bam.txt --targets /mnt/data/BED/SCH/SCH_ALL.bed --genotype-qualities --no-partial-observations --min-repeat-entropy 1 --min-coverage 8 --min-mapping-quality 1 --min-base-quality 3 --min-alternate-fraction 0.1"

##SnpEffVersion="5.0 (build 2020-10-04 16:02), by Pablo Cingolani"
##SnpEffCmd="SnpEff  -i vcf -o vcf -stats SCH_freebayes_sort_ostE_SnpEff.vcf-effects-stats.html -csvStats SCH_freebayes_sort_ostE_SnpEff.vcf-effects-stats.csv GRCh38.p13.RefSeq SCH_freebayes_sort_ostE.vcf "

##SnpSiftVersion="SnpSift 5.0 (build 2020-10-04 16:02), by Pablo Cingolani"
##SnpSiftCmd="SnpSift Annotate -dbsnp SCH_freebayes_sort_ostE_SnpEff.vcf"


- bcftools norm withoout reference: total/split/realight - 78758/4626/0/0
- bcftoools norm  with referenc: total/split/realight - 78758/4626/21864

Possible discrepancies:

After using bcftools norm -m-both -o output.vcf input.vcf. If that column deleted, there will be another with dbNSFP anotation almost like random. Reason for that in my opinion is no depencency between snpEff and dbNSFP. Still wouldnt this change somehow data or is it from badly formated data?

Error: wrong number of fields in INFO/dbNSFP_PHRED at chr1:1034085, expected 6, found 4

Differences between references in 2020. Because the dbNSFP have been used only with SnpSift.jar, it could be outdated and thos my actually data as well (possibility of using dbNSFP for GRCh37.h19 instead version for GRCh38).

Changes in ALT and dbSNP id:

chr3 184106211 ACCACCCAGC ACCCCCCCGC,ACCACCCCGC . HTR3E # from my data - in variation viewer: rs897969599 look like possible change similar to my data, but a few bp off

chr5 141102429 G A,C rs13174972 PCDHB3 || now in GRCh38 p14 there is no C in the rsID. Thought if I reanotate it, i would get updated varsion of C variation.

chr6 154246729 C T rs34427887;829067 OPRM1 stop_gained HIGH || this weird composite identification code is gone, now there are only rs34427887. Again change between my data and actuall patch.




follow-up question: dbNSFP in-silico predictor results counts not-equal to number to alternatives in vcf file

vcf • 1.2k views
ADD COMMENT
0
Entering edit mode

Unfortunately dbNSFP reanotation didn't work. I tried to delete every variation without all genotype samples, but problem still prevail.

So I checked the vcf with vcf-validator and it have definitely over thousand coordinates with not matching dbNSFP annotations and ALT counts. Interesting vast variations have less count of dbNSFP then ALTs.

Still I am out of options and I think the vcf I got might have a problem in calling process.

ADD REPLY
0
Entering edit mode

I added again the sample vcf and txt errors. Unfortunately It looks like much bigger problem, then I thought it be. It seem like that almost every multi variation have problem in 13/31 dbNSFP prediction toools data. Because the Siftsnp.jar dbNSFP automatically update vcf header, i found that according it every dbNSFP truelly should have same counts of prediction tools values as alternative aleles. So it looks more like problem with creating that vcf, rather my problematic donwstream analysis. Still any help would be nice to prove or disprove this assumption. My supervisor just told me, that i should not care about this and just write something into my master thesis. I think he is wrong.

ADD REPLY
1
Entering edit mode

Would you guys use vcf changed like this for analysis or not?

ADD REPLY
0
Entering edit mode

You probably don't need to start editing VCF files and maybe start asking some simpler questions in a new biostars post. You seem confused about either multiallelic positions or multiple transcripts generating different consequences in your vcf. i think one thing that happens is that beginners get fixated on some confused line of reasoning and instead of just asking, "should I expect to see this?" or "what does it mean when i see this?", start asking some question based on assumptions that don't make sense.

ADD REPLY
0
Entering edit mode

Thank you for your advice. I added into the post link to the new one.

You are right, I am self taught beginner doing masters thesis with my supervisor without speciality in bioinformatics. Unfortunately he truelly underestemate the data analysis and dont even connect me with other specialists, because he thinks i am doing much ado about nothing.

However the vcf I got have unfortunately only useable anotations are directly from dbNSFP, snpEff which i dont know nothing about, because creating of my own vcf wasnt part of my masters thesis. So when I saw problems in validation I panic. Sure, because I am unexperiance, but because I wanted to use 2 of those 13 anotations for my data analysis - main reason why I am terrified. If the vcf is wrong I have been doing analysis on useless data for 4 years..

ADD REPLY
0
Entering edit mode

Thank you for your advice. I added a new post. Link is in question. You are right, I am self taught beginner with bioinformatics and programming in general. I am doing now my masters thesis, however part of my analysis wasnt created own vcf.

So when i saw the change, I was terrified. My supervisor (no bioinformatics speciality) said, that I am only panicing. However I truelly need to use 2 of those 13 tools in my analysis. So even possibility, that the annotation might be false haunts me (lost 4 years on that).

ADD REPLY
4
Entering edit mode
3 months ago

ok so these are just GRCh38 with some extra stuff reads can map to

hs38a    hs38 (GRCh38) plus ALT contigs
hs38DH   hs38a plus decoy contigs and HLA genes (recommended for GRCh38 mapping)

dbNSFP doesn't know about these extra things but that shouldn't matter, they are just thrown in to ensure the reads that do map to GRCh38 are not artifactual alignments

if you have a specific example of a discrepancy then please include it, otherwise I don't see a real issue here

ADD COMMENT
0
Entering edit mode

I think I was wrong here. Sure using direct dbNSFP from snpEff might not be best practice (maybe not newest version), but not a mistake. I thought that dbNSFP from snpEff is directly part of snpEff annotations, but that is not a case. Still I am I am terrified to reanotate those targeted genes. I don't know, what are they and if something were specially added ( it's data from schizophrenia patients, quite a special case). So is it ok to build GRCh38.p14 (newest one, because I have problematic config file with that, by name it's p14, but by genomic assembly id it's p13 patch) and use directly the dbNSFP as part of Snpsift or should I download newest dbNSFP from authors?

ADD REPLY
2
Entering edit mode

There is no difference between p14, p13 or any other patch when it comes to the primary sequence assembly that you actually align reads to. See Understanding Used Assembly: Why aren't authors specific about patch version?

SnpEFF mentions patches because they are linked to changes in annotation released by Ensembl.

I'm not sure how often dbNSFP gets updated (seems pretty seldom) but I think you need to identify and post actual discrepancies you are seeing in order for anyone to help you.

ADD REPLY
0
Entering edit mode

Added a few differences. I hope it will be understandable. I know, that patches are not big changes. I know, it would be possible to just crossreference it with dbSNP,ClinVar,VarSome,UniProt, but it seems like reading the list or guessing. That was my reasoning for truelly reanotate the data. To have something tangible to argue about.

ADD REPLY
0
Entering edit mode

as strange as it may sound, dbSNP rsid's are not linked to allelic changes, but just chr:pos, so it's natural things would come and go there

so this has nothing to do with imaginary genome patches and everything to do with dbSNP versions

the only freeze version that matters is the one you just added that says "GRCh37.h19 instead version for GRCh38". That would be a breaking change.

ADD REPLY
0
Entering edit mode

So you are saying that I truly don't need to reanotate anything and just cross reference it and say why it may be interesting? It sounds like advertisment, not science. Still thanks you very much for your answer. You are unfortunately one of a few people which takes my questions about bioinformatics seriously (most of them are from this forum, non from my lab). I truly appreciate it.

ADD REPLY
0
Entering edit mode

It's fine to reannotate but it's not helpful to start pointing fingers at patch releases when things looks slightly different.

If you see more discrepancies that concern you please post them, but dbSNP ones are not a big deal in my opinion

The Error: wrong number of fields in is interesting but we'd need a reprex to examine

ADD REPLY
0
Entering edit mode

I added link with tastable data and the errors from bcftools norm -m-both.

I don´t know what happened, but it looks like the dbNSFP had problem only inside 3 variations viz that data.

So should I rather reanotate it as whole or just delete the affected variations? And I dont know why, but I always thought that dbNSFP just put together only data from different variation callers and annotators to get general overview, not divided into per transript. So thats why I even havent considred bcftools norm and use OnePerline script from SnpSift.jar.

ADD REPLY
0
Entering edit mode

Don't think I am going to get newest data, but I noticed that there are a few changes in dbSNP ID and IMPACT.

ADD REPLY

Login before adding your answer.

Traffic: 1458 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6