Sure, I can help you rewrite your text. Here’s a revised version:
Hi everyone,
I’m struggling with reannotation and could really use some help. Please forgive me if this question seems basic—it's not my intention to waste anyone's time.
I've encountered an issue with an annotated VCF file from a targeted gene panel provided by a colleague. The VCF mentions the reference hg38RD.fa for variant calling, uses snpEff with GRCh38.p13, and applies dbNSFP annotations with SnpSift. After investigating, I found that the reference is more closely aligned with GRCh38.
I decided to reannotate the variants to see if the selected variants have changed over the past four years. Here’s what I did:
bcftools norm -m-both -o output.vcf input.vcf # separate the multi variatins into separate lines
bcftools norm -m-both -f reference.fa -o output.vcf -Ov input.vcf # used with the reference
Certainly! Here's a more polite and refined version of your query:
I encountered an issue with discrepancies in counts within the dbNSFP parameters. Specifically, there seems to be a mismatch between the counts of ALT and values in various fields of the dbNSFP INFO section. It appears that the provided dbNSFP file may not be properly aligned with the reference, leading to these discrepancies.
As a result, I decided to remove the entire dbNSFP annotation and handle multivariations using bcftools norm -m-both without re-aligning with the reference genome.
Could you please advise on the best course of action? If the annotation was incorrect, should I use genomic coordinates to re-annotate it with the latest genome build without concern, or should I realign it with the original reference sequence and re-annotate from the beginning? Additionally, is normalization still recommended in this scenario?
If anybody have any resources for how to annotate targeted gene panel, I would be elated if you mention them.
Thank you for your time
Essential header:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
#fileDate=20210212
##source=freeBayes v1.3.2-44-gfce9620
##reference=/opt/bwa.kit/hs38DH.fa
##commandline="/opt/freebayes/bin/freebayes --fasta-reference /opt/bwa.kit/hs38DH.fa --bam-list bam.txt --targets /mnt/data/BED/SCH/SCH_ALL.bed --genotype-qualities --no-partial-observations --min-repeat-entropy 1 --min-coverage 8 --min-mapping-quality 1 --min-base-quality 3 --min-alternate-fraction 0.1"
##SnpEffVersion="5.0 (build 2020-10-04 16:02), by Pablo Cingolani"
##SnpEffCmd="SnpEff -i vcf -o vcf -stats SCH_freebayes_sort_ostE_SnpEff.vcf-effects-stats.html -csvStats SCH_freebayes_sort_ostE_SnpEff.vcf-effects-stats.csv GRCh38.p13.RefSeq SCH_freebayes_sort_ostE.vcf "
##SnpSiftVersion="SnpSift 5.0 (build 2020-10-04 16:02), by Pablo Cingolani"
##SnpSiftCmd="SnpSift Annotate -dbsnp SCH_freebayes_sort_ostE_SnpEff.vcf"
- bcftools norm withoout reference: total/split/realight - 78758/4626/0/0
- bcftoools norm with referenc: total/split/realight - 78758/4626/21864
Possible discrepancies:
After using bcftools norm -m-both -o output.vcf input.vcf. If that column deleted, there will be another with dbNSFP anotation almost like random. Reason for that in my opinion is no depencency between snpEff and dbNSFP. Still wouldnt this change somehow data or is it from badly formated data?
Error: wrong number of fields in INFO/dbNSFP_PHRED at chr1:1034085, expected 6, found 4
Differences between references in 2020. Because the dbNSFP have been used only with SnpSift.jar, it could be outdated and thos my actually data as well (possibility of using dbNSFP for GRCh37.h19 instead version for GRCh38).
Changes in ALT and dbSNP id:
chr3 184106211 ACCACCCAGC ACCCCCCCGC,ACCACCCCGC . HTR3E # from my data - in variation viewer: rs897969599 look like possible change similar to my data, but a few bp off
chr5 141102429 G A,C rs13174972 PCDHB3 || now in GRCh38 p14 there is no C in the rsID. Thought if I reanotate it, i would get updated varsion of C variation.
chr6 154246729 C T rs34427887;829067 OPRM1 stop_gained HIGH || this weird composite identification code is gone, now there are only rs34427887. Again change between my data and actuall patch.
follow-up question: dbNSFP in-silico predictor results counts not-equal to number to alternatives in vcf file
Unfortunately dbNSFP reanotation didn't work. I tried to delete every variation without all genotype samples, but problem still prevail.
So I checked the vcf with vcf-validator and it have definitely over thousand coordinates with not matching dbNSFP annotations and ALT counts. Interesting vast variations have less count of dbNSFP then ALTs.
Still I am out of options and I think the vcf I got might have a problem in calling process.
I added again the sample vcf and txt errors. Unfortunately It looks like much bigger problem, then I thought it be. It seem like that almost every multi variation have problem in 13/31 dbNSFP prediction toools data. Because the Siftsnp.jar dbNSFP automatically update vcf header, i found that according it every dbNSFP truelly should have same counts of prediction tools values as alternative aleles. So it looks more like problem with creating that vcf, rather my problematic donwstream analysis. Still any help would be nice to prove or disprove this assumption. My supervisor just told me, that i should not care about this and just write something into my master thesis. I think he is wrong.
Would you guys use vcf changed like this for analysis or not?
You probably don't need to start editing VCF files and maybe start asking some simpler questions in a new biostars post. You seem confused about either multiallelic positions or multiple transcripts generating different consequences in your vcf. i think one thing that happens is that beginners get fixated on some confused line of reasoning and instead of just asking, "should I expect to see this?" or "what does it mean when i see this?", start asking some question based on assumptions that don't make sense.
Thank you for your advice. I added into the post link to the new one.
You are right, I am self taught beginner doing masters thesis with my supervisor without speciality in bioinformatics. Unfortunately he truelly underestemate the data analysis and dont even connect me with other specialists, because he thinks i am doing much ado about nothing.
However the vcf I got have unfortunately only useable anotations are directly from dbNSFP, snpEff which i dont know nothing about, because creating of my own vcf wasnt part of my masters thesis. So when I saw problems in validation I panic. Sure, because I am unexperiance, but because I wanted to use 2 of those 13 anotations for my data analysis - main reason why I am terrified. If the vcf is wrong I have been doing analysis on useless data for 4 years..
Thank you for your advice. I added a new post. Link is in question. You are right, I am self taught beginner with bioinformatics and programming in general. I am doing now my masters thesis, however part of my analysis wasnt created own vcf.
So when i saw the change, I was terrified. My supervisor (no bioinformatics speciality) said, that I am only panicing. However I truelly need to use 2 of those 13 tools in my analysis. So even possibility, that the annotation might be false haunts me (lost 4 years on that).