Skip to content

Analysis Converter for Human who might Abhor Bioinformatics. A simple and useful interface to analysis of WES data for molecular diagnosis

License

Notifications You must be signed in to change notification settings

JC-Delmas/Captain-ACHAB

 
 

Repository files navigation

Captain ACHAB | Analysis Converter for Human who might Abhor Bioinformatics


Overview

Captain ACHAB is a simple and useful interface to analysis of NGS data for molecular diagnosis. No more endless tables with so many columns ! All necessary information is available in one look. Captain ACHAB files are readable with Excel (.xlsx) or web browser (.html) ( or soon with the open-source CuteVariant (.JSON)).

achab

Main features

  • Single window interpretation : All needed informations for variants biological interpretation are accessible (Quality data, genotype, variant localization, general population frequency, in silico splice and missense predictions, clinical human disease association ...) with multi-dimensional display (colors, on hover commentary).

  • Ranked and prioritized variants : Variants are sorted and ranked by the MoBiDiC Prioritization Algorithm with additional ranking rules depending gene knowledge (Gnomad LOEUF ) and phenotypes (HPO and Phenolyzer).

  • Mode of inheritance : Multiples sheets are created depending on assumed mode of inheritance (Autosomal recessive, Autosomal Dominant, Compound Heterozygous, de novo with the --trio option).

  • Selection of gene panels : --candidates option create a specific sheet with your genes of interests.

  • Hyperlinks to MobiDetails : Variants URL links can be opened on MobiDetails.

  • CNV implementation : --cnvGeneList option let you add CNV list from your CNV caller into Captain Achab. We recommand to use MoLLuDiC or MobiCNV for CNV calling with a output ready for Captain Achab.
    CNV

Biological interpretation with Captain Achab

To be notice : More informations are available when mouse is hovering over the cell, if a right top red triangle is displayed.

1. Select your mode of inheritance hypothesis by selecting the right sheet :

Sheets

2. Filter variant depending on general population frequency :

Default frequency threshold is 0.01 in gnomAD_genome_ALL (1%).

gnomAD

3. Analyse variants from top to bottom via the MPA-achab ranking :

According to MPA >= v1.1.0 Ranking : from 1 to 10. Priorization is refined by Achab processing (based on phenotype, genotype, allele balancy, Haploinsufficiency and custom filtering) .

mpa

  • 1 clinvar_pathogenicity : Pathogenic variants reported on ClinVar (score : 10)

ClinVar

  • 2 stop or frameshift_impact : Premature Truncation Codon : nonsense or frameshift (score : 10)
  • 3 splicing_impact (ADA, RF) : Affecting splice variants predictions ranked by algorithm performance robustness and strength (score : 10)
  • 4 splicing_impact (spliceAI high) : Affecting splice variants predictions ranked by algorithm performance robustness and strength (score : 10)
  • 5 missense impact moderate to high impact (6-10)
  • 6 moderate splicing_impact (spliceAI moderate) (score 6)
  • 7 missense_impact moderate : Missense variants scores low impact (score : 2-6)
  • 8 low splicing impact (spliceAI low) (indel) (score : 2)
  • 9 missense_impact low : Missense variants scores low impact (score : 0-2)
  • 10 unknown impact : Exonic variants with not clearly annotated ORFs and splicing variants not predicted pathogenic ; or NULL (no annotation on genes, splice etc...) (score : 0-10) (put in new Hope output file)

Mpa

4. Analysis first variants with good coverage :

DP stand for Depth
AB for Allele Balancy
AD for Allele Depth (ref and alt)
By default, cell is pink if AB is beneath 20% (0.2) and becomes purple when alt AD is under 5.
Genotype

5. Diagnosis analysis :

  • Selection of OMIM morbid gene only
    Filter emptys cells "." in the Phenotypes.refGene column.
    Clinical

  • Intervar automatic and pre-computed advice on ACMG classification.
    Intervar

6. Research analysis :

Select candidate gene by in silico predictions (LOEUF, missense score), tissue specificity and gene function.
Gene

Select candidate gene by Phenolyzer predictions
Phenolyzer

7. HGVS classification for NGS report :

Find variant nomenclature for all transcripts from RefSeq (NM).
Function

How to get a Captain Achab file

Input

Captain ACHAB need a vcf annotated by ANNOVAR with MPA annotations and Phenolyzer predictions. See MoBiDiC Prioritization Algorithm and Phenolyzer.

Use Captain Achab via a Singularity Container from a raw vcf file

We create a Singularity container to use Captain Achab workflow easier. All details are available at the repository Achabilarity.

Get all requirements to get a vcf file ready for Captain Achab

1. Get custom annotations

To get unavailable annotations in ANNOVAR database into our vcf, we strongly recommend to use AddDB_updater to add LOEUF score, OMIM database and up-to-date clinvar data into the gene_fullxref.txt from ANNOVAR available in the example folder.

2. Annovar annotation

A tutorial to install ANNOVAR and more informations are available at : MoBiDiC Prioritization Algorithm

Command line for vcf annotation by ANNOVAR with needed databases.

perl path/to/table_annovar.pl path/to/example.vcf humandb/ -buildver hg19 -out path/to/output/name -remove -protocol refGene,refGene,clinvar_20170905,dbnsfp33a,spidex,dbscsnv11,gnomad_exome,gnomad_genome,intervar_20180118 -operation gx,g,f,f,f,f,f,f,f -nastring . -vcfinput -otherinfo -arg '-splicing 20','-hgvs',,,,,,, -xref example/gene_customfullxref.txt

3. MPA annotation

See installation and more informations about MPA at MoBiDiC Prioritization Algorithm.

git clone https://github.com/mobidic/MPA.git

Command line for annotated vcf by ANNOVAR with MPA scores.

python MPA.py -i name.hg19_multianno.vcf -o name.hg19_multianno_MPA.vcf

4. Phenolyzer annotation

Tutorial to install Phenolyzer is available at Phenolyzer.

Installation (need Bioperl and Graph, easy to install with cpanm)

git clone https://github.com/WGLab/phenolyzer

Create a disease file where you can add your HPO phenotypes (one line per phenotype).

vim disease.txt

Command line to get predictions for Phenolyzer and the out.predicted_gene_scores.

perl disease_annotation.pl disease.txt -f -p -ph -logistic -out disease/out

5. MobiDetails API KEY

In order to activate MobiDetails variant links, you need to have a "MD.apikey" file containing only your personnal MobiDetails API KEY. By default it must be in the same folder as wwwachab.pl, or specify the path/name

6. Library used

Python library : pandas and dependencies (only tested with python 2.7)

Perl library via cpanm : BioPerl, Graph, Switch, Excel::Writer::XLSX

Captain ACHAB Command

https://github.com/mobidic/Captain-ACHAB.git

Command line to use Captain ACHAB

#USAGE : perl wwwachab.pl 
                        --vcf <vcf_file> (mandatory)
                        --outDir <output directory (default = current dir)>
                        --outPrefix <output file prelifx (default = "")>
                        --candidates <file with end-of-line separated gene symbols of interest (it will create more tabs, if "#myPathology" is present in the file, a myPathology tab will be created)>  
                        --phenolyzerFile <phenolyzer output file suffixed by predicted_gene_scores (it will contribute to the final ranking and top50 genes will be added in METADATA tab)>   
                        --popFreqThr <allelic frequency threshold from 0 to 1 default=0.01 (based on gnomad_genome_ALL or on the first field of --gnomadGenome argument)> 
                        --trio (requires case dad and mum option to be filled, but if case dad and mum option are filled, trio mode is automatically activated ) 
                          --case <index_sample_name> (required with trio option)
                          --dad <father_sample_name> (required with trio option)
                          --mum <mother_sample_name>  (required with trio option)
                        --customInfoList  <comma separated list of vcf annotation INFO name (each of them will be added in a new column)> 
                        --filterList <comma separated list of VCF FILTER to output (default= 'PASS', included automatically to the list)>
                        --cnvGeneList <file with gene symbol + annotation (1 tab-separated), involved by parallel CNV calling>
                        --customVCF <VCF format File with custom annotation (if variant matches then INFO field annotations will be added in new column)>
                        --mozaicRate <mozaic rate value from 0 to 1, it will color 0/1 genotype according to this value  (default=0.2 as 20%)>
                        --mozaicDP <ALT variant Depth, number of read supporting ALT, it will give darker color to the 0/1 genotype  (default=5)>
                        --newHope (only popFreqThr filter is applied (no more filterList nor MPA_ranking filtering)
                        --affected <comma separated list of samples affected by phenotype (assuming they support the same genotype) >
                        --favouriteGeneRef <File with transcript references to extract in a new column (1 transcript by line) >
                        --filterCustomVCF <integer value, penalizes variant if its frequency in the customVCF is greater than [value] (default key of info field : found=[value])  >
                        --filterCustomVCFRegex <string pattern used as regex to search for a specific field to filter customVCF (default key of info field: 'found=')  >
                        --addCustomVCFRegex (customVCF field matched with customVCFRegex will be added in a new column - it requires customVCF option -  default regex is 'found=')
                        --pooledSamples <comma separated list of samples that are pooled (it will convert 0/0 genotype into 0/1 if at least 1 read support ALT base and it will flag cell in yellow, e.g. parents pool in trio context)  >
                        --sampleSubset <comma separated list of samples only processed by Achab to the output>
                        --addCaseDepth (case Depth will be added in a new column)
                        --addCaseAB (case Allelic Balance will be added in a new column)
                        --intersectVCF <VCF format File for comparison (if variant matches then 'yes' will be added in a new 'intersectVCF' column) >
                        --poorCoverageFile <poor Coverage File (it will annotate OMIM genes if present in 4th column -requires OMIM genemap2 File- and create an excel file )>
                        --genemap2File <OMIM genemap2 file (it will help to annotate OMIM genes in poor coverage file )>
                        --skipCaseWT (only if trio mode is activated or in duo context (case and dad or mum must ne specified), it will skip variant if case genotype is 0/0 )
                        --gnomadGenome <comma separated list of gnomad genome annotation fields that will be displayed as gnomAD comments. First field of the list will be filtered regarding to popFreqThr argument. (default fields are hard-coded gnomAD_genome_ALL like)  > 
                        --gnomadExome <comma separated list of gnomad exome annotation fields that will be displayed as gnomAD comments. (default fields are hard-coded gnomAD_exome_ALL like) >
                        --hideACMG (ACMG tab will be empty but information will be reported in the gene comment)
                        --MDAPIkey <Path to File containing only MobiDetails API key (default file is MD.apikey in the achab folder, default build is hg19, but vcf header is parsed to check if hg38 and correct url ) >
                        --help (print this command usage)



Troubleshooting

Clean VCF if Platypus caller is used.

Remove line with discording field from Platypus caller.

awk -F "\t|:" '{if($5~/,.+,.+/ && $0 ~ /GT:GOF:GQ:NR:NV:PL/){na=split($5,a,",");nb=split($NF,b,",");if(nb > 8){print $0 }}else{print}}' sample.vcf > sample.clean.vcf

Multiallelic lines

Multiallelic lines from vcf have to be split before annotation.

sort example.vcf 
bcftools-1.3.1/htslib-1.3.1/bgzip -i example.sort.vcf
bcftools-1.3.1/bcftools norm -O v -m - -o example.norm.vcf example.sort.vcf.gz

Clean whitespace in VCF

No whitespace is allowed in VCF ready to Captain Achab. Replace them with underscores.

sed -i 's/ /_/g' example.vcf 

Contig in VCF missing

Contig is necessary in VCF header.

Create db.vcf input for Captain Achab

If you doesn't have database of your variant, you can create a db.vcf with all of your vcfs and Captain ACHAB will tell you how many times the variants are found & in which sample.

Put all your vcf in one folder. Use GATK Combine Variant to merge all VCF.

Create a vcfmerge.sh file :

java -jar GenomeAnalysisTK.jar -T CombineVariants -R hg19.fa -o all.vcf -genotypeMergeOptions UNIQUIFY -minN 2 \ 

Add to the script all VCF :

for i in *.vcf; do echo --variant $i \\ >> vcfmerge.sh ; done

Run vcfmerge.sh. Multiallelic lines from vcf have to be split before sample counting.

Then run the vcf_sample_counter_merger.sh script with positionnal arguments and you'll get a db.vcf. Alternately, you can also give all your normalized vcf files to merge/count as arguments.

./vcf_sample_counter_merger.sh outputFileName.vcf  VCF1.vcf VCF2.vcf VCF3.vcf VCF4.vcf VCF5.vcf ......

Optional :

If needed, you can "clean" the all.vcf with this command :

awk -F "\t" 'BEGIN{OFS="\t"}{if($1!~/^#/){$8="AN=10"}print}' all.vcf > all_clean.vcf

If needed, you may need to bcftools LeftAlign and bcftools Split the all_clean.vcf.


Montpellier Bioinformatique pour le Diagnostic Clinique (MoBiDiC)

CHU de Montpellier

France

MoBiDiC

Visit our website


About

Analysis Converter for Human who might Abhor Bioinformatics. A simple and useful interface to analysis of WES data for molecular diagnosis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Perl 56.5%
  • HTML 43.0%
  • Shell 0.5%