These scripts were applied for variant calling and processing in the context of NAVIP: https://github.com/bpucker/NAVIP. Some scripts are only included for documentation purposes, while others were written in a generic way to facilitate re-use. Scripts are written in Python v2.7 or Python v3.8.
This script is intended as documentation of the process. It is customized for best performance on the local compute cluster. Re-use would require adjustments to certain parts of the script.
Usage
python GATK_variant_calling.py
Mandatory:
--input_bam_file STR Path to BAM file.
--ref_file STR Path to reference sequence file.
--directory STR Output folder
--piccard STR Full path to piccard tools.
--samtools STR Samtools path.
--gatk STR Path to GATK.
--varcallprepscript STR Path to variant_call_preparation.py.
--varsortscript STR Path to sort_vcf_by_fasta.py.
Optional:
--bam_is_sorted (prevents sorting of bam file).
--input_bam_file
specifies full path to BAM input file.
--ref_file
specifies the full path to the reference genome sequence FASTA file.
--directory
specifies the output folder.
--piccard
specifies the full path to piccard tools.
--samtools
specifies the full path to samtools.
--gatk
specifies the full path to GATK.
--varcallprepscript
specifies the full path to the Python script variant_call_preparation.py (see below).
--varsortscript
specifies the full path to the Python script sort_vcf_by_fasta.py (see below).
This script is used internaly to allow parallel processing of sequences in the reference data set.
This script is intended as documentation of the process. It is customized for best performance on the local compute cluster. Re-use would require adjustments to certain parts of the script.
Usage
python GATK1_BP.py
Mandatory:
--input_bam_file STR Path to BAM file.
--ref_file STR Path to reference sequence file.
--directory STR Output folder
--gold_vcf STR Path to gold standard VCF
--piccard STR Full path to piccard tools.
--samtools STR Samtools path.
--gatk STR Path to GATK.
--varcallprepscript STR Path to variant_call_preparation.py.
Optional:
--bam_is_sorted (prevents sorting of bam file).
--input_bam_file
specifies full path to BAM input file.
--ref_file
specifies the full path to the reference genome sequence FASTA file.
--directory
specifies the output folder.
--gold_vcf
specifies the full path to the gold standard VCF.
--piccard
specifies the full path to piccard tools.
--samtools
specifies the full path to samtools.
--gatk
specifies the full path to GATK.
--varcallprepscript
specifies the full path to the Python script variant_call_preparation.py (see below).
This script is intended as documentation of the process. It is customized for best performance on the local compute cluster. Re-use would require adjustments to certain parts of the script.
Usage
python GATK2_BP.py
Mandatory:
--ref_file STR Path to reference sequence file.
--vcf_dir STR Path to VCF folder
--out_dir STR Path to output folder
--gold_vcf STR Path to gold standard VCF
--piccard STR Full path to piccard tools.
--samtools STR Samtools path.
--gatk STR Path to GATK.
Optional:
--bam_is_sorted (prevents sorting of bam file).
--ref_file
specifies the full path to the reference genome sequence FASTA file.
--vcf_dir
specifies the folder containing the VCF files.
--out_dir
specifies the output folder.
--gold_vcf
specifies the full path to the gold standard VCF.
--piccard
specifies the full path to piccard tools.
--samtools
specifies the full path to samtools.
--gatk
specifies the full path to GATK.
This script combines the content of all VCF files detected in the provided input folder in a single VCF file.
Usage
python VCF_combiner.py
Mandatory:
--in STR Path to VCF input folder.
--out STR Path to output file.
--in
specifies the path to the input VCF folder.
--out
specifies the path to the output VCF file.
This script sorts a given VCF file based on the oder of sequences in a given FASTA file.
Usage
python sort_vcf_by_fasta.py
Mandatory:
--vcf STR Path to input VCF.
--fasta STR Path to input FASTA.
--output STR Path to output VCF.
--vcf
specifies the VCF input file.
--fasta
specifies the FASTA input file.
--output
specifies the VCF output file.
This script validates variants in a given VCF file by comparison against a high quality assembly. This assembly needs to be independent from the reads contributing to the analyzed variants.
WARNING: number of sequences (chromosomes) should not exceed 9!
Usage
python variant_validator.py
Mandatory:
--assembly STR Path to assembly file.
--ref STR Path to reference genome sequence file.
--invcf STR Path to input VCF file.
--flank INT Length of flanking sequences.
--outvcf STR Path to output VCF.
--chr STR Chromosome name.
--outerr STR Path to error output file.
--assembly
specifies the full path to the assembly FASTA file.
--ref
specifies the full path to the reference genome FASTA file.
--invcf
specifies the full path to the input VCF file.
--flank
specifies the size of the flanking sequences of variants to run the validation.
--outvcf
specifies the full path to the output VCF.
--chr
specifies the name of a chromsome to run the validation for one chromosome at a time.
--outerr
specifies the full path to the error output file.
This script splits a given VCF file and allows parallel processing of variants in each sequence.
Usage
python variant_validation_wrapper.py
Mandatory:
--assembly STR Path to assembly file.
--ref STR Path to reference file.
--vcf STR Path to input VCF file.
--flank INT Length of flanking sequences.
--out STR Path to the output folder.
--script STR Path to variant_validator.py
--assembly
specifies the full path to the assembly FASTA file.
--ref
specifies the full path to the reference genome sequence FASTA file.
--vcf
specifies the input VCF file.
--flank
specifies the length of the variant flanking sequence used for validation.
--out
specifies the output folder.
--script
specifies the full path to the script variant_validator.py.
This script calculates statistics and displays the genome-wide distribution of variants.
Usage
python analyze_variant_set.py
Mandatory:
--vcf STR Path to input VCF file.
--fig STR Path to output figure.
--report STR Path to report file.
--vcf
specifies the full path to the input VCF file.
--fig
specifies the full path to the output figure file.
--report
specifies the full path to the report file.
Add a last column (FORMAT) to an existing VCF-like file to meet the VCF requirements.
Usage
python3 correct_VCF_format.py
Mandatory:
--in STR Path to input VCF file.
--out STR Path to output VCF file.
--in
specifies the full path to the input VCF file.
--out
specifies the full path to the output VCF file.
Separate SNVs and InDels from a VCF file by generating two separate new files.
Usage
python3 separate_SNVs_InDels.py
Mandatory:
--in STR Path to input VCF file.
--snvout STR Path to SNV output VCF file.
--indelout STR Path to InDel output VCF file.
--in
specifies the full path to the input VCF file.
--snvout
specifies the full path to the SNV output VCF file.
--indelout
specifies the full path to the InDel output VCF file.
This script compares the stop_gain predictions of SnpEff and NAVIP.
Usage
python3 compare_stop_gain_events.py.py
Mandatory:
--snpeffvcf STR Path to SnpEff output file.
--navipvcf STR Path to NAVIP output file.
--out STR Path to output folder.
--snpeffvcf
specifies the SnpEff output VCF file that is required as input for this script.
--navipvcf
specifies the NAVIP output VCF file that is required as input for this script.
--out
specifies the output folder.
This script performs an analysis of synonymous (aaS) and non-synonymous (aaN) variants in genes with premature stop codons.
Usage
python aa_ns_analysis.py
Mandatory:
--in STR Path to NAVIP output file.
--genes STR Path to genes info file.
--out STR Path to output folder.
--in
specifies the NAVIP output file as input for this script.
--genes
specifies the gene info file that provides the IDs of genes with premature stop codons.
--out
specifies the folder for all output files.
This scripts takes the average expression per gene and compares these values between two groups of genes.
Usage
python3 compare_gene_exp_between_gene_groups.py
Mandatory:
--genes STR Path to genes info file.
--exp STR Path to average expression file.
--out STR Path to output folder.
optional:
--gff STR Path to GFF file.
--genes
specifies the gene info file that provides the IDs of genes with premature stop codons.
--exp
specifies the path to a file with average gene expression. Gene IDs are in the first column, mean values in the second column, and median values in the third column.
--out
specifies the folder for all output files.
--gff
specifies the GFF3 file for background gene IDs.
Baasner, J.-S., Howard, D., Pucker, B.(2019). Influence of neighboring small sequence variants on functional impact prediction. bioRxiv. doi:10.1101/596718 https://doi.org/10.1101/596718