Hi everyone. I tried to apply GATK
pipeline to amplicon-based
mtDNA sequencing data analyse and found some problems want to ask/discuss with you guys.
I sat up my pipeline according this GATK Best Practice. Here is my pipeline:
- QC(fastp)
- mapping
bwa mem -t 4 -R "@RG\tID..." hs_ref_GRCh38.p12_chrMT.fa sample.R1.fastq.gz sample.R2.fastq.gz > sample.sam
- SAM to BAM(samtools)
samtools view -bF 12 sample.sam > sample.bam
- sort BAM(GATK SortSam)
gatk SortSam -I sample.bam -O sample_sorted.bam -SO coordinate
- BAM index(samtools index)
- Validate BAM(GATK ValidateSamFile)
gatk ValidateSamFile -I sample_sorted.bam -M VERBOSE
- Call variant(GATK HaplotypeCaller)
gatk HaplotypeCaller -I samplt_sorted.bam -O sample.g.vcf -R hs_ref_GRCh38.p12_chrMT.fa -ploidy 1 -ERC GVCF
- Consolidate GVCFs(GATK GenomicsDBImport)
gatk GenomicsDBImport --genomicsdb-workspace-path ./workspace -L MT -V sample1.g.vcf -V sample2.g.vcf ...(many samples)
- Joint-Call Cohort(GATK GenotypeGVCFs)
gatk GenotypeGVCFs -O 18R111.vcf -R hs_ref_GRCh38.p12_chrMT.fa -ploidy 1 -V gendb://workspace
Here is something I want to explain first
- No mark/remove duplicates because this is amplicon-based(multi-PCR) sequencing data, most of them are duplicates normally(fastp reported about 96%).
- I set
-ploidy 1
in step 7 and 9 because mtDNA should treated as haploid. However I can't set step 8 tohaploid
because this only for diploid. Second, I find description below in this GATK article: > The one quantity that changes whether the GATK considers the possibility of a heterozygous genotype at all is the ploidy, which describes how many copies of each chromosome each individual in the species carries.
It seems this ploidy setting only works for nuclear genome, means the copies of chromosome. But this not quite right for mtDNA for copy number not sure(At least not 1 copy). Question1: What will this affect my result?
When I check my finally vcf result I find 2 problems, result example:
MT 214 . A G 2386.46 ... GT:AD:DP:GQ:PL 1:0,112:112:99:5445,0 0:136,0:136:99:0,1080...
There problem is AD and DP too low value(Important to me, I need to calculate each variant's frequency.). The real depth value is about 5000, I find many has the same issue(check here). This seems because HaplotypeCaller
not suite for amplicon data. Question2: Have anyone call such mtDNA variant using GATK? How?
Question3: Will Mutect2
work here?
I think about this because mtDNA heterogeneity are like tumor, while tumor has many other features that mtDNA don't like structure variants. Any one think this before? What should be normal tissue data
(-I:normal param in Mutect2) if I apply Mutect2
to mtDNA? Reference sequence?
Thanks everyone!
Note that you shorten steps 2-4, and avoid intermediate files by piping the result of the alignment directly to
samtools sort
, creating a sorted bam file:Thanks, I used
samtools sort
before until I foundgatk SortSam
. I though this two tools do exactly the same thing. You mean my understanding here is wrong?They probably do exactly the same thing, but I'd naively expect samtools to be quicker.
What you are doing is correct, there is just no need to split the alignment, sam to bam and sorting bam across three commands. It's messy, takes longer and makes intermediate files which you'll have to delete. A sam file takes a ton of disk space.
Good advice, indeed I need to delete those files by hand. Thanks.