Hello,
I am trying to provide data to a researcher who has agreed to look at a trio of WGS data. He intends to use seqr for analysis (https://seqr.broadinstitute.org/). I have for each of the three samples a:
- VCF file
- CRAM file
- FASTQ file
Given the significant difference in file size, I provided just the three VCF files (each a sample in the trio). However, he stated:
- Can you please reformat your data as a single joint-called VCF file? I will then process it for analysis. Single VCFs are not appropriate for family comparisons.
Thinking I could use bcftools merge, I provided him the output of:
- bcftools merge sample1.vcf.gz sample2.vcf.gz sample3.vcf.gz > combined_family.vcf
However, he states:
- No. You have to go back and joint call all 3 datasets at the same time using a different tool. That permits the proper phasing of haplotypes for more definitive genotyping. Your yield will go up a lot. GATK has workflows for this.
I am confused by what he is asking for and how I might go about achieving it with the three types of files I have; I do not have a genomics or bioinformatics background.
Could you provide some insight as to how I should go about producing what he is asking for? I presume I need to use the three CRAM files to produce a single "joint called VCF," is that correct? If so, can I do that with GATK alone? I have read several articles which strictly use BAM files and it is my understanding to get BAM files I would first have to use samtools to convert each CRAM to BAM (I don't have any BAM files provided by the testing company). I am having great difficulty getting samtools to install so I would like to accomplish this task with GATK alone (with the CRAM files, not BAM) if that is possible (to produce this "joint called VCF" from the three CRAM files). Is that possible?
Thank you
Here is the logic for doing joing calling of SNP: https://gatk.broadinstitute.org/hc/en-us/articles/360035890431-The-logic-of-joint-calling-for-germline-short-variants