How to produce a single joint-called VCF file using as input three WGS samples (VCF,CRAM,FASTQ)
1
0
Entering edit mode
3 months ago
shwivel • 0

Hello,

I am trying to provide data to a researcher who has agreed to look at a trio of WGS data. He intends to use seqr for analysis (https://seqr.broadinstitute.org/). I have for each of the three samples a:

  • VCF file
  • CRAM file
  • FASTQ file

Given the significant difference in file size, I provided just the three VCF files (each a sample in the trio). However, he stated:

  • Can you please reformat your data as a single joint-called VCF file? I will then process it for analysis. Single VCFs are not appropriate for family comparisons.

Thinking I could use bcftools merge, I provided him the output of:

  • bcftools merge sample1.vcf.gz sample2.vcf.gz sample3.vcf.gz > combined_family.vcf

However, he states:

  • No. You have to go back and joint call all 3 datasets at the same time using a different tool. That permits the proper phasing of haplotypes for more definitive genotyping. Your yield will go up a lot. GATK has workflows for this.

I am confused by what he is asking for and how I might go about achieving it with the three types of files I have; I do not have a genomics or bioinformatics background.

Could you provide some insight as to how I should go about producing what he is asking for? I presume I need to use the three CRAM files to produce a single "joint called VCF," is that correct? If so, can I do that with GATK alone? I have read several articles which strictly use BAM files and it is my understanding to get BAM files I would first have to use samtools to convert each CRAM to BAM (I don't have any BAM files provided by the testing company). I am having great difficulty getting samtools to install so I would like to accomplish this task with GATK alone (with the CRAM files, not BAM) if that is possible (to produce this "joint called VCF" from the three CRAM files). Is that possible?

Thank you

VCF WGS • 1.5k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
2
Entering edit mode
3 months ago
gatk HaplotypeCaller -R reference.fasta -I bam1.cram -I bam2.cram-I bam3.cram -O output.vcf.gz
ADD COMMENT
0
Entering edit mode

Thank you! This appears to be what I'm looking for but I receive the following error, which I've shortened as it is enough to get the idea:

A USER ERROR has occurred: Input files reference and reads have incompatible contigs: Dictionary reference is missing contigs found in dictionary reads.  Missing contigs:
 chrUn_KN707606v1_decoy, chrUn_KN707607v1_decoy, chrUn_KN707608v1_decoy, chrUn_KN707609v1_decoy, chrUn_KN707610v1_decoy, chrUn_KN707611v1_decoy, chrUn_KN707612v1_decoy, chrUn_KN707613v1_decoy, chrUn_KN707614v1_decoy, chrUn_KN707615v1_decoy, chrUn_KN707616v1_decoy, chrUn_KN707617v1_decoy, chrUn_KN707618v1_decoy, chrUn_KN707619v1_decoy, chrUn_KN707620v1_decoy, chrUn_KN707621v1_decoy,

The testing company provided, in both the VCF files -- and apparently the CRAM files -- certain "odd chromosome values" which I have no idea the significance of. I expect chr1,chr2,chr3,chr4,chrX,chrY, etc. but the above values also exist in my data and, because the reference fasta apparently does not, this error is returned. My sample data is all GRCh38.

Am I correct in presuming this is benign as the data behind these "missing contigs" may not be relevant anyway? If so, is there any way I can force the command to proceed, and ignore that error?

To be clear as to what reference file I am using, I am using GCA_000001405.15_GRCh38_no_alt_analysis_set.fa from https://hgdownload.soe.ucsc.edu/treehouse/reference/

With both the index file (.fai) and dictionary file (.dict) from there downloaded into the same directory.

I appreciate any insight you can provide. Thanks again

ADD REPLY
0
Entering edit mode

you're not using the fasta that was used to map the bam/cram. It must be the very same file. Ask the fasta from the people who mapped the crams.

ADD REPLY
0
Entering edit mode

In the VCF files there is a line in each file containing the word "reference" and the following filename appears: GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna

I find a file of that exact name here: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/

So I downloaded it and extracted it, and also downloaded the .fai index file into the same directory, but I get this error:

A USER ERROR has occurred: Fasta dict file file:///mnt/synology1/reference/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.dict for reference file:///mnt/synology1/reference/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna does not exist. Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.

Unfortunately the FTP NCBI link I pasted above does not have a .dict dictionary file for this reference file, and the help link in the error goes nowhere / is outdated. Is there somewhere I can download a dictionary file for this reference file, or can I create one?

Thanks

ADD REPLY
0
Entering edit mode

I created a dictionary file as follows:

gatk CreateSequenceDictionary R=xxxxx O=xxxxx

And re-ran the command but now receive this error:

15:47:23.552 INFO  HaplotypeCaller - Shutting down engine
[July 13, 2024 at 3:47:23 PM EDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.28 minutes.
Runtime.totalMemory()=649068544
htsjdk.samtools.SAMException: No sequence dictionary mapping available for header: SAMFileHeader{VN=1.4, SO=coordinate}
        at htsjdk.samtools.SamFileHeaderMerger.getMergedSequenceIndex(SamFileHeaderMerger.java:741)
        at htsjdk.samtools.MergingSamRecordIterator$MergedSequenceDictionaryCoordinateOrderComparator.getReferenceIndex(MergingSamRecordIterator.java:234)
        at htsjdk.samtools.MergingSamRecordIterator$MergedSequenceDictionaryCoordinateOrderComparator.fileOrderCompare(MergingSamRecordIterator.java:214)
        at htsjdk.samtools.SAMRecordCoordinateComparator.compare(SAMRecordCoordinateComparator.java:48)
        at htsjdk.samtools.SAMRecordCoordinateComparator.compare(SAMRecordCoordinateComparator.java:43)
        at htsjdk.samtools.ComparableSamRecordIterator.compareTo(ComparableSamRecordIterator.java:75)
        at htsjdk.samtools.ComparableSamRecordIterator.compareTo(ComparableSamRecordIterator.java:36)
        at java.base/java.util.PriorityQueue.siftUpComparable(PriorityQueue.java:662)
        at java.base/java.util.PriorityQueue.siftUp(PriorityQueue.java:654)
        at java.base/java.util.PriorityQueue.offer(PriorityQueue.java:345)
        at htsjdk.samtools.MergingSamRecordIterator.addIfNotEmpty(MergingSamRecordIterator.java:161)
        at htsjdk.samtools.MergingSamRecordIterator.<init>(MergingSamRecordIterator.java:94)
        at org.broadinstitute.hellbender.engine.ReadsPathDataSource.prepareIteratorsForTraversal(ReadsPathDataSource.java:429)
        at org.broadinstitute.hellbender.engine.ReadsPathDataSource.iterator(ReadsPathDataSource.java:336)
        at org.broadinstitute.hellbender.engine.MultiIntervalLocalReadShard.iterator(MultiIntervalLocalReadShard.java:134)
        at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.<init>(AssemblyRegionIterator.java:86)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:188)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:173)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1085)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)

Do you know what the solution might be at this point?

Sorry for the back-and-forth. Thanks for your time.

ADD REPLY
0
Entering edit mode

The sequence dictionary must be named xxx.dict if the fasta is xxx.fa

ADD REPLY
0
Entering edit mode

I have all of these files in the same directory:

GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna
GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.fai
GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.dict

But get that error. I think the dictionary file is named correctly because if I rename it to something else I get the "Fasta dict file file does not exist" error. Because I get this other error instead I presume it is accessing it, but for some reason does not like it.

ADD REPLY
0
Entering edit mode

Ha, I think the problem comes from a dict INSIDE one of your bam/cram.

check , with samtools view -H that ALL the input CRAM have a dictionary (header with @SQ) and they all have the same dictionary.

ADD REPLY
0
Entering edit mode

Having some trouble with samtools but I used gatk PrintReadsHeader to extract the header of each CRAM.

Indeed there are some differences (and similarities) between them. These samples were processed some time apart by Nebula Genomics. Although they are all GRCh38, there may be some slight differences in how the data was processed.

Will that preclude me from using gatk HaplotypeCaller with the three CRAMs? (because of their apparent differences in processing)

If so, my thought is to create unaligned BAM files from the paired fastq data, like:

gatk FastqToSam F1=sampleA_1.fq.gz F2=sampleA_2.fq.gz O=sampleA.bam SAMPLE_NAME=sampleA
gatk FastqToSam F1=sampleB_1.fq.gz F2=sampleB_2.fq.gz O=sampleB.bam SAMPLE_NAME=sampleB
gatk FastqToSam F1=sampleC_1.fq.gz F2=sampleC_2.fq.gz O=sampleC.bam SAMPLE_NAME=sampleC

And to then use gatk HaplotypeCaller with the resulting three BAM files. The FASTQ data is large so it might take a minute, but would that likely get around the issue with the CRAM headers?

I have a text file of the header of each CRAM file I can provide if helpful.

ADD REPLY
0
Entering edit mode

Will that preclude me from using gatk HaplotypeCaller with the three CRAMs?

yes

ADD REPLY
0
Entering edit mode

if so, my thought is to create unaligned BAM files from the paired fastq data

no, reads must be mapped (with eg bwa) on the reference genome.

ADD REPLY
0
Entering edit mode

you can also try to use bcftools mpileup + bcftools call on the only common regions of your bams to get a join VCF.

ADD REPLY
0
Entering edit mode

I have linked below a folder with the three CRAM header files in text .txt for the child, mother, and father samples. (not any data just the headers, with sample numbers removed)

https://drive.google.com/drive/folders/1NX8TKfkowjBlWXKyaOgPyYE8PXUFEWYZ?usp=sharing

Of note, the "child file" has none of the "decoy" regions of the others. That sample was processed by Nebula much earlier than the other two so perhaps there was a change in methodology between the times the samples were processed.

The regions that exist in all three files, with matching values of SQ/SN/LN/M5 are those listed below.

Can you explain how I would specify this list, using bcftools mpileup + bcftools call, to produce a joint called VCF using only these common regions from each CRAM? Thank you for your help.

SN:chr1
SN:chr1_KI270706v1_random
SN:chr1_KI270707v1_random
SN:chr1_KI270708v1_random
SN:chr1_KI270709v1_random
SN:chr1_KI270710v1_random
SN:chr1_KI270711v1_random
SN:chr1_KI270712v1_random
SN:chr1_KI270713v1_random
SN:chr1_KI270714v1_random
SN:chr10
SN:chr11
SN:chr11_KI270721v1_random
SN:chr12
SN:chr13
SN:chr14
SN:chr14_GL000009v2_random
SN:chr14_GL000194v1_random
SN:chr14_GL000225v1_random
SN:chr14_KI270722v1_random
SN:chr14_KI270723v1_random
SN:chr14_KI270724v1_random
SN:chr14_KI270725v1_random
SN:chr14_KI270726v1_random
SN:chr15
SN:chr15_KI270727v1_random
SN:chr16
SN:chr16_KI270728v1_random
SN:chr17
SN:chr17_GL000205v2_random
SN:chr17_KI270729v1_random
SN:chr17_KI270730v1_random
SN:chr18
SN:chr19
SN:chr2
SN:chr2_KI270715v1_random
SN:chr2_KI270716v1_random
SN:chr20
SN:chr21
SN:chr22
SN:chr22_KI270731v1_random
SN:chr22_KI270732v1_random
SN:chr22_KI270733v1_random
SN:chr22_KI270734v1_random
SN:chr22_KI270735v1_random
SN:chr22_KI270736v1_random
SN:chr22_KI270737v1_random
SN:chr22_KI270738v1_random
SN:chr22_KI270739v1_random
SN:chr3
SN:chr3_GL000221v1_random
SN:chr4
SN:chr4_GL000008v2_random
SN:chr5
SN:chr5_GL000208v1_random
SN:chr6
SN:chr7
SN:chr8
SN:chr9
SN:chr9_KI270717v1_random
SN:chr9_KI270718v1_random
SN:chr9_KI270719v1_random
SN:chr9_KI270720v1_random
SN:chrEBV
SN:chrM
SN:chrUn_GL000195v1
SN:chrUn_GL000213v1
SN:chrUn_GL000214v1
SN:chrUn_GL000216v2
SN:chrUn_GL000218v1
SN:chrUn_GL000219v1
SN:chrUn_GL000220v1
SN:chrUn_GL000224v1
SN:chrUn_GL000226v1
SN:chrX
SN:chrY
SN:chrY_KI270740v1_random
ADD REPLY
0
Entering edit mode

I am currently running the following to test out joint calling on chromosome 1 only:

bcftools mpileup -Ou -I -f /mnt/synology1/nebula-backups/reference/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna  -r chr1 -b /mnt/synology1/nebula-backups/cram/cram_list.txt -d 8000 | bcftools call -vmO z -o /home/shwivel/outputted_variants.vcf.gz

File cram_list.txt has 3 rows with the location of each of my 3 CRAM files. In the same directory as the CRAMs is a corresponding index for each CRAM.

My hope is that the outputted outputted_variants.vcf.gz will be a joint called VCF, based on the three CRAMs, limited to chromosome 1 only and that, if indeed that is true, simply changing:

- -r chr1

to:

-r chr1,chr2,chr3,chr4  #.... etc. (expanded to include every region in common among all 3 CRAM headers) .....

Will end up being the final desired file for use.

But there are a lot of switches for these utilities so I just want to make sure this is what I should be doing and that I needn't configure something more. Am I on the right path?

ADD REPLY
0
Entering edit mode

yeah, it looks ok to me.

ADD REPLY
0
Entering edit mode

I cannot thank you enough sir...

ADD REPLY
0
Entering edit mode

Please use code formatting, not bullet-point formatting for code. Using bullet point formatting for code makes no sense whatsoever. I've fixed your posts as much as I can.

ADD REPLY
0
Entering edit mode

Will do. First time here. I looked for the code icon but did not see one. Most WYIWYG editors use {} or <> for the code icon (and not some microscopic depiction of 1s and 0s).

ADD REPLY
0
Entering edit mode

That's really good feedback. Maybe open an issue here: https://github.com/ialbert/biostar-central/issues ?

ADD REPLY

Login before adding your answer.

Traffic: 1506 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6