Hello, everyone!
I got 4 separate Illumina sequencing runs for 4 different samples of microbial biofilm from the same location. I assembled contigs using reads from all 4 samples together (used Spades for that), then mapped reads from each sample separately. Then I used MetaBAT's jgi_summarize_bam_contig_depths
and got values of variance-to-mean ratio of coverage depth around 2 (1.9538, 1.8905, 1.9579, 1.896 for 4 samples, respectively). For me it looked quite strange, but OK, if we sequenced with not enough depth, it might be so.
Here's first 4 lines of a depth file:
contigName contigLen totalAvgDepth al_1.bam al_1.bam-var al_2.bam al_2.bam-var al_3.bam al_3.bam-var al_4.bam al_4.bam-var
NODE_1_length_881468_cov_46.839432 881468 93.7107 19.6243 38.7464 18.0613 33.7377 19.0095 36.5217 37.0157 76.8248
NODE_2_length_803954_cov_9.260631 803954 18.291 3.18857 5.75179 1.54228 2.79595 1.66402 3.13238 11.8961 22.8288
NODE_3_length_757581_cov_10.170569 757581 20.1714 6.60012 12.3694 5.41781 9.69118 6.27614 12.0979 1.87738 3.31236
NODE_4_length_652487_cov_11.289944 652487 22.0799 4.12398 7.45103 3.88173 6.96793 4.42189 8.52849 9.65228 18.577
I thought that if I merge all reads from all samples and map them again I'll get same mean coverage but with much lower variance-to-mean ratio since mapped reads originate from random regions. I've done this and was even more puzzled cause variance-to-mean ratio did not become any better, it became even bigger (2.2065):
contigName contigLen totalAvgDepth al_merged.bam al_merged.bam-var
NODE_1_length_881468_cov_46.839432 881468 93.7173 93.7173 217.989
NODE_2_length_803954_cov_9.260631 803954 18.2896 18.2896 34.3618
NODE_3_length_757581_cov_10.170569 757581 20.173 20.173 38.5595
NODE_4_length_652487_cov_11.289944 652487 22.0798 22.0798 43.0854
I thought that this might be caused by non-specific mapping of reads, that originate from different contig, to conservative regions, so I set percentIdentity
parameter value to 100. No more reads in alignment with even single mismatch. I got coverage decreased by one third (that's OK) but again even bigger variance-to-mean ratio (2.3438)!
contigName contigLen totalAvgDepth al_merged_100.bam al_merged_100.bam-var
NODE_1_length_881468_cov_46.839432 881468 61.5177 61.5177 180.272
NODE_2_length_803954_cov_9.260631 803954 13.4113 13.4113 24.8087
NODE_3_length_757581_cov_10.170569 757581 14.8889 14.8889 28.014
NODE_4_length_652487_cov_11.289944 652487 15.8534 15.8534 29.7099
And now time for questions: 1) What are the typical values of variance-to-mean ratio you get when dealing with metagenomic samples? Is it normal that I got them ~2? 2) How could you explain the fact that when I merged reads from all samples I got BIGGER variance-to-mean ratio? 3) And what about the next step when I set threshold for mapping identity up to 100. The conservative regions must have been getting much less mapped reads while variable regions have been recruiting the same amounts, so theoretically variance must have dropped. But nevertheless it became bigger.
Wish you all the best, Kirill.