Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jul;19(7):1243-53.
doi: 10.1101/gr.092957.109. Epub 2009 May 15.

DNA Sudoku--harnessing high-throughput sequencing for multiplexed specimen analysis

Affiliations

DNA Sudoku--harnessing high-throughput sequencing for multiplexed specimen analysis

Yaniv Erlich et al. Genome Res. 2009 Jul.

Abstract

Next-generation sequencers have sufficient power to analyze simultaneously DNAs from many different specimens, a practice known as multiplexing. Such schemes rely on the ability to associate each sequence read with the specimen from which it was derived. The current practice of appending molecular barcodes prior to pooling is practical for parallel analysis of up to many dozen samples. Here, we report a strategy that permits simultaneous analysis of tens of thousands of specimens. Our approach relies on the use of combinatorial pooling strategies in which pools rather than individual specimens are assigned barcodes. Thus, the identity of each specimen is encoded within the pooling pattern rather than by its association with a particular sequence tag. Decoding the pattern allows the sequence of an original specimen to be inferred with high confidence. We verified the ability of our encoding and decoding strategies to accurately report the sequence of individual samples within a large number of mixed specimens in two ways. First, we simulated data both from a clone library and from a human population in which a sequence variant associated with cystic fibrosis was present. Second, we actually pooled, sequenced, and decoded identities within two sets of 40,000 bacterial clones comprising approximately 20,000 different artificial microRNAs targeting Arabidopsis or human genes. We achieved greater than 97% accuracy in these trials. The strategies reported here can be applied to a wide variety of biological problems, including the determination of genotypic variation within large populations of individuals.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The problem of nonunique genotypes in simple row/column strategies. (A,B) Pooling and arraying of 20 specimens. Letters indicate pools of rows, and Roman numerals indicate pools of columns. (A) The (red) mutant state is unique and appears only in specimen #6 (left). After sequencing the pools (right), this genotype is found in pool II and B, a pattern that can only be associated with specimen #6, demonstrating successful decoding. (B) The mutant state appears in #6 and #20. After decoding there is an ambiguity of whether the mutant state is associated with specimens #6 and #20, or #8 and #18, or #6, #8, #18, and #20.
Figure 2.
Figure 2.
An example of pooling design. (A) The steps involved in pooling 20 specimens with two pooling patterns: (left) nr2 (mod 5); (right) nr2 (mod 8). Note that each pattern corresponds to a pooling group and to a set of destination wells. For simplicity, the destination wells are labeled 1–5 and 1–8 instead of 0–4 and 0–7. In this example, we choose specimen #6 to be mutant (red, green = wild type) among the 20 specimens. (B) The corresponding pooling matrix. The matrix is 13 × 20 and partitioned into two regions (broken line) that correspond to the two pooling patterns. The (highlighted in gray) staircase pattern in each region is typically created in our pooling scheme. The weight of the matrix is 2, the maximal intersection between each set of two column vectors is 1, and the maximal compression rate is 4.
Figure 3.
Figure 3.
Pattern consistency decoder. (A) Successful decoding. (Red) Specimen #6 is the only mutant in the library (top panel). After pooling and sequencing, the only data available for the decoder are the following pattern: mutant is in the first pool of the first pooling window and in the sixth pool of the second pooling window (middle panel; see corresponding red rows in the matrix). Summing along the columns of the pattern (bottom panel) creates a histogram that represents the number of windows in which a specimen was found. Notice that the scores of the histograms range from 0 to the weight of the matrix. The pattern consistency decoder asserts that only specimens that appeared in all windows will be assigned to the mutant state. Since only specimen #6 has a score of 2 in the histogram meaning it appeared in all possible windows, it is associated with the mutant state, and the decoder reports the correct result. (B) Failure in the decoding. This pooling matrix has decoding robustness of d = 1, and the correct decoding of d0 = 2. Thus, in the example, correct decoding is not guaranteed. Specimens #6 and #8 are mutants (top), and the pattern that the decoder encounters is indicated in the matrix (middle). Consequently, associating specimen #16 with the mutant state is consistent, and it gets a score of 2 in the histogram, and is reported as a mutant.
Figure 4.
Figure 4.
Simulations of decoder performance in different CRT pooling designs with 40,320 specimens. (A,B) The effect of increasing the weight with 384 barcodes. (C,D) The effect of increasing the number of barcodes with five pooling windows. (A,C) The probability of correct decoding as a function of d0 with different pooling designs. (Gray line) 99%. (B,D) The expected number of ambiguous specimens due to decoding errors as a function of d0 and different pooling designs.
Figure 5.
Figure 5.
Performance of the minimal discrepancy decoder. (A, red line) The distribution of quality scores for the Arabidopsis library is bimodal with transition around a quality score of 5. We speculate that this corresponds to two decoding regimes—noisy on the left and error-free on the right. (B) Analyzing the correlation between filtering above different quality scores and the correct decoding rate. The size of the bubbles corresponds to the number of specimens passing the threshold. The increase in the correct decoding rate with the threshold of the quality validates the predictive value of the quality scores. (Red bubble) The strong drop in the correct decoding rates below a quality score of 5 confirms our speculations about the two decoding conditions. (C) The quality score distribution of the human library. The noisy area, left of a quality score of 5, contains fewer specimens than observed in the Arabidopsis library. The shift of the distribution to the right suggests better decoding performance presumably owing to higher sequencing depth. After filtering specimens with a quality score above 5, the correct decoding rate was estimated to be 98.2% for 36,000 specimens and indicates the robustness of the quality score. Decoding all specimens regardless of the quality score revealed a correct decoding rate of 97.3%.

Similar articles

Cited by

References

    1. Andrews GE. Number theory. Dover; New York: 1994. Solving congruences; pp. 58–75.
    1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. - PMC - PubMed
    1. Bruno WJ, Knill E, Balding DJ, Bruce DC, Doggett NA, Sawhill WW, Stallings RL, Whittaker CC, Torney DC. Efficient pooling designs for library screening. Genomics. 1995;26:21–30. - PubMed
    1. Cleary MA, Kilian K, Wang Y, Bradshaw J, Cavet G, Ge W, Kulkarni A, Paddison PJ, Chang K, Sheth N, et al. Production of complex nucleic acid libraries using highly parallel in situ oligonucleotide synthesis. Nat Methods. 2004;1:241–248. - PubMed
    1. Cormen TC, Leiserson CE, Rivest RL, Stein C. Introduction to algorithms. 2nd ed. MIT Press; Cambridge, MA: 2001. Number theoretic algorithms; pp. 849–905.

Publication types

MeSH terms

Substances

LinkOut - more resources