Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Nov 6;456(7218):53-9.
doi: 10.1038/nature07517.

Accurate whole human genome sequencing using reversible terminator chemistry

David R Bentley  1 Shankar BalasubramanianHarold P SwerdlowGeoffrey P SmithJohn MiltonClive G BrownKevin P HallDirk J EversColin L BarnesHelen R BignellJonathan M BoutellJason BryantRichard J CarterR Keira CheethamAnthony J CoxDarren J EllisMichael R FlatbushNiall A GormleySean J HumphrayLeslie J IrvingMirian S KarbelashviliScott M KirkHeng LiXiaohai LiuKlaus S MaisingerLisa J MurrayBojan ObradovicTobias OstMichael L ParkinsonMark R PrattIsabelle M J RasolonjatovoMark T ReedRoberto RigattiChiara RodighieroMark T RossAndrea SabotSubramanian V SankarAylwyn ScallyGary P SchrothMark E SmithVincent P SmithAnastassia SpiridouPeta E TorranceSvilen S TzonevEric H VermaasKlaudia WalterXiaolin WuLu ZhangMohammed D AlamCarole AnastasiIfy C AnieboDavid M D BaileyIain R BancarzSaibal BanerjeeSelena G BarbourPrimo A BaybayanVincent A BenoitKevin F BensonClaire BevisPhillip J BlackAsha BoodhunJoe S BrennanJohn A BridghamRob C BrownAndrew A BrownDale H BuermannAbass A BunduJames C BurrowsNigel P CarterNestor CastilloMaria Chiara E CatenazziSimon ChangR Neil CooleyNatasha R CrakeOlubunmi O DadaKonstantinos D DiakoumakosBelen Dominguez-FernandezDavid J EarnshawUgonna C EgbujorDavid W ElmoreSergey S EtchinMark R EwanMilan FedurcoLouise J FraserKarin V Fuentes FajardoW Scott FureyDavid GeorgeKimberley J GietzenColin P GoddardGeorge S GoldaPhilip A GranieriDavid E GreenDavid L GustafsonNancy F HansenKevin HarnishChristian D HaudenschildNarinder I HeyerMatthew M HimsJohnny T HoAdrian M HorganKatya HoschlerSteve HurwitzDenis V IvanovMaria Q JohnsonTerena JamesT A Huw JonesGyoung-Dong KangTzvetana H KerelskaAlan D KerseyIrina KhrebtukovaAlex P KindwallZoya KingsburyPaula I Kokko-GonzalesAnil KumarMarc A LaurentCynthia T LawleySarah E LeeXavier LeeArnold K LiaoJennifer A LochMitch LokShujun LuoRadhika M MammenJohn W MartinPatrick G McCauleyPaul McNittParul MehtaKeith W MoonJoe W MullensTaksina NewingtonZemin NingBee Ling NgSonia M NovoMichael J O'NeillMark A OsborneAndrew OsnowskiOmead OstadanLambros L ParaschosLea PickeringAndrew C PikeAlger C PikeD Chris PinkardDaniel P PliskinJoe PodhaskyVictor J QuijanoCome RaczyVicki H RaeStephen R RawlingsAna Chiva RodriguezPhyllida M RoeJohn RogersMaria C Rogert BacigalupoNikolai RomanovAnthony RomieuRithy K RothNatalie J RourkeSilke T RuedigerEli RusmanRaquel M Sanches-KuiperMartin R SchenkerJosefina M SeoaneRichard J ShawMitch K ShiverSteven W ShortNing L SiztoJohannes P SluisMelanie A SmithJean Ernest Sohna SohnaEric J SpenceKim StevensNeil SuttonLukasz SzajkowskiCarolyn L TregidgoGerardo TurcattiStephanie VandevondeleYuli VerhovskySelene M VirkSuzanne WakelinGregory C WalcottJingwen WangGraham J WorsleyJuying YanLing YauMike ZuerleinJane RogersJames C MullikinMatthew E HurlesNick J McCookeJohn S WestFrank L OaksPeter L LundbergDavid KlenermanRichard DurbinAnthony J Smith
Affiliations

Accurate whole human genome sequencing using reversible terminator chemistry

David R Bentley et al. Nature. .

Abstract

DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally used long (400-800 base pair) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Sample preparation
a. DNA fragments are generated e.g. by random shearing and joined to a pair of oligonucleotides in a forked adapter configuration. The ligated products are amplified using two oligonucleotide primers, resulting in double-stranded blunt-ended material with a different adapter sequence on either end. b. formation of clonal single molecule array. DNA fragments prepared as in a are denatured and single strands are annealed to complementary oligonucleotides on the flowcell surface (hatched in the figure). A new strand (dotted) is copied from the original strand in an extension reaction that is primed from the 3’ end of the surface-bound oligonucleotide, and the original strand is then removed by denaturation. The adapter sequence at the 3’ end of each copied strand is annealed to a new surface bound complementary oligonucleotide, forming a bridge and generating a new site for synthesis of a second strand (shown dotted). Multiple cycles of annealing, extension and denaturation in isothermal conditions result in growth of clusters each ~1micron in physical diameter. This follows the basic method outlined in ref c. The DNA in each cluster is linearised by cleavage within one adapter sequence (gap marked by an asterisk) and denatured, generating single stranded template for sequencing by synthesis to obtain a sequence read (read 1)(the sequencing product is shown dotted). To perform paired-read sequencing, the products of read 1 are removed by denaturation, the template is used to generate a bridge, the second strand is re-synthesised (shown dotted), and the opposite strand is then cleaved (gap marked by an asterisk) to provide the template for the second read (read 2). d. Long range paired end sample preparation. To sequence the ends of a long (e.g. >1 kb) DNA fragment, the ends of each fragment are tagged by incorporation of biotinylated (B) nucleotide and then circularised, forming a junction between the two ends. Circularised DNA is randomly fragmented and the biotinylated junction fragments are recovered and used as starting material in the standard sample preparation procedure illustrated in a above. The orientation of the sequence reads relative to the DNA fragment is tracked in the figure by magenta arrows. When aligned to the reference sequence, these reads are oriented with their 5’ ends towards each other (in contrast to the short insert paired reads produced as shown in a–c). See fig S17a for examples of both. Turquoise and blue lines represent oligonucleotides and red lines represent genomic DNA. Note that all surface-bound oligonucleotides are attached to the flowcell by their 5’ ends. Dotted lines indicate newly synthesized strands during cluster formation or sequencing. See supplementary methods for details.
Figure 2
Figure 2. X chromosome data
a. Distribution of mapped read depth in the X chromosome dataset, sampled at every 50th position along the chromosome and displayed as a histogram (‘all’). An equivalent analysis of mapped read depth for the unique subset of these positions is also shown (‘unique only’). The solid line represents a Poisson distribution with the same mean. b. Distribution of X chromosome uniquely mapped reads as a function of GC content. Note that the x axis is % GC content and is scaled by percentile of unique sequence. The solid line is average mapped depth of unique sequence; the grey region is the central 80% of the data (10th to 90th centiles); the dashed lines are 10th and 90th centiles of a Poisson distribution with the same mean as the data.
Figure 2
Figure 2. X chromosome data
a. Distribution of mapped read depth in the X chromosome dataset, sampled at every 50th position along the chromosome and displayed as a histogram (‘all’). An equivalent analysis of mapped read depth for the unique subset of these positions is also shown (‘unique only’). The solid line represents a Poisson distribution with the same mean. b. Distribution of X chromosome uniquely mapped reads as a function of GC content. Note that the x axis is % GC content and is scaled by percentile of unique sequence. The solid line is average mapped depth of unique sequence; the grey region is the central 80% of the data (10th to 90th centiles); the dashed lines are 10th and 90th centiles of a Poisson distribution with the same mean as the data.
Figure 3
Figure 3. SNPs identified in the human genome sequence of NA18507
a. number of SNPs detected by class and % in dbSNP (release 128). Results from ELAND and MAQ alignments are reported separately. b. Overlap of SNPs detected in each analysis reveals extensive overlap. The % of NA18507 SNP calls that match previous entries in dbSNP is lower than that of our X chromosome study (see fig S6). We expect this because individual NA07340 (from the X study) was also previously used for discovery and submission of SNPs to dbSNP during the HapMap project, in contrast to NA18507.
Figure 4
Figure 4. Homozygous complex rearrangement detected by anomalous paired reads. The rearrangement involves an inversion of 369 bp (blue-turquoise bar in the schematic) flanked by deletions (red bars) of 1206 and 164 bp, respectively, at the left and right hand breakpoints
a. summary tracks in the Resembl browser, denoting scale, simulated alignability of reads to reference (blue plot), actual aligned depth of coverage by NA18507 reads (green plot), density of anomalous reads indicating structural variants (red plot; peaks denote ‘hotspots’), density of singleton reads (pink plot). b. anomalous long insert read pairs (orange lines denote DNA fragment, blocks at either end denote each read); the data indicate loss of ~1.3kb in NA18507 relative to the reference. c. anomalous short insert pairs of two types (red and pink) indicate an inverted sequence flanked by two deletions. d. normal short insert read pair alignments (each green line denotes the extent of the reference that is covered by the short fragment, including the two reads). e. The schematic depicts the arrangement of normal and anomalous read pairs relative to the rearrangement. Top line: structure of NA18507, second line: structure of reference sequence. Green bars denote sequence that is collinear in the reference and NA18507. The turquoise-blue bar illustrates the inverted segment. Red bars indicate the sequences present in the reference but absent in NA18507. Arrows denote orientation of reads when aligned to the reference. Note that the display in a–d is a composite of screen shots of the same window, overlapped for display purposes in this figure.
Figure 5
Figure 5. Effect of sequence depth on coverage and accuracy of human genome sequencing. ELAND alignments were used for this analysis
a. Accumulation of sequence-based SNP calls, including all SNPs (squares), heterozygous SNPs (triangles) and homozygous SNPs (circles) with increasing input read depth. b. Decrease in genotype positions not covered by sequence (squares), heterozygote undercalls in sequence data relative to genotype data (triangles) and discordant SNP calls compared to genotypes (circles) with increasing input read depth. Vertical dotted lines indicate various input read depths (10x, 15x, 30x haploid genome).
Figure 5
Figure 5. Effect of sequence depth on coverage and accuracy of human genome sequencing. ELAND alignments were used for this analysis
a. Accumulation of sequence-based SNP calls, including all SNPs (squares), heterozygous SNPs (triangles) and homozygous SNPs (circles) with increasing input read depth. b. Decrease in genotype positions not covered by sequence (squares), heterozygote undercalls in sequence data relative to genotype data (triangles) and discordant SNP calls compared to genotypes (circles) with increasing input read depth. Vertical dotted lines indicate various input read depths (10x, 15x, 30x haploid genome).

Comment in

Similar articles

  • The complete genome of an individual by massively parallel DNA sequencing.
    Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM. Wheeler DA, et al. Nature. 2008 Apr 17;452(7189):872-6. doi: 10.1038/nature06884. Nature. 2008. PMID: 18421352
  • A map of human genome variation from population-scale sequencing.
    1000 Genomes Project Consortium; Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. 1000 Genomes Project Consortium, et al. Nature. 2010 Oct 28;467(7319):1061-73. doi: 10.1038/nature09534. Nature. 2010. PMID: 20981092 Free PMC article.
  • Fast and accurate genomic analyses using genome graphs.
    Rakocevic G, Semenyuk V, Lee WP, Spencer J, Browning J, Johnson IJ, Arsenijevic V, Nadj J, Ghose K, Suciu MC, Ji SG, Demir G, Li L, Toptaş BÇ, Dolgoborodov A, Pollex B, Spulber I, Glotova I, Kómár P, Stachyra AL, Li Y, Popovic M, Källberg M, Jain A, Kural D. Rakocevic G, et al. Nat Genet. 2019 Feb;51(2):354-362. doi: 10.1038/s41588-018-0316-4. Epub 2019 Jan 14. Nat Genet. 2019. PMID: 30643257
  • Whole genome sequencing.
    Ng PC, Kirkness EF. Ng PC, et al. Methods Mol Biol. 2010;628:215-26. doi: 10.1007/978-1-60327-367-1_12. Methods Mol Biol. 2010. PMID: 20238084 Review.
  • Whole-genome re-sequencing.
    Bentley DR. Bentley DR. Curr Opin Genet Dev. 2006 Dec;16(6):545-52. doi: 10.1016/j.gde.2006.10.009. Epub 2006 Oct 18. Curr Opin Genet Dev. 2006. PMID: 17055251 Review.

Cited by

References

    1. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. - PubMed
    1. Levy S, et al. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:e254. - PMC - PubMed
    1. Margulies M, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
    1. Shendure J, et al. Accurate multiplex polony sequencing of an evolved bacterial genome. Science. 2005;309:1728–1732. - PubMed
    1. Harris TD, et al. Single-molecule DNA sequencing of a viral genome. Science. 2008;320:106–109. - PubMed

Publication types