Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr;376(6588):eabj6965.
doi: 10.1126/science.abj6965. Epub 2022 Apr 1.

Segmental duplications and their variation in a complete human genome

Affiliations

Segmental duplications and their variation in a complete human genome

Mitchell R Vollger et al. Science. 2022 Apr.

Abstract

Despite their importance in disease and evolution, highly identical segmental duplications (SDs) are among the last regions of the human reference genome (GRCh38) to be fully sequenced. Using a complete telomere-to-telomere human genome (T2T-CHM13), we present a comprehensive view of human SD organization. SDs account for nearly one-third of the additional sequence, increasing the genome-wide estimate from 5.4 to 7.0% [218 million base pairs (Mbp)]. An analysis of 268 human genomes shows that 91% of the previously unresolved T2T-CHM13 SD sequence (68.3 Mbp) better represents human copy number variation. Comparing long-read assemblies from human (n = 12) and nonhuman primate (n = 5) genomes, we systematically reconstruct the evolution and structural haplotype diversity of biomedically relevant and duplicated genes. This analysis reveals patterns of structural heterozygosity and evolutionary differences in SD organization between humans and other primates.

PubMed Disclaimer

Figures

Fig. 0.
Fig. 0.. More complete segmental duplication content improves genotyping.
a) ~10-fold increase in the number of large (>10 kbp) acrocentric segmental duplications (red) in T2T-CHM13 (right) compared to GRCh38 (left). b) Read-depth genotyping of short-read Illumina whole-genome sequence from a human diversity panel (n=268) better matches T2T-CHM13 (red) when compared to GRCh38 (blue) irrespective of human population group considered.
Fig. 1.
Fig. 1.. Segmental duplication (SD) content of the T2T-CHM13 genome.
A) The pattern of previously unresolved or structurally variant intrachromosomal duplications in T2T-CHM13 (red) compared to known duplications in GRCh38 (blue-gray). These predict hotspots of genomic instability (gold) flanked by large (>10 kbp), high-identity (>95%) interspersed (>50 kbp) SDs. B) Circos plot highlighting previously unresolved interchromosomal SDs (red) shows the preponderance of previously unresolved SDs mapping to pericentromeric and acrocentric regions. C) A histogram comparing SD content in different human reference genomes. The sum of bases in pairwise SD alignments stratified by their percent identity for the Celera (yellow, Sanger-based), GRCh38 (blue-gray, BAC-based), and T2T-CHM13 (red, long-read) assemblies. D) The 30 genic duplicons (ancestral repeat units) with the greatest copy number difference between GRCh38 and T2T-CHM13 as determined by DupMasker (table S2). All of the 30 largest differences are increased in T2T-CHM13.
Fig. 2.
Fig. 2.. Validation of previously unresolved SDs in T2T-CHM13 and heteromorphic variation.
A) Ideogram (top) depicts large SD regions (light red boxes) present in T2T-CHM13 but absent from the current reference human genome (GRCh38). An expanded view of the duplication (red) and satellite organization (blue) are depicted below showing the location of fosmid FISH probes (e.g., C15) and SD organization compared to ancestral duplicon segments (multi-colored bars) (see inset). B,C) FISH signals (red) shown on extracted metaphase for two probes and three human cell lines. Probe K20 shows a fixed signal (except for one heterozygous signal), and G6 is heteromorphic among humans (see table S4 and fig. S4 for complete description for all nine probes). D) Inversion polymorphism (green bar) between T2T-CHM13 and GRCh38 in the pericentromeric chromosome 1q region. The inversion (green bar) identified by Strand-seq (32) is confirmed in the assembly; however, the sequence-resolved assembly shows a more complex structure including two inversions (red) and one reordered segment (blue) mapping near the NOTCH2NL human-specific duplications.
Fig. 3.
Fig. 3.. SD single-nucleotide and copy number variation.
A) Sequence divergence (% in 10 kbp bins) based on syntenic alignments between GRCh38 and T2T-CHM13 for SDs (red), and unique genomic regions (black). SD regions show significantly more divergence when compared to unique sequence (black) and chromosome X (blue) but less than the MHC regions (green). B) Copy number of SD regions that are previously unresolved or structurally different in T2T-CHM13 compared to GRCh38 based on 268 human genomes from the Simons Genome Diversity Project (SGDP). The histogram shows the number of Mbp where more samples support the copy number of the given assembly [T2T-CHM13 (red), GRCh38 (blue), neither (green), or both equally (equal copy number)]. C) Empirical cumulative distribution showing how many samples genotype correctly with either GRCh38 or T2T-CHM13 as a function of the allowed difference between sample and reference copy number. The inset shows the area under the curve (AUC) calculation for both references allowing a maximum copy number difference of 30. The green curve shows an in silico reference made using the median copy number of the SGDP samples at each site. D) Genic copy number variation. Copy number variation in nine gene families are shown (generated with SGDP) and distribution is colored according to which reference better reflects the median copy number; GRCh38 generally underestimates copy number (vertical lines) and Africans (orange) tend to show higher copy number than non-Africans (blue); circle size indicates # of samples.
Fig. 4.
Fig. 4.. Human-specific expansion of TBC1D3 compared to nonhuman primates.
A) Regions of homology between human T2T-CHM13’s chromosome 17 (top) and a HiFi assembly of the chimpanzee genome (bottom). Red blocks represent regions of human-specific expansion, including TBC1D3 duplications. Colored arrows above and below the homologous sequence represent unique ancestral units (duplicons) identified by DupMasker. Inset plots for both expansion sites are included below with the gene models identified by Liftoff (94). B) Copy number (diploid) estimates from an Illumina read-depth analysis of SGDP, ancient hominids, and nonhuman primates for a TBC1D3 paralog (table S14). Copy number estimates include pseudogenes (5) not included in the phylogeny, explaining the higher counts observed. The T2T-CHM13 copy number and GRCh38 copy number are represented by the red and blue lines, respectively. C) Phylogeny of TBC1D3 copies at these two expansion sites as well as nonhuman primate copies. Single asterisks at nodes indicate bootstrap values greater than or equal to 70%, while double asterisks indicate 100%. The data illustrate a human-specific expansion, as well as several independent expansions in the macaque, gorilla, and orangutan. Using macaque sequence as an outgroup, we estimate the human-specific expansion to be ~2.3 million years ago (MYA). D) Variation in human haplotypes across the first TBC1D3 expansion site: a graph representation (rGFA, left) of the locus where colors indicate the source genome for the sequence, and on the right the path for each haplotype-resolved assembly through the graph. The top row for each haplotype composed of large polygons represents an alignment comparing the haplotype-resolved sequence (horizontal) against the graph (vertical), and color represents the source haplotype for the vertical sequence. For example, a single large red triangle indicates there is a one-to-one alignment between CHM13 and the haplotype. Structural variants can be identified from discontinuities in height (deletion), changes between colors (insertion), or changes in the direction of the polygon (inversion). Below is shown the gene of interest (red arrow) and other genic content in the region (black arrow). Colored bars show ancestral duplication segments (duplicons) that compose the larger duplication blocks.
Fig. 5.
Fig. 5.. Genic variation in previously unresolved SD regions of T2T-CHM13.
A) Ideogram showing the previously unresolved or non-syntenic gene models (open reading frames [ORFs] with >200 bp of coding sequence and multiple exons) in the T2T-CHM13 assembly as predicted by Liftoff. Previously unresolved genes mapping to SDs (red) are indicated with an asterisk if predicted to be an expansion in the gene family relative to GRCh38 (25). Arrows indicate inverted regions. Most unique genes mapping to non-syntenic regions (black) are the result of an inversion (arrow). B) Percent improvement in mapping of CHM13 Iso-Seq reads in candidate duplicated genes (red) mapping to non-syntenic regions of the T2T-CHM13 assembly. Positive values identify Iso-Seq reads aligning better to T2T-CHM13 than GRCh38. C) Gene models of LPA with ORF generated from haplotype-resolved HiFi assemblies. The double-exon repeat in these gene models encode for the Kringle IV subtype 2 domain of the LPA protein. Highlighted in red are haplotypes with reduced Kringle IV subtype 2 repeats predicted to increase risk of cardiovascular disease. D) Amino acid variation in the Kringle IV subtype 2 repeat in the paternal haplotype of HG01325 identifies a previously unknown set of amino acid substitutions including rare variants: Ser42Leu in the active site, Ser24Tyr and Tyr49Cys.
Fig. 6.
Fig. 6.. SD methylation and gene transcription.
A) Methylated (red) or unmethylated (blue-gray) SD blocks in the CHM13 genome based on processing ONT data. The histogram shows the distribution of average methylation across these regions. B) Median methylation signal of SD (red) and unique (blue-gray) genes stratified by their Iso-Seq expression levels in CHM13. The filled intervals represent the 25 and 75 quartiles of the observed data. Vertical lines indicate the position of the transcription start site (TSS) and the transcription termination site (TTS). C) Methylation signal across the recently duplicated NPIPA gene family in CHM13, showing increased methylation in transcriptionally active copies. Black points are individual methylation calls, and the red line is a rolling mean across 10 methylation sites. The labels in gray show the number of CHM13 Iso-Seq transcripts and the gene name.

Comment in

Similar articles

Cited by

References

    1. Ohno Wolf, Atkin, Evolution from fish to mammals by gene duplication. Hereditas. 59, 169–187 (1968). - PubMed
    1. Ohno, Evolution by Gene Duplication (Springer Science & Business Media, 1970).
    1. Bailey Yavor, Massa Trask, Eichler, Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 1005–1017 (2001). - PMC - PubMed
    1. Cooper Coe, Girirajan Rosenfeld, Vu, et al., A copy number variation morbidity map of developmental delay. Nat. Genet. 43, 838–846 (2011). - PMC - PubMed
    1. Dennis Nuttle, Sudmant Antonacci, Graves, et al., Evolution of human-specific neural SRGAP2 genes by incomplete segmental duplication. Cell. 149, 912–922 (2012). - PMC - PubMed

Publication types

Substances