Segmental duplications and their variation in a complete human genome

doi:10.1126/science.abj6965

. 2022 Apr;376(6588):eabj6965.

doi: 10.1126/science.abj6965. Epub 2022 Apr 1.

Segmental duplications and their variation in a complete human genome

Mitchell R Vollger¹, Xavi Guitart¹, Philip C Dishuck¹, Ludovica Mercuri², William T Harvey¹, Ariel Gershman³, Mark Diekhans⁴, Arvis Sulovari¹, Katherine M Munson¹, Alexandra P Lewis¹, Kendra Hoekzema¹, David Porubsky¹, Ruiyang Li¹, Sergey Nurk⁵, Sergey Koren⁵, Karen H Miga⁴, Adam M Phillippy⁵, Winston Timp³, Mario Ventura², Evan E Eichler^{1

6}

Affiliations

¹ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
² Department of Biology, University of Bari, Aldo Moro, Bari 70125, Italy.
³ Department of Molecular Biology and Genetics, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
⁴ UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA.
⁵ Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
⁶ Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA.

PMID: 35357917
PMCID: PMC8979283
DOI: 10.1126/science.abj6965

Segmental duplications and their variation in a complete human genome

Mitchell R Vollger et al. Science. 2022 Apr.

. 2022 Apr;376(6588):eabj6965.

doi: 10.1126/science.abj6965. Epub 2022 Apr 1.

Authors

Affiliations

¹ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
² Department of Biology, University of Bari, Aldo Moro, Bari 70125, Italy.
³ Department of Molecular Biology and Genetics, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
⁴ UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA.
⁵ Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
⁶ Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA.

PMID: 35357917
PMCID: PMC8979283
DOI: 10.1126/science.abj6965

Abstract

Despite their importance in disease and evolution, highly identical segmental duplications (SDs) are among the last regions of the human reference genome (GRCh38) to be fully sequenced. Using a complete telomere-to-telomere human genome (T2T-CHM13), we present a comprehensive view of human SD organization. SDs account for nearly one-third of the additional sequence, increasing the genome-wide estimate from 5.4 to 7.0% [218 million base pairs (Mbp)]. An analysis of 268 human genomes shows that 91% of the previously unresolved T2T-CHM13 SD sequence (68.3 Mbp) better represents human copy number variation. Comparing long-read assemblies from human (n = 12) and nonhuman primate (n = 5) genomes, we systematically reconstruct the evolution and structural haplotype diversity of biomedically relevant and duplicated genes. This analysis reveals patterns of structural heterozygosity and evolutionary differences in SD organization between humans and other primates.

PubMed Disclaimer

Figures

**Fig. 0.. More complete segmental duplication content improves genotyping.**
a) ~10-fold increase in the number of large (>10 kbp) acrocentric segmental duplications (red) in T2T-CHM13 (right) compared to GRCh38 (left). b) Read-depth genotyping of short-read Illumina whole-genome sequence from a human diversity panel (n=268) better matches T2T-CHM13 (red) when compared to GRCh38 (blue) irrespective of human population group considered.

**Fig. 1.. Segmental duplication (SD) content of the T2T-CHM13 genome.**
A) The pattern of previously unresolved or structurally variant intrachromosomal duplications in T2T-CHM13 (red) compared to known duplications in GRCh38 (blue-gray). These predict hotspots of genomic instability (gold) flanked by large (>10 kbp), high-identity (>95%) interspersed (>50 kbp) SDs. B) Circos plot highlighting previously unresolved interchromosomal SDs (red) shows the preponderance of previously unresolved SDs mapping to pericentromeric and acrocentric regions. C) A histogram comparing SD content in different human reference genomes. The sum of bases in pairwise SD alignments stratified by their percent identity for the Celera (yellow, Sanger-based), GRCh38 (blue-gray, BAC-based), and T2T-CHM13 (red, long-read) assemblies. D) The 30 genic duplicons (ancestral repeat units) with the greatest copy number difference between GRCh38 and T2T-CHM13 as determined by DupMasker (table S2). All of the 30 largest differences are increased in T2T-CHM13.

**Fig. 2.. Validation of previously unresolved SDs in T2T-CHM13 and heteromorphic variation.**
A) Ideogram (top) depicts large SD regions (light red boxes) present in T2T-CHM13 but absent from the current reference human genome (GRCh38). An expanded view of the duplication (red) and satellite organization (blue) are depicted below showing the location of fosmid FISH probes (e.g., C15) and SD organization compared to ancestral duplicon segments (multi-colored bars) (see inset). **B,C)** FISH signals (red) shown on extracted metaphase for two probes and three human cell lines. Probe K20 shows a fixed signal (except for one heterozygous signal), and G6 is heteromorphic among humans (see table S4 and fig. S4 for complete description for all nine probes). D) Inversion polymorphism (green bar) between T2T-CHM13 and GRCh38 in the pericentromeric chromosome 1q region. The inversion (green bar) identified by Strand-seq (32) is confirmed in the assembly; however, the sequence-resolved assembly shows a more complex structure including two inversions (red) and one reordered segment (blue) mapping near the *NOTCH2NL* human-specific duplications.

**Fig. 3.. SD single-nucleotide and copy number variation.**
A) Sequence divergence (% in 10 kbp bins) based on syntenic alignments between GRCh38 and T2T-CHM13 for SDs (red), and unique genomic regions (black). SD regions show significantly more divergence when compared to unique sequence (black) and chromosome X (blue) but less than the MHC regions (green). B) Copy number of SD regions that are previously unresolved or structurally different in T2T-CHM13 compared to GRCh38 based on 268 human genomes from the Simons Genome Diversity Project (SGDP). The histogram shows the number of Mbp where more samples support the copy number of the given assembly [T2T-CHM13 (red), GRCh38 (blue), neither (green), or both equally (equal copy number)]. C) Empirical cumulative distribution showing how many samples genotype correctly with either GRCh38 or T2T-CHM13 as a function of the allowed difference between sample and reference copy number. The inset shows the area under the curve (AUC) calculation for both references allowing a maximum copy number difference of 30. The green curve shows an in silico reference made using the median copy number of the SGDP samples at each site. D) Genic copy number variation. Copy number variation in nine gene families are shown (generated with SGDP) and distribution is colored according to which reference better reflects the median copy number; GRCh38 generally underestimates copy number (vertical lines) and Africans (orange) tend to show higher copy number than non-Africans (blue); circle size indicates # of samples.

**Fig. 4.. Human-specific expansion of TBC1D3 compared to nonhuman primates.**
A) Regions of homology between human T2T-CHM13’s chromosome 17 (top) and a HiFi assembly of the chimpanzee genome (bottom). Red blocks represent regions of human-specific expansion, including *TBC1D3* duplications. Colored arrows above and below the homologous sequence represent unique ancestral units (duplicons) identified by DupMasker. Inset plots for both expansion sites are included below with the gene models identified by Liftoff (94). B) Copy number (diploid) estimates from an Illumina read-depth analysis of SGDP, ancient hominids, and nonhuman primates for a *TBC1D3* paralog (table S14). Copy number estimates include pseudogenes (5) not included in the phylogeny, explaining the higher counts observed. The T2T-CHM13 copy number and GRCh38 copy number are represented by the red and blue lines, respectively. C) Phylogeny of *TBC1D3* copies at these two expansion sites as well as nonhuman primate copies. Single asterisks at nodes indicate bootstrap values greater than or equal to 70%, while double asterisks indicate 100%. The data illustrate a human-specific expansion, as well as several independent expansions in the macaque, gorilla, and orangutan. Using macaque sequence as an outgroup, we estimate the human-specific expansion to be ~2.3 million years ago (MYA). D) Variation in human haplotypes across the first *TBC1D3* expansion site: a graph representation (rGFA, left) of the locus where colors indicate the source genome for the sequence, and on the right the path for each haplotype-resolved assembly through the graph. The top row for each haplotype composed of large polygons represents an alignment comparing the haplotype-resolved sequence (horizontal) against the graph (vertical), and color represents the source haplotype for the vertical sequence. For example, a single large red triangle indicates there is a one-to-one alignment between CHM13 and the haplotype. Structural variants can be identified from discontinuities in height (deletion), changes between colors (insertion), or changes in the direction of the polygon (inversion). Below is shown the gene of interest (red arrow) and other genic content in the region (black arrow). Colored bars show ancestral duplication segments (duplicons) that compose the larger duplication blocks.

**Fig. 5.. Genic variation in previously unresolved SD regions of T2T-CHM13.**
A) Ideogram showing the previously unresolved or non-syntenic gene models (open reading frames [ORFs] with >200 bp of coding sequence and multiple exons) in the T2T-CHM13 assembly as predicted by Liftoff. Previously unresolved genes mapping to SDs (red) are indicated with an asterisk if predicted to be an expansion in the gene family relative to GRCh38 (25). Arrows indicate inverted regions. Most unique genes mapping to non-syntenic regions (black) are the result of an inversion (arrow). B) Percent improvement in mapping of CHM13 Iso-Seq reads in candidate duplicated genes (red) mapping to non-syntenic regions of the T2T-CHM13 assembly. Positive values identify Iso-Seq reads aligning better to T2T-CHM13 than GRCh38. C) Gene models of *LPA* with ORF generated from haplotype-resolved HiFi assemblies. The double-exon repeat in these gene models encode for the Kringle IV subtype 2 domain of the LPA protein. Highlighted in red are haplotypes with reduced Kringle IV subtype 2 repeats predicted to increase risk of cardiovascular disease. D) Amino acid variation in the Kringle IV subtype 2 repeat in the paternal haplotype of HG01325 identifies a previously unknown set of amino acid substitutions including rare variants: Ser42Leu in the active site, Ser24Tyr and Tyr49Cys.

**Fig. 6.. SD methylation and gene transcription.**
A) Methylated (red) or unmethylated (blue-gray) SD blocks in the CHM13 genome based on processing ONT data. The histogram shows the distribution of average methylation across these regions. B) Median methylation signal of SD (red) and unique (blue-gray) genes stratified by their Iso-Seq expression levels in CHM13. The filled intervals represent the 25 and 75 quartiles of the observed data. Vertical lines indicate the position of the transcription start site (TSS) and the transcription termination site (TTS). C) Methylation signal across the recently duplicated *NPIPA* gene family in CHM13, showing increased methylation in transcriptionally active copies. Black points are individual methylation calls, and the red line is a rolling mean across 10 methylation sites. The labels in gray show the number of CHM13 Iso-Seq transcripts and the gene name.

See this image and copyright information in PMC

Comment in

The final pieces of the human genome.
Attwaters M. Attwaters M. Nat Rev Genet. 2022 Jun;23(6):321. doi: 10.1038/s41576-022-00494-5. Nat Rev Genet. 2022. PMID: 35488041 No abstract available.

Cited by

Benchmarking challenging small variants with linked and long reads.
Wagner J, Olson ND, Harris L, Khan Z, Farek J, Mahmoud M, Stankovic A, Kovacevic V, Yoo B, Miller N, Rosenfeld JA, Ni B, Zarate S, Kirsche M, Aganezov S, Schatz MC, Narzisi G, Byrska-Bishop M, Clarke W, Evani US, Markello C, Shafin K, Zhou X, Sidow A, Bansal V, Ebert P, Marschall T, Lansdorp P, Hanlon V, Mattsson CA, Barrio AM, Fiddes IT, Xiao C, Fungtammasan A, Chin CS, Wenger AM, Rowell WJ, Sedlazeck FJ, Carroll A, Salit M, Zook JM. Wagner J, et al. Cell Genom. 2022 May;2(5):100128. doi: 10.1016/j.xgen.2022.100128. Cell Genom. 2022. PMID: 36452119 Free PMC article.
Long-read sequencing and genome assembly of natural history collection samples and challenging specimens.
Bein B, Chrysostomakis I, Arantes LS, Brown T, Gerheim C, Schell T, Schneider C, Leushkin E, Chen Z, Sigwart J, Gonzalez V, Wong NLWS, Santos FR, Blom MPK, Mayer F, Mazzoni CJ, Böhne A, Winkler S, Greve C, Hiller M. Bein B, et al. bioRxiv [Preprint]. 2024 Sep 27:2024.03.04.583385. doi: 10.1101/2024.03.04.583385. bioRxiv. 2024. PMID: 39386456 Free PMC article. Preprint.
A cell type-aware framework for nominating non-coding variants in Mendelian regulatory disorders.
Lee AS, Ayers LJ, Kosicki M, Chan WM, Fozo LN, Pratt BM, Collins TE, Zhao B, Rose MF, Sanchis-Juan A, Fu JM, Wong I, Zhao X, Tenney AP, Lee C, Laricchia KM, Barry BJ, Bradford VR, Lek M, MacArthur DG, Lee EA, Talkowski ME, Brand H, Pennacchio LA, Engle EC. Lee AS, et al. medRxiv [Preprint]. 2023 Dec 27:2023.12.22.23300468. doi: 10.1101/2023.12.22.23300468. medRxiv. 2023. Update in: Nat Commun. 2024 Sep 27;15(1):8268. doi: 10.1038/s41467-024-52463-7 PMID: 38234731 Free PMC article. Updated. Preprint.
A survey of algorithms for the detection of genomic structural variants from long-read sequencing data.
Ahsan MU, Liu Q, Perdomo JE, Fang L, Wang K. Ahsan MU, et al. Nat Methods. 2023 Aug;20(8):1143-1158. doi: 10.1038/s41592-023-01932-w. Epub 2023 Jun 29. Nat Methods. 2023. PMID: 37386186 Free PMC article. Review.
A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing.
Prodanov T, Bansal V. Prodanov T, et al. Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i279-i287. doi: 10.1093/bioinformatics/btad268. Bioinformatics. 2023. PMID: 37387146 Free PMC article.

See all "Cited by" articles

References

1. Ohno Wolf, Atkin, Evolution from fish to mammals by gene duplication. Hereditas. 59, 169–187 (1968). - PubMed
1. Ohno, Evolution by Gene Duplication (Springer Science & Business Media, 1970).
1. Bailey Yavor, Massa Trask, Eichler, Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 1005–1017 (2001). - PMC - PubMed
1. Cooper Coe, Girirajan Rosenfeld, Vu, et al., A copy number variation morbidity map of developmental delay. Nat. Genet. 43, 838–846 (2011). - PMC - PubMed
1. Dennis Nuttle, Sudmant Antonacci, Graves, et al., Evolution of human-specific neural SRGAP2 genes by incomplete segmental duplication. Cell. 149, 912–922 (2012). - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

[1] Ohno Wolf, Atkin, Evolution from fish to mammals by gene duplication. Hereditas. 59, 169–187 (1968). - PubMed

[2] Ohno Wolf, Atkin, Evolution from fish to mammals by gene duplication. Hereditas. 59, 169–187 (1968). - PubMed

[3] Ohno, Evolution by Gene Duplication (Springer Science & Business Media, 1970).

[4] Ohno, Evolution by Gene Duplication (Springer Science & Business Media, 1970).

[5] Bailey Yavor, Massa Trask, Eichler, Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 1005–1017 (2001). - PMC - PubMed

[6] Bailey Yavor, Massa Trask, Eichler, Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 1005–1017 (2001). - PMC - PubMed

[7] Cooper Coe, Girirajan Rosenfeld, Vu, et al., A copy number variation morbidity map of developmental delay. Nat. Genet. 43, 838–846 (2011). - PMC - PubMed

[8] Cooper Coe, Girirajan Rosenfeld, Vu, et al., A copy number variation morbidity map of developmental delay. Nat. Genet. 43, 838–846 (2011). - PMC - PubMed

[9] Dennis Nuttle, Sudmant Antonacci, Graves, et al., Evolution of human-specific neural SRGAP2 genes by incomplete segmental duplication. Cell. 149, 912–922 (2012). - PMC - PubMed

[10] Dennis Nuttle, Sudmant Antonacci, Graves, et al., Evolution of human-specific neural SRGAP2 genes by incomplete segmental duplication. Cell. 149, 912–922 (2012). - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Segmental duplications and their variation in a complete human genome

Affiliations

Segmental duplications and their variation in a complete human genome

Authors

Affiliations

Abstract

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous