Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium

doi:10.1086/381000

. 2004 Jan;74(1):106-20.

doi: 10.1086/381000. Epub 2003 Dec 15.

Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium

Christopher S Carlson¹, Michael A Eberle, Mark J Rieder, Qian Yi, Leonid Kruglyak, Deborah A Nickerson

Affiliations

PMID: 14681826
PMCID: PMC1181897
DOI: 10.1086/381000

Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium

Christopher S Carlson et al. Am J Hum Genet. 2004 Jan.

. 2004 Jan;74(1):106-20.

doi: 10.1086/381000. Epub 2003 Dec 15.

Authors

Christopher S Carlson¹, Michael A Eberle, Mark J Rieder, Qian Yi, Leonid Kruglyak, Deborah A Nickerson

Affiliation

¹ Department of Genome Sciences, University of Washington, Seattle, WA, 98195, USA. csc47@u.washington.edu

PMID: 14681826
PMCID: PMC1181897
DOI: 10.1086/381000

Abstract

Common genetic polymorphisms may explain a portion of the heritable risk for common diseases. Within candidate genes, the number of common polymorphisms is finite, but direct assay of all existing common polymorphism is inefficient, because genotypes at many of these sites are strongly correlated. Thus, it is not necessary to assay all common variants if the patterns of allelic association between common variants can be described. We have developed an algorithm to select the maximally informative set of common single-nucleotide polymorphisms (tagSNPs) to assay in candidate-gene association studies, such that all known common polymorphisms either are directly assayed or exceed a threshold level of association with a tagSNP. The algorithm is based on the r(2) linkage disequilibrium (LD) statistic, because r(2) is directly related to statistical power to detect disease associations with unassayed sites. We show that, at a relatively stringent r(2) threshold (r2>0.8), the LD-selected tagSNPs resolve >80% of all haplotypes across a set of 100 candidate genes, regardless of recombination, and tag specific haplotypes and clades of related haplotypes in nonrecombinant regions. Thus, if the patterns of common variation are described for a candidate gene, analysis of the tagSNP set can comprehensively interrogate for main effects from common functional variation. We demonstrate that, although common variation tends to be shared between populations, tagSNPs should be selected separately for populations with different ancestries.

PubMed Disclaimer

Figures

**Figure 1**
Common variation and LD in European Americans at *BDKRB2.* At the *BDKRB2* gene, 22 SNPs with *MAF*>10% were described for the European American samples (A). Patterns of genotype at each SNP are shown as a visual genotype plot, in which each column represents a site and each row represents a sample. Genotype is color coded, as shown, with SNPs presented in the order they were identified across the gene. Patterns of genotype are clearly similar for many SNPs (e.g., sites 10922 and 12574) but not necessarily for adjacent SNPs. The same data are shown in panel B, with the order of SNPs rearranged such that each SNP is adjacent to SNPs with similar patterns of genotype. Among the 22 SNPs, the LD-based SNP-selection algorithm identified five bins of tagSNPs at an r² threshold of 0.5. tagSNP bins are boxed (B). The LD statistic r² describes the similarity of pattern between pairs of polymorphic sites: pairwise r² between SNPs is shown for the same order of SNPs as in panel B, and bins of SNPs with similar patterns are visible as reddish triangles above the diagonal (C).

**Figure 2**
tagSNPs per gene, with threshold r²>0.5 and *MAF*>10%. The complete genomic region of 100 genes was resequenced in 24 unrelated African American and 23 unrelated European American samples. Within each population, tagSNPs were selected from all SNPs with *MAF*>10% at an r² threshold of 0.5. A, The number of tagSNPs selected in each gene under these parameters, plotted against the size of the genomic region for each gene. Although there is a clear trend toward more tagSNPs in larger genes, there is considerable variance in the required tagSNP density in both populations. B, The number of tagSNPs selected in each gene, plotted against nucleotide diversity (π) per base pair. Thus, variance in tagSNP density between genes reflects both variation in nucleotide diversity and variation in the average extent of LD within genes. Within each gene, a greater number of tagSNPs is generally required in the African American population, reflecting both greater nucleotide diversity and shorter range LD, relative to the European American population.

**Figure 3**
Total tagSNP bins in 100 genes, versus threshold r². At each r² threshold, tagSNP bins were identified for 100 genes within African American (“AA tagSNPs”) and European American (“EA tagSNPs”) populations. As expected, more tagSNP bins were identified in African American samples than in European American samples. To measure the effects of population stratification on the LD-select algorithm, tagSNPs were also selected from merged African American and European American populations (“Merged tagSNPs”). The minimal set of tagSNPs relevant to both populations was also assembled at each r² threshold as the union of the tagSNP sets selected in each subpopulation (“Optimal tagSNPs”); this set was larger than the tagSNP set in either subpopulation alone but considerably smaller than the sum of the population-specific site sets, reflecting the fact that many (but not all) tagSNPs were useful in both populations.

**Figure 4**
The relationship between LD-selected tagSNPs and haplotypes. For each gene, haplotypes were inferred computationally. Results are shown as the fraction of haplotypes resolved using only LD-selected tagSNPs, relative to haplotypes resolved using all common SNPs. Results are shown across a range of r² values in each population (A). The effective number of haplotypes weights the number of haplotypes by frequency, with common haplotypes more heavily weighted. For each gene, the fraction of effective haplotypes resolved using only LD-selected tagSNPs, relative to effective haplotypes resolved, by use of all common SNPs is shown across a range of r² values in each population (B). For r² thresholds >0.5, >80% of effective haplotypes were resolved, demonstrating how, at adequately stringent r² thresholds, LD-selected tagSNPs efficiently resolve common haplotypes.

**Figure 5**
tagSNP bins and the evolutionary relationships between haplotypes. A hypothetical nonrecombinant region with five existing haplotypes is shown, with each row (*A–E*) representing a haplotype and each column (*1–7*) representing an SNP with a unique pattern of alleles. The common allele is shown as blue and the rare allele as yellow. There are five possible patterns (*1–5*) that are haplotype specific, and two (6 and 7) that are specific to clades of related haplotypes. LD-based tagSNP selection at an adequately stringent r² threshold would identify all seven patterns in this hypothetical region. Thus, directly testing LD-selected tagSNPs can identify disease associations with either specific haplotypes or with clades of related haplotypes.

See this image and copyright information in PMC

Cited by

Common single nucleotide polymorphisms in genes related to immune function and risk of papillary thyroid cancer.
Brenner AV, Neta G, Sturgis EM, Pfeiffer RM, Hutchinson A, Yeager M, Xu L, Zhou C, Wheeler W, Tucker MA, Chanock SJ, Sigurdson AJ. Brenner AV, et al. PLoS One. 2013;8(3):e57243. doi: 10.1371/journal.pone.0057243. Epub 2013 Mar 8. PLoS One. 2013. PMID: 23520464 Free PMC article. Clinical Trial.
CUBN as a novel locus for end-stage renal disease: insights from renal transplantation.
Reznichenko A, Snieder H, van den Born J, de Borst MH, Damman J, van Dijk MC, van Goor H, Hepkema BG, Hillebrands JL, Leuvenink HG, Niesing J, Bakker SJ, Seelen M, Navis G; REGaTTA (REnal GeneTics TrAnsplantation) Groningen group. Reznichenko A, et al. PLoS One. 2012;7(5):e36512. doi: 10.1371/journal.pone.0036512. Epub 2012 May 4. PLoS One. 2012. PMID: 22574174 Free PMC article.
GStream: improving SNP and CNV coverage on genome-wide association studies.
Alonso A, Marsal S, Tortosa R, Canela-Xandri O, Julià A. Alonso A, et al. PLoS One. 2013 Jul 3;8(7):e68822. doi: 10.1371/journal.pone.0068822. Print 2013. PLoS One. 2013. PMID: 23844243 Free PMC article.
Genetic variants and cell-free hemoglobin processing in sickle cell nephropathy.
Saraf SL, Zhang X, Shah B, Kanias T, Gudehithlu KP, Kittles R, Machado RF, Arruda JA, Gladwin MT, Singh AK, Gordeuk VR. Saraf SL, et al. Haematologica. 2015 Oct;100(10):1275-84. doi: 10.3324/haematol.2015.124875. Epub 2015 Jul 23. Haematologica. 2015. PMID: 26206798 Free PMC article.
Candidate gene studies in hypodontia suggest role for FGF3.
Vieira AR, D'Souza RN, Mues G, Deeley K, Hsin HY, Küchler EC, Meira R, Patir A, Tannure PN, Lips A, Costa MC, Granjeiro JM, Seymen F, Modesto A. Vieira AR, et al. Eur Arch Paediatr Dent. 2013 Dec;14(6):405-10. doi: 10.1007/s40368-013-0010-2. Epub 2013 Apr 3. Eur Arch Paediatr Dent. 2013. PMID: 23549991 Free PMC article.

See all "Cited by" articles

References

Electronic-Database Information

1. GenBank, http://www.ncbi.nlm.nih.gov/Genbank/ (Accession numbers for all genes are listed in .)
1. HaploBlockFinder, http://cgi.uc.edu/cgi-bin/kzhang/haploBlockFinder.cgi/
1. Pharmacogenetics and Risk of Cardiovascular Disease Project, http://droog.gs.washington.edu/parc/
1. PHASE, http://www.stat.washington.edu/stephens/software.html
1. Phred/Phrap/Consed System Web Site, http://www.phrap.org/

References

1. Ackerman H, Usen S, Mott R, Richardson A, Sisay-Joof F, Katundu P, Taylor T, Ward R, Molyneux M, Pinder M, Kwiatkowski DP (2003) Haplotypic analysis of the TNF locus by association efficiency and entropy. Genome Biol 4:R2410.1186/gb-2003-4-4-r24 - DOI - PMC - PubMed
1. Botstein D, Risch N (2003) Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease. Nat Genet Suppl 33:228–23710.1038/ng1090 - DOI - PubMed
1. Cambien F, Poirier O, Nicaud V, Herrmann S-M, Mallet C, Ricard S, Behague I, Hallet V, Blanc H, Loukaci V, Thillet J, Evans A, Ruidavets J-B, Arveiler D, Luc G, Tiret L (1999) Sequence diversity in 36 candidate genes for cardiovascular disorders. Am J Hum Genet 65:183–191 - PMC - PubMed
1. Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, Patil N, Lane CR, Lim EP, Kalayanaraman N, Nemesh J, Ziaugra L, Friedland L, Rolfe A, Warrington J, Lipshutz R, Daley GQ, Lander ES (1999) Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet 22:231–23810.1038/10290 - DOI - PubMed
1. Carlson CS, Eberle MA, Rieder MJ, Smith JD, Kruglyak L, Nickerson DA (2003) Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat Genet 33:518–52110.1038/ng1128 - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Medical
- MedlinePlus Health Information
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Research Materials
- Coriell Cell Repositories
- NCI CPTC Antibody Characterization Program

[1] GenBank, http://www.ncbi.nlm.nih.gov/Genbank/ (Accession numbers for all genes are listed in .)

[2] GenBank, http://www.ncbi.nlm.nih.gov/Genbank/ (Accession numbers for all genes are listed in .)

[3] HaploBlockFinder, http://cgi.uc.edu/cgi-bin/kzhang/haploBlockFinder.cgi/

[4] HaploBlockFinder, http://cgi.uc.edu/cgi-bin/kzhang/haploBlockFinder.cgi/

[5] Pharmacogenetics and Risk of Cardiovascular Disease Project, http://droog.gs.washington.edu/parc/

[6] Pharmacogenetics and Risk of Cardiovascular Disease Project, http://droog.gs.washington.edu/parc/

[7] PHASE, http://www.stat.washington.edu/stephens/software.html

[8] PHASE, http://www.stat.washington.edu/stephens/software.html

[9] Phred/Phrap/Consed System Web Site, http://www.phrap.org/

[10] Phred/Phrap/Consed System Web Site, http://www.phrap.org/

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed