Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2004 Jan;74(1):106-20.
doi: 10.1086/381000. Epub 2003 Dec 15.

Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium

Affiliations

Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium

Christopher S Carlson et al. Am J Hum Genet. 2004 Jan.

Abstract

Common genetic polymorphisms may explain a portion of the heritable risk for common diseases. Within candidate genes, the number of common polymorphisms is finite, but direct assay of all existing common polymorphism is inefficient, because genotypes at many of these sites are strongly correlated. Thus, it is not necessary to assay all common variants if the patterns of allelic association between common variants can be described. We have developed an algorithm to select the maximally informative set of common single-nucleotide polymorphisms (tagSNPs) to assay in candidate-gene association studies, such that all known common polymorphisms either are directly assayed or exceed a threshold level of association with a tagSNP. The algorithm is based on the r(2) linkage disequilibrium (LD) statistic, because r(2) is directly related to statistical power to detect disease associations with unassayed sites. We show that, at a relatively stringent r(2) threshold (r2>0.8), the LD-selected tagSNPs resolve >80% of all haplotypes across a set of 100 candidate genes, regardless of recombination, and tag specific haplotypes and clades of related haplotypes in nonrecombinant regions. Thus, if the patterns of common variation are described for a candidate gene, analysis of the tagSNP set can comprehensively interrogate for main effects from common functional variation. We demonstrate that, although common variation tends to be shared between populations, tagSNPs should be selected separately for populations with different ancestries.

PubMed Disclaimer

Figures

Figure  1
Figure 1
Common variation and LD in European Americans at BDKRB2. At the BDKRB2 gene, 22 SNPs with MAF>10% were described for the European American samples (A). Patterns of genotype at each SNP are shown as a visual genotype plot, in which each column represents a site and each row represents a sample. Genotype is color coded, as shown, with SNPs presented in the order they were identified across the gene. Patterns of genotype are clearly similar for many SNPs (e.g., sites 10922 and 12574) but not necessarily for adjacent SNPs. The same data are shown in panel B, with the order of SNPs rearranged such that each SNP is adjacent to SNPs with similar patterns of genotype. Among the 22 SNPs, the LD-based SNP-selection algorithm identified five bins of tagSNPs at an r2 threshold of 0.5. tagSNP bins are boxed (B). The LD statistic r2 describes the similarity of pattern between pairs of polymorphic sites: pairwise r2 between SNPs is shown for the same order of SNPs as in panel B, and bins of SNPs with similar patterns are visible as reddish triangles above the diagonal (C).
Figure  2
Figure 2
tagSNPs per gene, with threshold r2>0.5 and MAF>10%. The complete genomic region of 100 genes was resequenced in 24 unrelated African American and 23 unrelated European American samples. Within each population, tagSNPs were selected from all SNPs with MAF>10% at an r2 threshold of 0.5. A, The number of tagSNPs selected in each gene under these parameters, plotted against the size of the genomic region for each gene. Although there is a clear trend toward more tagSNPs in larger genes, there is considerable variance in the required tagSNP density in both populations. B, The number of tagSNPs selected in each gene, plotted against nucleotide diversity (π) per base pair. Thus, variance in tagSNP density between genes reflects both variation in nucleotide diversity and variation in the average extent of LD within genes. Within each gene, a greater number of tagSNPs is generally required in the African American population, reflecting both greater nucleotide diversity and shorter range LD, relative to the European American population.
Figure  3
Figure 3
Total tagSNP bins in 100 genes, versus threshold r2. At each r2 threshold, tagSNP bins were identified for 100 genes within African American (“AA tagSNPs”) and European American (“EA tagSNPs”) populations. As expected, more tagSNP bins were identified in African American samples than in European American samples. To measure the effects of population stratification on the LD-select algorithm, tagSNPs were also selected from merged African American and European American populations (“Merged tagSNPs”). The minimal set of tagSNPs relevant to both populations was also assembled at each r2 threshold as the union of the tagSNP sets selected in each subpopulation (“Optimal tagSNPs”); this set was larger than the tagSNP set in either subpopulation alone but considerably smaller than the sum of the population-specific site sets, reflecting the fact that many (but not all) tagSNPs were useful in both populations.
Figure  4
Figure 4
The relationship between LD-selected tagSNPs and haplotypes. For each gene, haplotypes were inferred computationally. Results are shown as the fraction of haplotypes resolved using only LD-selected tagSNPs, relative to haplotypes resolved using all common SNPs. Results are shown across a range of r2 values in each population (A). The effective number of haplotypes weights the number of haplotypes by frequency, with common haplotypes more heavily weighted. For each gene, the fraction of effective haplotypes resolved using only LD-selected tagSNPs, relative to effective haplotypes resolved, by use of all common SNPs is shown across a range of r2 values in each population (B). For r2 thresholds >0.5, >80% of effective haplotypes were resolved, demonstrating how, at adequately stringent r2 thresholds, LD-selected tagSNPs efficiently resolve common haplotypes.
Figure  5
Figure 5
tagSNP bins and the evolutionary relationships between haplotypes. A hypothetical nonrecombinant region with five existing haplotypes is shown, with each row (A–E) representing a haplotype and each column (1–7) representing an SNP with a unique pattern of alleles. The common allele is shown as blue and the rare allele as yellow. There are five possible patterns (1–5) that are haplotype specific, and two (6 and 7) that are specific to clades of related haplotypes. LD-based tagSNP selection at an adequately stringent r2 threshold would identify all seven patterns in this hypothetical region. Thus, directly testing LD-selected tagSNPs can identify disease associations with either specific haplotypes or with clades of related haplotypes.

Similar articles

Cited by

References

Electronic-Database Information

    1. GenBank, http://www.ncbi.nlm.nih.gov/Genbank/ (Accession numbers for all genes are listed in .)
    1. HaploBlockFinder, http://cgi.uc.edu/cgi-bin/kzhang/haploBlockFinder.cgi/
    1. Pharmacogenetics and Risk of Cardiovascular Disease Project, http://droog.gs.washington.edu/parc/
    1. PHASE, http://www.stat.washington.edu/stephens/software.html
    1. Phred/Phrap/Consed System Web Site, http://www.phrap.org/

References

    1. Ackerman H, Usen S, Mott R, Richardson A, Sisay-Joof F, Katundu P, Taylor T, Ward R, Molyneux M, Pinder M, Kwiatkowski DP (2003) Haplotypic analysis of the TNF locus by association efficiency and entropy. Genome Biol 4:R2410.1186/gb-2003-4-4-r24 - DOI - PMC - PubMed
    1. Botstein D, Risch N (2003) Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease. Nat Genet Suppl 33:228–23710.1038/ng1090 - DOI - PubMed
    1. Cambien F, Poirier O, Nicaud V, Herrmann S-M, Mallet C, Ricard S, Behague I, Hallet V, Blanc H, Loukaci V, Thillet J, Evans A, Ruidavets J-B, Arveiler D, Luc G, Tiret L (1999) Sequence diversity in 36 candidate genes for cardiovascular disorders. Am J Hum Genet 65:183–191 - PMC - PubMed
    1. Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, Patil N, Lane CR, Lim EP, Kalayanaraman N, Nemesh J, Ziaugra L, Friedland L, Rolfe A, Warrington J, Lipshutz R, Daley GQ, Lander ES (1999) Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet 22:231–23810.1038/10290 - DOI - PubMed
    1. Carlson CS, Eberle MA, Rieder MJ, Smith JD, Kruglyak L, Nickerson DA (2003) Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat Genet 33:518–52110.1038/ng1128 - DOI - PubMed

Publication types