Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Oct 2;109(40):16161-6.
doi: 10.1073/pnas.1212755109. Epub 2012 Sep 17.

Statistical inference of the generation probability of T-cell receptors from sequence repertoires

Affiliations

Statistical inference of the generation probability of T-cell receptors from sequence repertoires

Anand Murugan et al. Proc Natl Acad Sci U S A. .

Abstract

Stochastic rearrangement of germline V-, D-, and J-genes to create variable coding sequence for certain cell surface receptors is at the origin of immune system diversity. This process, known as "VDJ recombination", is implemented via a series of stochastic molecular events involving gene choices and random nucleotide insertions between, and deletions from, genes. We use large sequence repertoires of the variable CDR3 region of human CD4+ T-cell receptor beta chains to infer the statistical properties of these basic biochemical events. Because any given CDR3 sequence can be produced in multiple ways, the probability distribution of hidden recombination events cannot be inferred directly from the observed sequences; we therefore develop a maximum likelihood inference method to achieve this end. To separate the properties of the molecular rearrangement mechanism from the effects of selection, we focus on nonproductive CDR3 sequences in T-cell DNA. We infer the joint distribution of the various generative events that occur when a new T-cell receptor gene is created. We find a rich picture of correlation (and absence thereof), providing insight into the molecular mechanisms involved. The generative event statistics are consistent between individuals, suggesting a universal biochemical process. Our probabilistic model predicts the generation probability of any specific CDR3 sequence by the primitive recombination process, allowing us to quantify the potential diversity of the T-cell repertoire and to understand why some sequences are shared between individuals. We argue that the use of formal statistical inference methods, of the kind presented in this paper, will be essential for quantitative understanding of the generation and evolution of diversity in the adaptive immune system.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
A 60 bp CDR3 read (gray box) can be aligned to different genes [nomenclature follows IMGT conventions (24)] with different deletions (white), insertions (yellow), and P-nucleotides (red). (A) Alignment to specific V-, D-, and J-genes with insVD = 13, insDJ = 6, delV = 5, delJ = 6, del5D = 6, del3D = -2 (in other words, pal3D = 2). (B) Alignment of the same read to different V- and D-genes, and with insVD = 15, insDJ = 9, delV = 7, del5D = 9, del3D = 3 (no P-nucleotides). Note that the alignment to the V-gene is not maximal in this case. A few heavily penalized mismatches are allowed (in the V-gene in this example) in order to accommodate a small sequencing error rate. The location of the sequencing primer is indicated: It is chosen to uniquely identify the start of the CDR3 read within each J-gene.
Fig. 2.
Fig. 2.
(A) Data-derived correlations between sequence features: Each entry is the mutual information I(X,Y) of a feature pair over the naïve nonproductive repertoire. The outlined elements are correlations expected from the form of Precomb(E): Red identifies a direct effect of a factor in Eq. 1 (e.g., DJ) and green indirect effects (e.g., DJ↔delJ). The top-left half of the matrix shows results from the MLE, while the bottom-right half corresponds to a deterministic maximum-alignment based identification of recombination events. (B) Probability distribution of the number of VD insertions conditioned on the number of DJ insertions for MLE (Upper) and deterministic (Lower) analysis. Each curve corresponds to a different value of insDJ, ranging from 0 (blue) to 10. The curves collapse for MLE indicating independence.
Fig. 3.
Fig. 3.
Statistics of VD and DJ insertions. (A) Insertion length profiles: maximum likelihood estimate (deterministic estimate) displayed as solid (dashed) lines; error bars show variation across the nine individuals. The distribution tail is accurately exponential. The deterministic estimate greatly overestimates the frequency of zero insertions. Inset: mononucleotide utilization bias. (B) Dinucleotide utilization in insertions; the bias in DJ insertions is very accurately the reverse complement of the VD insertion bias. (C) Higher-order nucleotide bias in VD (blue) and DJ (red) insertions is completely accounted for by dinucleotide statistics.
Fig. 4.
Fig. 4.
(A) Gene-specific deletion profiles for selected V (red) and J (green) genes: The profiles vary widely from gene to gene but are nearly identical across individuals (all nine are plotted; one in gray from an individual with significantly smaller sample size). The blue curves in all panels show the predictions of a simple model for the sequence context dependence of deletion probabilities using a position weight matrix (PWM), fit to the V deletion profiles (see SI Appendix for details). The model ignores P-nucleotide generation and lacks any effects of distance from the gene end but performs reasonably well (r2 = 0.7). (B) Sequence logo of the context dependence of deletion probability, from the PWM fit to the V deletion profiles. Only positions 3 of the deletion site have strong effects on the probability. (C) Cumulative deletion profiles for V-genes and J-genes. Error bars indicate variation across individuals.
Fig. 5.
Fig. 5.
(A) Entropy decomposition. Top bars: Sequence entropy is smaller than recombination entropy by 5 bits because of convergent recombination; Bottom bars: Recombination event entropy decomposed into contributions from gene choice, insertions, and deletions. (B) Statistics of the 21 CDR3 sequences shared between pairs of individuals: actual (red) vs. expected on the basis of the inferred Pgen(σ) (blue). (C) Histogram of Pgen(σ) for all sequences (blue) and for the 21 shared sequences (red, kernel density estimate); formula image for the full repertoire is indicated by the vertical green line.

Similar articles

Cited by

References

    1. Murphy KP, Travers P, Walport M, Janeway C. Janeway’s Immunobiology. New York: Garland; 2008.
    1. Freeman JD, Warren RL, Webb JR, Nelson BH, Holt RA. Profiling the T-cell receptor beta-chain repertoire by massively parallel sequencing. Genome Res. 2009;19:1817–1824. - PMC - PubMed
    1. Weinstein JA, et al. High-throughput sequencing of the zebrafish antibody repertoire. Science. 2009;324:807–810. - PMC - PubMed
    1. Robins HS, et al. Comprehensive assessment of T-cell receptor beta-chain diversity in alphabeta T cells. Blood. 2009;114:4099–4107. - PMC - PubMed
    1. Robins HS, et al. Overlap and effective size of the human CD8+ T-cell repertoire. Sci Transl Med. 2010;2:47ra64. - PMC - PubMed

Publication types