Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Sep 1;35(17):2974-2981.
doi: 10.1093/bioinformatics/btz035.

OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs

Affiliations

OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs

Zachary Sethna et al. Bioinformatics. .

Abstract

Motivation: High-throughput sequencing of large immune repertoires has enabled the development of methods to predict the probability of generation by V(D)J recombination of T- and B-cell receptors of any specific nucleotide sequence. These generation probabilities are very non-homogeneous, ranging over 20 orders of magnitude in real repertoires. Since the function of a receptor really depends on its protein sequence, it is important to be able to predict this probability of generation at the amino acid level. However, brute-force summation over all the nucleotide sequences with the correct amino acid translation is computationally intractable. The purpose of this paper is to present a solution to this problem.

Results: We use dynamic programming to construct an efficient and flexible algorithm, called OLGA (Optimized Likelihood estimate of immunoGlobulin Amino-acid sequences), for calculating the probability of generating a given CDR3 amino acid sequence or motif, with or without V/J restriction, as a result of V(D)J recombination in B or T cells. We apply it to databases of epitope-specific T-cell receptors to evaluate the probability that a typical human subject will possess T cells responsive to specific disease-associated epitopes. The model prediction shows an excellent agreement with published data. We suggest that OLGA may be a useful tool to guide vaccine design.

Availability and implementation: Source code is available at https://github.com/zsethna/OLGA.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Partitioning a CDR3 sequence: boxes correspond to nucleotides and are indexed by integers. Each group of three boxes (identified by heavier boundary lines) corresponds to an amino acid. The nucleotide positions x1,,x4 identify the boundaries between different elements of the partition. The V,M,D(D),N and J(D) matrices define cumulated weights corresponding to each of the five elements
Fig. 2.
Fig. 2.
Monte Carlo estimate of the generation probability of amino acid CDR3 sequences, Pgenaa, versus OLGA’s predictions (mouse TRB). The horizontal lines at the lower left of the plot represent CDR3s that were generated once, twice, etc., in the MC sample. The one- and two-sigma curves display the deviations from exact equality between simulated and computed Pgen to be expected on the basis of Poisson statistics
Fig. 3.
Fig. 3.
Distributions of probabilities of recombination events (Pgenrec), nucleotide CDR3 sequences (Pgennt) and CDR3 amino acid sequences (Pgenaa) in different contexts. Each curve is determined by Monte Carlo sampling of 106 productive sequences for the indicated locus, and computing its generation probabilities at the three different levels. Entropies in bits (S) are, up to a ln(2)/ln(10) factor, the negative of the mean of each distributions, indicated by dotted lines
Fig. 4.
Fig. 4.
Generation probabilities of human CDR3s that respond to hepatitis C and influenza A epitopes. Pgenaa of sequences that respond to an epitope are plotted as circles (color encodes density of the points). The fraction of the repertoire specific to each epitope (Pgenfunc as defined in Eq. 7) is obtained as the sum of the Pgenaa for each of the corresponding sequences (values plotted as triangles) (Color version of this figure is available at Bioinformatics online.)
Fig. 5.
Fig. 5.
Distributions of TRB generation probabilities Pgenaa for sequences in the VDJdb database that bind to any epitopes of six different viruses (colored curves). For comparison, we plot (black curve) the same distribution for the unsorted TRB repertoire of a typical healthy subject; the 2σ variance represents biological variability across multiple individuals [data from Emerson et al. (2017)] (Color version of this figure is available at Bioinformatics online.)
Fig. 6.
Fig. 6.
Mean occurrence frequencies across a collection of 658 human samples of all CDR3 sequences in the VDJdb database, plotted against their computed Pgenaa (dots, colored by their density in the plot). Also, the net occurrence frequency in the VDJdb database of epitope-related collections of sequences, plotted against their computed Pgenfunc (triangles, colored to identify the virus the epitope belongs to) (Color version of this figure is available at Bioinformatics online.)

Similar articles

Cited by

References

    1. Becattini S. et al. (2015) Functional heterogeneity of human memory cd4+ t cell clones primed by pathogens or vaccines. Science, 347, 400–406. - PubMed
    1. Dash P. et al. (2017) Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature, 547, 89–93. - PMC - PubMed
    1. DeWitt W.S. et al. (2016) A public database of memory and naive B-cell receptor sequences. PLoS One, 11, e0160853.. - PMC - PubMed
    1. DeWitt W.S. et al. (2018) Human T cell receptor occurrence patterns encode immune history, genetic background, and receptor specificity. eLife, 7, e38358. - PMC - PubMed
    1. Dupic T. et al. (2018) Genesis of the αβ T-cell receptor. arXiv: 1806.11030.

Publication types