repgenHMM: a dynamic programming tool to infer the rules of immune receptor generation from sequence data

doi:10.1093/bioinformatics/btw112

. 2016 Jul 1;32(13):1943-51.

doi: 10.1093/bioinformatics/btw112. Epub 2016 Feb 26.

repgenHMM: a dynamic programming tool to infer the rules of immune receptor generation from sequence data

Yuval Elhanati¹, Quentin Marcou¹, Thierry Mora², Aleksandra M Walczak¹

Affiliations

¹ Laboratoire de physique théorique, CNRS, UPMC and Ecole normale supérieure, Paris, France.
² Laboratoire de physique statistique, CNRS, UPMC and Ecole normale supérieure, Paris, France.

PMID: 27153709
PMCID: PMC4920122
DOI: 10.1093/bioinformatics/btw112

repgenHMM: a dynamic programming tool to infer the rules of immune receptor generation from sequence data

Yuval Elhanati et al. Bioinformatics. 2016.

. 2016 Jul 1;32(13):1943-51.

doi: 10.1093/bioinformatics/btw112. Epub 2016 Feb 26.

Authors

Yuval Elhanati¹, Quentin Marcou¹, Thierry Mora², Aleksandra M Walczak¹

Affiliations

¹ Laboratoire de physique théorique, CNRS, UPMC and Ecole normale supérieure, Paris, France.
² Laboratoire de physique statistique, CNRS, UPMC and Ecole normale supérieure, Paris, France.

PMID: 27153709
PMCID: PMC4920122
DOI: 10.1093/bioinformatics/btw112

Abstract

Motivation: The diversity of the immune repertoire is initially generated by random rearrangements of the receptor gene during early T and B cell development. Rearrangement scenarios are composed of random events-choices of gene templates, base pair deletions and insertions-described by probability distributions. Not all scenarios are equally likely, and the same receptor sequence may be obtained in several different ways. Quantifying the distribution of these rearrangements is an essential baseline for studying the immune system diversity. Inferring the properties of the distributions from receptor sequences is a computationally hard problem, requiring enumerating every possible scenario for every sampled receptor sequence.

Results: We present a Hidden Markov model, which accounts for all plausible scenarios that can generate the receptor sequences. We developed and implemented a method based on the Baum-Welch algorithm that can efficiently infer the parameters for the different events of the rearrangement process. We tested our software tool on sequence data for both the alpha and beta chains of the T cell receptor. To test the validity of our algorithm, we also generated synthetic sequences produced by a known model, and confirmed that its parameters could be accurately inferred back from the sequences. The inferred model can be used to generate synthetic sequences, to calculate the probability of generation of any receptor sequence, as well as the theoretical diversity of the repertoire. We estimate this diversity to be [Formula: see text] for human T cells. The model gives a baseline to investigate the selection and dynamics of immune repertoires.

Availability and implementation: Source code and sample sequence files are available at https://bitbucket.org/yuvalel/repgenhmm/downloads

Contact: elhanati@lpt.ens.fr or tmora@lps.ens.fr or awalczak@lpt.ens.fr.

PubMed Disclaimer

Figures

**Fig. 1.**
(a) Schematic description of the rearrangement process for the alpha chains. Random V and J genes are chosen from the genome. A random number of nucleotides are trimmed from their facing ends. These ends are then joined with an insertion segment of variable length and composition. (b) Markov model for this rearrangement process, when the V and J gene choices are known. By progressing one path following the arrows, the model produces a rearranged receptor gene. Each state denoted by a circle emits a nucleotide. V and J states each emit one nucleotide from the chosen template, up to an error rate. Emissions from the I states are drawn from a specified distribution. The states represented by squares are nonemitting ghost states. The arrows represent the allowed transitions, some of them are marked on the diagram with $M_{S S'}$ . The probabilities of the transitions and emissions are the parameters of the HMM, as described in the main text

**Fig. 2.**
Subdiagram of Markov model for beta chain, focusing on the D gene. Each row corresponds to a different pattern of deletions $(del D l, del D r)$ for the left and right ends of the D segments. State $D_{d}^{(del D l, del D r)}$ corresponds to the dth base in the D gene, when l bases are deleted from the left and r from the right. Each row is entered from the ghost state G₂ with probability $P_{del D} (l, r) = P (del D l, del D r)$ , and then proceeds deterministically until G₃.

**Fig. 3.**
TCR alpha chain rearrangement distribution inferred from sequence data taken from Zvyagin *et al.* (2014). (a) The log-likelihood of the data given the model saturates as a function of the number of iterations of the Expectation–Maximization algorithm. (b) Shannon entropy of rearrangements (top row) and sequences (middle row). The sequence entropy is lower than the total recombination entropy because of convergent rearrangements. The rearrangement entropy is the sum of entropies of its elementary events (bottom row). (c) Distribution of the number of inserted nucleotides (solid curve). For comparison, the same distribution obtained by the MiXCR software is represented by a dashed line. (d) Distributions of the number of deletions for both V and J genes, averaged over genes. (e) Joint distribution for V and J usage, P(V, J). Genes are ordered by position along the genome. (f) The covariance $P (V, J) - P (V) P (J)$ clearly shows strong correlations for genes that are either close to the separation between the V and J segments, or far from it

**Fig. 4.**
Performance of the algorithm on synthetic data. Sequences generated using a known model were given as an input to the inference algorithm. The results of the inference are compared to the true model used for generation, for (a) the distribution of the number of insertions (inset: usage of inserted nucleotides) and (b) V, J gene usage. The error bars, which correspond to sample noise, are smaller than symbol size for (a). In (a) we also report the distribution of insertions obtained using MiXCR

**Fig. 5.**
TCR beta chain rearrangement distribution inferred from sequence data previously analyzed in (Murugan *et al.*, 2012). (a) Distribution of the number of insertions at both VD and DJ junctions, and comparison with the distribution of insertion in the alpha chain from Figure 3c. Inset: The nucleotide usage is identical for VD and DJ insertions when considered on opposite strands. (b) Distribution of the number of deletions on both the V and J genes, averaged over different genes

See this image and copyright information in PMC

Cited by

Benchmarking data-driven filtering for denoising of TCRpMHC single-cell data.
Montemurro A, Povlsen HR, Jessen LE, Nielsen M. Montemurro A, et al. Sci Rep. 2023 Sep 26;13(1):16147. doi: 10.1038/s41598-023-43048-3. Sci Rep. 2023. PMID: 37752190 Free PMC article.
Antibody repertoire sequencing analysis.
Zhang S, Yang T, Liu X, Yang J, Zheng X. Zhang S, et al. Acta Biochim Biophys Sin (Shanghai). 2022 May 25;54(6):864-873. doi: 10.3724/abbs.2022062. Acta Biochim Biophys Sin (Shanghai). 2022. PMID: 35713313 Free PMC article. Review.
Computational Strategies for Dissecting the High-Dimensional Complexity of Adaptive Immune Repertoires.
Miho E, Yermanos A, Weber CR, Berger CT, Reddy ST, Greiff V. Miho E, et al. Front Immunol. 2018 Feb 21;9:224. doi: 10.3389/fimmu.2018.00224. eCollection 2018. Front Immunol. 2018. PMID: 29515569 Free PMC article. Review.
Likelihood-Based Inference of B Cell Clonal Families.
Ralph DK, Matsen FA 4th. Ralph DK, et al. PLoS Comput Biol. 2016 Oct 17;12(10):e1005086. doi: 10.1371/journal.pcbi.1005086. eCollection 2016 Oct. PLoS Comput Biol. 2016. PMID: 27749910 Free PMC article.
OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs.
Sethna Z, Elhanati Y, Callan CG, Walczak AM, Mora T. Sethna Z, et al. Bioinformatics. 2019 Sep 1;35(17):2974-2981. doi: 10.1093/bioinformatics/btz035. Bioinformatics. 2019. PMID: 30657870 Free PMC article.

See all "Cited by" articles

References

1. Bishop C.M. (2006). Pattern Recognition and Machine Learning. New York, USA: Springer.
1. Bolotin D.A. et al. (2012) Next generation sequencing for TCR repertoire profiling: platform-specific features and correction algorithms. Eur. J. Immunol., 42, 3073–3083. - PubMed
1. Bolotin D.A. et al. (2015) MiXCR: software for comprehensive adaptive immunity profiling. Nat. Methods, 12, 380–381. - PubMed
1. Bonissone S, Pevzner P. (2015). Immunoglobulin classification using the colored antibody graph In: Przytycka T.M. (ed.) Research in Computational Molecular Biology SE - 7, volume 9029 of Lecture Notes in Computer Science. Switzerland: Springer International Publishing, pp. 44–59.
1. Brochet X. et al. (2008) IMGT/V-QUEST: the highly customized and integrated system for IG and TR standardized V-J and V-D-J sequence analysis. Nucleic Acids Res., 36, 503–508. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

[1] Bishop C.M. (2006). Pattern Recognition and Machine Learning. New York, USA: Springer.

[2] Bishop C.M. (2006). Pattern Recognition and Machine Learning. New York, USA: Springer.

[3] Bolotin D.A. et al. (2012) Next generation sequencing for TCR repertoire profiling: platform-specific features and correction algorithms. Eur. J. Immunol., 42, 3073–3083. - PubMed

[4] Bolotin D.A. et al. (2012) Next generation sequencing for TCR repertoire profiling: platform-specific features and correction algorithms. Eur. J. Immunol., 42, 3073–3083. - PubMed

[5] Bolotin D.A. et al. (2015) MiXCR: software for comprehensive adaptive immunity profiling. Nat. Methods, 12, 380–381. - PubMed

[6] Bolotin D.A. et al. (2015) MiXCR: software for comprehensive adaptive immunity profiling. Nat. Methods, 12, 380–381. - PubMed

[7] Bonissone S, Pevzner P. (2015). Immunoglobulin classification using the colored antibody graph In: Przytycka T.M. (ed.) Research in Computational Molecular Biology SE - 7, volume 9029 of Lecture Notes in Computer Science. Switzerland: Springer International Publishing, pp. 44–59.

[8] Bonissone S, Pevzner P. (2015). Immunoglobulin classification using the colored antibody graph In: Przytycka T.M. (ed.) Research in Computational Molecular Biology SE - 7, volume 9029 of Lecture Notes in Computer Science. Switzerland: Springer International Publishing, pp. 44–59.

[9] Brochet X. et al. (2008) IMGT/V-QUEST: the highly customized and integrated system for IG and TR standardized V-J and V-D-J sequence analysis. Nucleic Acids Res., 36, 503–508. - PMC - PubMed

[10] Brochet X. et al. (2008) IMGT/V-QUEST: the highly customized and integrated system for IG and TR standardized V-J and V-D-J sequence analysis. Nucleic Acids Res., 36, 503–508. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

repgenHMM: a dynamic programming tool to infer the rules of immune receptor generation from sequence data

Affiliations

repgenHMM: a dynamic programming tool to infer the rules of immune receptor generation from sequence data

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases