MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics

doi:10.1038/nmeth.4256

. 2017 May;14(5):513-520.

doi: 10.1038/nmeth.4256. Epub 2017 Apr 10.

MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics

Andy T Kong^{1

2}, Felipe V Leprevost², Dmitry M Avtonomov², Dattatreya Mellacheruvu², Alexey I Nesvizhskii^{1

2}

Affiliations

¹ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA.
² Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA.

PMID: 28394336
PMCID: PMC5409104
DOI: 10.1038/nmeth.4256

MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics

Andy T Kong et al. Nat Methods. 2017 May.

. 2017 May;14(5):513-520.

doi: 10.1038/nmeth.4256. Epub 2017 Apr 10.

Authors

Andy T Kong^{1

2}, Felipe V Leprevost², Dmitry M Avtonomov², Dattatreya Mellacheruvu², Alexey I Nesvizhskii^{1

2}

Affiliations

¹ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA.
² Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA.

PMID: 28394336
PMCID: PMC5409104
DOI: 10.1038/nmeth.4256

Abstract

There is a need to better understand and handle the 'dark matter' of proteomics-the vast diversity of post-translational and chemical modifications that are unaccounted in a typical mass spectrometry-based analysis and thus remain unidentified. We present a fragment-ion indexing method, and its implementation in peptide identification tool MSFragger, that enables a more than 100-fold improvement in speed over most existing proteome database search tools. Using several large proteomic data sets, we demonstrate how MSFragger empowers the open database search concept for comprehensive identification of peptides and all their modified forms, uncovering dramatic differences in modification rates across experimental samples and conditions. We further illustrate its utility using protein-RNA cross-linked peptide data and using affinity purification experiments where we observe, on average, a 300% increase in the number of identified spectra for enriched proteins. We also discuss the benefits of open searching for improved false discovery rate estimation in proteomics.

PubMed Disclaimer

Conflict of interest statement

COMPETING FINANCIAL INTERESTS

The authors declare no competing financial interests.

Figures

**Figure 1. Database search strategies and the MSFragger algorithm**
(a) Conventional database search involves in-silico digestion of a protein database into candidate peptides from which theoretical spectra are sequentially generated and compared against experimental spectra one at a time. **(b)** MSFragger digests a protein database and generates a non-redundant set of peptides that are arranged in a peptide index. This index is then used as a reference to generate a fragment index that allows for rapid retrieval of theoretical spectra that contains a fragment of a query mass. This fragment index is then used for the efficient and simultaneous scoring of an experimental spectrum against all candidate spectra. **(c)** Mass binning and precursor mass ordering within the fragment index allows for rapid retrieval of candidate spectra that matches a given experimental fragment ion. Scores of candidate peptides corresponding to retrieved spectra are incremented. **(d)** Processing of all experimental fragment ions results in the identification of all matching fragments between experimental spectrum and all candidate theoretical spectra, decomposing spectrum to spectra matches to fragment to spectra matches. Matched fragments can then be used to compute a similarity score.

**Figure 2. Peptide identifications across traditional narrow window and open searches demonstrate false discovery rates underestimation**
(a) Peptides passing a 1% FDR filter in both narrow window and open searches are compared. Common peptides are of high confidence. Peptides found only in open search are also of high confidence suggesting that many peptides are only found in its modified form. High FDR was observed for peptides unique to narrow window search. (b) Profile of peptides that were only found in modified forms is similar to that of all modified peptides. PSMs from peptides that were only found in narrow window search were mapped to their higher scoring matches in open search and generated a profile devoid of modifications that can be easily represented as some series of amino acid insertions and deletions. These PSMs may be false positive events that arise due to unaccounted for modifications in narrow window search. (c) PSMs supporting peptides only found in narrow window search are commonly matched to peptides with greater counts in open search, giving greater confidences to their open assignment. (d) Peptides suspected to be false positives as a result of unaccounted for modifications in narrow window search are plotted across peptide confidences. Their numbers exceeds the number of decoys and are prevalent in ranges of high peptide confidences, suggesting that are not well estimated by the target-decoy strategy and cannot be eliminated using any scoring threshold. (e) Confirmation of target-decoy violation by examining PSMs with common modifications in narrow window search.

**Figure 3. Analysis of large-scale shotgun proteomics experiments reveals differences in modification profiles**
Mass difference features are identified with high mass accuracy aligned across multiple experiments. Features are characterized by their localization rates and amino acid propensities. (a) Common modifications are present across different experiments with vastly different modification rates. Modifications are sometimes localized to amino acids that are unaccounted for in traditional workflows. (b) Large numbers of abundant features were found unique to particular experiments. Localization information assisted in characterizing these unknown modifications. (c) Highly abundant mass features were observed in which the mass difference could not be effectively localized.

**Figure 4. Open searching detects modified peptides containing labile modifications**
Spectral similarity scores for each mass bin were computed to capture the spectral similarity between a modified peptide and its unmodified counterpart. Most modifications, such as phosphorylation, have average similarities between 0.4 and 0.6. Modifications that are localized to peptide C-terminus disrupt the intense y-ion series and have lower similarity scores. Few mass bins contain low similarity scores as these modified peptides would otherwise be impossible to detect using open searching. Interestingly, there exists a population of mass bins that have similarity scores exceeding that of carbon-13 (which leaves a largely unaltered spectrum). These modifications may represent labile modifications that are lost during peptide fragmentation.

**Figure 5. Application of MSFragger to diverse proteomics experiments**
(a) The speed of MSFragger allows for reasonable analysis times even when the SILAC labels are specified as variable modifications in conjunction with open searching. In this comparison between a panel of breast tissues and a heavy labeled super-SILAC mix, we observe differences in their modification profiles with certain modifications unique to the super-SILAC mix. (b) Low sample complexities in affinity purification mass spectrometry experiments allow lower abundance modified peptides to be more effectively sampled. On average, across a dataset consisting of 2594 bait proteins, the number of bait PSMs identified in open search was 3.88 times that of narrow window search. (c) Open searching of a RNA-protein crosslinking dataset using MSFragger successfully identifies RNA crosslinked peptides. 134 of the 189 originally reported crosslinked peptides were recovered. Shorter crosslinked peptides are unlikely to have sufficient non-shifted fragment ions for detection in open searching and account for the majority of peptides not recovered.

See this image and copyright information in PMC

Cited by

AI-Assisted Processing Pipeline to Boost Protein Isoform Detection.
The M, Picciani M, Jensen C, Gabriel W, Kuster B, Wilhelm M. The M, et al. Methods Mol Biol. 2024;2836:157-181. doi: 10.1007/978-1-0716-4007-4_10. Methods Mol Biol. 2024. PMID: 38995541
OVOL2 sustains postnatal thymic epithelial cell identity.
Zhong X, Peddada N, Wang J, Moresco JJ, Zhan X, Shelton JM, SoRelle JA, Keller K, Lazaro DR, Moresco EMY, Choi JH, Beutler B. Zhong X, et al. Nat Commun. 2023 Nov 27;14(1):7786. doi: 10.1038/s41467-023-43456-z. Nat Commun. 2023. PMID: 38012144 Free PMC article.
Protein degradation by human 20S proteasomes elucidates the interplay between peptide hydrolysis and splicing.
Soh WT, Roetschke HP, Cormican JA, Teo BF, Chiam NC, Raabe M, Pflanz R, Henneberg F, Becker S, Chari A, Liu H, Urlaub H, Liepe J, Mishto M. Soh WT, et al. Nat Commun. 2024 Feb 7;15(1):1147. doi: 10.1038/s41467-024-45339-3. Nat Commun. 2024. PMID: 38326304 Free PMC article.
Web of venom: exploration of big data resources in animal toxin research.
Zancolli G, von Reumont BM, Anderluh G, Caliskan F, Chiusano ML, Fröhlich J, Hapeshi E, Hempel BF, Ikonomopoulou MP, Jungo F, Marchot P, de Farias TM, Modica MV, Moran Y, Nalbantsoy A, Procházka J, Tarallo A, Tonello F, Vitorino R, Zammit ML, Antunes A. Zancolli G, et al. Gigascience. 2024 Jan 2;13:giae054. doi: 10.1093/gigascience/giae054. Gigascience. 2024. PMID: 39250076 Free PMC article.
Study on Tissue Homogenization Buffer Composition for Brain Mass Spectrometry-Based Proteomics.
Karpiński AA, Torres Elguera JC, Sanner A, Konopka W, Kaczmarek L, Winter D, Konopka A, Bulska E. Karpiński AA, et al. Biomedicines. 2022 Oct 2;10(10):2466. doi: 10.3390/biomedicines10102466. Biomedicines. 2022. PMID: 36289728 Free PMC article.

See all "Cited by" articles

References

1. Nesvizhskii AI. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. Journal of Proteomics. 2010;73:2092–2123. - PMC - PubMed
1. Eng JK, Searle BC, Clauser KR, Tabb DL. A Face in the Crowd: Recognizing Peptides Through Database Search. Molecular & Cellular Proteomics: MCP. 2011;10:R111.009522. - PMC - PubMed
1. Skinner OS, Kelleher NL. Illuminating the dark matter of shotgun proteomics. Nat Biotech. 2015;33:717–718. - PubMed
1. Chick JM, et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat Biotech. 2015;33:743–749. - PMC - PubMed
1. Griss J, et al. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat Meth. 2016;13:651–656. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

[1] Nesvizhskii AI. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. Journal of Proteomics. 2010;73:2092–2123. - PMC - PubMed

[2] Nesvizhskii AI. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. Journal of Proteomics. 2010;73:2092–2123. - PMC - PubMed

[3] Eng JK, Searle BC, Clauser KR, Tabb DL. A Face in the Crowd: Recognizing Peptides Through Database Search. Molecular & Cellular Proteomics: MCP. 2011;10:R111.009522. - PMC - PubMed

[4] Eng JK, Searle BC, Clauser KR, Tabb DL. A Face in the Crowd: Recognizing Peptides Through Database Search. Molecular & Cellular Proteomics: MCP. 2011;10:R111.009522. - PMC - PubMed

[5] Skinner OS, Kelleher NL. Illuminating the dark matter of shotgun proteomics. Nat Biotech. 2015;33:717–718. - PubMed

[6] Skinner OS, Kelleher NL. Illuminating the dark matter of shotgun proteomics. Nat Biotech. 2015;33:717–718. - PubMed

[7] Chick JM, et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat Biotech. 2015;33:743–749. - PMC - PubMed

[8] Chick JM, et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat Biotech. 2015;33:743–749. - PMC - PubMed

[9] Griss J, et al. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat Meth. 2016;13:651–656. - PMC - PubMed

[10] Griss J, et al. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat Meth. 2016;13:651–656. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics

Affiliations

MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources