MS-GF+ makes progress towards a universal database search tool for proteomics

doi:10.1038/ncomms6277

. 2014 Oct 31:5:5277.

doi: 10.1038/ncomms6277.

MS-GF+ makes progress towards a universal database search tool for proteomics

Sangtae Kim¹, Pavel A Pevzner¹

Affiliations

PMID: 25358478
PMCID: PMC5036525
DOI: 10.1038/ncomms6277

MS-GF+ makes progress towards a universal database search tool for proteomics

Sangtae Kim et al. Nat Commun. 2014.

. 2014 Oct 31:5:5277.

doi: 10.1038/ncomms6277.

Authors

Sangtae Kim¹, Pavel A Pevzner¹

Affiliation

¹ Department of Computer Science and Engineering, University of California San Diego, La Jolla, California 92093, USA.

PMID: 25358478
PMCID: PMC5036525
DOI: 10.1038/ncomms6277

Abstract

Mass spectrometry (MS) instruments and experimental protocols are rapidly advancing, but the software tools to analyse tandem mass spectra are lagging behind. We present a database search tool MS-GF+ that is sensitive (it identifies more peptides than most other database search tools) and universal (it works well for diverse types of spectra, different configurations of MS instruments and different experimental protocols). We benchmark MS-GF+ using diverse spectral data sets: (i) spectra of varying fragmentation methods; (ii) spectra of multiple enzyme digests; (iii) spectra of phosphorylated peptides; and (iv) spectra of peptides with unusual fragmentation propensities produced by a novel alpha-lytic protease. For all these data sets, MS-GF+ significantly increases the number of identified peptides compared with commonly used methods for peptide identifications. We emphasize that although MS-GF+ is not specifically designed for any particular experimental set-up, it improves on the performance of tools specifically designed for these applications (for example, specialized tools for phosphoproteomics).

PubMed Disclaimer

Conflict of interest statement

Contributions S.K. and P.P. designed the algorithms and the experiments and wrote the manuscript. S.K. implemented the algorithms and performed the data analysis. The authors declare no conflict of interest.

Figures

**Figure 1**
Various spectral types. Spectral types are represented as paths in the graph representing possible choices of the fragment method (Fragmentation), the instrument measuring product ion m/z (Instrument), the protocol used to prepare a sample (Protocol), and the enzyme used to digest proteins (Enzyme). ‘Low’ in Instrument indicates low-resolution instruments (e.g. linear ion-trap), ‘High’ indicates high-resolution instruments (e.g. Orbitrap), and ‘TOF’ indicates time-of-flight instruments. ‘Phosphorylation’ and ‘Ubiquitination’ in Protocol indicate that spectra are generated from phosphopeptides and ubiquitinated peptides, respectively. A path in the graph represents a spectral type. For example, the green path (CID, Low, Phosphorylation, Trypsin) represents low-precision CID spectra of trypsin digests generated from a sample enriched for phosphopeptides. The blue, red, green, and magenta paths represent spectral types of the datasets used in recent studies by Frese et al. [20], Swaney et al. [1], Huttlin et al. [21], and Starita et al. [22], respectively. Different combinations of analysis tools were used for different studies. Frese et al. used an in-house tool for peak filtering, de-isotoping, and charge deconvolution, Mascot for database search, Percolator for re-scoring, and RockerBox [58] for peptide-level FDR control. Swaney et al. used an in-house tool for peak filtering, OMSSA [27] for database search, and an in-house tool for both peptide- and protein-level FDR control. Huttlin et al. used an in-house tool for re-calibrating peak masses, SEQUEST for database search, an in-house tool for re-scoring, and peptide- and protein-level FDR control. Starita et al. used the Trans-Proteomics Pipeline [45] along with SEQUEST for database search. The same datasets were analyzed by MS-GF+ without using any additional tool with scoring parameters trained separately for different spectral types.

**Figure 2**
Benchmarking MS-GF+ against Mascot+Percolator. Percent increases in the number of identified PSMs for MS-GF+ compared to Mascot+Percolator for all 19 datasets. Each bar represents a spectral dataset of a specified spectral type. For (CID, Low, Standard, Trypsin) and (ETD, Low, Standard, Trypsin), there are two corresponding datasets, one from human and the other from yeast. We distinguish them by adding ‘*’ to the yeast datasets. For the (CID, Low, Phosphorylation, Trypsin) and (CID, Low, Ubiquitination, Trypsin) datasets, the number of phosphorylated and ubiquitinated PSMs were counted instead of the number of all identified PSMs. For the (ETD,Low,Standard,αLP) dataset, Mascot+Percolator identified no PSM.

**Figure 3**
Comparison of MS-GF+ and other tools for diverse spectral types. The numbers of identified PSMs (a–c) or peptides (d) at 1% FDR are shown. Numbers above bars represent the percentages of increase in the number of identifications for MS-GF+ compared to other tools. (a) Results for the human datasets with varying fragmentations and instruments. MS-GF+, Mascot+Percolator, and Mascot results are shown along with the results in [20]. Percolator greatly increased the number of identifications as compared to Mascot, but MS-GF+ outperformed Mascot+Percolator for all the datasets. (b) Increase in the number of identifications due to the availability of high-precision product ion peaks. For the three human datasets representing HH spectra, MS-GF+, Mascot+Percolator, and Mascot were run using search parameters for HL spectra. The results of these searches (denoted by HL) are compared with the numbers of identifications for the regular searches (denoted by HH). HH searches identified more PSMs than HL searches for every tool and every dataset. The difference was larger for CID and HCD than ETD spectra. (c) Results for the yeast datasets with varying fragmentations and enzymes. MS-GF+ and Mascot+Percolator results are shown. MS-GF+ outperformed Mascot+Percolator for all these datasets. (d) Comparison of MS-GF+ and the results in [1] that used OMSSA along with in-house post-processing tools for the yeast datasets. The numbers of (unique) peptides at the peptide-level 1% are shown. In [1], only the number of identified peptides matched to proteins identified at 1% protein-level FDR was counted while for MS-GF+, the number of identified peptides was counted regardless of their matched proteins.

**Figure 4**
Constructing a Directed Acyclic Graph (DAG) in the case of two “amino acids” with real masses 2.012 and 2.996. Assume that only singly-charged b-ion with a real of f set 1.008 contributes to the scoring. The spectrum S is converted into *S^′* by shifting each peak by 1.008 to the left. Each arrowed line in *S^′* represents a pair of peaks separated approximately by 2 Da (blue) or 3 Da (red) that form a duo (solid) or does not form a duo (dashed) for a fragment mass tolerance 0.01 Da. A DAG G is constructed from *S^′*. The number in the vertex represents its label. The color of the edge represents its label (0 for dashed grey and 1 for solid black).

**Figure 5**
Illustration of the MS-GF+ Directed Acyclic Graph (DAG) scoring. The peptide ABAA is converted into its Boolean string P = 010010101 and the spectrum S is converted into a labeled DAG G as described in the text. The number in the vertex represents its label. The color of the edge represents its label (0 for grey and 1 for black). The vertex i is colored depending on the peptide character i (white for 0 and black for 1). We also color vertex 0 as black. The procedure to compute Score(*P, G*) is illustrated. All edges are partitioned into 8 classes depending on *s_i,j*, *p_i*, and *p_j*. For example, there are 5 edges with *s_i,j* = *p_i* = *p_j* = 0.

See this image and copyright information in PMC

Cited by

Short-term acidification promotes diverse iron acquisition and conservation mechanisms in upwelling-associated phytoplankton.
Lampe RH, Coale TH, Forsch KO, Jabre LJ, Kekuewa S, Bertrand EM, Horák A, Oborník M, Rabines AJ, Rowland E, Zheng H, Andersson AJ, Barbeau KA, Allen AE. Lampe RH, et al. Nat Commun. 2023 Nov 8;14(1):7215. doi: 10.1038/s41467-023-42949-1. Nat Commun. 2023. PMID: 37940668 Free PMC article.
Comprehensive Overview of Bottom-Up Proteomics using Mass Spectrometry.
Jiang Y, Rex DAB, Schuster D, Neely BA, Rosano GL, Volkmar N, Momenzadeh A, Peters-Clarke TM, Egbert SB, Kreimer S, Doud EH, Crook OM, Yadav AK, Vanuopadath M, Mayta ML, Duboff AG, Riley NM, Moritz RL, Meyer JG. Jiang Y, et al. ArXiv [Preprint]. 2023 Nov 13:arXiv:2311.07791v1. ArXiv. 2023. Update in: ACS Meas Sci Au. 2024 Jun 04;4(4):338-417. doi: 10.1021/acsmeasuresciau.3c00068 PMID: 38013887 Free PMC article. Updated. Preprint.
A pan-cancer transcriptome analysis of exitron splicing identifies novel cancer driver genes and neoepitopes.
Wang TY, Liu Q, Ren Y, Alam SK, Wang L, Zhu Z, Hoeppner LH, Dehm SM, Cao Q, Yang R. Wang TY, et al. Mol Cell. 2021 May 20;81(10):2246-2260.e12. doi: 10.1016/j.molcel.2021.03.028. Epub 2021 Apr 15. Mol Cell. 2021. PMID: 33861991 Free PMC article.
Remodeling of the human skeletal muscle proteome found after long-term endurance training but not after strength training.
Emanuelsson EB, Arif M, Reitzner SM, Perez S, Lindholm ME, Mardinoglu A, Daub C, Sundberg CJ, Chapman MA. Emanuelsson EB, et al. iScience. 2023 Dec 5;27(1):108638. doi: 10.1016/j.isci.2023.108638. eCollection 2024 Jan 19. iScience. 2023. PMID: 38213622 Free PMC article.
Identification of modified peptides using localization-aware open search.
Yu F, Teo GC, Kong AT, Haynes SE, Avtonomov DM, Geiszler DJ, Nesvizhskii AI. Yu F, et al. Nat Commun. 2020 Aug 13;11(1):4065. doi: 10.1038/s41467-020-17921-y. Nat Commun. 2020. PMID: 32792501 Free PMC article.

See all "Cited by" articles

References

1. Swaney DL, Wenger CD, Coon JJ. Value of using multiple proteases for large-scale mass spectrometry-based proteomics. J Proteome Res. 2010;9:1323–9. - PMC - PubMed
1. Eng J, McCormack A, Yates J. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom. 1994;5:976–89. - PubMed
1. Perkins D, Pappin D, Creasy D, Cottrell J. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999;20:3551–67. - PubMed
1. Cox J, et al. Andromeda: A peptide search engine integrated into the maxquant environment. J Proteome Res. 2011;10:1794–805. - PubMed
1. Wenger CD, Coon JJ. A proteomics search algorithm specifically designed for high-resolution tandem mass spectra. Journal of proteome research. 2013;12:1377–86. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

[1] Swaney DL, Wenger CD, Coon JJ. Value of using multiple proteases for large-scale mass spectrometry-based proteomics. J Proteome Res. 2010;9:1323–9. - PMC - PubMed

[2] Swaney DL, Wenger CD, Coon JJ. Value of using multiple proteases for large-scale mass spectrometry-based proteomics. J Proteome Res. 2010;9:1323–9. - PMC - PubMed

[3] Eng J, McCormack A, Yates J. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom. 1994;5:976–89. - PubMed

[4] Eng J, McCormack A, Yates J. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom. 1994;5:976–89. - PubMed

[5] Perkins D, Pappin D, Creasy D, Cottrell J. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999;20:3551–67. - PubMed

[6] Perkins D, Pappin D, Creasy D, Cottrell J. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999;20:3551–67. - PubMed

[7] Cox J, et al. Andromeda: A peptide search engine integrated into the maxquant environment. J Proteome Res. 2011;10:1794–805. - PubMed

[8] Cox J, et al. Andromeda: A peptide search engine integrated into the maxquant environment. J Proteome Res. 2011;10:1794–805. - PubMed

[9] Wenger CD, Coon JJ. A proteomics search algorithm specifically designed for high-resolution tandem mass spectra. Journal of proteome research. 2013;12:1377–86. - PMC - PubMed

[10] Wenger CD, Coon JJ. A proteomics search algorithm specifically designed for high-resolution tandem mass spectra. Journal of proteome research. 2013;12:1377–86. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MS-GF+ makes progress towards a universal database search tool for proteomics

Affiliation

MS-GF+ makes progress towards a universal database search tool for proteomics

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases