Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 May;14(5):513-520.
doi: 10.1038/nmeth.4256. Epub 2017 Apr 10.

MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics

Affiliations

MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics

Andy T Kong et al. Nat Methods. 2017 May.

Abstract

There is a need to better understand and handle the 'dark matter' of proteomics-the vast diversity of post-translational and chemical modifications that are unaccounted in a typical mass spectrometry-based analysis and thus remain unidentified. We present a fragment-ion indexing method, and its implementation in peptide identification tool MSFragger, that enables a more than 100-fold improvement in speed over most existing proteome database search tools. Using several large proteomic data sets, we demonstrate how MSFragger empowers the open database search concept for comprehensive identification of peptides and all their modified forms, uncovering dramatic differences in modification rates across experimental samples and conditions. We further illustrate its utility using protein-RNA cross-linked peptide data and using affinity purification experiments where we observe, on average, a 300% increase in the number of identified spectra for enriched proteins. We also discuss the benefits of open searching for improved false discovery rate estimation in proteomics.

PubMed Disclaimer

Conflict of interest statement

COMPETING FINANCIAL INTERESTS

The authors declare no competing financial interests.

Figures

Figure 1
Figure 1. Database search strategies and the MSFragger algorithm
(a) Conventional database search involves in-silico digestion of a protein database into candidate peptides from which theoretical spectra are sequentially generated and compared against experimental spectra one at a time. (b) MSFragger digests a protein database and generates a non-redundant set of peptides that are arranged in a peptide index. This index is then used as a reference to generate a fragment index that allows for rapid retrieval of theoretical spectra that contains a fragment of a query mass. This fragment index is then used for the efficient and simultaneous scoring of an experimental spectrum against all candidate spectra. (c) Mass binning and precursor mass ordering within the fragment index allows for rapid retrieval of candidate spectra that matches a given experimental fragment ion. Scores of candidate peptides corresponding to retrieved spectra are incremented. (d) Processing of all experimental fragment ions results in the identification of all matching fragments between experimental spectrum and all candidate theoretical spectra, decomposing spectrum to spectra matches to fragment to spectra matches. Matched fragments can then be used to compute a similarity score.
Figure 2
Figure 2. Peptide identifications across traditional narrow window and open searches demonstrate false discovery rates underestimation
(a) Peptides passing a 1% FDR filter in both narrow window and open searches are compared. Common peptides are of high confidence. Peptides found only in open search are also of high confidence suggesting that many peptides are only found in its modified form. High FDR was observed for peptides unique to narrow window search. (b) Profile of peptides that were only found in modified forms is similar to that of all modified peptides. PSMs from peptides that were only found in narrow window search were mapped to their higher scoring matches in open search and generated a profile devoid of modifications that can be easily represented as some series of amino acid insertions and deletions. These PSMs may be false positive events that arise due to unaccounted for modifications in narrow window search. (c) PSMs supporting peptides only found in narrow window search are commonly matched to peptides with greater counts in open search, giving greater confidences to their open assignment. (d) Peptides suspected to be false positives as a result of unaccounted for modifications in narrow window search are plotted across peptide confidences. Their numbers exceeds the number of decoys and are prevalent in ranges of high peptide confidences, suggesting that are not well estimated by the target-decoy strategy and cannot be eliminated using any scoring threshold. (e) Confirmation of target-decoy violation by examining PSMs with common modifications in narrow window search.
Figure 3
Figure 3. Analysis of large-scale shotgun proteomics experiments reveals differences in modification profiles
Mass difference features are identified with high mass accuracy aligned across multiple experiments. Features are characterized by their localization rates and amino acid propensities. (a) Common modifications are present across different experiments with vastly different modification rates. Modifications are sometimes localized to amino acids that are unaccounted for in traditional workflows. (b) Large numbers of abundant features were found unique to particular experiments. Localization information assisted in characterizing these unknown modifications. (c) Highly abundant mass features were observed in which the mass difference could not be effectively localized.
Figure 4
Figure 4. Open searching detects modified peptides containing labile modifications
Spectral similarity scores for each mass bin were computed to capture the spectral similarity between a modified peptide and its unmodified counterpart. Most modifications, such as phosphorylation, have average similarities between 0.4 and 0.6. Modifications that are localized to peptide C-terminus disrupt the intense y-ion series and have lower similarity scores. Few mass bins contain low similarity scores as these modified peptides would otherwise be impossible to detect using open searching. Interestingly, there exists a population of mass bins that have similarity scores exceeding that of carbon-13 (which leaves a largely unaltered spectrum). These modifications may represent labile modifications that are lost during peptide fragmentation.
Figure 5
Figure 5. Application of MSFragger to diverse proteomics experiments
(a) The speed of MSFragger allows for reasonable analysis times even when the SILAC labels are specified as variable modifications in conjunction with open searching. In this comparison between a panel of breast tissues and a heavy labeled super-SILAC mix, we observe differences in their modification profiles with certain modifications unique to the super-SILAC mix. (b) Low sample complexities in affinity purification mass spectrometry experiments allow lower abundance modified peptides to be more effectively sampled. On average, across a dataset consisting of 2594 bait proteins, the number of bait PSMs identified in open search was 3.88 times that of narrow window search. (c) Open searching of a RNA-protein crosslinking dataset using MSFragger successfully identifies RNA crosslinked peptides. 134 of the 189 originally reported crosslinked peptides were recovered. Shorter crosslinked peptides are unlikely to have sufficient non-shifted fragment ions for detection in open searching and account for the majority of peptides not recovered.

Similar articles

Cited by

References

    1. Nesvizhskii AI. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. Journal of Proteomics. 2010;73:2092–2123. - PMC - PubMed
    1. Eng JK, Searle BC, Clauser KR, Tabb DL. A Face in the Crowd: Recognizing Peptides Through Database Search. Molecular & Cellular Proteomics: MCP. 2011;10:R111.009522. - PMC - PubMed
    1. Skinner OS, Kelleher NL. Illuminating the dark matter of shotgun proteomics. Nat Biotech. 2015;33:717–718. - PubMed
    1. Chick JM, et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat Biotech. 2015;33:743–749. - PMC - PubMed
    1. Griss J, et al. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat Meth. 2016;13:651–656. - PMC - PubMed