Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jul 20:144:23-32.
doi: 10.1016/j.jprot.2016.05.032. Epub 2016 May 31.

A multi-model statistical approach for proteomic spectral count quantitation

Affiliations

A multi-model statistical approach for proteomic spectral count quantitation

Owen E Branson et al. J Proteomics. .

Abstract

The rapid development of mass spectrometry (MS) technologies has solidified shotgun proteomics as the most powerful analytical platform for large-scale proteome interrogation. The ability to map and determine differential expression profiles of the entire proteome is the ultimate goal of shotgun proteomics. Label-free quantitation has proven to be a valid approach for discovery shotgun proteomics, especially when sample is limited. Label-free spectral count quantitation is an approach analogous to RNA sequencing whereby count data is used to determine differential expression. Here we show that statistical approaches developed to evaluate differential expression in RNA sequencing experiments can be applied to detect differential protein expression in label-free discovery proteomics. This approach, termed MultiSpec, utilizes open-source statistical platforms; namely edgeR, DESeq and baySeq, to statistically select protein candidates for further investigation. Furthermore, to remove bias associated with a single statistical approach a single ranked list of differentially expressed proteins is assembled by comparing edgeR and DESeq q-values directly with the false discovery rate (FDR) calculated by baySeq. This statistical approach is then extended when applied to spectral count data derived from multiple proteomic pipelines. The individual statistical results from multiple proteomic pipelines are integrated and cross-validated by means of collapsing protein groups.

Biological significance: Spectral count data from shotgun proteomics experiments is semi-quantitative and semi-random, yet a robust way to estimate protein concentration. Tag-count approaches are routinely used to analyze RNA sequencing data sets. This approach, termed MultiSpec, utilizes multiple tag-count based statistical tests to determine differential protein expression from spectral counts. The statistical results from these tag-count approaches are combined in order to reach a final MultiSpec q-value to re-rank protein candidates. This re-ranking procedure is completed to remove bias associated with a single approach in order to better understand the true proteomic differences driving the biology in question. The MultiSpec approach can be extended to multiple proteomic pipelines. In such an instance, MultiSpec statistical results are integrated by collapsing protein groups across proteomic pipelines to provide a single ranked list of differentially expressed proteins. This integration mechanism is seamlessly integrated with the statistical analysis and provides the means to cross-validate protein inferences from multiple proteomic pipelines.

Keywords: DESeq; Proteomics; Spectral Counting; baySeq; edgeR; q-Value.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Label-free spectral counts derived from MassMatrix, MyriMatch and Proteome Discoverer proteomic pipelines robustly estimate true fold changes
(a) Plot of the average log2-fold-changes (UPSB:UPSA) derived from TMM normalized spectral counts. The vertical dashed lines indicate the true spike ratios for each cassette. The UniProtID is aligned and color-coded by cassette. If the protein was not identified in one condition the corresponding log2-fold-change was not calculated. (b) Plot of TMM normalized counts from cassette 5. The expected spike ratio for cassette 5 is 4:1 (UPSA:UPSB). UPSA is the baseline, so the expected log2-fold-change (UPSB:UPSA) for cassette 5 is −2.0.
Figure 2
Figure 2. Schematic representation of spectral count data analysis performed on EAE/Sham lysates
This dataset consisted of 18 raw files. Three technical replicates were collected for each of the six biological replicates. Prior to MultiSpec analysis, the technical replicates were merged into one file. The number depicted at the top of the data funnel represents the total number of homologous protein groups identified by each search engine and protein grouping approach. The specifics for each approach are highlighted to the right of each respective graphic. The boxed number at the bottom of the data funnel depicts the protein groups remaining after a ten total count filter is applied. The MultiSpec statistical approach was applied to identify differential protein expression. The median q-value/FDR was chosen as a representation of statistical significance. The number of differentially expressed proteins is highlighted in the lower pie shaped region. Protein identifications are then collapsed and differential expression is validated across search engines. A minimum q-value is chosen to rank candidates across proteomic pipelines. A final result of 192 proteins were differentially expressed by the MultiSpec analysis of the EAE/Sham dataset.
Figure 3
Figure 3. MultiSpec Differential Expression Analysis of spectral counts derived from the MassMatrix analysis of EAE/Sham lysates
(a) The Origin/Rank Plot highlights the relationships between the tag-count approaches used by MultiSpec. This two part figure describes the q-value/FDR as a function of the final MultiSpec q-value/rank. The top portion indicates which statistical platform (edgeR, DESeq or baySeq) the MultiSpec q-value is derived from while the bottom portion tracks the q-values from each individual statistical approach. (b) Venn diagram of the results derived from MultiSpec. The proteins identified as statistically significant by the MultiSpec approach must have a significant q-value/FDR in at least two of the three statistical approaches. The proteins that met this criteria are highlighted in the gray shaded oval. Proteins that were identified solely by one statistical approach are considered ‘orphans’ and are not significant in the MultiSpec approach. (c) The Orphan Plot allows a quick evaluation of the statistical outputs of orphan candidates. In the edgeR Orphan plot the edgeR q-value is plotted on the x-axis while the DESeq q-values (red) and the baySeq FDR (blue) are plotted on the y-axes.
Figure 4
Figure 4. Cross-Pipeline Validation of MultiSpec Differential Expression Analyses
(a) For each protein log2-fold-change depicted in panel (b) a gold point is drawn that represents if the protein was identified by the indicated proteomic pipeline. (b) MultiSpec fold-change plot of the median log2-fold-change of all proteins found to meet the MultiSpec q-value threshold (≤ 0.05) sorted from smallest to largest. Proteins identified by three, two or one pipeline(s) are green triangles, blue circles or red squares. (c) Venn diagram of significant protein groups across pipelines prior to cross-pipeline validation. A total of 1322 unique protein groups were identified. (d) Venn diagram of significant protein groups across pipelines after cross-pipeline validation. A total of 287 redundant protein groups were collapsed.

Similar articles

Cited by

References

    1. Mallick P, Kuster B. Proteomics: a pragmatic perspective. Nature biotechnology. 2010;28:695–709. - PubMed
    1. Liu H, Sadygov RG, Yates JR., 3rd A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Analytical chemistry. 2004;76:4193–201. - PubMed
    1. Lundgren DH, Hwang SI, Wu L, Han DK. Role of spectral counting in quantitative proteomics. Expert review of proteomics. 2010;7:39–53. - PubMed
    1. Old WM, Meyer-Arendt K, Aveline-Wolf L, Pierce KG, Mendoza A, Sevinsky JR, et al. Comparison of label-free methods for quantifying human proteins by shotgun proteomics. Molecular & cellular proteomics : MCP. 2005;4:1487–502. - PubMed
    1. Patel VJ, Thalassinos K, Slade SE, Connolly JB, Crombie A, Murrell JC, et al. A comparison of labeling and label-free mass spectrometry-based proteomics approaches. J Proteome Res. 2009;8:3752–9. - PubMed

Publication types

MeSH terms

LinkOut - more resources