Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May 7;2(5):100245.
doi: 10.1016/j.patter.2021.100245. eCollection 2021 May 14.

A community effort to identify and correct mislabeled samples in proteogenomic studies

Affiliations

A community effort to identify and correct mislabeled samples in proteogenomic studies

Seungyeul Yoo et al. Patterns (N Y). .

Abstract

Sample mislabeling or misannotation has been a long-standing problem in scientific research, particularly prevalent in large-scale, multi-omic studies due to the complexity of multi-omic workflows. There exists an urgent need for implementing quality controls to automatically screen for and correct sample mislabels or misannotations in multi-omic studies. Here, we describe a crowdsourced precisionFDA NCI-CPTAC Multi-omics Enabled Sample Mislabeling Correction Challenge, which provides a framework for systematic benchmarking and evaluation of mislabel identification and correction methods for integrative proteogenomic studies. The challenge received a large number of submissions from domestic and international data scientists, with highly variable performance observed across the submitted methods. Post-challenge collaboration between the top-performing teams and the challenge organizers has created an open-source software, COSMO, with demonstrated high accuracy and robustness in mislabeling identification and correction in simulated and real multi-omic datasets.

Keywords: DSML 3: Development/Pre-production: Data science output has been rolled out/validated across multiple domains/problems.

PubMed Disclaimer

Conflict of interest statement

S.Y. and J.Z. are employees of Sema4, a for-profit organization that promotes a healthcare through information-driven insights. R.P., H.F., and H.C. are employees of Sentieon Inc. A.C. is an employee of Bionamic AB. The other authors declare no competing interests.

Figures

Figure 1
Figure 1
Overview of pFDA-NCI-CPTAC Challenge design and post-challenge development The challenge consisted of two sub-challenges structured sequentially. In the first sub-challenge, participants were presented with clinical and proteomic data for the same set of samples and asked to detect samples with unmatched clinical and proteomic data. In the second sub-challenge, participants were further provided with RNA-seq data for the same samples as in the first sub-challenge and were requested to detect the mislabeled samples, identify the problematic data types, and correct the errors. F1 scores were used for performance evaluation. In the end, the top-performing teams worked together to develop and implement an automated sample-labeling check algorithm named COSMO (COrrection of Sample Mislabeling by Omics).
Figure 2
Figure 2
Summary of challenge results (A) Global participants for the challenge suggesting high interest in the challenge problems. (B) Performance evaluation of 149 submissions (columns) from 52 unique submitters (rows) for sub-challenge 1. The F1 score with 95% confidence interval was evaluated for each submission and averaged for unique submitters. (C) Evaluation of sub-challenge 2. In total 57 submissions (columns) from 31 unique submitters (rows) were evaluated in terms of average F1 score. Wide distribution of performance of submission for both sub-challenges was observed. Even within the same team, performance varied in a wide range, suggesting significance of standardized methods. (D and E) Association between team performances in sub-challenge 1 and missing data imputation methods (C) and feature-selection methods (D). Metric used: average percentile rank. (F) Evaluation of the robustness of the top three methods from sub-challenge 2 using 50 colon cancer simulated datasets with fixed types and number of errors. P values were calculated using two-sided paired Student’s t test.
Figure 3
Figure 3
COSMO and its performance on independent test datasets (A) Mimicking real cases of the sample mislabeling by generation of simulated dataset with different types and number of sample-labeling errors from the colon cancer dataset. (B) Performance with different sources of clinical attribute predictions. P values were calculated using two-sided paired Student’s t test. (C) Overall schematic of COSMO to detect and correct mislabeling samples in clinical or omic data. (D) Mimicking real cases of sample mislabeling by generation of simulated dataset with different types and number of sample-labeling errors using CPTAC kidney cancer datasets. (E) Performance of COSMO in the 50 simulated datasets from (D).
Figure 4
Figure 4
Application of COSMO in real datasets (A) CPTAC LUAD: four pairs of proteomic samples reciprocally matched each other between RNA-seq-Proteomics and Proteomics-CNV, but no labeling swapping was observed in RNA-seq-CNV. (B) CPTAC CCRCC: three samples in proteomics were shifted in RNA-seq-Proteomics and Proteomics-CNV matching while samples between RNA-seq and CNV were matched well. (C) TCGA BRCA: eight pairs of microarray samples were swapped in RNA-seq-Microarray and Microarray-CNV matching. (D) Battle et al.: two RNA-seq samples were swapped based on alignment among RNA-seq, proteomic, and Riboseq data. Potential duplicated protein sample was observed. (E) Chick et al.: nine pairs of samples were swapped between RNA-seq and protein data. Merging with clinical annotation of gender of the sample suggested swapping in proteomic data.
Figure 5
Figure 5
Biological impact of error corrections using COSMO in CPTAC LUAD dataset (A) Number of DEPs between male and female tumors before and after error correction. (B) Comparison of t test FDRs (−log10) of 8,528 proteins between male and female tumors. (C) HALLMARK pathways (FET FDR < 0.05) significantly associated with gender DEPs before and after COSMO. Unique DEPs after COSMO were also used for functional enrichment test. (D) Number of DEPs in immune-hot and immune-cold tumors. (E) Comparison of t test FDRs of 8,528 proteins between immune-hot and immune-cold tumors. (F) HALLMARK pathways significantly associated with upregulated proteins in immune-hot sub-type tumors. (G) Correlation strengths of 8,366 gene-protein pairs before and after correction. Pearson correlation p values were adjusted as Benjamini-Hochberg adjusted p values (FDR) and then log10 transformed. (H) Difference in correlation strengths of 269 gene-protein pairs significant only after error correction.
Figure 6
Figure 6
pQTL analysis impacted by error correction The log odds ratio (LOD) score of OMA1 local pQTL increased from 24 (left) to 31 (right) after correction of the errors.

Similar articles

Cited by

References

    1. Nilsson T., Mann M., Aebersold R., Yates J.R., 3rd, Bairoch A., Bergeron J.J. Mass spectrometry in high-throughput proteomics: ready for the big time. Nat. Methods. 2010;7:681–685. doi: 10.1038/nmeth0910-681. - DOI - PubMed
    1. Roadmap Epigenomics C., Kundaje A., Meuleman W., Ernst J., Bilenky M., Yen A., Heravi-Moussavi A., Kheradpour P., Zhang Z., Wang J. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. - DOI - PMC - PubMed
    1. Stark R., Grzelak M., Hadfield J. RNA sequencing: the teenage years. Nat. Rev. Genet. 2019;20:631–656. doi: 10.1038/s41576-019-0150-2. - DOI - PubMed
    1. Zhang B., Wang J., Wang X., Zhu J., Liu Q., Shi Z., Chambers M.C., Zimmerman L.J., Shaddox K.F., Kim S. Proteogenomic characterization of human colon and rectal cancer. Nature. 2014;513:382–387. doi: 10.1038/nature13438. - DOI - PMC - PubMed
    1. Ding L., Bailey M.H., Porta-Pardo E., Thorsson V., Colaprico A., Bertrand D., Gibbs D.L., Weerasinghe A., Huang K.L., Tokheim C. Perspective on oncogenic processes at the end of the beginning of cancer genomics. Cell. 2018;173:305–320.e10. doi: 10.1016/j.cell.2018.03.033. - DOI - PMC - PubMed

LinkOut - more resources