Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Aug 4;92(15):10759-10767.
doi: 10.1021/acs.analchem.0c02137. Epub 2020 Jul 23.

Chemical Class Prediction of Unknown Biomolecules Using Ion Mobility-Mass Spectrometry and Machine Learning: Supervised Inference of Feature Taxonomy from Ensemble Randomization

Affiliations

Chemical Class Prediction of Unknown Biomolecules Using Ion Mobility-Mass Spectrometry and Machine Learning: Supervised Inference of Feature Taxonomy from Ensemble Randomization

Jaqueline A Picache et al. Anal Chem. .

Abstract

This work presents a machine learning algorithm referred to as the supervised inference of feature taxonomy from ensemble randomization (SIFTER), which supports the identification of features derived from untargeted ion mobility-mass spectrometry (IM-MS) experiments. SIFTER utilizes random forest machine learning on three analytical measurements derived from IM-MS (collision cross section, CCS), mass-to-charge (m/z), and mass defect (Δm) to classify unknown features into a taxonomy of chemical kingdom, super class, class, and subclass. Each of these classifications is assigned a calculated probability as well as alternate classifications with associated probabilities. After optimization, SIFTER was tested against a set of molecules not used in the training set. The average success rate in classifying all four taxonomy categories correctly was found to be >99%. Analysis of molecular features detected from a complex biological matrix and not used in the training set yielded a lower success rate where all four categories were correctly predicted for ∼80% of the compounds. This decline in performance is in part due to incompleteness of the training set across all potential taxonomic categories, but also resulting from a nearest-neighbor bias in the random forest algorithm. Ongoing efforts are focused on improving the class prediction accuracy of SIFTER through expansion of empirical data sets used for training as well as improvements to the core algorithm.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
SIFTER Algorithm (A) Training set coverage numbers per classification category. (B) Comparison of training set coverage per classification category (white) to the total reported classifications (gray). (C) Overall schematic of SIFTER random forest machine learning workflow.
Figure 2.
Figure 2.
SIFTER Performance and Outcomes (A) Receiver operating characteristic curves depicting the sensitivity and specificity of SIFTER per classification category. Associated Out-of-bag (OOB) errors and also reported. (B) Potential outcomes of SIFTER where none to all of the classifications are correct.
Figure 3.
Figure 3.
Test Set Performance (A) Classification success per category in test set; n=30. (B) Example where SIFTER classified the molecule 6-Keto-prostaglandin F1α where 4/4 categories were correct and their associated probabilities are shown. (C) Example where SIFTER classified the molecule Carnosine where 2/4 categories correct and their associated probabilities are shown. For the Carnosine example, it is important to note SIFTER was 50/50 between the correct and incorrect class categories.
Figure 4.
Figure 4.
SIFTER Performance of Complex Samples Summary (A) Number of correct classifications in complex sample test set of 100 compounds. (B) Classification success per category in complex test set out of 100 compounds. (C) False discovery of SIFTER per category where a prediction occurred when it should not have been predicted.
Figure 5.
Figure 5.
Compounds in Complex Sample Case Studies (A) Example where SIFTER classified the molecule Astilbin where 4/4 categories were correct and their associated probabilities are shown. (B) Example where SIFTER classified the molecule cis-Piceid where 2/4 categories were correct and their associated probabilities are shown (C) Example where SIFTER classified the molecule Ferulic Acid where 1/4 categories were correct and their associated probabilities are shown.

Similar articles

Cited by

References

    1. May JC; McLean JA Advanced Multidimensional Separations in Mass Spectrometry: Navigating the Big Data Deluge. Annu. Rev. Anal. Chem 2016, 9, 387–409. - PMC - PubMed
    1. Sherrod SD; Mclean JA Systems-Wide High-Dimensional Data Acquisition and Informatics Using Structural Mass Spectrometry Strategies. Clin. Chem 2016, 62 (1), 77–83. - PMC - PubMed
    1. Baker ES; Patti GJ Perspectives on Data Analysis in Metabolomics: Points of Agreement and Disagreement from the 2018 ASMS Fall Workshop. J. Am. Soc. Mass Spectrom 2019, 30 (10), 2031–2036. - PMC - PubMed
    1. Schrimpe-Rutledge AC; Codreanu SG; Sherrod SD; Mclean JA Untargeted Metabolomics Strategies—Challenges and Emerging Directions. J. Am. Soc. Mass Spectrom 2016, 27, 1897–1905. - PMC - PubMed
    1. Picache JA; Rose BS; Balinski A; Leaptrot KL; Sherrod SD; May JC; McLean JA Collision Cross Section Compendium to Annotate and Predict Multi-Omic Compound Identities. Chem. Sci 2019, 10 (4), 983–993. - PMC - PubMed

Publication types