Semi-supervised Nonnegative Matrix Factorization for gene expression deconvolution: a case study
- PMID: 21930246
- DOI: 10.1016/j.meegid.2011.08.014
Semi-supervised Nonnegative Matrix Factorization for gene expression deconvolution: a case study
Abstract
Heterogeneity in sample composition is an inherent issue in many gene expression studies and, in many cases, should be taken into account in the downstream analysis to enable correct interpretation of the underlying biological processes. Typical examples are infectious diseases or immunology-related studies using blood samples, where, for example, the proportions of lymphocyte sub-populations are expected to vary between cases and controls. Nonnegative Matrix Factorization (NMF) is an unsupervised learning technique that has been applied successfully in several fields, notably in bioinformatics where its ability to extract meaningful information from high-dimensional data such as gene expression microarrays has been demonstrated. Very recently, it has been applied to biomarker discovery and gene expression deconvolution in heterogeneous tissue samples. Being essentially unsupervised, standard NMF methods are not guaranteed to find components corresponding to the cell types of interest in the sample, which may jeopardize the correct estimation of cell proportions. We have investigated the use of prior knowledge, in the form of a set of marker genes, to improve gene expression deconvolution with NMF algorithms. We found that this improves the consistency with which both cell type proportions and cell type gene expression signatures are estimated. The proposed method was tested on a microarray dataset consisting of pure cell types mixed in known proportions. Pearson correlation coefficients between true and estimated cell type proportions improved substantially (typically from about 0.5 to approximately 0.8) with the semi-supervised (marker-guided) versions of commonly used NMF algorithms. Furthermore known marker genes associated with each cell type were assigned to the correct cell type more frequently for the guided versions. We conclude that the use of marker genes improves the accuracy of gene expression deconvolution using NMF and suggest modifications to how the marker gene information is used that may lead to further improvements.
Copyright © 2011 Elsevier B.V. All rights reserved.
Similar articles
-
In silico microdissection of microarray data from heterogeneous cell populations.BMC Bioinformatics. 2005 Mar 14;6:54. doi: 10.1186/1471-2105-6-54. BMC Bioinformatics. 2005. PMID: 15766384 Free PMC article.
-
Reducing microarray data via nonnegative matrix factorization for visualization and clustering analysis.J Biomed Inform. 2008 Aug;41(4):602-6. doi: 10.1016/j.jbi.2007.12.003. Epub 2007 Dec 23. J Biomed Inform. 2008. PMID: 18234564
-
Tumor classification based on non-negative matrix factorization using gene expression data.IEEE Trans Nanobioscience. 2011 Jun;10(2):86-93. doi: 10.1109/TNB.2011.2144998. Epub 2011 Jul 7. IEEE Trans Nanobioscience. 2011. PMID: 21742573
-
A primer on gene expression and microarrays for machine learning researchers.J Biomed Inform. 2004 Aug;37(4):293-303. doi: 10.1016/j.jbi.2004.07.002. J Biomed Inform. 2004. PMID: 15465482 Review.
-
Functional genomics and proteomics in the clinical neurosciences: data mining and bioinformatics.Prog Brain Res. 2006;158:83-108. doi: 10.1016/S0079-6123(06)58004-5. Prog Brain Res. 2006. PMID: 17027692 Review.
Cited by
-
Deconvolution from bulk gene expression by leveraging sample-wise and gene-wise similarities and single-cell RNA-Seq data.BMC Genomics. 2024 Sep 18;25(1):875. doi: 10.1186/s12864-024-10728-x. BMC Genomics. 2024. PMID: 39294558 Free PMC article.
-
Brain high-throughput multi-omics data reveal molecular heterogeneity in Alzheimer's disease.PLoS Biol. 2024 Apr 30;22(4):e3002607. doi: 10.1371/journal.pbio.3002607. eCollection 2024 Apr. PLoS Biol. 2024. PMID: 38687811 Free PMC article.
-
imply: improving cell-type deconvolution accuracy using personalized reference profiles.Genome Med. 2024 Apr 29;16(1):65. doi: 10.1186/s13073-024-01338-z. Genome Med. 2024. PMID: 38685057 Free PMC article.
-
CATD: a reproducible pipeline for selecting cell-type deconvolution methods across tissues.Bioinform Adv. 2024 Mar 23;4(1):vbae048. doi: 10.1093/bioadv/vbae048. eCollection 2024. Bioinform Adv. 2024. PMID: 38638280 Free PMC article.
-
GEOMETRIC STRUCTURE GUIDED MODEL AND ALGORITHMS FOR COMPLETE DECONVOLUTION OF GENE EXPRESSION DATA.Found Data Sci. 2022 Sep;4(3):441-466. doi: 10.3934/fods.2022013. Found Data Sci. 2022. PMID: 38250319 Free PMC article.
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources