Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2012;7(5):e36540.
doi: 10.1371/journal.pone.0036540. Epub 2012 May 17.

Algebraic comparison of partial lists in bioinformatics

Affiliations
Comparative Study

Algebraic comparison of partial lists in bioinformatics

Giuseppe Jurman et al. PLoS One. 2012.

Abstract

The outcome of a functional genomics pipeline is usually a partial list of genomic features, ranked by their relevance in modelling biological phenotype in terms of a classification or regression model. Due to resampling protocols or to a meta-analysis comparison, it is often the case that sets of alternative feature lists (possibly of different lengths) are obtained, instead of just one list. Here we introduce a method, based on permutations, for studying the variability between lists ("list stability") in the case of lists of unequal length. We provide algorithms evaluating stability for lists embedded in the full feature set or just limited to the features occurring in the partial lists. The method is demonstrated by finding and comparing gene profiles on a large prostate cancer dataset, consisting of two cohorts of patients from different countries, for a total of 455 samples.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Operational steps in computing the Complete Canberra Dissimilarity Measure between two partial lists.
Example on two lists of length 3 and 4 on an alphabet of 6 features, by the closed form Eq. (5) and through the open formula Eq. (3).
Figure 2
Figure 2. Ratio between Core and Complete measures vs. ratio between the length of partial lists and the size of the full feature set for about 7000 instances of couples of partial lists.
Lists pairs have the same length and they are randomly permuted, with partial lists length ranging between 1 and 5000 and full set size ranging between 10 and 100000.
Figure 3
Figure 3. Zoom of the bottom left corner of Figure 2.
Core and Complete measures are proportional when the ratio between the length of partial lists and the size of the full feature set is less than 0.15.
Figure 4
Figure 4. Analysis pipeline for the classifier/feature ranking methods: a 100×5-fold CV is applied separately on the two cohorts, and a set of models is build on increasing number of important features, ranked by discriminant power for the employed classifier.
At the same time, the stability level of the set if derived lists is computed, and all models are evaluated on a accuracy-stability plot.
Figure 5
Figure 5. MCC and Canberra Core values on the two Setlur datasets computed by using the SRDA, lSVM, and DLDA models.
Each point indicates a model with a fixed number of features, marked above the corresponding 95% Student bootstrap CI line.
Figure 6
Figure 6. Zoom of MCC and Canberra Core values computed by using the SRDA, lSVM, and DLDA models on the two Setlur datasets.
Each point indicates a model with a fixed number of features, marked above the corresponding 95% Student bootstrap CI line.
Figure 7
Figure 7. Boxplot of the DAP2_5229 expression value separately for the two Setlur datasets and the two class labels.
Figure 8
Figure 8. MCC for SRDA and DLDA models on increasing number of features extracted from the global list from 1 to 30 on the Setlur data.
Figure 9
Figure 9. Analysis pipeline for the filtering methods: a 90%/10% split is repeated 100 times, and the selected filter method applied on the training portion.
The stability indicator is then computed for the corresponding set of lists.
Figure 10
Figure 10. Canberra core evaluated on the Setlur dataset on B = 100 repeated filtering experiments on 90% of the data.
Figure 11
Figure 11. Zoom of Fig. 10 on the 80%–100% threshold zone.
formula image.
Figure 12
Figure 12. Levelplot of the values computed on the lists produced by filtering methods (75% threshold) and SRDA models with Complete Canberra Measure computed on their Borda lists.
Figure 13
Figure 13. Levelplot of the values computed on the lists produced by filtering methods (75% threshold) and SRDA models, with Complete Canberra Measure computed on their whole list sets.
Figure 14
Figure 14. Multidimensional Scaling (MDS) on two components computed on the lists produced by filtering methods (75% threshold) and SRDA models, with Complete Canberra Measure computed on their Borda lists.
Figure 15
Figure 15. Multidimensional Scaling (MDS) on two components computed on the lists produced by filtering methods (75% threshold) and SRDA models, with Complete Canberra Measure computed on their whole lists.

Similar articles

Cited by

References

    1. Boulesteix AL, Slawski M. Stability and aggregation of ranked gene lists. Brief Bioinform. 2009;10:556–568. - PubMed
    1. Ein-Dor L, Zuk O, Domany E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. PNAS. 2006;103:5923–5928. - PMC - PubMed
    1. Boutros PC, Lau SK, Pintilie M, Liu N, Shepherd FA, et al. Prognostic gene signatures for non-small-cell lung cancer. PNAS. 2009;106:2824–2828. - PMC - PubMed
    1. Lau SK, Boutros PC, Pintilie M, Blackhall FH, Zhu CQ, et al. Three-Gene Prognostic Classifier for Early-Stage Non Small-Cell Lung Cancer. J Clin Oncol. 2007;25:5562–5569. - PubMed
    1. Shi W, Tsyganova M, Dosymbekov D, Dezso Z, Nikolskaya T, et al. The Tale of Underlying biology: Functional Analysis of MAQC-II Signatures. Pharmacogenomics J. 2010;10:310–323. - PMC - PubMed

Publication types