Inferring biological tasks using Pareto analysis of high-dimensional data

Hart, Yuval; Sheftel, Hila; Hausser, Jean; Szekely, Pablo; Ben-Moshe, Noa Bossel; Korem, Yael; Tendler, Avichai; Mayo, Avraham E; Alon, Uri

doi:10.1038/nmeth.3254

Brief Communication
Published: 26 January 2015

Inferring biological tasks using Pareto analysis of high-dimensional data

Yuval Hart¹^na1,
Hila Sheftel¹^na1,
Jean Hausser¹^na1,
Pablo Szekely¹^na1,
Noa Bossel Ben-Moshe²,
Yael Korem¹,
Avichai Tendler¹,
Avraham E Mayo ORCID: orcid.org/0000-0002-4479-3423¹ &
…
Uri Alon¹

Nature Methods volume 12, pages 233–235 (2015)Cite this article

12k Accesses
85 Citations
29 Altmetric
Metrics details

Subjects

Abstract

We present the Pareto task inference method (ParTI; http://www.weizmann.ac.il/mcb/UriAlon/download/ParTI) for inferring biological tasks from high-dimensional biological data. Data are described as a polytope, and features maximally enriched closest to the vertices (or archetypes) allow identification of the tasks the vertices represent. We demonstrate that human breast tumors and mouse tissues are well described by tetrahedrons in gene expression space, with specific tumor types and biological functions enriched at each of the vertices, suggesting four key tasks.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Key cancer features are maximally enriched at points nearest the archetypes.**

**Figure 2: A mouse tissue gene expression data set is well described by a tetrahedron, with archetypes enriched with specific features.**

A python library for the fast and scalable computation of biologically meaningful individual specific networks

Article Open access 06 August 2024

A universal tool for predicting differentially active features in single-cell and spatial genomics data

Article Open access 22 July 2023

KBoost: a new method to infer gene regulatory networks from gene expression data

Article Open access 29 July 2021

References

Kim, H.D., Shay, T., O'Shea, E.K. & Regev, A. Science 325, 429–432 (2009).
Article CAS Google Scholar
Kalisky, T., Blainey, P. & Quake, S.R. Annu. Rev. Genet. 45, 431–445 (2011).
Article CAS Google Scholar
Curtis, C. et al. Nature 486, 346–352 (2012).
Article CAS Google Scholar
Bendall, S.C. & Nolan, G.P. Nat. Biotechnol. 30, 639–647 (2012).
Article CAS Google Scholar
The Cancer Genome Atlas Network. Nature 490, 61–70 (2012).
Ringnér, M. Nat. Biotechnol. 26, 303–304 (2008).
Article Google Scholar
Van der Maaten, L. & Hinton, G. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Hastie, T., Tibshirani, R. & Friedman, J. in The Elements of Statistical Learning 2nd edn. 520–528 (Springer, 2009).
Shoval, O. et al. Science 336, 1157–1160 (2012).
Article CAS Google Scholar
Sheftel, H., Shoval, O., Mayo, A. & Alon, U. Ecol. Evol. 3, 1471–1483 (2013).
Article Google Scholar
Szekely, P., Sheftel, H., Mayo, A. & Alon, U. PLoS Comput. Biol. 9, e1003163 (2013).
Article CAS Google Scholar
Mørup, M. & Hansen, L.K. Neurocomputing 80, 54–63 (2012).
Article Google Scholar
Li, J. & Bioucas-Dias, J.M. IEEE Int. Geosci. Remote Sens. Symp. 3, 250–253 (2008).
Google Scholar
Chan, T.-H., Chi, C.-Y., Huang, Y.-M. & Ma, W.-K. IEEE Trans. Signal Process. 57, 4418–4432 (2009).
Article Google Scholar
Chan, T.-H., Liou, J.-Y., Ambikapathi, A., Ma, W.-K. & Chi, C.-Y. in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1237–1240 (IEEE, 2012).
Bioucas-Dias, J.M. et al. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 5, 354–379 (2012).
Article Google Scholar
Schwartz, R. & Shackney, S.E. BMC Bioinformatics 11, 42 (2010).
Article Google Scholar
Tolliver, D., Tsourakakis, C., Subramanian, A., Shackney, S. & Schwartz, R. Bioinformatics 26, i106–i114 (2010).
Article CAS Google Scholar
Thøgersen, J.C., Mørup, M., Damkiær, S., Molin, S. & Jelsbak, L. BMC Bioinformatics 14, 279 (2013).
Article Google Scholar
Subramanian, A. et al. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005).
Article CAS Google Scholar
Lehmann, B.D. et al. J. Clin. Invest. 121, 2750–2767 (2011).
Article CAS Google Scholar
Lattin, J.E. et al. Immunome Res. 4, 5 (2008).
Article Google Scholar
Cutler, A. & Breiman, L. Technometrics 36, 338–347 (1994).
Article Google Scholar
Bioucas-Dias, J.M. in Hyperspectral Image Signal Process. Evol. Remote Sens. First Workshop 1–4 (IEEE, 2009).
Mann, H.B. & Whitney, D.R. Ann. Math. Stat. 18, 50–60 (1947).
Article Google Scholar
Benjamini, Y. & Hochberg, Y. J. R. Stat. Soc. Series B Stat. Methodol. 57, 289–300 (1995).
Google Scholar
Nishimura, D. Biotech Softw. Internet Rep. 2, 117–120 (2001).
Article Google Scholar
Kanehisa, M. & Goto, S. Nucleic Acids Res. 28, 27–30 (2000).
Article CAS Google Scholar
Croft, D. et al. Nucleic Acids Res. 39 (suppl. 1), D691–D697 (2011).
Article CAS Google Scholar

Download references

Acknowledgements

We thank N. Drayman, B. Towbin, M. Botzman, Y. Liron, M. Adler, G. Aidelberg, D. Rothschild, S. Malihi, O. Szekely and members of the Alon lab for discussions. We acknowledge support by the Human Frontier Science Program, project number RGP0020/2012, European Research Council, project number 249919, and Rising Tide Cancer Research Fund, project number 721176. U.A. receives support as the Abisch-Frenkel Professorial Chair. J.H. acknowledges the support of the Swiss National Science Foundation (PBBSP3_14961) and EMBO (ALTF 1160-2012).

Author information

Yuval Hart, Hila Sheftel, Jean Hausser and Pablo Szekely: These authors contributed equally to this work.

Authors and Affiliations

Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot, Israel
Yuval Hart, Hila Sheftel, Jean Hausser, Pablo Szekely, Yael Korem, Avichai Tendler, Avraham E Mayo & Uri Alon
Department of Physics of Complex Systems, Weizmann Institute of Science, Rehovot, Israel
Noa Bossel Ben-Moshe

Authors

Yuval Hart
View author publications
You can also search for this author in PubMed Google Scholar
Hila Sheftel
View author publications
You can also search for this author in PubMed Google Scholar
Jean Hausser
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Szekely
View author publications
You can also search for this author in PubMed Google Scholar
Noa Bossel Ben-Moshe
View author publications
You can also search for this author in PubMed Google Scholar
Yael Korem
View author publications
You can also search for this author in PubMed Google Scholar
Avichai Tendler
View author publications
You can also search for this author in PubMed Google Scholar
Avraham E Mayo
View author publications
You can also search for this author in PubMed Google Scholar
Uri Alon
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.H., H.S., J.H. and P.S. developed the method and analyzed the data. N.B.B.-M. analyzed the microarray breast cancer data. Y.K., A.T. and A.E.M. consulted on the method and algorithm. U.A. designed the method and research program. Y.H., H.S., J.H. and P.S. wrote the Matlab code, and Y.H., H.S., J.H., P.S. and U.A. wrote the manuscript.

Corresponding author

Correspondence to Uri Alon.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 The best trade-off phenotypes lie on polytopes in trait space.

(A) 2 tasks result in a line. (B) 3 tasks result in a triangle. (C) 4 tasks result in a tetrahedron.

Supplementary Figure 2 Schematic description of Pareto archetype analysis and its relation to clustering analysis.

(A) Clustering works well for data that is divided into discrete groups. (B) Data that uniformly fill a triangle is clustered by k-means clustering into three clusters, so that each data point is categorized into one of three categories. Close-by points (circled in black) can be assigned to different categories. (C) Archetype analysis of the same data provides a continuous description where each data point is described by the distances from the archetypes. Thus two near-by points (circled in black) are categorized in different clusters according to clustering algorithms, but have similar weights in ParTI. (D) Point density in the dataset affects clustering but not the archetypes of the ParTI method. Shown are two datasets, where clustering yields different clusters whereas the archetype positions remain unchanged.

Supplementary Figure 3 The breast cancer gene expression data set is well enclosed by a tetrahedron.

(A) Fraction of the total variance explained by the polytope as a function of the number of archetypes. Archetypes in dimension d (=#archetypes-1) were calculated using the PCHA algorithm. Explained variance was computed. The effective number of archetypes can be estimated from the maximal distance between the EV and the line connecting between the first and last points. (B) 3D plot of the data and enclosing tetrahedron. The axes are the first three principal components, which explain 30.4% of variance. The colored ellipsoids represent the archetype location and error on the most varying directions. Archetype error bars are obtained by bootstrapping. Each ellipsoid represents 68% confidence level. The inset near each archetype shows the projection of the data on the plane defined by the tetrahedron’s face opposing that archetype.

Supplementary Figure 4 The clusters defined by Curtis et al.³ are located in specific areas of the tetrahedron.

Each panel represents the location of one of the clusters found by Curtis et al. in the tetrahedron calculated by the ParTI method for the breast cancer dataset. The volume colored in purple is defined by the region in the polytope in which a given cluster is most enriched (the convex hull of the 50% locally enriched points). Data points belonging to the relevant cluster are plotted in black.

Supplementary Figure 5 Several features found by Enrichment At Archetype (EAA) are not maximal at the archetype.

Each panel shows the density of a given feature as a function of distance from the archetype to the features detected by using the archetype position (Enrichment At Archetype, EAA) instead of PartTI. As can be seen (red boxes), some features are maximally enriched at some distance from the archetype rather than at the archetype itself, suggesting that they are not associated with the archetype’s biological task according to Pareto optimality theory. x-axis is the bin number, y-axis is normalized enrichment (compared to the mean density).

Supplementary Figure 6 The “basal” archetype splits as additional archetypes are added to the analysis.

The tree of archetypes is determined by Euclidean distance between archetypes in different dimensions.

Supplementary Figure 7 Breast cancer gene expression data profiled by mRNA-seq is well described by a tetrahedron.

The axes are the first three principal components. The color ellipsoids represent the archetype location and error on the most varying directions. Archetype error bars are obtained by bootstrapping. Each ellipsoid represents 68% confidence level.

Supplementary Figure 8 Three-dimensional plot of the tissue data set and enclosing tetrahedron.

Axes are the first three principal components. Different tissue categories are marked by different coloring – neural (green), macrophage and microglia (purple), secretory glands (red), stem cells (light blue), hematopoietic cells (lymphoid – orange, myeloid – olive green), other homogeneous tissues (blue).

Supplementary Figure 9 Caveats in Pareto archetype analysis.

(A) Coin toss data (binomial process B(0.5,N)) falls on a triangular shaped region in log-log plots of number of heads versus number of tosses because variance increases with number of tosses. The triangle is unrelated to Pareto theory. No data point is expected to be enriched for any feature. (B) A non-convex distribution of data can result in significant triangle assignment. Pareto origin of this triangle can be doubted if no feature is enriched near archetype z. (C) A distant outlier from the rest of the dataset can make a triangle in the PCA, since first principal component will span the line between the outlier and the data and second component will condense the rest of the data into a line, thus forming an artificial triangle.

Supplementary Figure 10 The ‘elbow’ method.

We assess the best-fit number of archetypes automatically by plotting the explained variance (EV) vs. the number of archetypes: we look for the ‘elbow on the plot by seeking for the point farthest from the line connecting the first and last EV. Here we show an example for the mouse tissue dataset. This method should be tested for different maximal numbers of archetypes to assess its robustness.

Supplementary Figure 11 Different algorithms result in similar positions of the archetypes.

Breast cancer data plotted in the three first PCs space, each tissue sample is a black dot. Blue, green, red and yellow circles represent archetypes positions found by Sisal, MVSA, MVES and SDVMM, respectively.

Supplementary Figure 12 The position of the archetypes is robust to data sampling (cancer data set).

Shown are the positions of the archetypes found for the bootstrapped datasets using sampling with replacement (blue points) and the archetypes position when removing points from the convex hull of the data (red points).

Supplementary Figure 13 Three-dimensional plot of the data and enclosing tetrahedron of the mouse tissue data set.

The axes are the first three principal components that explain 61% of variance. The color ellipsoids represent the archetype location and error on the most varying directions. Archetype error bars are obtained by bootstrapping. Each ellipsoid represents 68% confidence level.

Supplementary Figure 14 Explained variance measures both dimensionality and structure of the data set.

The fraction of explained variance curves of a tetrahedron (red), a 3D-cube (green) and a 3D- sphere (blue) with 3% noise embedded in a 100 dimensional space. The explained variance was calculated with the PCHA algorithm for 2-10 archetypes.

Supplementary Figure 15 Archetypes in the real data set show many more enriched features than are expected by chance.

Purple circles indicate the number of enriched features at the most-enriched archetype in the shuffled dataset as a function of p-value threshold P_th, averaged over 1000 shuffled datasets. Black circles indicate the mean number of enriched features at a single archetype in the shuffled dataset, averaged over 1000 shuffled datasets. Brown, red, green and blue small circles indicate the corresponding total number of enriched features at the archetypes in the real non-shuffled dataset.

Source data

Source data to Fig. 1

Source data to Fig. 2

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hart, Y., Sheftel, H., Hausser, J. et al. Inferring biological tasks using Pareto analysis of high-dimensional data. Nat Methods 12, 233–235 (2015). https://doi.org/10.1038/nmeth.3254

Download citation

Received: 11 July 2014
Accepted: 17 November 2014
Published: 26 January 2015
Issue Date: March 2015
DOI: https://doi.org/10.1038/nmeth.3254

This article is cited by

Network traits predict ecological strategies in fungi
- C A Aguilar-Trigueros
- L Boddy
- M D Fricker
ISME Communications (2022)
Proteomic traits vary across taxa in a coastal Antarctic phytoplankton bloom
- J Scott P McCain
- Andrew E Allen
- Erin M Bertrand
The ISME Journal (2022)
Cell type and gene expression deconvolution with BayesPrism enables Bayesian integrative analysis across bulk and single-cell RNA sequencing in oncology
- Tinyi Chu
- Zhong Wang
- Charles G. Danko
Nature Cancer (2022)
Comparative assessment and novel strategy on methods for imputing proteomics data
- Minjie Shen
- Yi-Tan Chang
- Yue Wang
Scientific Reports (2022)
Chronic nicotine increases midbrain dopamine neuron activity and biases individual strategies towards reduced exploration in mice
- Malou Dongelmans
- Romain Durand-de Cuttoli
- Philippe Faure
Nature Communications (2021)