Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2016 Oct 20;11(10):e0164542.
doi: 10.1371/journal.pone.0164542. eCollection 2016.

IGSA: Individual Gene Sets Analysis, including Enrichment and Clustering

Affiliations
Comparative Study

IGSA: Individual Gene Sets Analysis, including Enrichment and Clustering

Lingxiang Wu et al. PLoS One. .

Abstract

Analysis of gene sets has been widely applied in various high-throughput biological studies. One weakness in the traditional methods is that they neglect the heterogeneity of genes expressions in samples which may lead to the omission of some specific and important gene sets. It is also difficult for them to reflect the severities of disease and provide expression profiles of gene sets for individuals. We developed an application software called IGSA that leverages a powerful analytical capacity in gene sets enrichment and samples clustering. IGSA calculates gene sets expression scores for each sample and takes an accumulating clustering strategy to let the samples gather into the set according to the progress of disease from mild to severe. We focus on gastric, pancreatic and ovarian cancer data sets for the performance of IGSA. We also compared the results of IGSA in KEGG pathways enrichment with David, GSEA, SPIA, ssGSEA and analyzed the results of IGSA clustering and different similarity measurement methods. Notably, IGSA is proved to be more sensitive and specific in finding significant pathways, and can indicate related changes in pathways with the severity of disease. In addition, IGSA provides with significant gene sets profile for each sample.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. The workflow of IGSA.
Step one, normalize the expression of genes and calculate each gene set expression score for each sample. Step two, find significant gene sets (e.g., pathways) according to Fisher’s exact test (count the number of gene set scores above or below the average score in controls and cases, and construct 2x2 contingency tables). Step three, obtain significant gene set expression according to a list of significant gene sets and gene sets expression, and then subject the results to IGSA clustering (the similarity measure is SMIC).
Fig 2
Fig 2. The workflow of IGSA clustering.
Step one, create an empty seed set and an empty candidate set. Step two, construct a start seed by calculating the average expression value of each significant gene set in the normal samples, and add the start seed into the empty seed set. Add all of the disease samples into the empty candidate set. Step three, calculate the average similarity of each sample in the candidate set with all seed samples in the seed set, and move the sample with the highest similarity score from the candidate set to the seed set. Step four, repeat step three until the candidate set is null.
Fig 3
Fig 3. The comparison of the seven methods by accuracy (average accuracy in three cancer-related datasets) and the proportion of significant pathways supported by papers found in three cancer-related datasets.
IGSA was more robust and sensitive in finding significant pathways compared with the other methods. Although the accuracy of DAVID and SPIA was a bit higher than that of IGSA, both DAVID and SPIA found only a subset of significant pathways.
Fig 4
Fig 4. The clustering of samples in hepatitis datasets.
The blue curves show the average similarity scores of clustering samples. The red curves were loess curves obtained by fitting the similarity scores. The gray vertical lines are used to divide the samples according to the flex points in the red curve. Most serious disease samples (nash) tended to be clustered on the right (in class 3). In class 2, Samples of less severe disease (steatosis) showed a tendency to cluster in the middle. In class 1, most of healthy obese samples tended to be clustered on the left. To some extent, the clustering may reveal the severity of samples in hepatitis datasets.
Fig 5
Fig 5. The comparison of IGSA clustering with different similarity measurement.
The blue points represent the survival time of the samples. The blue lines were generated by linear fitting the blue points. The green curve shows the average similarity scores of clustering samples. The red curves were loess curves obtained by fitting the similarity scores. The gray vertical lines are used to distinguish the samples according to the flex points in the red curve. (A) represents the IGSA clustering based on SMIC applied in the ovarian cancer data set (batch 9) based on pathways. (B) represents the IGSA clustering based on Euclidean distance applied in the ovarian cancer data set (batch 9) based on pathways. The survival time in both methods (A, B) tended to decrease. However, the red curve in B was too smooth to divide the samples into different disease classes.
Fig 6
Fig 6. The classification comparison of IGSA, HCBP (hierarchical clustering based on pathways) and HCBG (hierarchical clustering based on genes) in ovarian cancer datasets (TCGA batch 9).
A shows the survival time curves of three classes obtained by IGSA (p value of 0.0362). B shows the survival analysis of three classes obtained by HCBP (p value of 0.187). C shows the survival time curves of three classes obtained by HCBG (p value of only 0.240). D shows the survival time curves of two classes (class 1 and class 2,3) obtained by IGSA (p value of 0.0362). The p values in both A and D are significant compared with HCBP and HCBG.
Fig 7
Fig 7. The survival analysis of ovarian cancer datasets (TCGA batch 9 and batch 40).
Part A shows the survival time curves of two classes obtained by IGSA (p value of 0.0778). Part B shows the survival analysis of two classes obtained by IGSA based on the same significant pathways (13 paper supporting SUPs and paper supporting 16 SDPs, p value of 0.0364).
Fig 8
Fig 8. The clustering of samples and significant pathways in gastric cancer datasets.
The x-axis was generated according to the list of significant pathways in the gastric cancer data clustered by IGSA clustering. The y-axis was generated according to the list of cases in the gastric cancer data clustered by IGSA clustering (samples whose pathway expression values were more similar to the average expression values of normal samples were closer to the origin of the coordinate). The dots represent the marks for pathways whose expression values in cancer samples were higher than the average level. The color of the dots from blue to green represents the potential progression (mild to severe) of the cancer.

Similar articles

Cited by

References

    1. Huang da W, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37(1):1–13. Epub 2008/11/27. gkn923 [pii] 10.1093/nar/gkn923 - DOI - PMC - PubMed
    1. Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS computational biology. 2012;8(2):e1002375 10.1371/journal.pcbi.1002375 - DOI - PMC - PubMed
    1. Tarca AL, Bhatti G, Romero R. A comparison of gene set analysis methods in terms of sensitivity, prioritization and specificity. PloS one. 2013;8(11):e79217 10.1371/journal.pone.0079217 - DOI - PMC - PubMed
    1. Alexa A, Rahnenfuhrer J, Lengauer T. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics. 2006;22(13):1600–7. 10.1093/bioinformatics/btl140 . - DOI - PubMed
    1. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. Systematic determination of genetic network architecture. Nature genetics. 1999;22(3):281–5. 10.1038/10343 . - DOI - PubMed

Publication types

Grants and funding

The work was supported by the following: Xiujie Chen: National Natural Science Foundation of China [Grant No. 61372188], https://isisn.nsfc.gov.cn/egrantweb/, Provincial Education Department Project of Heilongjiang, China [Grant No. 12541331], Graduate Innovation Foundation of Harbin Medical University, China [Grant No. YJSCX2014-19HYD]; and Denan Zhang: National Natural Science Foundation of China [Grant No.11401137], https://isisn.nsfc.gov.cn/egrantweb/, China Postdoctoral Science Foundation [Grant No. 2015M58147], http://jj.chinapostdoctor.org.cn/V1/Program2/Default.aspx, Heilongjiang Postdoctoral Science Foundation [Grant No. LBH-Z15177]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.