Skip to content

Module Usage in Projects

Stephan Reichl edited this page Dec 21, 2024 · 22 revisions

As a concrete example, we will apply the unsupervised_analysis module to the UCI ML hand-written digits dataset digits imported from sklearn.

Data

We provide a minimal example of an unsupervised analysis of the UCI ML hand-written digits datasets imported from sklearn:

  • Configuration
    • configuration: config/digits/digits_unsupervised_analysis_config.yaml
    • annotation: config/digits/digits_unsupervised_analysis_annotation.csv
  • Data (automatically generated within the example)
    • dataset (1797 observations, 64 features): data/digits/digits_data.csv
    • metadata (consisting only of the ground truth label "target"): data/digits/digits_labels.csv
  • Results will be generated in the configured results folder results/digits/
  • Performance: On an HPC it took less than 7 minutes to complete a full run split into 92 jobs with up to 32GB of memory per job. Excluding conda environment installations.

Code & Configuration

First, we provide the configuration file for the application of the unsupervised_analysis module to digits using this specific and predefined structure within your project's config/config.yaml.

#### Datasets and Workflows to include ###
workflows:
    digits:
        unsupervised_analysis: "config/digits/digits_unsupervised_analysis_config.yaml"

Tip

Recommended folder and naming scheme for config files: config/{dataset_name}/{dataset_name}_{module}_config.yaml.

Second, within the main Snakefile (workflow/Snakefile) we have to do three things

  • load and parse all configurations into a structured dictionary.
    # load configs for all workflows and datasets
    config_wf = dict()
    
    for ds in config["workflows"]:
        for wf in config["workflows"][ds]:
            with open(config["workflows"][ds][wf], 'r') as stream:
                try:
                    config_wf[ds+'_'+wf]=yaml.safe_load(stream)
                except yaml.YAMLError as exc:
                    print(exc)
  • include the workflow/rules/digits.smk analysis snakefile from the rule subfolder (see last step).
    ##### load rules (one per dataset) #####
    include: os.path.join("rules", "digits.smk")
  • require all outputs from the used module as inputs to the target rule all.
    #### Target Rule ####
    rule all:
        input:
            #### digits Analysis
            rules.digits_unsupervised_analysis_all.input,
            ...

Finally, within the dedicated Snakefile for the analysis of digits (workflow/rules/digits.smk) we import the digits dataset using a custom rule and script before loading the specified version of the unsupervised_analysis module from a local copy or directly from GitHub, provide it with the previously prepared configuration and use a prefix for all (*) loaded rules.

# digits Analysis

### digits - Load data with custom rule and script ####
rule load_digits:
    output:
        data = os.path.join('data','digits','digits_data.csv'),
        labels = os.path.join('data','digits','digits_labels.csv'),
    resources:
        mem_mb=1000,
    threads: 1
    conda:
        "../envs/sklearn.yaml"
    log:
        os.path.join("logs","rules","load_digits.log"),
    script:
        "../scripts/digits/load_digits.py"

### digits - Unsupervised Analysis ####
module digits_unsupervised_analysis:
    snakefile:
        #"/path/to/clone/unsupervised_analysis/workflow/Snakefile"
        github("epigen/unsupervised_analysis", path="workflow/Snakefile", tag="v3.0.1")
    config:
        config_wf["digits_unsupervised_analysis"]

use rule * from digits_unsupervised_analysis as digits_unsupervised_analysis_*

Tip

Recommended naming scheme:

  • Datasets/projects always in camelCase (no _ recommended) e.g. ATACtreated.
  • Filename for the analysis/dataset-specific rule file: ./workflow/rules/{dataset_name}.smk.
  • Module name: {dataset_name}_{module}
  • Prefix for the loaded rules: {dataset_name}_{module}_.

Results

Below we show selected results to illustrate an unsupervised analysis, mirroring the modules' core features.

Dimensionality Reduction

To visualize high-dimensional data we employed three different approaches: Principal Component Analysis (PCA; linear), Uniform/Density-preserving Manifold Approximation and Projection (dens/UMAP; non-linear), and Heatmaps (not shown).

Method Metadata target Feature pixel_0_2 Clustering
Leiden ModularityVertexPartition
euclidean with knn=15
PCA PCA of digits colored by metadata targetPCA of digits colored by metadata target PCA of digits colored by feature pixel_0_2PCA of digits colored by feature pixel_0_2 PCA of digits colored by clusteringPCA of digits colored by clustering
UMAP
euclidean
UMAP of digits colored by metadata targetUMAP of digits colored by metadata target UMAP of digits colored by feature pixel_0_2UMAP of digits colored by feature pixel_0_2 UMAP of digits colored by clusteringUMAP of digits colored by clustering
densMAP
euclidean
densMAP of digits colored by metadata targetdensMAP of digits colored by metadata target densMAP of digits colored by feature pixel_0_2densMAP of digits colored by feature pixel_0_2 densMAP of digits colored by clusteringdensMAP of digits colored by clustering

Cluster Analysis

For clustering, i.e., grouping data points by similarity with respect to their features, we support Leiden, a graph-based clustering algorithm, applied directly on the UMAP k-nearest neighbors graph. For the analysis and validation of clustering results we provide clustree, external clustering indices for comparison to metadata (not shown) and internal clustering indices in combination with Multi-Criteria-Decision-Making (MCDM) using TOPSIS to find the "best" clustering.

clustree for comparing cluster memberships across clustering resultsclustree for comparing cluster memberships across clustering results

Clustering results ranked by MCDM TOPSIS from best to worstClustering results ranked by MCDM TOPSIS from best to worst

Conclusion

To visualize the digits dataset we used different dimensionality reduction methods, of which UMAP and densMAP captured the structure best. Then we investigated clustering results and found Leiden_euclidean_15_ModularityVertexPartition to fit the data the best, according to internal clustering indices. Here, we only showed a small selection of results. For a detailed description of all features, e.g., 3D interactive visualizations and diagnostics, check out the comprehensive module documentation.