Module Usage in Projects

As a concrete example, we will apply the unsupervised_analysis module to the UCI ML hand-written digits dataset digits imported from sklearn.

Data

We provide a minimal example of an unsupervised analysis of the UCI ML hand-written digits datasets imported from sklearn:

Configuration
- configuration: config/digits/digits_unsupervised_analysis_config.yaml
- annotation: config/digits/digits_unsupervised_analysis_annotation.csv
Data (automatically generated within the example)
- dataset (1797 observations, 64 features): data/digits/digits_data.csv
- metadata (consisting only of the ground truth label "target"): data/digits/digits_labels.csv
Results will be generated in the configured results folder results/digits/
Performance: On an HPC it took less than 7 minutes to complete a full run split into 92 jobs with up to 32GB of memory per job. Excluding conda environment installations.

Code & Configuration

First, we provide the configuration file for the application of the unsupervised_analysis module to digits using this specific and predefined structure within your project's config/config.yaml.

#### Datasets and Workflows to include ###
workflows:
    digits:
        unsupervised_analysis: "config/digits/digits_unsupervised_analysis_config.yaml"

Tip

Recommended folder and naming scheme for config files: config/{dataset_name}/{dataset_name}_{module}_config.yaml.

Second, within the main Snakefile (workflow/Snakefile) we have to do three things

load and parse all configurations into a structured dictionary.

# load configs for all workflows and datasets
config_wf = dict()

for ds in config["workflows"]:
    for wf in config["workflows"][ds]:
        with open(config["workflows"][ds][wf], 'r') as stream:
            try:
                config_wf[ds+'_'+wf]=yaml.safe_load(stream)
            except yaml.YAMLError as exc:
                print(exc)

include the workflow/rules/digits.smk analysis snakefile from the rule subfolder (see last step).
```
##### load rules (one per dataset) #####
include: os.path.join("rules", "digits.smk")
```

require all outputs from the used module as inputs to the target rule all.

#### Target Rule ####
rule all:
    input:
        #### digits Analysis
        rules.digits_unsupervised_analysis_all.input,
        ...

Finally, within the dedicated Snakefile for the analysis of digits (workflow/rules/digits.smk) we import the digits dataset using a custom rule and script before loading the specified version of the unsupervised_analysis module from a local copy or directly from GitHub, provide it with the previously prepared configuration and use a prefix for all (*) loaded rules.

# digits Analysis

### digits - Load data with custom rule and script ####
rule load_digits:
    output:
        data = os.path.join('data','digits','digits_data.csv'),
        labels = os.path.join('data','digits','digits_labels.csv'),
    resources:
        mem_mb=1000,
    threads: 1
    conda:
        "../envs/sklearn.yaml"
    log:
        os.path.join("logs","rules","load_digits.log"),
    script:
        "../scripts/digits/load_digits.py"

### digits - Unsupervised Analysis ####
module digits_unsupervised_analysis:
    snakefile:
        #"/path/to/clone/unsupervised_analysis/workflow/Snakefile"
        github("epigen/unsupervised_analysis", path="workflow/Snakefile", tag="v3.0.1")
    config:
        config_wf["digits_unsupervised_analysis"]

use rule * from digits_unsupervised_analysis as digits_unsupervised_analysis_*

Tip

Recommended naming scheme:

Datasets/projects always in camelCase (no _ recommended) e.g. ATACtreated.
Filename for the analysis/dataset-specific rule file: ./workflow/rules/{dataset_name}.smk.
Module name: {dataset_name}_{module}
Prefix for the loaded rules: {dataset_name}_{module}_.

Results

Below we show selected results to illustrate an unsupervised analysis, mirroring the modules' core features.

Dimensionality Reduction

To visualize high-dimensional data we employed three different approaches: Principal Component Analysis (PCA; linear), Uniform/Density-preserving Manifold Approximation and Projection (dens/UMAP; non-linear), and Heatmaps (not shown).

Method	Metadata `target`	Feature `pixel_0_2`	Clustering Leiden `ModularityVertexPartition` `euclidean` with knn=`15`
PCA	PCA of `digits` colored by metadata target	PCA of `digits` colored by feature pixel_0_2	PCA of `digits` colored by clustering
UMAP `euclidean`	UMAP of `digits` colored by metadata target	UMAP of `digits` colored by feature pixel_0_2	UMAP of `digits` colored by clustering
densMAP `euclidean`	densMAP of `digits` colored by metadata target	densMAP of `digits` colored by feature pixel_0_2	densMAP of `digits` colored by clustering

Cluster Analysis

For clustering, i.e., grouping data points by similarity with respect to their features, we support Leiden, a graph-based clustering algorithm, applied directly on the UMAP k-nearest neighbors graph. For the analysis and validation of clustering results we provide clustree, external clustering indices for comparison to metadata (not shown) and internal clustering indices in combination with Multi-Criteria-Decision-Making (MCDM) using TOPSIS to find the "best" clustering.

clustree for comparing cluster memberships across clustering results clustree for comparing cluster memberships across clustering results

Clustering results ranked by MCDM TOPSIS from best to worst

Conclusion

To visualize the digits dataset we used different dimensionality reduction methods, of which UMAP and densMAP captured the structure best. Then we investigated clustering results and found Leiden_euclidean_15_ModularityVertexPartition to fit the data the best, according to internal clustering indices. Here, we only showed a small selection of results. For a detailed description of all features, e.g., 3D interactive visualizations and diagnostics, check out the comprehensive module documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Module Usage in Projects

Data

Code & Configuration

Results

Dimensionality Reduction

Cluster Analysis

Conclusion

Modules

Module Usage in Projects

Recipes

Tips

CeMM Users

Clone this wiki locally