-
Notifications
You must be signed in to change notification settings - Fork 1
Module Usage in Projects
As a concrete example, we will apply the unsupervised_analysis
module to the UCI ML hand-written digits dataset digits
imported from sklearn.
We provide a minimal example of an unsupervised analysis of the UCI ML hand-written digits datasets imported from sklearn:
- Configuration
- configuration:
config/digits/digits_unsupervised_analysis_config.yaml
- annotation:
config/digits/digits_unsupervised_analysis_annotation.csv
- configuration:
- Data (automatically generated within the example)
- dataset (1797 observations, 64 features):
data/digits/digits_data.csv
- metadata (consisting only of the ground truth label "target"):
data/digits/digits_labels.csv
- dataset (1797 observations, 64 features):
- Results will be generated in the configured results folder
results/digits/
- Performance: On an HPC it took less than 7 minutes to complete a full run split into 92 jobs with up to 32GB of memory per job. Excluding conda environment installations.
First, we provide the configuration file for the application of the unsupervised_analysis module
to digits
using this specific and predefined structure within your project's config/config.yaml.
#### Datasets and Workflows to include ###
workflows:
digits:
unsupervised_analysis: "config/digits/digits_unsupervised_analysis_config.yaml"
Tip
Recommended folder and naming scheme for config files: config/{dataset_name}/{dataset_name}_{module}_config.yaml
.
Second, within the main Snakefile (workflow/Snakefile
) we have to do three things
- load and parse all configurations into a structured dictionary.
# load configs for all workflows and datasets config_wf = dict() for ds in config["workflows"]: for wf in config["workflows"][ds]: with open(config["workflows"][ds][wf], 'r') as stream: try: config_wf[ds+'_'+wf]=yaml.safe_load(stream) except yaml.YAMLError as exc: print(exc)
- include the
workflow/rules/digits.smk
analysis snakefile from the rule subfolder (see last step).##### load rules (one per dataset) ##### include: os.path.join("rules", "digits.smk")
- require all outputs from the used module as inputs to the target rule
all
.#### Target Rule #### rule all: input: #### digits Analysis rules.digits_unsupervised_analysis_all.input, ...
Finally, within the dedicated Snakefile for the analysis of digits
(workflow/rules/digits.smk
) we import the digits
dataset using a custom rule and script before loading the specified version of the unsupervised_analysis
module from a local copy or directly from GitHub, provide it with the previously prepared configuration and use a prefix for all (*
) loaded rules.
# digits Analysis
### digits - Load data with custom rule and script ####
rule load_digits:
output:
data = os.path.join('data','digits','digits_data.csv'),
labels = os.path.join('data','digits','digits_labels.csv'),
resources:
mem_mb=1000,
threads: 1
conda:
"../envs/sklearn.yaml"
log:
os.path.join("logs","rules","load_digits.log"),
script:
"../scripts/digits/load_digits.py"
### digits - Unsupervised Analysis ####
module digits_unsupervised_analysis:
snakefile:
#"/path/to/clone/unsupervised_analysis/workflow/Snakefile"
github("epigen/unsupervised_analysis", path="workflow/Snakefile", tag="v3.0.1")
config:
config_wf["digits_unsupervised_analysis"]
use rule * from digits_unsupervised_analysis as digits_unsupervised_analysis_*
Tip
Recommended naming scheme:
- Datasets/projects always in camelCase (no
_
recommended) e.g.ATACtreated
. - Filename for the analysis/dataset-specific rule file:
./workflow/rules/{dataset_name}.smk
. - Module name:
{dataset_name}_{module}
- Prefix for the loaded rules:
{dataset_name}_{module}_
.
Below we show selected results to illustrate an unsupervised analysis, mirroring the modules' core features.
To visualize high-dimensional data we employed three different approaches: Principal Component Analysis (PCA; linear), Uniform/Density-preserving Manifold Approximation and Projection (dens/UMAP; non-linear), and Heatmaps (not shown).
Method | Metadata target
|
Feature pixel_0_2
|
Clustering Leiden ModularityVertexPartition euclidean with knn=15
|
---|---|---|---|
PCA |
PCA of digits colored by metadata target
|
PCA of digits colored by feature pixel_0_2
|
PCA of digits colored by clustering
|
UMAP euclidean
|
UMAP of digits colored by metadata target
|
UMAP of digits colored by feature pixel_0_2
|
UMAP of digits colored by clustering
|
densMAP euclidean
|
densMAP of digits colored by metadata target
|
densMAP of digits colored by feature pixel_0_2
|
densMAP of digits colored by clustering
|
For clustering, i.e., grouping data points by similarity with respect to their features, we support Leiden
, a graph-based clustering algorithm, applied directly on the UMAP k-nearest neighbors graph. For the analysis and validation of clustering results we provide clustree
, external clustering indices for comparison to metadata (not shown) and internal clustering indices in combination with Multi-Criteria-Decision-Making (MCDM) using TOPSIS
to find the "best" clustering.
clustree
for comparing cluster memberships across clustering results
Clustering results ranked by MCDM TOPSIS from best to worst
To visualize the digits
dataset we used different dimensionality reduction methods, of which UMAP and densMAP captured the structure best. Then we investigated clustering results and found Leiden_euclidean_15_ModularityVertexPartition
to fit the data the best, according to internal clustering indices.
Here, we only showed a small selection of results. For a detailed description of all features, e.g., 3D interactive visualizations and diagnostics, check out the comprehensive module documentation.