feat: add model_selection sub-package for hyper-parameters (#257)

* add `model_selection` module * classification -> inference * clean-up * add sklearn licenses * c info * add .coveragerc * add unit tests
eonu · eonu · Dec 27, 2024 · Dec 22, 2024 · Dec 22, 2024 · Dec 22, 2024
commit 4c3ca389a753832da462e03c61b3a81286f09eb9
@@ -0,0 +1,2 @@
+[run]
+omit = "sequentia/model_selection/_validation.py"
@@ -69,12 +69,15 @@ Some examples of how Sequentia can be used on sequence data include:
 
 ### Models
 
-The following models provided by Sequentia all support variable length sequences.
-
 #### [Dynamic Time Warping + k-Nearest Neighbors](https://sequentia.readthedocs.io/en/latest/sections/models/knn/index.html) (via [`dtaidistance`](https://github.com/wannesm/dtaidistance))
 
+Dynamic Time Warping (DTW) is a distance measure that can be applied to two sequences of different length.
+When used as a distance measure for the k-Nearest Neighbors (kNN) algorithm this results in a simple yet
+effective inference algorithm.
+
 - [x] Classification
 - [x] Regression
+- [x] Variable length sequences
 - [x] Multivariate real-valued observations
 - [x] Sakoe–Chiba band global warping constraint
 - [x] Dependent and independent feature warping (DTWD/DTWI)
@@ -83,19 +86,28 @@ The following models provided by Sequentia all support variable length sequences
 
 #### [Hidden Markov Models](https://sequentia.readthedocs.io/en/latest/sections/models/hmm/index.html) (via [`hmmlearn`](https://github.com/hmmlearn/hmmlearn))
 
-Parameter estimation with the Baum-Welch algorithm and prediction with the forward algorithm [[1]](#references)
+A Hidden Markov Model (HMM) is a state-based statistical model which represents a sequence as 
+a series of observations that are emitted from a collection of latent hidden states which form
+an underlying Markov chain. Each hidden state has an emission distribution that models its observations.
+
+Expectation-maximization via the Baum-Welch algorithm (or forward-backward algorithm) [[1]](#references) is used to 
+derive a maximum likelihood estimate of the Markov chain probabilities and emission distribution parameters 
+based on the provided training sequence data.
 
 - [x] Classification
-- [x] Multivariate real-valued observations (Gaussian mixture model emissions)
-- [x] Univariate categorical observations (discrete emissions)
+- [x] Variable length sequences
+- [x] Multivariate real-valued observations (modeled with Gaussian mixture emissions)
+- [x] Univariate categorical observations (modeled with discrete emissions)
 - [x] Linear, left-right and ergodic topologies
 - [x] Multi-processed predictions
 
 ### Scikit-Learn compatibility
 
-**Sequentia (≥2.0) is fully compatible with the Scikit-Learn API (≥1.4), enabling for rapid development and prototyping of sequential models.**
+**Sequentia (≥2.0) is compatible with the Scikit-Learn API (≥1.4), enabling for rapid development and prototyping of sequential models.**
 
-In most cases, the only necessary change is to add a `lengths` key-word argument to provide sequence length information, e.g. `fit(X, y, lengths=lengths)` instead of `fit(X, y)`.
+The integration relies on the use of [metadata routing](https://scikit-learn.org/stable/metadata_routing.html), 
+which means that in most cases, the only necessary change is to add a `lengths` key-word argument to provide 
+sequence length information, e.g. `fit(X, y, lengths=lengths)` instead of `fit(X, y)`.
 
 ### Similar libraries
 
@@ -134,10 +146,7 @@ The [Free Spoken Digit Dataset](https://sequentia.readthedocs.io/en/latest/secti
   - 1500 used for training, 1500 used for testing (split via label stratification)
 - 13 features ([MFCCs](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum))
   - Only the first feature was used as not all of the above libraries support multivariate sequences
-- Sequence length statistics:
-  - Minimum: 6
-  - Median: 17
-  - Maximum: 92
+- Sequence length statistics: (min 6, median 17, max 92)
 
 Each result measures the total time taken to complete training and prediction repeated 10 times.
 
@@ -162,13 +171,13 @@ The latest stable version of Sequentia can be installed with the following comma
 pip install sequentia
 ```
 
-### C library compilation
+### C libraries
 
-For optimal performance when using any of the k-NN based models, it is important that `dtaidistance` C libraries are compiled correctly.
+For optimal performance when using any of the k-NN based models, it is important that the correct `dtaidistance` C libraries are accessible.
 
 Please see the [`dtaidistance` installation guide](https://dtaidistance.readthedocs.io/en/latest/usage/installation.html) for troubleshooting if you run into C compilation issues, or if setting `use_c=True` on k-NN based models results in a warning.
 
-You can use the following to check if the appropriate C libraries have been installed.
+You can use the following to check if the appropriate C libraries are available.
 
 ```python
 from dtaidistance import dtw
@@ -185,26 +194,25 @@ Documentation for the package is available on [Read The Docs](https://sequentia.
 
 ## Examples
 
-Demonstration of classifying multivariate sequences with two features into two classes using the `KNNClassifier`.
+Demonstration of classifying multivariate sequences into two classes using the `KNNClassifier`.
 
-This example also shows a typical preprocessing workflow, as well as compatibility with Scikit-Learn.
+This example also shows a typical preprocessing workflow, as well as compatibility with 
+Scikit-Learn for pipelining and hyper-parameter optimization.
 
-```python
-import numpy as np
+---
 
-from sklearn.preprocessing import scale
-from sklearn.decomposition import PCA
-from sklearn.pipeline import Pipeline
+First, we create some sample multivariate input data consisting of three sequences with two features.
 
-from sequentia.models import KNNClassifier
-from sequentia.preprocessing import IndependentFunctionTransformer, median_filter
+- Sequentia expects sequences to be concatenated and represented as a single NumPy array.
+- Sequence lengths are provided separately and used to decode the sequences when needed.
 
-# Create input data
-# - Sequentia expects sequences to be concatenated into a single array
-# - Sequence lengths are provided separately and used to decode the sequences when needed
-# - This avoids the need for complex structures such as lists of arrays with different lengths
+This avoids the need for complex structures such as lists of nested arrays with different lengths, 
+or a 3D array with wasteful and annoying padding.
 
-# Sequences
+```python
+import numpy as np
+
+# Sequence data
 X = np.array([
     # Sequence 1 - Length 3
     [1.2 , 7.91],
@@ -226,33 +234,99 @@ lengths = np.array([3, 5, 2])
 
 # Sequence classes
 y = np.array([0, 1, 1])
+```
+
+With this data, we can train a `KNNClassifier` and use it for prediction and scoring.
+
+**Note**: Each of the `fit()`, `predict()` and `score()` methods require the sequence lengths 
+to be provided in addition to the sequence data `X` and labels `y`.
+
+```python
+from sequentia.models import KNNClassifier
 
-# Train and predict (without preprocessing)
+# Initialize and fit the classifier
 clf = KNNClassifier(k=1)
 clf.fit(X, y, lengths=lengths)
+
+# Make predictions based on the provided sequences
 y_pred = clf.predict(X, lengths=lengths)
-acc = pipeline.score(X, y, lengths=lengths)
+
+# Make predicitons based on the provided sequences and calculate accuracy
+acc = clf.score(X, y, lengths=lengths)
+```
+
+Alternatively, we can use [`sklearn.preprocessing.Pipeline`](https://scikit-learn.org/1.5/modules/generated/sklearn.pipeline.Pipeline.html) to build a more complex preprocessing pipeline:
+
+1. Individually denoise each sequence by applying a [median filter](https://sequentia.readthedocs.io/en/latest/sections/preprocessing/transforms/filters.html#sequentia.preprocessing.transforms.median_filter) to each sequence.
+2. Individually [standardize](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html) each sequence by subtracting the mean and dividing the s.d. for each feature.
+3. Reduce the dimensionality of the data to a single feature by using [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
+4. Pass the resulting transformed data into a `KNNClassifier`.
+
+**Note**: Steps 1 and 2 use [`IndependentFunctionTransformer`](https://sequentia.readthedocs.io/en/latest/sections/preprocessing/transforms/function_transformer.html#sequentia.preprocessing.transforms.IndependentFunctionTransformer) provided by Sequentia to 
+apply the specified transformation to each sequence in `X` individually, rather than using 
+[`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer) from Scikit-Learn which would transform the entire `X`
+array once, treating it as a single sequence.
+
+```python
+from sklearn.preprocessing import scale
+from sklearn.decomposition import PCA
+from sklearn.pipeline import Pipeline
+
+from sequentia.preprocessing import IndependentFunctionTransformer, median_filter
 
 # Create a preprocessing pipeline that feeds into a KNNClassifier
-# 1. Individually denoise each sequence by applying a median filter for each feature
-# 2. Individually standardize each sequence by subtracting the mean and dividing the s.d. for each feature
-# 3. Reduce the dimensionality of the data to a single feature by using PCA
-# 4. Pass the resulting transformed data into a KNNClassifier
 pipeline = Pipeline([
     ('denoise', IndependentFunctionTransformer(median_filter)),
     ('scale', IndependentFunctionTransformer(scale)),
     ('pca', PCA(n_components=1)),
     ('knn', KNNClassifier(k=1))
 ])
 
-# Fit the pipeline to the data - lengths must be provided
+# Fit the pipeline to the data
 pipeline.fit(X, y, lengths=lengths)
 
-# Predict classes for the sequences and calculate accuracy - lengths must be provided
+# Predict classes for the sequences and calculate accuracy
 y_pred = pipeline.predict(X, lengths=lengths)
+
+# Make predicitons based on the provided sequences and calculate accuracy
 acc = pipeline.score(X, y, lengths=lengths)
 ```
 
+For hyper-parameter optimization, Sequentia provides a `sequentia.model_selection` sub-package
+that includes most of the hyper-parameter search and cross-validation methods provided by 
+[`sklearn.model_selection`](https://scikit-learn.org/stable/api/sklearn.model_selection.html), 
+but adapted to work with sequences.
+
+For instance, we can perform a grid search with k-fold cross-validation stratifying over labels
+in order to find an optimal value for the number of neighbors in `KNNClassifier` for the 
+above pipeline.
+
+```python
+from sequentia.model_selection import StratifiedKFold, GridSearchCV
+
+# Define hyper-parameter search and specify cross-validation method
+search = GridSearchCV(
+    # Re-use the above pipeline
+    estimator=Pipeline([
+        ('denoise', IndependentFunctionTransformer(median_filter)),
+        ('scale', IndependentFunctionTransformer(scale)),
+        ('pca', PCA(n_components=1)),
+        ('knn', KNNClassifier(k=1))
+    ]),
+    # Try a range of values of k
+    param_grid={"knn__k": [1, 2, 3, 4, 5]},
+    # Specify k-fold cross-validation with label stratification using 4 splits
+    cv=StratifiedKFold(n_splits=4),
+)
+
+# Perform cross-validation over accuracy and retrieve the best model
+search.fit(X, y, lengths=lengths)
+clf = search.best_estimator_
+
+# Make predicitons using the best model and calculate accuracy
+acc = clf.score(X, y, lengths=lengths)
+```
+
 ## Acknowledgments
 
 In earlier versions of the package, an approximate DTW implementation [`fastdtw`](https://github.com/slaypni/fastdtw) was used in hopes of speeding up k-NN predictions, as the authors of the original FastDTW paper [[2]](#references) claim that approximated DTW alignments can be computed in linear memory and time, compared to the O(N<sup>2</sup>) runtime complexity of the usual exact DTW implementation.
@@ -326,7 +400,7 @@ All contributions to this repository are greatly appreciated. Contribution guide
 
 Sequentia is released under the [MIT](https://opensource.org/licenses/MIT) license.
 
-Certain parts of the source code are heavily adapted from [Scikit-Learn](scikit-learn.org/).
+Certain parts of source code are heavily adapted from [Scikit-Learn](scikit-learn.org/).
 Such files contain a copy of [their license](https://github.com/scikit-learn/scikit-learn/blob/main/COPYING).
 
 ---

@@ -1,9 +1,7 @@
-/* Adds overflow to the Table of Contents on the side bar */
-div[aria-label="main navigation"] div.sphinxsidebarwrapper div:first-child {
+div.sphinxsidebarwrapper {
 	overflow-x: auto;
 }
 
-/* Hides any API reference lists in the Table of Contents */
-div[aria-label="main navigation"] div.sphinxsidebarwrapper div:first-child a[href="#api-reference"] + ul {
+div.sphinxsidebarwrapper a[href="#definitions"] + ul > li > ul {
 	display: none;
-}
+}
@@ -42,6 +42,7 @@ Features
 
    sections/models/index
    sections/preprocessing/index
+   sections/model_selection/index
    sections/datasets/index
    sections/configuration
 

@@ -13,7 +13,10 @@ API Reference
     ~sequentia.enums.TopologyMode
     ~sequentia.enums.TransitionMode
 
-|
+.. _definitions:
+
+Definitions
+^^^^^^^^^^^
 
 .. automodule:: sequentia.enums
     :members:

@@ -4,4 +4,9 @@ Digits
 API reference
 -------------
 
+.. _definitions:
+
+Definitions
+^^^^^^^^^^^
+
 .. autofunction:: sequentia.datasets.load_digits
@@ -4,4 +4,9 @@ Gene Families
 API reference
 -------------
 
+.. _definitions:
+
+Definitions
+^^^^^^^^^^^
+
 .. autofunction:: sequentia.datasets.load_gene_families
@@ -49,7 +49,10 @@ Properties
    ~sequentia.datasets.base.SequentialDataset.lengths
    ~sequentia.datasets.base.SequentialDataset.y
 
-|
+.. _definitions:
+
+Definitions
+^^^^^^^^^^^
 
 .. autoclass:: sequentia.datasets.base.SequentialDataset
    :members:

@@ -0,0 +1,20 @@
+Model Selection
+===============
+
+.. toctree::
+    :titlesonly:
+
+    searching.rst
+    splitting.rst
+
+----
+
+For validating models and performing hyper-parameter selection, it is common
+to use cross-validation methods such as those in :mod:`sklearn.model_selection`.
+
+Although :mod:`sklearn.model_selection` is partially compatible with Sequentia, 
+we define our own wrapped versions of certain classes and functions to allow 
+support for sequences.
+
+- :ref:`searching` defines methods for searching hyper-parameter spaces in different ways, such as :class:`sequentia.model_selection.GridSearchCV`.
+- :ref:`splitting` defines methods for partitioning data into training/validation splits for cross-validation, such as :class:`sequentia.model_selection.KFold`.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		[run]
		omit = "sequentia/model_selection/_validation.py"