Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release: v2.5.0 #260

Merged
merged 8 commits into from
Dec 27, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
feat: add model_selection sub-package for hyper-parameters (#257)
* add `model_selection` module

* classification -> inference

* clean-up

* add sklearn licenses

* c info

* add .coveragerc

* add unit tests
  • Loading branch information
eonu authored Dec 27, 2024
commit 4c3ca389a753832da462e03c61b3a81286f09eb9
2 changes: 2 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[run]
omit = "sequentia/model_selection/_validation.py"
148 changes: 111 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,12 +69,15 @@ Some examples of how Sequentia can be used on sequence data include:

### Models

The following models provided by Sequentia all support variable length sequences.

#### [Dynamic Time Warping + k-Nearest Neighbors](https://sequentia.readthedocs.io/en/latest/sections/models/knn/index.html) (via [`dtaidistance`](https://github.com/wannesm/dtaidistance))

Dynamic Time Warping (DTW) is a distance measure that can be applied to two sequences of different length.
When used as a distance measure for the k-Nearest Neighbors (kNN) algorithm this results in a simple yet
effective inference algorithm.

- [x] Classification
- [x] Regression
- [x] Variable length sequences
- [x] Multivariate real-valued observations
- [x] Sakoe–Chiba band global warping constraint
- [x] Dependent and independent feature warping (DTWD/DTWI)
Expand All @@ -83,19 +86,28 @@ The following models provided by Sequentia all support variable length sequences

#### [Hidden Markov Models](https://sequentia.readthedocs.io/en/latest/sections/models/hmm/index.html) (via [`hmmlearn`](https://github.com/hmmlearn/hmmlearn))

Parameter estimation with the Baum-Welch algorithm and prediction with the forward algorithm [[1]](#references)
A Hidden Markov Model (HMM) is a state-based statistical model which represents a sequence as
a series of observations that are emitted from a collection of latent hidden states which form
an underlying Markov chain. Each hidden state has an emission distribution that models its observations.

Expectation-maximization via the Baum-Welch algorithm (or forward-backward algorithm) [[1]](#references) is used to
derive a maximum likelihood estimate of the Markov chain probabilities and emission distribution parameters
based on the provided training sequence data.

- [x] Classification
- [x] Multivariate real-valued observations (Gaussian mixture model emissions)
- [x] Univariate categorical observations (discrete emissions)
- [x] Variable length sequences
- [x] Multivariate real-valued observations (modeled with Gaussian mixture emissions)
- [x] Univariate categorical observations (modeled with discrete emissions)
- [x] Linear, left-right and ergodic topologies
- [x] Multi-processed predictions

### Scikit-Learn compatibility

**Sequentia (≥2.0) is fully compatible with the Scikit-Learn API (≥1.4), enabling for rapid development and prototyping of sequential models.**
**Sequentia (≥2.0) is compatible with the Scikit-Learn API (≥1.4), enabling for rapid development and prototyping of sequential models.**

In most cases, the only necessary change is to add a `lengths` key-word argument to provide sequence length information, e.g. `fit(X, y, lengths=lengths)` instead of `fit(X, y)`.
The integration relies on the use of [metadata routing](https://scikit-learn.org/stable/metadata_routing.html),
which means that in most cases, the only necessary change is to add a `lengths` key-word argument to provide
sequence length information, e.g. `fit(X, y, lengths=lengths)` instead of `fit(X, y)`.

### Similar libraries

Expand Down Expand Up @@ -134,10 +146,7 @@ The [Free Spoken Digit Dataset](https://sequentia.readthedocs.io/en/latest/secti
- 1500 used for training, 1500 used for testing (split via label stratification)
- 13 features ([MFCCs](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum))
- Only the first feature was used as not all of the above libraries support multivariate sequences
- Sequence length statistics:
- Minimum: 6
- Median: 17
- Maximum: 92
- Sequence length statistics: (min 6, median 17, max 92)

Each result measures the total time taken to complete training and prediction repeated 10 times.

Expand All @@ -162,13 +171,13 @@ The latest stable version of Sequentia can be installed with the following comma
pip install sequentia
```

### C library compilation
### C libraries

For optimal performance when using any of the k-NN based models, it is important that `dtaidistance` C libraries are compiled correctly.
For optimal performance when using any of the k-NN based models, it is important that the correct `dtaidistance` C libraries are accessible.

Please see the [`dtaidistance` installation guide](https://dtaidistance.readthedocs.io/en/latest/usage/installation.html) for troubleshooting if you run into C compilation issues, or if setting `use_c=True` on k-NN based models results in a warning.

You can use the following to check if the appropriate C libraries have been installed.
You can use the following to check if the appropriate C libraries are available.

```python
from dtaidistance import dtw
Expand All @@ -185,26 +194,25 @@ Documentation for the package is available on [Read The Docs](https://sequentia.

## Examples

Demonstration of classifying multivariate sequences with two features into two classes using the `KNNClassifier`.
Demonstration of classifying multivariate sequences into two classes using the `KNNClassifier`.

This example also shows a typical preprocessing workflow, as well as compatibility with Scikit-Learn.
This example also shows a typical preprocessing workflow, as well as compatibility with
Scikit-Learn for pipelining and hyper-parameter optimization.

```python
import numpy as np
---

from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
First, we create some sample multivariate input data consisting of three sequences with two features.

from sequentia.models import KNNClassifier
from sequentia.preprocessing import IndependentFunctionTransformer, median_filter
- Sequentia expects sequences to be concatenated and represented as a single NumPy array.
- Sequence lengths are provided separately and used to decode the sequences when needed.

# Create input data
# - Sequentia expects sequences to be concatenated into a single array
# - Sequence lengths are provided separately and used to decode the sequences when needed
# - This avoids the need for complex structures such as lists of arrays with different lengths
This avoids the need for complex structures such as lists of nested arrays with different lengths,
or a 3D array with wasteful and annoying padding.

# Sequences
```python
import numpy as np

# Sequence data
X = np.array([
# Sequence 1 - Length 3
[1.2 , 7.91],
Expand All @@ -226,33 +234,99 @@ lengths = np.array([3, 5, 2])

# Sequence classes
y = np.array([0, 1, 1])
```

With this data, we can train a `KNNClassifier` and use it for prediction and scoring.

**Note**: Each of the `fit()`, `predict()` and `score()` methods require the sequence lengths
to be provided in addition to the sequence data `X` and labels `y`.

```python
from sequentia.models import KNNClassifier

# Train and predict (without preprocessing)
# Initialize and fit the classifier
clf = KNNClassifier(k=1)
clf.fit(X, y, lengths=lengths)

# Make predictions based on the provided sequences
y_pred = clf.predict(X, lengths=lengths)
acc = pipeline.score(X, y, lengths=lengths)

# Make predicitons based on the provided sequences and calculate accuracy
acc = clf.score(X, y, lengths=lengths)
```

Alternatively, we can use [`sklearn.preprocessing.Pipeline`](https://scikit-learn.org/1.5/modules/generated/sklearn.pipeline.Pipeline.html) to build a more complex preprocessing pipeline:

1. Individually denoise each sequence by applying a [median filter](https://sequentia.readthedocs.io/en/latest/sections/preprocessing/transforms/filters.html#sequentia.preprocessing.transforms.median_filter) to each sequence.
2. Individually [standardize](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html) each sequence by subtracting the mean and dividing the s.d. for each feature.
3. Reduce the dimensionality of the data to a single feature by using [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
4. Pass the resulting transformed data into a `KNNClassifier`.

**Note**: Steps 1 and 2 use [`IndependentFunctionTransformer`](https://sequentia.readthedocs.io/en/latest/sections/preprocessing/transforms/function_transformer.html#sequentia.preprocessing.transforms.IndependentFunctionTransformer) provided by Sequentia to
apply the specified transformation to each sequence in `X` individually, rather than using
[`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer) from Scikit-Learn which would transform the entire `X`
array once, treating it as a single sequence.

```python
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

from sequentia.preprocessing import IndependentFunctionTransformer, median_filter

# Create a preprocessing pipeline that feeds into a KNNClassifier
# 1. Individually denoise each sequence by applying a median filter for each feature
# 2. Individually standardize each sequence by subtracting the mean and dividing the s.d. for each feature
# 3. Reduce the dimensionality of the data to a single feature by using PCA
# 4. Pass the resulting transformed data into a KNNClassifier
pipeline = Pipeline([
('denoise', IndependentFunctionTransformer(median_filter)),
('scale', IndependentFunctionTransformer(scale)),
('pca', PCA(n_components=1)),
('knn', KNNClassifier(k=1))
])

# Fit the pipeline to the data - lengths must be provided
# Fit the pipeline to the data
pipeline.fit(X, y, lengths=lengths)

# Predict classes for the sequences and calculate accuracy - lengths must be provided
# Predict classes for the sequences and calculate accuracy
y_pred = pipeline.predict(X, lengths=lengths)

# Make predicitons based on the provided sequences and calculate accuracy
acc = pipeline.score(X, y, lengths=lengths)
```

For hyper-parameter optimization, Sequentia provides a `sequentia.model_selection` sub-package
that includes most of the hyper-parameter search and cross-validation methods provided by
[`sklearn.model_selection`](https://scikit-learn.org/stable/api/sklearn.model_selection.html),
but adapted to work with sequences.

For instance, we can perform a grid search with k-fold cross-validation stratifying over labels
in order to find an optimal value for the number of neighbors in `KNNClassifier` for the
above pipeline.

```python
from sequentia.model_selection import StratifiedKFold, GridSearchCV

# Define hyper-parameter search and specify cross-validation method
search = GridSearchCV(
# Re-use the above pipeline
estimator=Pipeline([
('denoise', IndependentFunctionTransformer(median_filter)),
('scale', IndependentFunctionTransformer(scale)),
('pca', PCA(n_components=1)),
('knn', KNNClassifier(k=1))
]),
# Try a range of values of k
param_grid={"knn__k": [1, 2, 3, 4, 5]},
# Specify k-fold cross-validation with label stratification using 4 splits
cv=StratifiedKFold(n_splits=4),
)

# Perform cross-validation over accuracy and retrieve the best model
search.fit(X, y, lengths=lengths)
clf = search.best_estimator_

# Make predicitons using the best model and calculate accuracy
acc = clf.score(X, y, lengths=lengths)
```

## Acknowledgments

In earlier versions of the package, an approximate DTW implementation [`fastdtw`](https://github.com/slaypni/fastdtw) was used in hopes of speeding up k-NN predictions, as the authors of the original FastDTW paper [[2]](#references) claim that approximated DTW alignments can be computed in linear memory and time, compared to the O(N<sup>2</sup>) runtime complexity of the usual exact DTW implementation.
Expand Down Expand Up @@ -326,7 +400,7 @@ All contributions to this repository are greatly appreciated. Contribution guide

Sequentia is released under the [MIT](https://opensource.org/licenses/MIT) license.

Certain parts of the source code are heavily adapted from [Scikit-Learn](scikit-learn.org/).
Certain parts of source code are heavily adapted from [Scikit-Learn](scikit-learn.org/).
Such files contain a copy of [their license](https://github.com/scikit-learn/scikit-learn/blob/main/COPYING).

---
Expand Down
8 changes: 3 additions & 5 deletions docs/source/_static/css/toc.css
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
/* Adds overflow to the Table of Contents on the side bar */
div[aria-label="main navigation"] div.sphinxsidebarwrapper div:first-child {
div.sphinxsidebarwrapper {
overflow-x: auto;
}

/* Hides any API reference lists in the Table of Contents */
div[aria-label="main navigation"] div.sphinxsidebarwrapper div:first-child a[href="#api-reference"] + ul {
div.sphinxsidebarwrapper a[href="#definitions"] + ul > li > ul {
display: none;
}
}
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ Features

sections/models/index
sections/preprocessing/index
sections/model_selection/index
sections/datasets/index
sections/configuration

Expand Down
5 changes: 4 additions & 1 deletion docs/source/sections/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@ API Reference
~sequentia.enums.TopologyMode
~sequentia.enums.TransitionMode

|
.. _definitions:

Definitions
^^^^^^^^^^^

.. automodule:: sequentia.enums
:members:
Expand Down
5 changes: 5 additions & 0 deletions docs/source/sections/datasets/digits.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,9 @@ Digits
API reference
-------------

.. _definitions:

Definitions
^^^^^^^^^^^

.. autofunction:: sequentia.datasets.load_digits
5 changes: 5 additions & 0 deletions docs/source/sections/datasets/gene_families.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,9 @@ Gene Families
API reference
-------------

.. _definitions:

Definitions
^^^^^^^^^^^

.. autofunction:: sequentia.datasets.load_gene_families
5 changes: 4 additions & 1 deletion docs/source/sections/datasets/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,10 @@ Properties
~sequentia.datasets.base.SequentialDataset.lengths
~sequentia.datasets.base.SequentialDataset.y

|
.. _definitions:

Definitions
^^^^^^^^^^^

.. autoclass:: sequentia.datasets.base.SequentialDataset
:members:
Expand Down
20 changes: 20 additions & 0 deletions docs/source/sections/model_selection/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Model Selection
===============

.. toctree::
:titlesonly:

searching.rst
splitting.rst

----

For validating models and performing hyper-parameter selection, it is common
to use cross-validation methods such as those in :mod:`sklearn.model_selection`.

Although :mod:`sklearn.model_selection` is partially compatible with Sequentia,
we define our own wrapped versions of certain classes and functions to allow
support for sequences.

- :ref:`searching` defines methods for searching hyper-parameter spaces in different ways, such as :class:`sequentia.model_selection.GridSearchCV`.
- :ref:`splitting` defines methods for partitioning data into training/validation splits for cross-validation, such as :class:`sequentia.model_selection.KFold`.
Loading