[ENH] Refactored and improved `Catch22` transformer - support for column names, short aliases, refactor to `pd.Series`, `sktime` native parallelization #6002

julnow · 2024-02-25T16:08:04Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Simplifies the code to be in accordance with the current version of sktime, similarly to #5981.
Adds option to add column names in the output, with short names option similarly to DynamicsAndNeuralSystems/pycatch22#18

Does your contribution introduce a new dependency? If yes, which one?

No

What should a reviewer concentrate their feedback on?

[ ] Changed logic from looping over nested_univ series to a single pd.Series approach.
[ ] Private catch22 from _catch22_numba.py functions are now passed through a dict
[ ] _catch22_numba.py now accesses variables passed through a dict, not indirectly, to simplify the logic of passing arguments in the main catch22.py module

Did you add any tests for the change?

No

Any other comments?

PR checklist

For all contributions

I've added myself to the list of contributors with any new badges I've earned :-)
How to: add yourself to the all-contributors file in the sktime root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
See here for full badge reference
Optionally, I've added myself and possibly others to the CODEOWNERS file - do this if you want to become the owner or maintainer of an estimator you added.
See here for further details on the algorithm maintainer role.
The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.

julnow · 2024-02-25T16:10:49Z

@yarnabrina @fkiraly I had to reopen it after syncing, as the last one closed automatically.

It's still work in progress, I have to implement proper naming of the columns (which was the main goal).

I also think of passing a dict of variables to _transform_case methods, and creating a wrapper for the original functions which would unpack needed variables from **kwargs passed to each _get_feature_function(feature)(*args) call. What do you think about it @fkiraly ?

julnow · 2024-02-25T16:14:04Z

Also, what about #5990?

fkiraly · 2024-02-25T16:57:40Z

Also, what about #5990

This also applies to Catch22 - it implements an unnecessary loop across time series instances - and could be much simplified.

I've done the same recently with Catch22Wrapper: #5983, the recipe is:

change tags: X_inner_mtype to pd.Series or pd.DataFrame (whichever more convenient), change univariate-only to True
remove all loops over instances and variables
deprecate n_jobs, add the deprecation message, switch use of n_jobs to use set_config for the time of the deprecation

It's up to you whether you do this in one PR, separate PR, or not at all - it's probably easier to separate this out, if you're changing logic as well.

fkiraly · 2024-02-25T16:58:15Z

FYI, linting (code formatting) checks are failing, see here for a guide how to ensure you follow these, and how to set up the checks locally: https://www.sktime.net/en/stable/developer_guide/coding_standards.html

julnow · 2024-02-25T18:19:27Z

FYI, linting (code formatting) checks are failing, see here for a guide how to ensure you follow these, and how to set up the checks locally: https://www.sktime.net/en/stable/developer_guide/coding_standards.html

Yes, I changed the typing from list to List so I hope it works now.

Also, what about #5990

This also applies to Catch22 - it implements an unnecessary loop across time series instances - and could be much simplified.

I've done the same recently with Catch22Wrapper: #5983, the recipe is:

change tags: X_inner_mtype to pd.Series or pd.DataFrame (whichever more convenient), change univariate-only to True

remove all loops over instances and variables

deprecate n_jobs, add the deprecation message, switch use of n_jobs to use set_config for the time of the deprecation

It's up to you whether you do this in one PR, separate PR, or not at all - it's probably easier to separate this out, if you're changing logic as well.

My only concern is that right now I use it in my project for multivariate time series, with a nested_univ dataframe as input. How would you recommend dealing it with it? Should the implementation allow the multivariate dataframe, or it would be better to run this the transform method in a loop and concat columns in the end (in my final implementation of this method in my project, not here)? It wouldn't be very elegant I'm afraid.

Output like in https://www.sktime.net/en/stable/examples/transformation/feature_extraction_with_tsfresh.html (Multivariate time series classification data) would be ideal for me. - InputColumnName_FeatureName (or its index like in your implementation here: #5983)

fkiraly · 2024-02-25T18:59:19Z

My only concern is that right now I use it in my project for multivariate time series, with a nested_univ dataframe as input.

Can you explain how/why this would pose a problem? You can still pass nested dataframes or other sktime compatible containers after the proposed refactor, it's just that the iteration would be handled by the base class boilerplate.

The column naming is also automatic, it is variablename__featurename if the iteration is done by the base class.

fkiraly · 2024-02-26T10:44:22Z

(FYI, linting again - it says it wants double quotes instead of single quotes for strings)

julnow · 2024-02-26T11:42:02Z

My only concern is that right now I use it in my project for multivariate time series, with a nested_univ dataframe as input.

Can you explain how/why this would pose a problem? You can still pass nested dataframes or other sktime compatible containers after the proposed refactor, it's just that the iteration would be handled by the base class boilerplate.

The column naming is also automatic, it is variablename__featurename if the iteration is done by the base class.

Okay great I see it now, thanks!

fkiraly

Excellent! Some comments.

Blocking (for discussion or change):

check_n_jobs is no longer used, and the change seems to affect logic. I would say, let's remove it from this PR, and if you think it is needed, discuss in another PR or issue.

Non-blocking:

why do we need two dictionaries, METHODS_DICT and CATCH24_METHODS_DICT - would it not simplify multiple places? Similarly, for short feature names. Odd that these are/were treated separately.
docstring should mention this also supports catch-24 features (plus mean, std)
why is the catch24 argument needed and mean/std treated completely differently? That is confusing. Why not simply add mean and std as additional features to the large list? Needs to be deprecate, of course.
outlier_norm docstring should mention which two feaures it affects.
the _transform_single_feature method is very unpleasant, as it is private, yet used in random classifiers. Not necessary for this PR, but a question: can you come up with a good way to decouple this, e.g., joint utility dispatcher or similar? It is not good style if estimtors keep calling each others' private methods, and risky. As a warning, if it is not immediately obvious, I would recommend not looking into it as things like this can develop into time consuming rabbit holes...

julnow · 2024-03-04T12:10:03Z

Excellent! Some comments.

Blocking (for discussion or change):

check_n_jobs is no longer used, and the change seems to affect logic. I would say, let's remove it from this PR, and if you think it is needed, discuss in another PR or issue.

Okay, removed

Non-blocking:

why do we need two dictionaries, METHODS_DICT and CATCH24_METHODS_DICT - would it not simplify multiple places? Similarly, for short feature names. Odd that these are/were treated separately.

I think it's due to the fact that std and mean weren't mentioned in the original publication: https://link.springer.com/article/10.1007/s10618-019-00647-x and were added later, so are not really a core part of catch22. That's why I created two seperate dicts, but this could be discussed in the future, but all the other packages (pycatch22, julia one etc.) also seperate the logic of core 22 features and these additional two.

docstring should mention this also supports catch-24 features (plus mean, std)

added

why is the catch24 argument needed and mean/std treated completely differently? That is confusing. Why not simply add mean and std as additional features to the large list? Needs to be deprecate, of course.

Same as above, maybe features should be "catch22" by default or "catch24" instead of "all", but this could be changed later in my opinion.

outlier_norm docstring should mention which two feaures it affects.

Added

the _transform_single_feature method is very unpleasant, as it is private, yet used in random classifiers. Not necessary for this PR, but a question: can you come up with a good way to decouple this, e.g., joint utility dispatcher or similar? It is not good style if estimtors keep calling each others' private methods, and risky. As a warning, if it is not immediately obvious, I would recommend not looking into it as things like this can develop into time consuming rabbit holes...

I spent a long time with this method which is in fact not necessary in the catch22 transformer, but I left it for compatibility with the other module. This should definetely be resolved in the future, but I didn't have time - the other module should be accordingly rewritten.

fkiraly · 2024-03-04T12:26:27Z

so are not really a core part of catch22. That's why I created two seperate dicts, but this could be discussed in the future, but all the other packages (pycatch22, julia one etc.) also seperate the logic of core 22 features and these additional two.

I am aware, there's catch22 and catch24. What I mean, on the code side there doesn't need to be a distinction, they are just 24 features and sometimes people only want 22 of those. Imo there's no reason to internally separate dictionaries etc, there still can be a separation in the user interface, or having only 22 as the default.

fkiraly · 2024-03-04T12:27:14Z

Same as above, maybe features should be "catch22" by default or "catch24" instead of "all", but this could be changed later in my opinion.

Sure - when I say "non-blocking", it means, fine if changed later or not at all.

fkiraly · 2024-03-04T12:28:42Z

I spent a long time with this method which is in fact not necessary in the catch22 transformer ... This should definetely be resolved in the future, but I didn't have time

Agreed - also non-blocking, as said. Just in case you had an idea on how to resolve this quickly, reading this I see that it probably is not that easy to clean this up.

julnow · 2024-03-04T12:58:39Z

I am aware, there's catch22 and catch24. What I mean, on the code side there doesn't need to be a distinction, they are just 24 features and sometimes people only want 22 of those. Imo there's no reason to internally separate dictionaries etc, there still can be a separation in the user interface, or having only 22 as the default.

Okay, do you think it could be done in the next PR in parallel with modifying the selection of features (removing catch24 bool argument)?

fkiraly · 2024-03-04T14:40:12Z

Okay, do you think it could be done in the next PR

Sure - when it's listed "non-blocking", it means it does not prevent from getting this PR merged if not addressed, and can well be addressed in another PR, or not a tall.

sktime/transformations/panel/catch22.py

fkiraly

Looks great to me!

Two small but blocking requests:

parameter sequence, see above
parameter coverage in tests, see above

fkiraly

Thanks - all addressed.

Once (or - if) tests pass, we can merge, from my side.

julnow · 2024-03-19T13:26:00Z

@fkiraly I have one last question, as I cannot find it in documentation -- how should the n_jobs be passed in non-deprecated way?

fkiraly · 2024-03-19T22:59:23Z

the warning and docstring should tell you - via set_config and backend:params. Please raise an issue if you think that information is not easily accessible or obscure, because that would be a problem (and a place for improvement).

PR #6002 is great but it accidentally introduced coupling between the `catch22` module and the `numba` utilities, resulting in imports of the `numba` utilities even if `numba` was not present. This would lead to the confusing warning (`numba` not present) being raised in many unrelated cases. This PR decouples the module from the `numba` utilities. Also fixes accidental renaming of a feature that occurred in #6002, and adds a test to ensure feature names and their order remain the same (compared to 0.27.0 and prior).

julnow added 2 commits February 25, 2024 17:02

simply feature names selection

e251486

implement n_jobs ==-1 to use all cores

b8cdde5

julnow requested review from achieveordie, benHeid, fkiraly and yarnabrina as code owners February 25, 2024 16:08

fkiraly added module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing enhancement Adding new functionality labels Feb 25, 2024

replace list with List in typing

f6184cf

remove unnecesary empty lines

a491c23

julnow added 12 commits February 26, 2024 12:49

fix linting

c5fc5de

Merge branch 'sktime:main' into main

10aa0e2

remove unimplemented function

1f04b23

pass variables through dict

1afcfac

Remove loops

0a13fc2

implement short feature names

7faaec2

fixed linting

c8b14c4

remove case match for compatibility with python 3.9

2e26bb9

fix annotation

e520dfe

pass a dict to _get_feature_function

c987876

fix passing dict

5757ea8

Seperate output col names logic

40706a4

julnow requested a review from fkiraly March 3, 2024 22:58

fkiraly requested changes Mar 3, 2024

View reviewed changes

julnow added 4 commits March 4, 2024 11:49

add catch24 features names

9b23f9a

Update __init__.py

fcf8894

add outlier_norm desc

14d2936

remove trailing space

f47adbc

fkiraly changed the title ~~[ENH] Refactor Catch22 transformer to simplify debugging and use short names in results~~ [ENH] Refactor Catch22 transformer - support for column names, short aliases, refactor to pd.Series, sktime native parallelization Mar 4, 2024

fkiraly linked an issue Mar 4, 2024 that may be closed by this pull request

[ENH] Add options for short names in catch22 #5981

Closed

julnow requested a review from fkiraly March 4, 2024 21:20

fkiraly reviewed Mar 4, 2024

View reviewed changes

sktime/transformations/panel/catch22.py Show resolved Hide resolved

fkiraly reviewed Mar 4, 2024

View reviewed changes

sktime/transformations/panel/catch22.py Outdated Show resolved Hide resolved

fkiraly requested changes Mar 4, 2024

View reviewed changes

julnow added 3 commits March 5, 2024 13:02

unify catch24 feature names

74fcf14

n_jobs arg should remain in its old position

d7a108a

add more tests

3b9b636

fkiraly approved these changes Mar 5, 2024

View reviewed changes

fkiraly merged commit b931ca9 into sktime:main Mar 5, 2024
54 checks passed

fkiraly mentioned this pull request Mar 10, 2024

[MNT] decouple catch22 module from numba utilities #6101

Merged

julnow mentioned this pull request Mar 20, 2024

[DOC] Improve documentation on multithreading and parallelization #6174

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Refactored and improved `Catch22` transformer - support for column names, short aliases, refactor to `pd.Series`, `sktime` native parallelization #6002

[ENH] Refactored and improved `Catch22` transformer - support for column names, short aliases, refactor to `pd.Series`, `sktime` native parallelization #6002

julnow commented Feb 25, 2024 •

edited

Loading

julnow commented Feb 25, 2024

julnow commented Feb 25, 2024

fkiraly commented Feb 25, 2024

fkiraly commented Feb 25, 2024

julnow commented Feb 25, 2024 •

edited

Loading

fkiraly commented Feb 25, 2024 •

edited

Loading

fkiraly commented Feb 26, 2024

julnow commented Feb 26, 2024

fkiraly left a comment

julnow commented Mar 4, 2024

fkiraly commented Mar 4, 2024

fkiraly commented Mar 4, 2024

fkiraly commented Mar 4, 2024

julnow commented Mar 4, 2024 •

edited

Loading

fkiraly commented Mar 4, 2024 •

edited

Loading

fkiraly left a comment

fkiraly left a comment

julnow commented Mar 19, 2024

fkiraly commented Mar 19, 2024

[ENH] Refactored and improved Catch22 transformer - support for column names, short aliases, refactor to pd.Series, sktime native parallelization #6002

[ENH] Refactored and improved Catch22 transformer - support for column names, short aliases, refactor to pd.Series, sktime native parallelization #6002

Conversation

julnow commented Feb 25, 2024 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Any other comments?

PR checklist

For all contributions

julnow commented Feb 25, 2024

julnow commented Feb 25, 2024

fkiraly commented Feb 25, 2024

fkiraly commented Feb 25, 2024

julnow commented Feb 25, 2024 • edited Loading

fkiraly commented Feb 25, 2024 • edited Loading

fkiraly commented Feb 26, 2024

julnow commented Feb 26, 2024

fkiraly left a comment

Choose a reason for hiding this comment

julnow commented Mar 4, 2024

fkiraly commented Mar 4, 2024

fkiraly commented Mar 4, 2024

fkiraly commented Mar 4, 2024

julnow commented Mar 4, 2024 • edited Loading

fkiraly commented Mar 4, 2024 • edited Loading

fkiraly left a comment

Choose a reason for hiding this comment

fkiraly left a comment

Choose a reason for hiding this comment

julnow commented Mar 19, 2024

fkiraly commented Mar 19, 2024

[ENH] Refactored and improved `Catch22` transformer - support for column names, short aliases, refactor to `pd.Series`, `sktime` native parallelization #6002

[ENH] Refactored and improved `Catch22` transformer - support for column names, short aliases, refactor to `pd.Series`, `sktime` native parallelization #6002

julnow commented Feb 25, 2024 •

edited

Loading

julnow commented Feb 25, 2024 •

edited

Loading

fkiraly commented Feb 25, 2024 •

edited

Loading

julnow commented Mar 4, 2024 •

edited

Loading

fkiraly commented Mar 4, 2024 •

edited

Loading