Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Refactored and improved Catch22 transformer - support for column names, short aliases, refactor to pd.Series, sktime native parallelization #6002

Merged
merged 79 commits into from
Mar 5, 2024

Conversation

julnow
Copy link
Contributor

@julnow julnow commented Feb 25, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Simplifies the code to be in accordance with the current version of sktime, similarly to #5981.
Adds option to add column names in the output, with short names option similarly to DynamicsAndNeuralSystems/pycatch22#18

Does your contribution introduce a new dependency? If yes, which one?

No

What should a reviewer concentrate their feedback on?

[ ] Changed logic from looping over nested_univ series to a single pd.Series approach.
[ ] Private catch22 from _catch22_numba.py functions are now passed through a dict
[ ] _catch22_numba.py now accesses variables passed through a dict, not indirectly, to simplify the logic of passing arguments in the main catch22.py module

Did you add any tests for the change?

No

Any other comments?

PR checklist

For all contributions
  • I've added myself to the list of contributors with any new badges I've earned :-)
    How to: add yourself to the all-contributors file in the sktime root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
    See here for full badge reference
  • Optionally, I've added myself and possibly others to the CODEOWNERS file - do this if you want to become the owner or maintainer of an estimator you added.
    See here for further details on the algorithm maintainer role.
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.

@julnow
Copy link
Contributor Author

julnow commented Feb 25, 2024

@yarnabrina @fkiraly I had to reopen it after syncing, as the last one closed automatically.

It's still work in progress, I have to implement proper naming of the columns (which was the main goal).

I also think of passing a dict of variables to _transform_case methods, and creating a wrapper for the original functions which would unpack needed variables from **kwargs passed to each _get_feature_function(feature)(*args) call. What do you think about it @fkiraly ?

@julnow
Copy link
Contributor Author

julnow commented Feb 25, 2024

Also, what about #5990?

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 25, 2024

Also, what about #5990

This also applies to Catch22 - it implements an unnecessary loop across time series instances - and could be much simplified.

I've done the same recently with Catch22Wrapper: #5983, the recipe is:

  • change tags: X_inner_mtype to pd.Series or pd.DataFrame (whichever more convenient), change univariate-only to True
  • remove all loops over instances and variables
  • deprecate n_jobs, add the deprecation message, switch use of n_jobs to use set_config for the time of the deprecation

It's up to you whether you do this in one PR, separate PR, or not at all - it's probably easier to separate this out, if you're changing logic as well.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 25, 2024

FYI, linting (code formatting) checks are failing, see here for a guide how to ensure you follow these, and how to set up the checks locally: https://www.sktime.net/en/stable/developer_guide/coding_standards.html

@fkiraly fkiraly added module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing enhancement Adding new functionality labels Feb 25, 2024
@julnow
Copy link
Contributor Author

julnow commented Feb 25, 2024

FYI, linting (code formatting) checks are failing, see here for a guide how to ensure you follow these, and how to set up the checks locally: https://www.sktime.net/en/stable/developer_guide/coding_standards.html

Yes, I changed the typing from list to List so I hope it works now.

Also, what about #5990

This also applies to Catch22 - it implements an unnecessary loop across time series instances - and could be much simplified.

I've done the same recently with Catch22Wrapper: #5983, the recipe is:

  • change tags: X_inner_mtype to pd.Series or pd.DataFrame (whichever more convenient), change univariate-only to True
  • remove all loops over instances and variables
  • deprecate n_jobs, add the deprecation message, switch use of n_jobs to use set_config for the time of the deprecation

It's up to you whether you do this in one PR, separate PR, or not at all - it's probably easier to separate this out, if you're changing logic as well.

My only concern is that right now I use it in my project for multivariate time series, with a nested_univ dataframe as input. How would you recommend dealing it with it? Should the implementation allow the multivariate dataframe, or it would be better to run this the transform method in a loop and concat columns in the end (in my final implementation of this method in my project, not here)? It wouldn't be very elegant I'm afraid.

Output like in https://www.sktime.net/en/stable/examples/transformation/feature_extraction_with_tsfresh.html (Multivariate time series classification data) would be ideal for me. - InputColumnName_FeatureName (or its index like in your implementation here: #5983)

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 25, 2024

My only concern is that right now I use it in my project for multivariate time series, with a nested_univ dataframe as input.

Can you explain how/why this would pose a problem? You can still pass nested dataframes or other sktime compatible containers after the proposed refactor, it's just that the iteration would be handled by the base class boilerplate.

The column naming is also automatic, it is variablename__featurename if the iteration is done by the base class.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 26, 2024

(FYI, linting again - it says it wants double quotes instead of single quotes for strings)

@julnow
Copy link
Contributor Author

julnow commented Feb 26, 2024

My only concern is that right now I use it in my project for multivariate time series, with a nested_univ dataframe as input.

Can you explain how/why this would pose a problem? You can still pass nested dataframes or other sktime compatible containers after the proposed refactor, it's just that the iteration would be handled by the base class boilerplate.

The column naming is also automatic, it is variablename__featurename if the iteration is done by the base class.

Okay great I see it now, thanks!

@julnow julnow requested a review from fkiraly March 3, 2024 22:58
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Some comments.

Blocking (for discussion or change):

  • check_n_jobs is no longer used, and the change seems to affect logic. I would say, let's remove it from this PR, and if you think it is needed, discuss in another PR or issue.

Non-blocking:

  • why do we need two dictionaries, METHODS_DICT and CATCH24_METHODS_DICT - would it not simplify multiple places? Similarly, for short feature names. Odd that these are/were treated separately.
  • docstring should mention this also supports catch-24 features (plus mean, std)
  • why is the catch24 argument needed and mean/std treated completely differently? That is confusing. Why not simply add mean and std as additional features to the large list? Needs to be deprecate, of course.
  • outlier_norm docstring should mention which two feaures it affects.
  • the _transform_single_feature method is very unpleasant, as it is private, yet used in random classifiers. Not necessary for this PR, but a question: can you come up with a good way to decouple this, e.g., joint utility dispatcher or similar? It is not good style if estimtors keep calling each others' private methods, and risky. As a warning, if it is not immediately obvious, I would recommend not looking into it as things like this can develop into time consuming rabbit holes...

@julnow
Copy link
Contributor Author

julnow commented Mar 4, 2024

Excellent! Some comments.

Blocking (for discussion or change):

  • check_n_jobs is no longer used, and the change seems to affect logic. I would say, let's remove it from this PR, and if you think it is needed, discuss in another PR or issue.

Okay, removed

Non-blocking:

  • why do we need two dictionaries, METHODS_DICT and CATCH24_METHODS_DICT - would it not simplify multiple places? Similarly, for short feature names. Odd that these are/were treated separately.

I think it's due to the fact that std and mean weren't mentioned in the original publication: https://link.springer.com/article/10.1007/s10618-019-00647-x and were added later, so are not really a core part of catch22. That's why I created two seperate dicts, but this could be discussed in the future, but all the other packages (pycatch22, julia one etc.) also seperate the logic of core 22 features and these additional two.

  • docstring should mention this also supports catch-24 features (plus mean, std)

added

  • why is the catch24 argument needed and mean/std treated completely differently? That is confusing. Why not simply add mean and std as additional features to the large list? Needs to be deprecate, of course.

Same as above, maybe features should be "catch22" by default or "catch24" instead of "all", but this could be changed later in my opinion.

  • outlier_norm docstring should mention which two feaures it affects.

Added

  • the _transform_single_feature method is very unpleasant, as it is private, yet used in random classifiers. Not necessary for this PR, but a question: can you come up with a good way to decouple this, e.g., joint utility dispatcher or similar? It is not good style if estimtors keep calling each others' private methods, and risky. As a warning, if it is not immediately obvious, I would recommend not looking into it as things like this can develop into time consuming rabbit holes...

I spent a long time with this method which is in fact not necessary in the catch22 transformer, but I left it for compatibility with the other module. This should definetely be resolved in the future, but I didn't have time - the other module should be accordingly rewritten.

@fkiraly fkiraly changed the title [ENH] Refactor Catch22 transformer to simplify debugging and use short names in results [ENH] Refactor Catch22 transformer - support for column names, short aliases, refactor to pd.Series, sktime native parallelization Mar 4, 2024
@fkiraly
Copy link
Collaborator

fkiraly commented Mar 4, 2024

so are not really a core part of catch22. That's why I created two seperate dicts, but this could be discussed in the future, but all the other packages (pycatch22, julia one etc.) also seperate the logic of core 22 features and these additional two.

I am aware, there's catch22 and catch24. What I mean, on the code side there doesn't need to be a distinction, they are just 24 features and sometimes people only want 22 of those. Imo there's no reason to internally separate dictionaries etc, there still can be a separation in the user interface, or having only 22 as the default.

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 4, 2024

Same as above, maybe features should be "catch22" by default or "catch24" instead of "all", but this could be changed later in my opinion.

Sure - when I say "non-blocking", it means, fine if changed later or not at all.

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 4, 2024

I spent a long time with this method which is in fact not necessary in the catch22 transformer ... This should definetely be resolved in the future, but I didn't have time

Agreed - also non-blocking, as said. Just in case you had an idea on how to resolve this quickly, reading this I see that it probably is not that easy to clean this up.

@julnow
Copy link
Contributor Author

julnow commented Mar 4, 2024

I am aware, there's catch22 and catch24. What I mean, on the code side there doesn't need to be a distinction, they are just 24 features and sometimes people only want 22 of those. Imo there's no reason to internally separate dictionaries etc, there still can be a separation in the user interface, or having only 22 as the default.

Okay, do you think it could be done in the next PR in parallel with modifying the selection of features (removing catch24 bool argument)?

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 4, 2024

Okay, do you think it could be done in the next PR

Sure - when it's listed "non-blocking", it means it does not prevent from getting this PR merged if not addressed, and can well be addressed in another PR, or not a tall.

@fkiraly fkiraly linked an issue Mar 4, 2024 that may be closed by this pull request
@julnow julnow requested a review from fkiraly March 4, 2024 21:20
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me!

Two small but blocking requests:

  • parameter sequence, see above
  • parameter coverage in tests, see above

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - all addressed.

Once (or - if) tests pass, we can merge, from my side.

@fkiraly fkiraly changed the title [ENH] Refactor Catch22 transformer - support for column names, short aliases, refactor to pd.Series, sktime native parallelization [ENH] Refactored and improved Catch22 transformer - support for column names, short aliases, refactor to pd.Series, sktime native parallelization Mar 5, 2024
@fkiraly fkiraly changed the title [ENH] Refactored and improved Catch22 transformer - support for column names, short aliases, refactor to pd.Series, sktime native parallelization [ENH] Refactored and improved Catch22 transformer - support for column names, short aliases, refactor to pd.Series, sktime native parallelization Mar 5, 2024
@fkiraly fkiraly merged commit b931ca9 into sktime:main Mar 5, 2024
54 checks passed
@julnow
Copy link
Contributor Author

julnow commented Mar 19, 2024

@fkiraly I have one last question, as I cannot find it in documentation -- how should the n_jobs be passed in non-deprecated way?

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 19, 2024

the warning and docstring should tell you - via set_config and backend:params. Please raise an issue if you think that information is not easily accessible or obscure, because that would be a problem (and a place for improvement).

fkiraly added a commit that referenced this pull request Mar 20, 2024
PR #6002 is great but it
accidentally introduced coupling between the `catch22` module and the
`numba` utilities, resulting in imports of the `numba` utilities even if
`numba` was not present.

This would lead to the confusing warning (`numba` not present) being
raised in many unrelated cases.
This PR decouples the module from the `numba` utilities.

Also fixes accidental renaming of a feature that occurred in #6002, and
adds a test to ensure feature names and their order remain the same
(compared to 0.27.0 and prior).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding new functionality module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ENH] Add options for short names in catch22
3 participants