-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Refactored and improved Catch22
transformer - support for column names, short aliases, refactor to pd.Series
, sktime
native parallelization
#6002
Conversation
@yarnabrina @fkiraly I had to reopen it after syncing, as the last one closed automatically. It's still work in progress, I have to implement proper naming of the columns (which was the main goal). I also think of passing a dict of variables to |
Also, what about #5990? |
This also applies to I've done the same recently with
It's up to you whether you do this in one PR, separate PR, or not at all - it's probably easier to separate this out, if you're changing logic as well. |
FYI, linting (code formatting) checks are failing, see here for a guide how to ensure you follow these, and how to set up the checks locally: https://www.sktime.net/en/stable/developer_guide/coding_standards.html |
Yes, I changed the typing from
My only concern is that right now I use it in my project for multivariate time series, with a Output like in https://www.sktime.net/en/stable/examples/transformation/feature_extraction_with_tsfresh.html (Multivariate time series classification data) would be ideal for me. - InputColumnName_FeatureName (or its index like in your implementation here: #5983) |
Can you explain how/why this would pose a problem? You can still pass nested dataframes or other The column naming is also automatic, it is |
(FYI, linting again - it says it wants double quotes instead of single quotes for strings) |
Okay great I see it now, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! Some comments.
Blocking (for discussion or change):
check_n_jobs
is no longer used, and the change seems to affect logic. I would say, let's remove it from this PR, and if you think it is needed, discuss in another PR or issue.
Non-blocking:
- why do we need two dictionaries,
METHODS_DICT
andCATCH24_METHODS_DICT
- would it not simplify multiple places? Similarly, for short feature names. Odd that these are/were treated separately. - docstring should mention this also supports catch-24 features (plus mean, std)
- why is the
catch24
argument needed and mean/std treated completely differently? That is confusing. Why not simply add mean and std as additional features to the large list? Needs to be deprecate, of course. outlier_norm
docstring should mention which two feaures it affects.- the
_transform_single_feature
method is very unpleasant, as it is private, yet used in random classifiers. Not necessary for this PR, but a question: can you come up with a good way to decouple this, e.g., joint utility dispatcher or similar? It is not good style if estimtors keep calling each others' private methods, and risky. As a warning, if it is not immediately obvious, I would recommend not looking into it as things like this can develop into time consuming rabbit holes...
Okay, removed
I think it's due to the fact that std and mean weren't mentioned in the original publication: https://link.springer.com/article/10.1007/s10618-019-00647-x and were added later, so are not really a core part of catch22. That's why I created two seperate dicts, but this could be discussed in the future, but all the other packages (pycatch22, julia one etc.) also seperate the logic of core 22 features and these additional two.
added
Same as above, maybe features should be "catch22" by default or "catch24" instead of "all", but this could be changed later in my opinion.
Added
I spent a long time with this method which is in fact not necessary in the |
pd.Series
, sktime
native parallelization
I am aware, there's catch22 and catch24. What I mean, on the code side there doesn't need to be a distinction, they are just 24 features and sometimes people only want 22 of those. Imo there's no reason to internally separate dictionaries etc, there still can be a separation in the user interface, or having only 22 as the default. |
Sure - when I say "non-blocking", it means, fine if changed later or not at all. |
Agreed - also non-blocking, as said. Just in case you had an idea on how to resolve this quickly, reading this I see that it probably is not that easy to clean this up. |
Okay, do you think it could be done in the next PR in parallel with modifying the selection of features (removing |
Sure - when it's listed "non-blocking", it means it does not prevent from getting this PR merged if not addressed, and can well be addressed in another PR, or not a tall. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me!
Two small but blocking requests:
- parameter sequence, see above
- parameter coverage in tests, see above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - all addressed.
Once (or - if) tests pass, we can merge, from my side.
pd.Series
, sktime
native parallelizationpd.Series
, sktime
native parallelization
pd.Series
, sktime
native parallelizationCatch22
transformer - support for column names, short aliases, refactor to pd.Series
, sktime
native parallelization
@fkiraly I have one last question, as I cannot find it in documentation -- how should the n_jobs be passed in non-deprecated way? |
the warning and docstring should tell you - via |
PR #6002 is great but it accidentally introduced coupling between the `catch22` module and the `numba` utilities, resulting in imports of the `numba` utilities even if `numba` was not present. This would lead to the confusing warning (`numba` not present) being raised in many unrelated cases. This PR decouples the module from the `numba` utilities. Also fixes accidental renaming of a feature that occurred in #6002, and adds a test to ensure feature names and their order remain the same (compared to 0.27.0 and prior).
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Simplifies the code to be in accordance with the current version of sktime, similarly to #5981.
Adds option to add column names in the output, with short names option similarly to DynamicsAndNeuralSystems/pycatch22#18
Does your contribution introduce a new dependency? If yes, which one?
No
What should a reviewer concentrate their feedback on?
[ ] Changed logic from looping over nested_univ series to a single pd.Series approach.
[ ] Private catch22 from
_catch22_numba.py
functions are now passed through a dict[ ]
_catch22_numba.py
now accesses variables passed through a dict, not indirectly, to simplify the logic of passing arguments in the maincatch22.py
moduleDid you add any tests for the change?
No
Any other comments?
PR checklist
For all contributions
How to: add yourself to the all-contributors file in the
sktime
root directory (not theCONTRIBUTORS.md
). Common badges:code
- fixing a bug, or adding code logic.doc
- writing or improving documentation or docstrings.bug
- reporting or diagnosing a bug (get this pluscode
if you also fixed the bug in the PR).maintenance
- CI, test framework, release.See here for full badge reference
See here for further details on the algorithm maintainer role.