[ENH] design question: how to signpost tagged estimator properties that may depend on components or parameters #6994
Description
In various different places - user frustration, internal interfaces, testing framework - we are encountering the issue of tags that may depend on hyperparameters, including on properties and hyperparameters of components. Examples:
- whether a distance based time series classifier/clusterer supports multivariate or unequal length data will depend on whether the distance used does
- whether a forecasting pipeline can deal with nans or categorical exogeneous data depends on whether some of the components can, and the sequence of those components (e.g., if the first component fills nans, the rest do not need to be able to handle nan)
Current philosophy has the tag on the respective estimators set to the most general setting, to avoid raising boilerplate errors in ambiguous cases, erring on the side of letting unsupported inputs through (as opposed to incorrectly blocking supported inputs).
However, this causes problems in retrieval related cases, such as:
- users searching for an estimator and then being confused/frustrated about concrete instances not supporting the capability that is nominally supported by the class (blueprint) tag, see here for an example: [BUG] TimeSeriesDBSCAN cannot handle unequal length time series despite being explicitly able to #6993
- test retrieval of classes and concrete instances being ambiguous about whether the test instances do have the intended capability, this has been a recurring problem in the introduction of categorical feature support, see here: [ENH] Extending categorical support in X to transformers and pipelines #6924
The problem is compounded by the fact that it may not be conclusively possible to infer the exact capability flag for a given composite, even if all the parameters are fully known, e.g.,:
- in the case of
sklearn
compatible estimators, the capability of supporting categorical data is partly but not entirely inspectable, e.g., from third party compatible estimators (likecatboost
). - in the case of pipelines, the sequentiality implies that the outputs may be valid even if the overall input is invalid for a given component. An example is the "supports na" capability with an imputer at the start, see above. The current solution is a tag specifying whether nan are removed, however this is also imperfect as there is no tag that specifies whether nans are introduced, and that also may furthermore depend on the data.
@Abhay-Lejith or @yarnabrina (please correct me if I am misattributing) had the idea of, instead of using the "most general tag", introducing a value that means "indeterminate", i.e., dependent on components or parameters.
This issue is to discuss potential designs to address the problem, and/or the concrete suggestion and what it would entail in the code base.