Skip to content

[ENH] design question: how to signpost tagged estimator properties that may depend on components or parameters #6994

Open
@fkiraly

Description

In various different places - user frustration, internal interfaces, testing framework - we are encountering the issue of tags that may depend on hyperparameters, including on properties and hyperparameters of components. Examples:

  • whether a distance based time series classifier/clusterer supports multivariate or unequal length data will depend on whether the distance used does
  • whether a forecasting pipeline can deal with nans or categorical exogeneous data depends on whether some of the components can, and the sequence of those components (e.g., if the first component fills nans, the rest do not need to be able to handle nan)

Current philosophy has the tag on the respective estimators set to the most general setting, to avoid raising boilerplate errors in ambiguous cases, erring on the side of letting unsupported inputs through (as opposed to incorrectly blocking supported inputs).

However, this causes problems in retrieval related cases, such as:

The problem is compounded by the fact that it may not be conclusively possible to infer the exact capability flag for a given composite, even if all the parameters are fully known, e.g.,:

  • in the case of sklearn compatible estimators, the capability of supporting categorical data is partly but not entirely inspectable, e.g., from third party compatible estimators (like catboost).
  • in the case of pipelines, the sequentiality implies that the outputs may be valid even if the overall input is invalid for a given component. An example is the "supports na" capability with an imputer at the start, see above. The current solution is a tag specifying whether nan are removed, however this is also imperfect as there is no tag that specifies whether nans are introduced, and that also may furthermore depend on the data.

@Abhay-Lejith or @yarnabrina (please correct me if I am misattributing) had the idea of, instead of using the "most general tag", introducing a value that means "indeterminate", i.e., dependent on components or parameters.

This issue is to discuss potential designs to address the problem, and/or the concrete suggestion and what it would entail in the code base.

Metadata

Assignees

No one assigned

    Labels

    API designAPI design & software architecturemodule:base-frameworkBaseObject, registry, base framework

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions