Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow complex filters being set #1401

Open
jluethi opened this issue Apr 16, 2024 · 4 comments · May be fixed by #2145
Open

Allow complex filters being set #1401

jluethi opened this issue Apr 16, 2024 · 4 comments · May be fixed by #2145
Assignees
Labels
flexibility Support more workflow-execution use cases

Comments

@jluethi
Copy link
Collaborator

jluethi commented Apr 16, 2024

If my understanding is correct, we currently can filter for the following:
Inclusion only
AND combination of filters
Single filter per attribute / type
For types: Filter for boolean true vs. false
For attributes: Filter for exact match of the attribute

Correct me if I'm wrong and we already support more in V2 @tcompa

Because it comes up in web discussions, I'll try to give a larger perspective of where this could go in the future. This is not meant as blocking for the v2 release.


There are many areas where we could expand the filters to allow for more complex filters, ordered by approximate importance:

Important:

  • Exclude given matches: e.g. process everything except well B07 (because there was some issue with that well)
  • Support for OR filters per attribute: Filter for wells B03, B05, C04, C07 => allows for selection of multiple items

Nice to have:

  • Filter with complex patterns, e.g. all B wells by saying filter "well = B*"

I don't currently see the value in:
Support OR filters between attributes: Look for well B03 or acquisition 1

@tcompa
Copy link
Collaborator

tcompa commented Apr 23, 2024

Reminder: the action of merging two filter sets together is now also implemented in fractal-web (ref fractal-analytics-platform/fractal-web#442). When we change the filter structure, we should either change it in both places or offer a dedicated endpoint for the merge operation.

@jluethi jluethi added the flexibility Support more workflow-execution use cases label Jul 3, 2024
@tcompa
Copy link
Collaborator

tcompa commented Dec 12, 2024

Here is the latest update.

Images subpackage

The current Filters model looks like

class Filters(BaseModel, extra=Extra.forbid):
    attributes: dict[str, Any] = Field(default_factory=dict)
    types: dict[str, bool] = Field(default_factory=dict)

    # Validators
    _attributes = validator("attributes", allow_reuse=True)(
        valdictkeys("attributes")
    )
    _types = validator("types", allow_reuse=True)(valdictkeys("types"))

    @validator("attributes")
    def validate_attributes(
        cls, v: dict[str, Any]
    ) -> dict[str, Union[int, float, str, bool, None]]:
        for key, value in v.items():
            if not isinstance(value, (int, float, str, bool, type(None))):
                raise ValueError(
                    f"Filters.attributes[{key}] must be a scalar "
                    "(int, float, str, bool, or None). "
                    f"Given {value} ({type(value)})"
                )
        return v

The new one should look like

The current `Filters` model looks like
```python
class Filters(BaseModel, extra=Extra.forbid):
    attributes_include: dict[str, list[Any]] = Field(default_factory=dict)
    attributes_exclude: dict[str, list[Any]] = Field(default_factory=dict)
    types: dict[str, bool] = Field(default_factory=dict)

... + validators

where the lists are meant as internally joined by a logical OR.

Validators will now apply to the list elements:

  • Each list element must be of the same scalar type as before (int, float, str, bool, None)
  • List can have any length 1 (-> include or exclude an exact match) or larger
  • List cannot have length 0.
  • We cannot have the same key both in attributes_include and attributes_exclude. This is a validation error.

The match_filter function should be updated accordingly

Examples:

{"attributes_include": {}, "attributes_exclude": {}, "types": {}}
--> no filtering

{"attributes_include": {keyA: []},  "attributes_exclude": {}, "types": {}}
--> invalid

{"attributes_include": {keyA: ["value1"]},  "attributes_exclude": {}, "types": {}}
--> requires exact match with "value1"

{"attributes_include": {keyA: ["value1", "value2"]},  "attributes_exclude": {}, "types": {}}
--> requires exact match with "value1" OR "value2"

{"attributes_include": {keyA: ["value1"]},  "attributes_exclude": {keyA: ["value2"]}, "types": {}}
--> invalid filters (even though the two filters would be logically compatible)

{"attributes_include": {keyA: ["value1"]},  "attributes_exclude": {keyB: ["value2"]}, "types": {}}
--> valid filters

DB

Current version

    input_filters: dict[
        Literal["attributes", "types"], dict[str, Any]
    ] = Field(
        sa_column=Column(
            JSON,
            nullable=False,
            server_default='{"attributes": {}, "types": {}}',
        )
    )

to be updated accordingly.

API

Changes are the expected ones (that is, using the new Filters model).
Pay attention to the fact that ORM objects only have Python dict attributes for filters, and we should always cast to Filters when appropriate. For instance in _workflow_insert_task we have a wrong type hint, saying that input_filters are a Filters object while they are a dict.

Data migration for current DB values

Existing relevant db columns are

  • DatasetV2.filters
  • WorkflowTaskV2.input_filters

A data-migration script is required, and it will make this change:

  • Loop over all DatasetV2s
  • If filters = {"attributes": {}, "types": WHATEVER}, replace it with filters = {"attributes_include": {}, "attributes_exclude": {}, "types": WHATEVER}.
  • If filters = {"attributes": {"key1: "value1"}, "types": WHATEVER}, replace it with filters = {"attributes_include": {"key1: ["value1"]}, "attributes_exclude": {}, "types": WHATEVER}
  • Repeat the same for WorkflowTaskV2s

Backwards-compatibility for importing old workflows/datasets from JSON

If a workflow was exported in the past, and it has some filters set, these would be in the "old" version.

This model

class WorkflowTaskImportV2(BaseModel, extra=Extra.forbid):
    ...
    input_filters: Optional[Filters] = None

should become something like

class WorkflowTaskImportV2(BaseModel, extra=Extra.forbid):
    ...
    input_filters: Optional[Union[Filters, FiltersLegacy]] = None

We should double check that:

  1. When the input value is a valid Filters object, nothing special occurs
  2. When it is a FiltersLegacy object, it is automatically replaced by the corresponding Filters object

The same applies to DatasetImportV2

@ychiucco ychiucco linked a pull request Dec 16, 2024 that will close this issue
3 tasks
@ychiucco ychiucco linked a pull request Dec 16, 2024 that will close this issue
3 tasks
@jluethi
Copy link
Collaborator Author

jluethi commented Dec 18, 2024

Let's not invest too much into handling older attribute filters.

Suggested behavior for workflow import: If any attribute filters are set in a workflow, upon import show a warning that it's dropped.

Suggested behavior for dataset import: TBD. Could raise an informative NotImplementedError

@tcompa
Copy link
Collaborator

tcompa commented Dec 18, 2024

Two additions:

  1. Workflowtask filters must not include any attribute (only types)
  2. The job will gain some attribute filters (the ones that were originally associated to the wftask) - but not types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flexibility Support more workflow-execution use cases
Projects
Development

Successfully merging a pull request may close this issue.

4 participants