Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add isNull condition for payload filtering #1617

Merged
merged 17 commits into from
Apr 1, 2023

Conversation

ibrahim-akrab
Copy link
Contributor

@ibrahim-akrab ibrahim-akrab commented Mar 28, 2023

All Submissions:

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you lint your code locally using cargo fmt command prior to submission?
  3. Have you checked your code using cargo clippy command?

This PR will solves issue #1609 with some rough edges.

TODO:

  • check the cardinality estimation correctness
  • unit tests
  • OpenAPI integration tests
  • grpc implementation

/claim #1609

@ibrahim-akrab ibrahim-akrab marked this pull request as ready for review March 29, 2023 00:14
@ibrahim-akrab
Copy link
Contributor Author

Alright, now I think it should be ready (except for the cardinality estimation since I need some help to make sure it is right)

I found that IsEmpty condition only gets matched if the key is not present at all in the payload. It doesn't get matched when it is [] as it was mentioned in the original issue. I checked that against the docker image without my changes as well. I think this is in part because of

pub fn get_value_from_json_map<'a>(
path: &str,
value: &'a serde_json::Map<String, Value>,
) -> MultiValue<&'a Value> {

Since it treats serde_json::value::Value::Array as a Single value. So this can be fixed but I am not sure that it doesn't have regressions elsewhere.
Another approach which I think will be less error prone is fixing
pub fn check_is_empty_condition(is_empty: &IsEmptyCondition, payload: &Payload) -> bool {
to take the previous fact into consideration.

I am willing to implement either fix of them. Just waiting for your opinion and approval.

@ibrahim-akrab
Copy link
Contributor Author

I went ahead and added a fix for the is_empty condition of "key":[] mentioned above. Now it behaves according to the issue description. It can always be reverted if it wasn't meant to behave that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a specific implementation for the MultiValue<&Value> used in the is_empty and is_null conditions. I couldn't change the main is_empty() method since it's being used elsewhere and the new specialized methods couldn't have the same name since the compiler won't attempt to resolve the ambiguity in case they overlap.

@ibrahim-akrab
Copy link
Contributor Author

Hey @agourlay, could you please review this and let me know of any required changes.

@generall generall requested a review from agourlay March 29, 2023 14:37
Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a very basic review. Someone else will have to go over this in more detail.

lib/segment/src/payload_storage/query_checker.rs Outdated Show resolved Hide resolved
Copy link
Member

@generall generall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/qdrant/qdrant/blob/master/docs/DEVELOPMENT.md#api-changes

REST: run /tools/generate_openapi_models.sh to generate specs
gRPC: generate docs ./tools/generate_grpc_docs.sh

This

@@ -49,7 +49,7 @@ def basic_collection_setup(collection_name='test_collection', on_disk_payload=Fa
{
"id": 1,
"vector": [0.05, 0.61, 0.76, 0.74],
"payload": {"city": "Berlin"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not change tests that are already working to test new functionality. Create new fixtures if needed

Copy link
Contributor Author

@ibrahim-akrab ibrahim-akrab Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that tests shouldn't be changed. However I thought the points in the basic_collection_setup() was too standard and didn't account for "key":[] which is why it was thought to be working while it didn't. Also if ,for example, for some strange reason the "values_count" condition was changed such that it mistakenly count/miscount the "key": null or "key": [], the old tests will pass.
That's why I thought adding some diversity and changing the values accordingly is a good idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any values_count changes is most likely related to this recent fix #1502

I agree that changing existing payloads is not optimal for us.

I understand where you are coming from, if you want to add diversity to the existing data, I'd be ok with adding new points to the basic_collection_setup at the cost of tracking down potential failing count assertions.

Use your best judgement :)

Copy link
Contributor Author

@ibrahim-akrab ibrahim-akrab Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values_count change is just an example. What I meant to say is without having the test data representing all cases, some change may break another case and never fail a test since they're not existing in the same collection.

I already reverted the test changes to their previous state and added a new collection with the diverse payloads as per @generall's request. However, I am not happy with this solution for the reasons mentioned above. I'd rather add to the old test points.

"filter": {
"should": [
{
"is_null": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is strange, that this passes validation without an updated openapi.json. Apparently jsonschema.validate is not able to process this example correctly.

That is something we need to keep in mind. FYI @agourlay

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to reproduce the issue locally which cause malformed requests to not fail.

I even have a lead for a fix, will report asap 🤞

lib/segment/src/index/struct_payload_index.rs Outdated Show resolved Hide resolved
@@ -234,6 +240,8 @@ mod tests {
"rating": vec![3, 7, 9, 9],
"color": "red",
"has_delivery": true,
"parts": [],
"packaging": null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a test for a field that is [null]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the filtering supposed to behave in that case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess [null] is equivalent to an empty array, so not null.
WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current implementation, it will not be matched with neither is_empty nor is_null.
If you want it to be matched with is_empty that's fine. However, what about [null, [null]], is that even a valid payload? if so, should it be equivalent to empty as well?

Copy link
Contributor Author

@ibrahim-akrab ibrahim-akrab Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another concern , wouldn't it be confusing for "key":[null] to get matched as empty and not as null with the word "null" in it? I don't see it making sense in the documentation 😕

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forget about my comment, I don't think it is necessary to test this.
Sorry for derailing your research.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's alright, I added it anyway to test for any future changes regarding this case.

lib/segment/src/types.rs Outdated Show resolved Hide resolved
lib/segment/src/types.rs Outdated Show resolved Hide resolved
Copy link
Member

@generall generall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@generall
Copy link
Member

@ibrahim-akrab , thanks for the contribution! Please do not forget to /claim #1609 the bounty before I merge this

@ibrahim-akrab
Copy link
Contributor Author

@generall, happy to help and thanks for the bounty 🤑

Copy link
Member

@agourlay agourlay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution 👍

If you are still feeling motivated you could add a short doc about it in the Filtering section.

@ibrahim-akrab
Copy link
Contributor Author

@agourlay Just submitted a PR there. Thanks for the suggestion 👍

@generall generall merged commit f85ae02 into qdrant:dev Apr 1, 2023
@ibrahim-akrab ibrahim-akrab deleted the handle_null branch April 2, 2023 08:46
generall pushed a commit that referenced this pull request Apr 11, 2023
* add minimal working is_null filter

* add is_null condition to grpc api (backward compatible)

* add unit tests is_null and is_empty conditions

* add is_null to  points.proto file

* add some failing OpenAPI tests

* fix a failing test due to change in collection data

* refactor MultiValue's check for is_null

* fix is_empty condition not picking up "key":[]

* remove duplicate OpenAPI integration test

* reuse same variable in condition checker tests

* update grpc docs

* fix is_null cardinality estimation to match is_empty

* update openapi specs

* remove unused debug statements

* add new test points to original test_collection

* fix failing tests according to newly added points

* add the `"key":[null]` test_case
@generall generall mentioned this pull request Apr 19, 2023
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants