-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPRangeQuery to search for IPs which are members of a subnet. #1546
Conversation
update from upstream v1.0.14
upgrade to blevesearch/bleve v2
upgrade to v2.0.1
…operly formatter CIDR or IP
OK, so the first thing that I would ask is that you not take any immediate action on my comments. Often times it seems people assume if they respond to my feedback immediately we'll merge it and be done. In practice, I will also be asking others to review this as well, and sometimes their feedback alters my opinion. So making changes at this stage is premature. |
First the good news, I think this PR has basically the right form/shape to make its way into the code base. However, there is one possibly big issue for us to think about. As this is currently implemented, there is basically no real new capability for end users. It is in essence a convenience wrapper around indexing terms in a particular format, and using term and term range queries to match them. To be clear, that is not a bad thing necessarily. The boolean field type does the exact same thing (indexes 'T' or 'F' and has a searcher that knows about this encoding). And the date/geo types are really just wrappers around the technique used by the numeric field type. So, then what is the problem with this. Well, the implementation uses term range queries to implement CIDR netmask queries, but this is not very efficient. If the range expands to a large number of addresses, it just doesn't work. I don't know exactly how this will be used in practice, but since ipv6 addresses are 128bit, net masks can quite easily expand to very large numbers of addresses. So, while there is nothing wrong with this as an initial implementation, I'm not sure it will be useful in practice. What other alternatives are there? Well, again the good news is that we already have one option. The numeric indexing format we have today addresses this exact problem. When you do a numeric range search from -∞ to ∞, we do not perform a disjunction query over 2^64 terms. Instead we use a technique to index multiple terms for each number, not just a single term. These terms have bits shifted off, so that multiple numeric values map to the same term. The trade-off here is additional space in the index allows for more efficient querying. I'm pretty sure a similar technique could be used here. The actual implementation we use for numeric fields has some hard-coded limit at 64-bits. The problem is a bit deeper here because the current technique trades space for query speed, but it can really bloat indexes. One of the major items we expect to focus on this year is improving the numeric index (most likely using approach similar to lucene's bkd-tree). Obviously supporting > 64bits is one of our requirements already, so hopefully this field type will eventually map to that implementation. That leaves us with 3 choices that I can see:
In a perfect world I would wait for 3, because if we do anything else, we have to break compatibility for some users or have some plan to migrate things in the future. But, arguably we've been waiting for 3 for too long now, and it's time to do something. We have to have some compatibility story for numeric/date/geo anyway, so this isn't really contributing to making that worse. That leaves option 1 or 2. Again, I would prefer 2, but I also get that it is more work that seems like it would be thrown away in the near future. So I could be persuaded to go ahead with option 1, but I would like to hear more about your use case, and how well this will work for it. I have some minor nits with naming and code formatting in some places, but wanted to start with this big top-level item first. I'm adding @abhinavdangeti and @sreekanth-cb to review as well. |
Hi Marty, A previous version of this PR used a FieldDictPrefix query to narrow down the search. This required some post-filtering if the mask length was not an exact number of bytes. But would this be a better approach? |
So, I don't think benchmarks will necessarily show the problem, as it depends on data loaded into the index, which for tests/benchmarks is usually small. Consider this example: User does an IPRangeQuery for bleve/search/searcher/search_disjunction.go Line 111 in 2522dc2
You can increase the limit, but at some point you will either run out of memory or find the query takes too long to execute. |
I would argue that returning an In our specific use case, our IP addresses correspond to actual physical computers, and the IPRangeQuery would usually be a CIDR corresponding to a specific network. We would not expect these searches to match a particularly large number of IPs and setting Our very old implementation represented addresses as numeric fields and this did bloat our indexes to So for these reasons, option (1) would be the best solution for our use case. |
Hello again @mschoch + team... |
@amnonbc I have provided my review and asked for @abhinavdangeti and @sreekanth-cb to review this, and so far they have not. They do not work for me, I cannot make them do it. As this PR changes the project's API (in an additive way), it must meet the highest standard of review, as once we add something, the expectation is that we'll support it going forward. So, our default posture for changes like this is to not accept them without a consensus to include them. In this case, as we've already discussed, neither the index nor the searching requires any new capability here. If you are in a hurry to use this approach, you can do it already using existing text fields and term range searches. I expect that when they have time, they will provide their review, and we'll decide how to move the project forward with this at that time. |
No rush. We are happy there is a chance it might be merged and more than happy to support that if it happens |
Hello @mschoch, @amnonbc. Apologies on the delay in getting to this. I believe this PR does add value (even if it's just more of a wrapper, we do this already with some others). So, for the moment - I don't object the current approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new files added here need the copyright stub.
Otherwise, I'm ok with getting this change in right now.
@@ -0,0 +1,67 @@ | |||
// Copyright (c) 2014 Couchbase, Inc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2021*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool...
@abhinavdangeti
Which copyright should I copy>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amnonbc This is the bit (from search/searcher/search_ip_range.go), that's missing in some files ..
// Copyright (c) 2021 Couchbase, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
- document/field_ip.go
- document/field_ip_test.go
- search/query/ip_range.go
- search/searcher/search_ip_range_test.go
- test/ip_field_test.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get a unit test failure when running against 1.17.
This will need a bit of work and some coffee to figure out.
I'll do it tomorrow. (It is nearly 11pm in London now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I had some coffee and fixed the breaking test.
The standard library in Go 1.17 has become stricter on parsing IP addresses, and was rejected IP input which
contained gratuitions leading zeros.
Because of golang/go@d3e3d03 which rejects leading zeros in dot-decimal notation.
Is it better use “IP” (all upper cases) instead of “Ip”? |
|
* MB-40604: Support IPRange query Related: #1546 * Update IPs used in TestIPRangeQuery
+ IP field support introduced with: blevesearch/bleve#1546 + Also, requires: blevesearch/bleve#1735 Change-Id: I76cc275ebc7817ecec0521789587c1b68eeafdd9 Reviewed-on: https://review.couchbase.org/c/cbft/+/179981 Well-Formed: Build Bot <build@couchbase.com> Tested-by: Abhi Dangeti <abhinav@couchbase.com> Reviewed-by: <thejas.orkombu@couchbase.com>
* MB-40604: Support IPRange query Related: #1546 * Update IPs used in TestIPRangeQuery
+ Uses `7.2-couchbase` branch of blevesearch/bleve which includes the following commit over v2.3.3: - blevesearch/bleve#1735 + IP field and RangeQuery support added with: - blevesearch/bleve#1546 - blevesearch/bleve#1735 Change-Id: I76cc275ebc7817ecec0521789587c1b68eeafdd9 Reviewed-on: https://review.couchbase.org/c/cbft/+/185273 Well-Formed: Restriction Checker Tested-by: Abhi Dangeti <abhinav@couchbase.com> Reviewed-by: Abhi Dangeti <abhinav@couchbase.com>
I have closed #1536
and squashed the history, to make it easier to review.
The review is under 600 lines of code, about half of which are tests.
This PR adds an IP field, and an IPRangeQuery which can be used to match all ip's in a network.
bleve.NewIPRangeQuery("1.2.3.0/24")
will return all documents whose IP fields contain IPs within the 1.2.3.X network.
The IP data is normalised as a big endian ipv6 format, which is a slice of 16 bytes,
and this is what is stored in the indexes.
We support searching for a specific IP, and searching for membership of a network.
The search expressions and fields are always converted to ipv6.
If we are searching for a specific IP, we just hand the work over to a TermSearcher.
For network membership, we calculate the minimum and maximum IP addresses in
the network and using indexReader.FieldDictRange() and return the indexes in this range.
The tests in https://github.com/blevesearch/bleve/blob/8bcceaf6405e9fbb1a5e08ca94db4ae2eef858da/test/ip_field_test.go
illustrate how the IP Range query can be used.