-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mega Change #28
base: master
Are you sure you want to change the base?
Mega Change #28
Conversation
Okay, there is a lot here. Basically, I moved a lot into fuzzywuzzy_compatible implemented on top of various other new root crate level primitives (which, for the most part, have gracefully handled everything fuzzywuzzy wants to do). A lot got moved into fuzzywuzzy_compatible and re-exported into I'll probably come back and edit this comment tomorrow with a more coherent/comprehensive summary. Still stuff left to do:
|
tests are passing locally on nightly, its just intersperse giving us grief. |
Let's see, what else... Processor trait vs normalizers? A couple good normalizers plus grapheme segmentation is probably the best default, w/ codepoints segmentation an occasional perf boost. I'm still not happy I can't easily represent the token set stuff and wonder if I can't unify that somehow. Are there any other complex normalization sequences that would be useful to have default impls? |
pub fn uwratio(s1: &str, s2: &str, full_process: bool) -> u8 { | ||
// trivial check omitted because this is a shallow delegator to wratio which checks. | ||
wratio(s1, s2, false, full_process) | ||
// TODO: document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Documentation needed] 🤖
/// let choices = vec!["foo", "bar", "baz"]; | ||
/// assert_eq!(extract_without_order_full(query, &choices, default_processor, default_scorer, 0.try_into().unwrap()), vec![Match{ item: &"foo", score: 0.try_into().unwrap() }, Match{ item: &"bar", score: 100.try_into().unwrap() }, Match{ item: &"baz", score: 67.try_into().unwrap() }, ]); | ||
/// ``` | ||
pub fn extract_without_order_full<'a, 'b, A: 'a, B, C, P, S>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pedantic nitpicking: is _full
a common suffix for functions providing more behavior controls in the signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know! It would be good to find some prior art on this before 1.0 but its probably fine for this PR. If you think we can do better, make an issue so we remember to check it out again.
/// // assert_eq!(replace_non_letters_non_numbers_with_whitespace("abcØØØकिमपि"), "abcØØØक मप "); | ||
/// assert_eq!(replace_non_letters_non_numbers_with_whitespace("abcØØØकिमपि"), "abcØØØकिमपि"); | ||
/// ``` | ||
pub fn replace_non_letters_non_numbers_with_whitespace(s: &str) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More nitpicking:
subst_wspace_for_non_alphanumeric
replace_non_alphanumeric_with_whitespace
/// assert_eq!(asciionly("abcØØØकिमपि"), "abc"); | ||
/// assert_eq!(asciionly("ØØØकिमपि"), ""); | ||
/// ``` | ||
pub fn asciionly(s: &str) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More nitpicking:
only_ascii
ascii_only
keep_ascii
} | ||
} | ||
|
||
// TODO: document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO docs
This normalizer splits by whitespace and unstably sorts the input.
convert::{From, TryFrom}, | ||
}; | ||
|
||
// TODO: turn primitives.rs into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make a branch and do this, if needed. IntelliJ Rust has a nice feature for transforming a single file into a module.
} | ||
|
||
impl TryFrom<u8> for Score { | ||
// TODO: pick a better 'error' type... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly the comment I leave whenever I'm too lazy to define an Enum and derive thiserror::Error
😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea! I honestly don't know what, if any, error ecosystem we should plug into. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wisdom for a while has been "anyhow" for binaries, "thiserror" for libraries. The macros from thiserror are simply and non-magic enough that I feel comfortable using them normally.
/// baseline query. | ||
// TODO: link wratio | ||
/// For example, wratio might be used to compare against strings. | ||
pub trait Scorer<A, B> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Scorer<Q, C>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except wratio
doesn't even fit this trait. With the talk in #6, it might be best to do away with traits here and just use S: Fn(A, B) -> Score
in the relevant signatures in process.rs
. Then we can maybe collect simple_ratio
, best_partial_ratio
, and whatever else into their own scorer.rs
module.
part of the issue is that the extract methods are unnecessarily generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be best to do away with traits here and just use S: Fn(A, B) -> Score in the relevant signatures
👍 The less complexity and code, the better :)
part of the issue is that the extract methods are unnecessarily generic.
What would you change about their signatures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've thought about this some more and think this should get the same treatment as trait Normalizer
and trait Segmenter
. The stuff in fuzzywuzzy_compatible::process
should be able to be adapted to the new trait.
// Basically takes the output of segmentation on two items...
trait Scorer<T>{
fn score<T: Eq>(&self, a: &[T], b: &[T]) -> Score;
}
// Then lets implement the most basic `Scorer`, `simple_ratio`...
// ... which happens to need it's a separate trait for reasons ...
trait Aligner {
fn align<T: Eq>(&self, a: &[T], b: &[T]) ->Vec<(usize, usize, usize)>;
}
pub struct SimpleRatioScorer<A: Aligner>(A);
impl <T: Eq, A: Aligner> Scorer<T> for SimpleRatioScorer<A> {
fn score<T: Eq>(&self, a: &[T], b: &[T]) -> Score;
let matches: usize = self.0.align(a, b).iter().map(|&(_, _, s)| s).sum();
let sumlength = (a.len() + b.len()) as f32;
if sumlength.is_normal() {
(100.0 * (2.0 * matches as f32) / sumlength).round().try_into().unwrap()
} else {
100.try_into().unwrap()
}
}
}
pub struct BasicMatchingBlocks;
impl Aligner for BasicMatchingBlocks {
fn align<T: Eq>(&self, a: &[T], b: &[T]) ->Vec<(usize, usize, usize)> {
get_matching_blocks(a, b)
}
}
pub struct SmithWaterman;
impl Aligner for SmithWaterman{
fn align<T: Eq>(&self, a: &[T], b: &[T]) ->Vec<(usize, usize, usize)> {
return magic();
}
}
pub struct BestPartialScorer<A: Aligner, S: Scorer>(A, S);
impl <T: Eq, A: Aligner, S: Scorer> Scorer<T> for BestPartialScorer<A, S> {
fn score<T: Eq>(&self, a: &[T], b: &[T]) -> Score;
let blocks = self.0.align(a, b);
let mut max = 0.try_into().unwrap();
for (i, j, _) in blocks {
let long_start = if j > i { j - i } else { 0 };
let long_end = std::cmp::min(long_start + shorter.len(), longer.len());
let long_substr = &longer[long_start..long_end];
let r = self.1.score(&shorter, long_substr).try_into().unwrap();
if r > 99 {
return 100.try_into().unwrap();
} else if r > max {
max = r;
}
}
max
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also means that token_set_ratio and such can implement the trait and have similar building blocks.
Welp, I think this wasn't a problem in python because the python |
Not sure whether it helps, but here is my C++ implementation of get_matching_blocks. |
I haven't looked at it in a few days but the problem was one of (I can't remember which offhand) debug_assert!(low1 <= high1);
debug_assert!(low2 <= high2);
debug_assert!(high1 <= shorter.len());
debug_assert!(high2 <= longer.len());
debug_assert!(high1 - low1 <= high2 - low2); was failing. All of these ought to be maintained for each call to I think your implementation suffers the same issue with if ((spos + length) < a_high && (dpos + length) < b_high) {
queue.emplace_back(spos + length, a_high, dpos + length, b_high);
} I have not looked at this since I noted it last week. |
I will test this in my implementation. However this is directly taken from the Python implementation (and appears to be the same in your implementation as well). I would personally assume the bug is somewhere in find_longest_match, since there your using quite a different implementation. RapidFuzz uses pretty much the same implementation as Difflib in there, but since I do not use the auto junk feature I was able to get rid of the hashmap RapidFuzz if (length) {
if (a_low < spos && b_low < dpos) {
queue.emplace_back(a_low, spos, b_low, dpos);
}
if ((spos + length) < a_high && (dpos + length) < b_high) {
queue.emplace_back(spos + length, a_high, dpos + length, b_high);
}
matching_blocks_pass1.emplace_back(spos, dpos, length);
} fuzzywuzzy-rsfuzzywuzzy-rs/src/primitives.rs Lines 207 to 215 in f787942
difflib if k: # if k is 0, there was no matching block
matching_blocks.append(x)
if alo < i and blo < j:
queue.append((alo, i, blo, j))
if i+k < ahi and j+k < bhi:
queue.append((i+k, ahi, j+k, bhi)) |
fixes #24
fixes #22
fixes #20
fixes #7 (mostly - once you have extractWithoutOrder, the rest are basically just 'get top N' which can be done by callers)
#6 might or might not make it depending on the wratio cleanups.