Mega Change #28

logannc · 2021-03-27T03:58:45Z

fixes #24
fixes #22
fixes #20
fixes #7 (mostly - once you have extractWithoutOrder, the rest are basically just 'get top N' which can be done by callers)

#6 might or might not make it depending on the wratio cleanups.

logannc · 2021-03-27T04:16:39Z

Okay, there is a lot here. Basically, I moved a lot into fuzzywuzzy_compatible implemented on top of various other new root crate level primitives (which, for the most part, have gracefully handled everything fuzzywuzzy wants to do). A lot got moved into fuzzywuzzy_compatible and re-exported into fuzz for now. If we improve the impls with non-compatible versions, we just stop exporting them. The apis should be the same.

I'll probably come back and edit this comment tomorrow with a more coherent/comprehensive summary.

Still stuff left to do:

wratio and the token_set methods need a little love. they don't fit with normalizers/segmenters so might just do minimal cleanup
documentation missing
primitives module probably needs splitting because there is a bunch of unrelated junk littered there
fuzzywuzzy_compatible should probably be python_compatible because we're also named fuzzywuzzy?
need to figure out how to replace intersperse
performance benchmarks/regression tests?

logannc · 2021-03-27T04:17:05Z

tests are passing locally on nightly, its just intersperse giving us grief.

logannc · 2021-03-27T05:15:38Z

Let's see, what else...

Processor trait vs normalizers?
Scorer vs calcs?
Redefine misc utils in terms of primitives (i.e., full_process)
Better defaults and docs on normalizes

A couple good normalizers plus grapheme segmentation is probably the best default, w/ codepoints segmentation an occasional perf boost.

I'm still not happy I can't easily represent the token set stuff and wonder if I can't unify that somehow.

Are there any other complex normalization sequences that would be useful to have default impls?

seanpianka · 2021-03-28T01:10:41Z

src/fuzz.rs

-pub fn uwratio(s1: &str, s2: &str, full_process: bool) -> u8 {
-    // trivial check omitted because this is a shallow delegator to wratio which checks.
-    wratio(s1, s2, false, full_process)
+// TODO: document


[Documentation needed] 🤖

seanpianka · 2021-03-28T01:14:40Z

src/fuzzywuzzy_compatible/process.rs

+/// let choices = vec!["foo", "bar", "baz"];
+/// assert_eq!(extract_without_order_full(query, &choices, default_processor, default_scorer, 0.try_into().unwrap()), vec![Match{ item: &"foo", score: 0.try_into().unwrap() }, Match{ item: &"bar", score: 100.try_into().unwrap() }, Match{ item: &"baz", score: 67.try_into().unwrap() }, ]);
+/// ```
+pub fn extract_without_order_full<'a, 'b, A: 'a, B, C, P, S>(


Pedantic nitpicking: is _full a common suffix for functions providing more behavior controls in the signature?

I don't know! It would be good to find some prior art on this before 1.0 but its probably fine for this PR. If you think we can do better, make an issue so we remember to check it out again.

seanpianka · 2021-03-28T01:17:26Z

src/fuzzywuzzy_compatible/string_processing.rs

+/// // assert_eq!(replace_non_letters_non_numbers_with_whitespace("abcØØØकिमपि"), "abcØØØक मप ");
+/// assert_eq!(replace_non_letters_non_numbers_with_whitespace("abcØØØकिमपि"), "abcØØØकिमपि");
+/// ```
+pub fn replace_non_letters_non_numbers_with_whitespace(s: &str) -> String {


More nitpicking:

subst_wspace_for_non_alphanumeric

replace_non_alphanumeric_with_whitespace

seanpianka · 2021-03-28T01:19:05Z

src/fuzzywuzzy_compatible/utils.rs

+/// assert_eq!(asciionly("abcØØØकिमपि"), "abc");
+/// assert_eq!(asciionly("ØØØकिमपि"), "");
+/// ```
+pub fn asciionly(s: &str) -> String {


More nitpicking:

only_ascii

ascii_only

keep_ascii

seanpianka · 2021-03-28T01:22:20Z

src/normalization.rs

+    }
+}
+
+// TODO: document


TODO docs

This normalizer splits by whitespace and unstably sorts the input.

seanpianka · 2021-03-28T01:23:20Z

src/primitives.rs

+    convert::{From, TryFrom},
+};
+
+// TODO: turn primitives.rs into


I can make a branch and do this, if needed. IntelliJ Rust has a nice feature for transforming a single file into a module.

seanpianka · 2021-03-28T01:24:38Z

src/primitives.rs

+}
+
+impl TryFrom<u8> for Score {
+    // TODO: pick a better 'error' type...


This is exactly the comment I leave whenever I'm too lazy to define an Enum and derive thiserror::Error 😆

yea! I honestly don't know what, if any, error ecosystem we should plug into. Thoughts?

The wisdom for a while has been "anyhow" for binaries, "thiserror" for libraries. The macros from thiserror are simply and non-magic enough that I feel comfortable using them normally.

seanpianka · 2021-03-28T01:25:38Z

src/primitives.rs

+/// baseline query.
+// TODO: link wratio
+/// For example, wratio might be used to compare against strings.
+pub trait Scorer<A, B> {


Nitpick: Scorer<Q, C>?

except wratio doesn't even fit this trait. With the talk in #6, it might be best to do away with traits here and just use S: Fn(A, B) -> Score in the relevant signatures in process.rs. Then we can maybe collect simple_ratio, best_partial_ratio, and whatever else into their own scorer.rs module.

part of the issue is that the extract methods are unnecessarily generic.

it might be best to do away with traits here and just use S: Fn(A, B) -> Score in the relevant signatures

👍 The less complexity and code, the better :)

part of the issue is that the extract methods are unnecessarily generic.

What would you change about their signatures?

I've thought about this some more and think this should get the same treatment as trait Normalizer and trait Segmenter. The stuff in fuzzywuzzy_compatible::process should be able to be adapted to the new trait.

// Basically takes the output of segmentation on two items... trait Scorer<T>{ fn score<T: Eq>(&self, a: &[T], b: &[T]) -> Score; } // Then lets implement the most basic `Scorer`, `simple_ratio`... // ... which happens to need it's a separate trait for reasons ... trait Aligner { fn align<T: Eq>(&self, a: &[T], b: &[T]) ->Vec<(usize, usize, usize)>; } pub struct SimpleRatioScorer<A: Aligner>(A); impl <T: Eq, A: Aligner> Scorer<T> for SimpleRatioScorer<A> { fn score<T: Eq>(&self, a: &[T], b: &[T]) -> Score; let matches: usize = self.0.align(a, b).iter().map(|&(_, _, s)| s).sum(); let sumlength = (a.len() + b.len()) as f32; if sumlength.is_normal() { (100.0 * (2.0 * matches as f32) / sumlength).round().try_into().unwrap() } else { 100.try_into().unwrap() } } } pub struct BasicMatchingBlocks; impl Aligner for BasicMatchingBlocks { fn align<T: Eq>(&self, a: &[T], b: &[T]) ->Vec<(usize, usize, usize)> { get_matching_blocks(a, b) } } pub struct SmithWaterman; impl Aligner for SmithWaterman{ fn align<T: Eq>(&self, a: &[T], b: &[T]) ->Vec<(usize, usize, usize)> { return magic(); } } pub struct BestPartialScorer<A: Aligner, S: Scorer>(A, S); impl <T: Eq, A: Aligner, S: Scorer> Scorer<T> for BestPartialScorer<A, S> { fn score<T: Eq>(&self, a: &[T], b: &[T]) -> Score; let blocks = self.0.align(a, b); let mut max = 0.try_into().unwrap(); for (i, j, _) in blocks { let long_start = if j > i { j - i } else { 0 }; let long_end = std::cmp::min(long_start + shorter.len(), longer.len()); let long_substr = &longer[long_start..long_end]; let r = self.1.score(&shorter, long_substr).try_into().unwrap(); if r > 99 { return 100.try_into().unwrap(); } else if r > max { max = r; } } max } }

also means that token_set_ratio and such can implement the trait and have similar building blocks.

logannc · 2021-04-12T03:26:48Z

Welp, get_matching_blocks has a bug. for the inputs "frodo baggin" and "samwise g" (after codepoint segmentation), it generates a bad item in the queue which causes debug assertions to panic.

I think this wasn't a problem in python because the python find_longest_match does a weird iterative solution different from ours.

maxbachmann · 2021-04-19T22:46:30Z

I think this wasn't a problem in python because the python find_longest_match does a weird iterative solution different from ours.

Not sure whether it helps, but here is my C++ implementation of get_matching_blocks.
https://github.com/maxbachmann/rapidfuzz-cpp/blob/a494e5557087203f54e76e49815c68324114b9e5/rapidfuzz/details/matching_blocks.hpp#L58
It is essentially the Python implementation without all autojunk parts + hash maps to improve the performance

logannc · 2021-04-20T01:00:35Z

I haven't looked at it in a few days but the problem was one of (I can't remember which offhand)

debug_assert!(low1 <= high1);
debug_assert!(low2 <= high2);
debug_assert!(high1 <= shorter.len());
debug_assert!(high2 <= longer.len());
debug_assert!(high1 - low1 <= high2 - low2);

was failing. All of these ought to be maintained for each call to find_longest_match() under the current implementation, I think.

I think your implementation suffers the same issue with

if ((spos + length) < a_high && (dpos + length) < b_high) {
    queue.emplace_back(spos + length, a_high, dpos + length, b_high);
}

I have not looked at this since I noted it last week.

maxbachmann · 2021-04-20T06:50:26Z

I will test this in my implementation. However this is directly taken from the Python implementation (and appears to be the same in your implementation as well). I would personally assume the bug is somewhere in find_longest_match, since there your using quite a different implementation. RapidFuzz uses pretty much the same implementation as Difflib in there, but since I do not use the auto junk feature I was able to get rid of the hashmap b2j

RapidFuzz

https://github.com/maxbachmann/rapidfuzz-cpp/blob/a494e5557087203f54e76e49815c68324114b9e5/rapidfuzz/details/matching_blocks.hpp#L120-L128

      if (length) {
        if (a_low < spos && b_low < dpos) {
          queue.emplace_back(a_low, spos, b_low, dpos);
        }
        if ((spos + length) < a_high && (dpos + length) < b_high) {
          queue.emplace_back(spos + length, a_high, dpos + length, b_high);
        }
        matching_blocks_pass1.emplace_back(spos, dpos, length);
      }

fuzzywuzzy-rs

fuzzywuzzy-rs/src/primitives.rs

Lines 207 to 215 in f787942

    
           if k != 0 { 
        
               matching_blocks.push((i, j, k)); 
        
               if low1 < i && low2 < j { 
        
                   queue.push((low1, i, low2, j)); 
        
               } 
        
               if i + k < high1 && j + k < high2 { 
        
                   queue.push((i + k, high1, j + k, high2)); 
        
               } 
        
           }

difflib

https://github.com/python/cpython/blob/fa03efda3dc6ad118788bebc61079cd481c0b24c/Lib/difflib.py#L490-L495

            if k:   # if k is 0, there was no matching block
                matching_blocks.append(x)
                if alo < i and blo < j:
                    queue.append((alo, i, blo, j))
                if i+k < ahi and j+k < bhi:
                    queue.append((i+k, ahi, j+k, bhi))

logannc added 4 commits March 26, 2021 12:34

initial compat - just committing so i can swap machines...

3197a7b

megachanges, need to amend and split into sensible history.

965c2af

notes

1a7c51d

fixes

09991c5

logannc added the wip work in progress label Mar 27, 2021

logannc self-assigned this Mar 27, 2021

fix notes on wratio

f787942

seanpianka mentioned this pull request Mar 28, 2021

process::extract_without_order must avoid re-invoking processor methods #6

Open

seanpianka reviewed Mar 28, 2021

View reviewed changes

src/normalization.rs

}

}

// TODO: document

Copy link

Collaborator

seanpianka Mar 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO docs

This normalizer splits by whitespace and unstably sorts the input.

seanpianka reviewed Mar 28, 2021

View reviewed changes

logannc mentioned this pull request Mar 29, 2021

use 'thiserror' #29

Open

seanpianka added this to the 1.0 milestone May 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mega Change #28

Mega Change #28

logannc commented Mar 27, 2021 •

edited

Loading

logannc commented Mar 27, 2021

logannc commented Mar 27, 2021

logannc commented Mar 27, 2021

seanpianka Mar 28, 2021

seanpianka Mar 28, 2021

logannc Mar 28, 2021

seanpianka Mar 28, 2021

seanpianka Mar 28, 2021

seanpianka Mar 28, 2021

seanpianka Mar 28, 2021

seanpianka Mar 28, 2021

logannc Mar 28, 2021

seanpianka Mar 28, 2021

seanpianka Mar 28, 2021

logannc Mar 28, 2021

seanpianka Mar 28, 2021

logannc Apr 17, 2021

logannc Apr 17, 2021

logannc commented Apr 12, 2021

maxbachmann commented Apr 19, 2021 •

edited

Loading

logannc commented Apr 20, 2021

maxbachmann commented Apr 20, 2021 •

edited

Loading

Mega Change #28

Are you sure you want to change the base?

Mega Change #28

Conversation

logannc commented Mar 27, 2021 • edited Loading

logannc commented Mar 27, 2021

logannc commented Mar 27, 2021

logannc commented Mar 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

logannc commented Apr 12, 2021

maxbachmann commented Apr 19, 2021 • edited Loading

logannc commented Apr 20, 2021

maxbachmann commented Apr 20, 2021 • edited Loading

RapidFuzz

fuzzywuzzy-rs

difflib

logannc commented Mar 27, 2021 •

edited

Loading

maxbachmann commented Apr 19, 2021 •

edited

Loading

maxbachmann commented Apr 20, 2021 •

edited

Loading