-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[classlib] add fuzzy array comparisons #4468
[classlib] add fuzzy array comparisons #4468
Conversation
976cc09
to
4f69842
Compare
Cool!
… On 27 Jun 2019, at 14:41, James Surgenor ***@***.***> wrote:
Purpose and Motivation
To add fuzzy string comparisons.
Initial motivation came from the work on choosing an audio device for Windows, I started thinking if we could work out the most similar string for a device name to make guessing a little smarter, accounting for typos etc.
The core of this is fuzzy string comparison, so I added it to the language.
It's very easy to take these new methods, and write a quick check for nearest string, comparing each word to find the most relevant match (or if agreed, add the methods below for this).
For this to apply to the device selection, we'd need a way to return an array of device name strings on all platforms, but fuzzy string comparison itself could prove useful to the language even without this, so here it is.
This PR adds the following instance methods to the String class:
editDistance
Alternatively known as Levenshtein Distance, this is the minimum amount of changes required to convert one string into another.
similarity
This uses the editDistance to calculate the percentage similarity (scaled between 0-1) of the two strings (0 is completely different, 1 is exactly the same)
isSimilar
This uses the similarity, and the given delta to return a bool if the two strings are at least as similar as the delta value.
Examples
"hello".editDistance("hallo"); // 1
"word".similarity("wodr"); // 0.5
// isSimliar(string="", delta=0.5)
"word".isSimilar("wards"); // true, within 50% limit
"word".isSimilar("wards", 0.75); // false, outside 75% limit
Additional work
If possible, I'd also like to add a numSimilarWords and nearestString methods, as follows:
numSimilarWords { | otherString="", delta=0.5 |
var count = 0;
var thisWords = this.split($ );
var otherWords = otherString.split($ );
thisWords.do({ | thisWord |
// is this word similar to a word in the other string?
otherWords.collect({ |otherWord|
if(thisWord.isSimilar(otherWord, delta)) {
count = count + 1;
};
});
});
^count;
}
nearestString { | strings=#[], delta=0.5 |
var lookupInd = 0;
var maxCount = 0;
strings.do({ | string, ind |
var count = this.numSimilarWords(string, delta);
if(count > maxCount) {
maxCount = count;
lookupInd = ind;
};
});
^strings[lookupInd];
}
Which would allow:
(
"awesoem".nearestString([
"Some Device That's Not Great",
"Some Good Device That's Not What We Want",
"Another Amazing Device Name",
"My Awesome Device Name"
]); // "My Awesome Device Name"
)
But I'm not 100% sold on the interface
Types of changes
Documentation
New feature
To-do list
Code is tested
All tests are passing
Updated documentation
This PR is ready for review
You can view, comment on, or merge this pull request online at:
#4468
Commit Summary
[classlib] add fuzzy string comparisons to String
[unittest] add unit tests for fuzzy string comparisons
[docs] add help for fuzzy string comparisons
File Changes
M HelpSource/Classes/String.schelp (53)
M SCClassLibrary/Common/Collections/String.sc (61)
M testsuite/classlibrary/TestString.sc (86)
Patch Links:
https://github.com/supercollider/supercollider/pull/4468.patch
https://github.com/supercollider/supercollider/pull/4468.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
d59fe1f
to
62767fe
Compare
This is great @jrsurge! Just to make this clear first (so the following makes more sense), in our current implementation Strings are just arrays of bytes. Multi-byte character codes from UTF-8 are considered as multiple separate bytes, and so the edit distance between "你好" and "" is 4, not 2. With that in mind, a couple thoughts: I think the idea of "words" is not well-defined enough for us to use it in the core library. Depending on your purpose and language, what delimits a word could change drastically. Here, you're just splitting by space; what about newlines and other whitespace? Is "my.device" considered one word or two? What about "cat's" vs "cat‘s", "cat-dog" vs "cat--dog" vs "cat–dog" (en-dash)? Should you include surrounding punctuation when you split? I would expect the words in "this... is(a dog)" to be [this, is, a, dog], while under the current implementation it's [this..., is(a, dog)]. Then there is also a bit of inherent ethnocentrism in that this method works best if your string is all characters from the ASCII set and in a language where words are delimited by spaces. Some of these points can't be fully addressed until we have the ability to iterate through strings as sequences of Unicode code points. Even then, I'd be skeptical, because there just isn't a definition that will work for most or even all situations when it comes down to the corner cases. I would rather not have our core library take a position on that, OR offer a watered down version of it whose name reflects the more salient biases/assumptions it makes. So, I'm a no on I could see a case for a less opinionated version of Note that this will still choose the right device in your example:
I'm also a little skeptical on the utility of "abc".similarity("def") > 0.7
"abc".isSimilar("def", 0.7) You lose a bit of semantic value when you hide the comparison behind the function interface, and the name I think we should supply these for the most abstract class possible, Finally, Levenshtein Distance is IMO worth making a C++ function and primitive. This could be super useful in the sclang compiler, and in IDE code, so it'd be nice to have a single (fast) implementation. In addition, this is a function I can see people wanting to potentially call on very large strings; it'd be nice if this weren't the bottleneck in case people want to, for instance, compute the edit distance between two small audio files. If that's done, we may need one primitive for SequenceableCollection and a separate one for ArrayedCollection. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above
Yup, this is exactly why I didn't include it in the initial PR - they didn't feel general enough.
In this example, that's true. But I noticed it gets massively skewed with mixtures of long and short strings. With a short search string, and an array with longer strings and a short string, checking with the entire string, instead of 'words', becomes problematic. It will almost always choose the shortest string because it involves the least manipulation.
This is fair. I liked it because it felt more concise when checking similarity, but it really is just a wrapper around
Somewhat terrifying, I hadn't really thought about this given the expected use case, but that's super cool.
I thought about this, but on checking the speed in sclang, I was pretty impressed. I didn't really test anything massive though. I am, however, somewhat unsure how to write one. |
Ok, yes I can see that. But, what exactly is the goal here? IIUC, the goal is to make it easy to choose something from a list by typing enough characters to disambiguate the element you want. If that's the case, I have some ideas for other algorithms that might be better.
I definitely think so. Here's a benchmark. Note that we probably can't achieve exactly these speeds in c++, but for raw array types we should be able to come close. For object types it will of course be a lot slower even in C++ because of the need to invoke In sclang, using your implementation, release build: (
Array.geom(12, 4, 2).do { |n|
n.post;
' '.post;
bench { String.fill(n, $a).editDistance(String.fill(n, $b)) }
}
)
/*
4
time to run: 7.4634999918999e-05 seconds.
8
time to run: 0.00019521800004441 seconds.
16
time to run: 0.00066588400000001 seconds.
32
time to run: 0.0023637589999908 seconds.
64
time to run: 0.0062601669999367 seconds.
128
time to run: 0.030553861000044 seconds.
256
time to run: 0.11677848900013 seconds.
512
time to run: 0.443587944 seconds.
1024
time to run: 1.2803825400001 seconds.
2048
time to run: 4.8449122709999 seconds.
4096
time to run: 19.728719459 seconds.
8192
time to run: 77.879591756 seconds.
*/ in C++, using the Rosetta code implementation: [brianheim@BrianMBP lsd]$ for i in `seq 2 16`; do echo $((2 ** $i)); time ./main $((2 ** $i)); done
4
distance: 4
real 0m0.004s
user 0m0.001s
sys 0m0.002s
8
distance: 8
real 0m0.003s
user 0m0.001s
sys 0m0.001s
16
distance: 16
real 0m0.003s
user 0m0.001s
sys 0m0.001s
32
distance: 32
real 0m0.003s
user 0m0.001s
sys 0m0.001s
64
distance: 64
real 0m0.003s
user 0m0.001s
sys 0m0.001s
128
distance: 128
real 0m0.003s
user 0m0.001s
sys 0m0.001s
256
distance: 256
real 0m0.003s
user 0m0.001s
sys 0m0.002s
512
distance: 512
real 0m0.003s
user 0m0.001s
sys 0m0.001s
1024
distance: 1024
real 0m0.005s
user 0m0.003s
sys 0m0.001s
2048
distance: 2048
real 0m0.009s
user 0m0.007s
sys 0m0.002s
4096
distance: 4096
real 0m0.025s
user 0m0.023s
sys 0m0.001s
8192
distance: 8192
real 0m0.088s
user 0m0.086s
sys 0m0.002s
16384
distance: 16384
real 0m0.337s
user 0m0.334s
sys 0m0.002s
32768
distance: 32768
real 0m1.345s
user 0m1.341s
sys 0m0.003s
65536
distance: 65536
real 0m5.373s
user 0m5.366s
sys 0m0.005s Note that 32K in C++ takes about the same time as it does for 1K in sclang. There are also different ways of computing Levenshtein difference that could be useful in different situations (https://en.wikipedia.org/wiki/Levenshtein_distance#Computing_Levenshtein_distance). There's one that uses a reduced memory footprint (O(n) instead of O(n^2)) and also an approximation algorithm. For very large arrays that's probably best. However, there's no guarantee those are going into the library unless someone is motivated to write them, so if we just have this one implementation, it would be great to have one that is as flexible and powerful as possible.
N.B. - James and I have been discussing this in private |
Thanks to Brian for all his time and help with this. For anyone interested, I have a working branch with an initial outline of the primitive version. https://github.com/jrsurge/supercollider/tree/topic/sclang-add-fuzzy-string-comparisons-working |
62767fe
to
04b5ec8
Compare
Based on Brian's suggestions, this is now implemented as a primitive. The codebase now has As is currently stands, all
Any thoughts greatly appreciated. |
04b5ec8
to
fa7a340
Compare
@jrsurge planning to review this in the next day or two. just making sure, are you planning to continue work on this? |
@brianlheim absolutely! Updated it last month to pass CI (there's a slot datatype enum that only exists in 64-bit builds?). Any comments greatly appreciated :) |
James and i discussed this privately earlier today. we agreed on the following design for edit distance:
this means that everything is fast when you have two Arrays, Strings, or RawArrays and want to compare by element identity, which is the same as equality in those cases. i would expect these to be the majority of use cases. a problem here is that (1) it's unwieldy to invoke |
Thanks Brian.
EDIT: I see how this doesn't work for "abc".editDistance([$a, $b, $c]), please ignore! |
fa7a340
to
d2b6e83
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @jrsurge! i looked this over and the implementation seems really solid. i added some new tests, and made a PR against your branch to include them
testsuite/classlibrary/TestArray.sc
Outdated
@@ -186,6 +186,92 @@ TestArray : UnitTest { | |||
} | |||
} | |||
|
|||
// ----- fuzzy comparisons ----------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had pasted this from somewhere else in the file - I'll reformat, but I'm wondering if it is even needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when i write a lot of tests for a function or group of functions i often leave a comment like this, in this case i would say "editDistance and similarity" to be totally clear
testsuite/classlibrary/TestArray.sc
Outdated
var result = [].editDistance([0,1,2,3]); | ||
var expected = 4; | ||
|
||
this.assertEquals(result, expected); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the sake of conserving vertical space i'd be perfectly fine with reducing these method bodies to 2 lines each:
var result = ...;
this.assertEquals(result, 4);
or 3 lines by removing the blank lines. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to reduce to 2 lines, but I think there was a guideline somewhere that specified it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those guidelines are WIP, i think we should revisit them soon.. it's really up to you here, just saying what i would prefer
prLevenshteinDistance { | other, compareFunc | | ||
// This is the same algorithm as the primitive, just in | ||
// sclang to allow equality | ||
var matrix = Array.fill(other.size + 1, { | ind | ind; }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can just be Array.iota(other.size + 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never really knew about iota! Thanks
A change can be: an addition, a deletion, or a substitution. | ||
This is known as the Levenshtein Distance and is implemented in SuperCollider using the Wagner–Fischer algorithm. | ||
|
||
In the most common usage, where both arrays are of the same type, like comparing two strings, a faster primitive will be used to calculate the distance. This primitive will compare teletype::SequenceableCollection::s using strong::identity::. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarify that comparison is always identity unless otherwise specified, this makes it sound like identity comparison is only done for this specific case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also not entirely accurate -- the primitive is only used if both arrays are of the same type, and that type is a raw array (String, Array, Int8Array, Int16Array, ... in the main library, or a user-defined one). that's kind of a lot to explain, but i think it's important to be technically accurate. i also wouldn't call it "the most common usage", but rather "a common usage".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, looking at the extra notes down below, you might just want to add a small paragraph somewhere entitled "performance" and explain all of this there. at present, it kind of disrupts the flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - docs were written quite quickly, happy to change this stuff. Thanks!
"hello".editDistance([$h, $e, $l, $l, $o]); | ||
:: | ||
|
||
For special cases that require comparisons other than identity, the optional teletype::compareFunc:: can be given to compare elements. This function will be passed two arguments, representing a single element from each array to compare, and it is the responsibility of the user to ensure that this function returns a boolean (teletype::true:: or teletype::false::) as to whether or not the elements are equal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i also would avoid calling this a special case, the case insensitive example you gave below seems quite reasonable actually. you also don't need to "boolean (true or false)", "boolean" on its own is clear enough. "it is the responsibility of the user to ensure that this function returns" is to me an overly formal way to say "this function should return"
if (compareFunc(dataA[indX], dataB[indY])) | ||
matrix[indY + 1] = corner; | ||
else | ||
matrix[indY + 1] = min3(upper, corner, matrix[indY]) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of the overloads of std::min
(https://en.cppreference.com/w/cpp/algorithm/min) can be used here:
std::min({upper, corner, matrix[indY]})
size_t indY = 0; | ||
|
||
// initialise matrix | ||
for (auto& val : matrix) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we actually have iota
in C++ too :) https://en.cppreference.com/w/cpp/algorithm/iota
this could be std::iota(matrix.begin(), matrix.end(), 0)
if (sizeA == 0) | ||
return sizeB; | ||
if (sizeB == 0) | ||
return sizeA; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be reduced to
if (sizeA == 0 || sizeB == 0)
return 0;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this to be general, we can't assume at this point that sizeA is larger?
move this check after the swap and return sizeA (assuming the same logic as my other comment)?
|
||
size_t sizeC = sizeA; | ||
sizeA = sizeB; | ||
sizeB = sizeC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std library again to the rescue:
std::swap(dataA, dataB);
std::swap(sizeA, sizeB);
matrix.resize(sizeB + 1); | ||
|
||
size_t indX = 0; | ||
size_t indY = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
declare these where they're used, in the for loops. also, would personally prefer to see either fooA
and fooB
OR fooX
and fooY
, not both in the same function
d2b6e83
to
bd85ebe
Compare
this commit adds the following methods to allow fuzzy comparisons: - editDistance - similarity
bd85ebe
to
ff4210c
Compare
This should be ready for review, forgot to update it |
This adds: - SC_Levenshtein.h Header-only, templated functor for calculating Levenshtein distances between arrays. - _ArrayLevenshteinDistance Primitive for sclang to calculate edit distances
Add tests for editDistance and similarity
- SequenceableCollection - String (a note, referencing SequenceableCollection help)
ff4210c
to
1a435a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, thank you so much!!
Purpose and Motivation
To add fuzzy string comparisons.
EDIT: this PR has changed focus to providing more general, fuzzy Array comparisons
Initial motivation came from the work on choosing an audio device for Windows, I started thinking if we could work out the most similar string for a device name to make guessing a little smarter, accounting for typos etc.
The core of this is fuzzy string comparison.
It's very easy to take these new methods, and write a quick check for nearest string, comparing each word to find the most relevant match (or if agreed, add the methods below for this).
Fuzzy comparison would prove useful to the language, including within the interpreter itself.
This PR adds the following instance methods to the SequenceableCollection class:
editDistance
Alternatively known as Levenshtein Distance, this is the minimum amount of changes required to convert one string into another.
similarity
This uses the
editDistance
to calculate the percentage similarity (scaled between 0-1) of the two strings (0 is completely different, 1 is exactly the same)It also adds a primitive to sclang for faster execution, using
SC_LevenshteinDistance.h
Examples
Types of changes
To-do list