[classlib] add fuzzy array comparisons #4468

jrsurge · 2019-06-27T13:41:42Z

Purpose and Motivation

To add fuzzy string comparisons.

EDIT: this PR has changed focus to providing more general, fuzzy Array comparisons

Initial motivation came from the work on choosing an audio device for Windows, I started thinking if we could work out the most similar string for a device name to make guessing a little smarter, accounting for typos etc.

The core of this is fuzzy string comparison.

It's very easy to take these new methods, and write a quick check for nearest string, comparing each word to find the most relevant match (or if agreed, add the methods below for this).

Fuzzy comparison would prove useful to the language, including within the interpreter itself.

This PR adds the following instance methods to the SequenceableCollection class:

editDistance
Alternatively known as Levenshtein Distance, this is the minimum amount of changes required to convert one string into another.
similarity
This uses the editDistance to calculate the percentage similarity (scaled between 0-1) of the two strings (0 is completely different, 1 is exactly the same)

It also adds a primitive to sclang for faster execution, using SC_LevenshteinDistance.h

Examples

"hello".editDistance("hallo"); // 1
"word".similarity("wodr"); // 0.5

Types of changes

Documentation
New feature

To-do list

Code is tested
All tests are passing
Updated documentation
This PR is ready for review

muellmusik · 2019-06-27T15:15:07Z

Cool!

…

On 27 Jun 2019, at 14:41, James Surgenor ***@***.***> wrote: Purpose and Motivation To add fuzzy string comparisons. Initial motivation came from the work on choosing an audio device for Windows, I started thinking if we could work out the most similar string for a device name to make guessing a little smarter, accounting for typos etc. The core of this is fuzzy string comparison, so I added it to the language. It's very easy to take these new methods, and write a quick check for nearest string, comparing each word to find the most relevant match (or if agreed, add the methods below for this). For this to apply to the device selection, we'd need a way to return an array of device name strings on all platforms, but fuzzy string comparison itself could prove useful to the language even without this, so here it is. This PR adds the following instance methods to the String class: editDistance Alternatively known as Levenshtein Distance, this is the minimum amount of changes required to convert one string into another. similarity This uses the editDistance to calculate the percentage similarity (scaled between 0-1) of the two strings (0 is completely different, 1 is exactly the same) isSimilar This uses the similarity, and the given delta to return a bool if the two strings are at least as similar as the delta value. Examples "hello".editDistance("hallo"); // 1 "word".similarity("wodr"); // 0.5 // isSimliar(string="", delta=0.5) "word".isSimilar("wards"); // true, within 50% limit "word".isSimilar("wards", 0.75); // false, outside 75% limit Additional work If possible, I'd also like to add a numSimilarWords and nearestString methods, as follows: numSimilarWords { | otherString="", delta=0.5 | var count = 0; var thisWords = this.split($ ); var otherWords = otherString.split($ ); thisWords.do({ | thisWord | // is this word similar to a word in the other string? otherWords.collect({ |otherWord| if(thisWord.isSimilar(otherWord, delta)) { count = count + 1; }; }); }); ^count; } nearestString { | strings=#[], delta=0.5 | var lookupInd = 0; var maxCount = 0; strings.do({ | string, ind | var count = this.numSimilarWords(string, delta); if(count > maxCount) { maxCount = count; lookupInd = ind; }; }); ^strings[lookupInd]; } Which would allow: ( "awesoem".nearestString([ "Some Device That's Not Great", "Some Good Device That's Not What We Want", "Another Amazing Device Name", "My Awesome Device Name" ]); // "My Awesome Device Name" ) But I'm not 100% sold on the interface Types of changes Documentation New feature To-do list Code is tested All tests are passing Updated documentation This PR is ready for review You can view, comment on, or merge this pull request online at: #4468 Commit Summary [classlib] add fuzzy string comparisons to String [unittest] add unit tests for fuzzy string comparisons [docs] add help for fuzzy string comparisons File Changes M HelpSource/Classes/String.schelp (53) M SCClassLibrary/Common/Collections/String.sc (61) M testsuite/classlibrary/TestString.sc (86) Patch Links: https://github.com/supercollider/supercollider/pull/4468.patch https://github.com/supercollider/supercollider/pull/4468.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

mossheim · 2019-06-27T22:41:36Z

This is great @jrsurge!

Just to make this clear first (so the following makes more sense), in our current implementation Strings are just arrays of bytes. Multi-byte character codes from UTF-8 are considered as multiple separate bytes, and so the edit distance between "你好" and "" is 4, not 2.

With that in mind, a couple thoughts:

I think the idea of "words" is not well-defined enough for us to use it in the core library. Depending on your purpose and language, what delimits a word could change drastically. Here, you're just splitting by space; what about newlines and other whitespace? Is "my.device" considered one word or two? What about "cat's" vs "cat‘s", "cat-dog" vs "cat--dog" vs "cat–dog" (en-dash)? Should you include surrounding punctuation when you split? I would expect the words in "this... is(a dog)" to be [this, is, a, dog], while under the current implementation it's [this..., is(a, dog)]. Then there is also a bit of inherent ethnocentrism in that this method works best if your string is all characters from the ASCII set and in a language where words are delimited by spaces.

Some of these points can't be fully addressed until we have the ability to iterate through strings as sequences of Unicode code points. Even then, I'd be skeptical, because there just isn't a definition that will work for most or even all situations when it comes down to the corner cases. I would rather not have our core library take a position on that, OR offer a watered down version of it whose name reflects the more salient biases/assumptions it makes.

So, I'm a no on numSimilarWords and nearestString. I think these should either (1) be written in the place they're needed with a very clear notice about the assumptions made, or (2) be reworked so they are more generic (although currently I'm skeptical they can be).

I could see a case for a less opinionated version of nearestString that just selects the argument with the lowest edit distance. In fact, that was what I expected when I first saw the name. And this is likely to be a common usage, since the primary usage I see for edit distance is when you need to pick which of the things in a list the user probably meant to type.

Note that this will still choose the right device in your example:

(
var nearestString = { |s, l| l[l.maxIndex { |x| x.similarity(s) }] }; // sorry :<
nearestString.("awesoem", [
	"Some Device That's Not Great",
	"Some Good Device That's Not What We Want",
	"Another Amazing Device Name",
	"My Awesome Device Name"
]) // ->  My Awesome Device Name
)

I'm also a little skeptical on the utility of isSimilar. The first line is clearer to me, and almost the same number of characters:

"abc".similarity("def") > 0.7
"abc".isSimilar("def", 0.7)

You lose a bit of semantic value when you hide the comparison behind the function interface, and the name isSimilar doesn't give much of a hint about what the second argument means.

I think we should supply these for the most abstract class possible, SequenceableCollection. I can copy and paste your exact definitions for editDistance and similarity there, and they work just fine. :)

Finally, Levenshtein Distance is IMO worth making a C++ function and primitive. This could be super useful in the sclang compiler, and in IDE code, so it'd be nice to have a single (fast) implementation. In addition, this is a function I can see people wanting to potentially call on very large strings; it'd be nice if this weren't the bottleneck in case people want to, for instance, compute the edit distance between two small audio files. If that's done, we may need one primitive for SequenceableCollection and a separate one for ArrayedCollection. What do you think?

mossheim

See comment above

jrsurge · 2019-06-27T23:19:02Z

So, I'm a no on numSimilarWords and nearestString. I think these should either (1) be written in the place they're needed with a very clear notice about the assumptions made, or (2) be reworked so they are more generic (although currently I'm skeptical they can be).

Yup, this is exactly why I didn't include it in the initial PR - they didn't feel general enough.

I could see a case for a less opinionated version of nearestString that just selects the argument with the lowest edit distance. In fact, that was what I expected when I first saw the name. And this is likely to be a common usage, since the primary usage I see for edit distance is when you need to pick which of the things in a list the user probably meant to type.

Note that this will still choose the right device in your example:
(
var nearestString = { |s, l| l[l.maxIndex { |x| x.similarity(s) }] }; // sorry :<
nearestString.("awesoem", [
	"Some Device That's Not Great",
	"Some Good Device That's Not What We Want",
	"Another Amazing Device Name",
	"My Awesome Device Name"
]) // ->  My Awesome Device Name
)

In this example, that's true. But I noticed it gets massively skewed with mixtures of long and short strings. With a short search string, and an array with longer strings and a short string, checking with the entire string, instead of 'words', becomes problematic. It will almost always choose the shortest string because it involves the least manipulation.
But, it should be more general, I definitely agree.

I'm also a little skeptical on the utility of isSimilar. The first line is clearer to me, and almost the same number of characters:
"abc".similarity("def") > 0.7
"abc".isSimilar("def", 0.7)
You lose a bit of semantic value when you hide the comparison behind the function interface, and the name isSimilar doesn't give much of a hint about what the second argument means.

This is fair. I liked it because it felt more concise when checking similarity, but it really is just a wrapper around "abc".similarity("def") > 0.7

I think we should supply these for the most abstract class possible, SequenceableCollection. I can copy and paste your exact definitions for editDistance and similarity there, and they work just fine. :)

Somewhat terrifying, I hadn't really thought about this given the expected use case, but that's super cool.

Finally, Levenshtein Distance is IMO worth making a C++ function and primitive. This could be super useful in the sclang compiler, and in IDE code, so it'd be nice to have a single (fast) implementation. In addition, this is a function I can see people wanting to potentially call on very large strings; it'd be nice if this weren't the bottleneck in case people want to, for instance, compute the edit distance between two small audio files. If that's done, we may need one primitive for SequenceableCollection and a separate one for ArrayedCollection. What do you think?

I thought about this, but on checking the speed in sclang, I was pretty impressed. I didn't really test anything massive though.
For massive strings, or moving to SequenceableCollection/ArrayedCollection, this might be more of an issue, so a primitive might be better in the long run.

I am, however, somewhat unsure how to write one.
If that's the direction we'd prefer, I'll go and consult the book?

mossheim · 2019-06-28T00:38:31Z

In this example, that's true. But I noticed it gets massively skewed with mixtures of long and short strings. With a short search string, and an array with longer strings and a short string, checking with the entire string, instead of 'words', becomes problematic. It will almost always choose the shortest string because it involves the least manipulation.
But, it should be more general, I definitely agree.

Ok, yes I can see that. But, what exactly is the goal here? IIUC, the goal is to make it easy to choose something from a list by typing enough characters to disambiguate the element you want. If that's the case, I have some ideas for other algorithms that might be better.

I thought about this, but on checking the speed in sclang, I was pretty impressed. I didn't really test anything massive though.
For massive strings, or moving to SequenceableCollection/ArrayedCollection, this might be more of an issue, so a primitive might be better in the long run.

I definitely think so. Here's a benchmark. Note that we probably can't achieve exactly these speeds in c++, but for raw array types we should be able to come close. For object types it will of course be a lot slower even in C++ because of the need to invoke == on the objects rather than being able to compare the values directly.

In sclang, using your implementation, release build:

(
Array.geom(12, 4, 2).do { |n|
	n.post;
	' '.post;
	bench { String.fill(n, $a).editDistance(String.fill(n, $b)) }
}
)

/*
4
time to run: 7.4634999918999e-05 seconds.
8
time to run: 0.00019521800004441 seconds.
16
time to run: 0.00066588400000001 seconds.
32
time to run: 0.0023637589999908 seconds.
64
time to run: 0.0062601669999367 seconds.
128
time to run: 0.030553861000044 seconds.
256
time to run: 0.11677848900013 seconds.
512
time to run: 0.443587944 seconds.
1024
time to run: 1.2803825400001 seconds.
2048
time to run: 4.8449122709999 seconds.
4096
time to run: 19.728719459 seconds.
8192
time to run: 77.879591756 seconds.
*/

in C++, using the Rosetta code implementation:

[brianheim@BrianMBP lsd]$ for i in `seq 2 16`; do echo $((2 ** $i)); time ./main $((2 ** $i)); done
4
distance: 4

real	0m0.004s
user	0m0.001s
sys	0m0.002s
8
distance: 8

real	0m0.003s
user	0m0.001s
sys	0m0.001s
16
distance: 16

real	0m0.003s
user	0m0.001s
sys	0m0.001s
32
distance: 32

real	0m0.003s
user	0m0.001s
sys	0m0.001s
64
distance: 64

real	0m0.003s
user	0m0.001s
sys	0m0.001s
128
distance: 128

real	0m0.003s
user	0m0.001s
sys	0m0.001s
256
distance: 256

real	0m0.003s
user	0m0.001s
sys	0m0.002s
512
distance: 512

real	0m0.003s
user	0m0.001s
sys	0m0.001s
1024
distance: 1024

real	0m0.005s
user	0m0.003s
sys	0m0.001s
2048
distance: 2048

real	0m0.009s
user	0m0.007s
sys	0m0.002s
4096
distance: 4096

real	0m0.025s
user	0m0.023s
sys	0m0.001s
8192
distance: 8192

real	0m0.088s
user	0m0.086s
sys	0m0.002s
16384
distance: 16384

real	0m0.337s
user	0m0.334s
sys	0m0.002s
32768
distance: 32768

real	0m1.345s
user	0m1.341s
sys	0m0.003s
65536
distance: 65536

real	0m5.373s
user	0m5.366s
sys	0m0.005s

Note that 32K in C++ takes about the same time as it does for 1K in sclang.

There are also different ways of computing Levenshtein difference that could be useful in different situations (https://en.wikipedia.org/wiki/Levenshtein_distance#Computing_Levenshtein_distance). There's one that uses a reduced memory footprint (O(n) instead of O(n^2)) and also an approximation algorithm. For very large arrays that's probably best. However, there's no guarantee those are going into the library unless someone is motivated to write them, so if we just have this one implementation, it would be great to have one that is as flexible and powerful as possible.

I am, however, somewhat unsure how to write one.
If that's the direction we'd prefer, I'll go and consult the book?

N.B. - James and I have been discussing this in private

jrsurge · 2019-06-30T20:59:34Z

Thanks to Brian for all his time and help with this.

For anyone interested, I have a working branch with an initial outline of the primitive version.
Currently, it can't compare objects (any advice on that appreciated), but everything else should work.

https://github.com/jrsurge/supercollider/tree/topic/sclang-add-fuzzy-string-comparisons-working

jrsurge · 2019-07-02T16:35:08Z

Based on Brian's suggestions, this is now implemented as a primitive.

The codebase now has SC_Levenshtein.h which is a header-only templated functor to calculate the distance between two arrays.

As is currently stands, all SequenceableCollections have access to editDistance and similarity.

ArrayedCollections (Array, String, RawArray etc.) use the primitive to calculate editDistance, but all other SequenceableCollections use the sclang-based implementation.

Any thoughts greatly appreciated.

mossheim · 2019-12-29T01:18:01Z

@jrsurge planning to review this in the next day or two. just making sure, are you planning to continue work on this?

jrsurge · 2019-12-29T15:49:56Z

@brianlheim absolutely!

Updated it last month to pass CI (there's a slot datatype enum that only exists in 64-bit builds?).

Any comments greatly appreciated :)

mossheim · 2019-12-29T21:10:33Z

James and i discussed this privately earlier today. we agreed on the following design for edit distance:

a C++ primitive that always checks for identity using SlotEq, and only operates on objects for which classHasIndexableInstances is set. i.e., ArrayedCollection's subclasses.
a sclang implementation on SequenceableCollection taking |other, compareFunc|. compareFunc defaults to identity comparison if nil. compareFunc is used to test for element equality. the current implementation is almost this idea, just missing compareFunc
a function on ArrayedCollection which takes |other, compareFunc| and calls down to the primitive if and only if compareFunc is nil. otherwise, it calls super.editDistance(other, compareFunc), which invokes the SeqColl implementation
if receiver and other are of different raw array types (i.e., String vs Array), fall back to the SeqColl implementation. it should be possible to find the edit distance between "abc" and [$a, $b, $c]. whether to do this before or after invoking the primitive (i.e., treating failure in the primitive as an indication that we should try the sclang impl instead) was a matter of discussion. i don't think the contents of obj_format are accessible in any way in the class library. if that's the case then it would have to be done the latter way (fail in the primitive, fall back to sclang impl), otherwise we are not being as generic about it as we could. (for instance, anyone is welcome to write MyString[char] : RawArray and it should be interoperable with String)

this means that everything is fast when you have two Arrays, Strings, or RawArrays and want to compare by element identity, which is the same as equality in those cases. i would expect these to be the majority of use cases.

a problem here is that (1) it's unwieldy to invoke == inside a primitive and (2) we don't want to penalize users who put homogeneous data in polymorphic arrays, which is typical practice. [1, 2, 3, 4, 5].distance(...) will run as fast as Int32Array[1, 2, 3, 4, 5].distance(...) with this approach, but since the comparison has to be identity-based, users who write ["abc", "def"].editDistance(...) may be surprised. however in my view as long as it's documented as identity-based by default, the default behavior is sane and matches other standard library functions like includes.

jrsurge · 2019-12-31T13:48:24Z

Thanks Brian.

~~I have a question about falling back.~~

~~In the case of two raw arrays of different underlying types, is failure of the primitive the expected behaviour?~~

~~If the edit distance is the number of changes to convert one into the other, then that counts as an entirely different array, not a failure?~~

~~Returning max(this.size, that.size) in those cases would greatly reduce the number of fallback conditions.~~

EDIT: I see how this doesn't work for "abc".editDistance([$a, $b, $c]), please ignore!

mossheim

thanks @jrsurge! i looked this over and the implementation seems really solid. i added some new tests, and made a PR against your branch to include them

mossheim · 2020-01-04T21:55:40Z

testsuite/classlibrary/TestArray.sc

@@ -186,6 +186,92 @@ TestArray : UnitTest {
 		}
 	}

+		// ----- fuzzy comparisons -----------------------------------------------


indentation

I had pasted this from somewhere else in the file - I'll reformat, but I'm wondering if it is even needed?

when i write a lot of tests for a function or group of functions i often leave a comment like this, in this case i would say "editDistance and similarity" to be totally clear

mossheim · 2020-01-04T21:56:21Z

testsuite/classlibrary/TestArray.sc

+		var result = [].editDistance([0,1,2,3]);
+		var expected = 4;
+
+		this.assertEquals(result, expected);


for the sake of conserving vertical space i'd be perfectly fine with reducing these method bodies to 2 lines each:

var result = ...; this.assertEquals(result, 4);

or 3 lines by removing the blank lines. what do you think?

happy to reduce to 2 lines, but I think there was a guideline somewhere that specified it?

those guidelines are WIP, i think we should revisit them soon.. it's really up to you here, just saying what i would prefer

mossheim · 2020-01-04T21:58:36Z

SCClassLibrary/Common/Collections/SequenceableCollection.sc

+	prLevenshteinDistance { | other, compareFunc |
+		// This is the same algorithm as the primitive, just in
+		// sclang to allow equality
+		var matrix = Array.fill(other.size + 1, { | ind | ind; });


can just be Array.iota(other.size + 1)

never really knew about iota! Thanks

mossheim · 2020-01-04T22:24:23Z

HelpSource/Classes/SequenceableCollection.schelp

+A change can be: an addition, a deletion, or a substitution.
+This is known as the Levenshtein Distance and is implemented in SuperCollider using the Wagner–Fischer algorithm.
+
+In the most common usage, where both arrays are of the same type, like comparing two strings, a faster primitive will be used to calculate the distance. This primitive will compare teletype::SequenceableCollection::s using strong::identity::.


clarify that comparison is always identity unless otherwise specified, this makes it sound like identity comparison is only done for this specific case

this is also not entirely accurate -- the primitive is only used if both arrays are of the same type, and that type is a raw array (String, Array, Int8Array, Int16Array, ... in the main library, or a user-defined one). that's kind of a lot to explain, but i think it's important to be technically accurate. i also wouldn't call it "the most common usage", but rather "a common usage".

actually, looking at the extra notes down below, you might just want to add a small paragraph somewhere entitled "performance" and explain all of this there. at present, it kind of disrupts the flow.

Agreed - docs were written quite quickly, happy to change this stuff. Thanks!

mossheim · 2020-01-04T22:27:16Z

HelpSource/Classes/SequenceableCollection.schelp

+"hello".editDistance([$h, $e, $l, $l, $o]);
+::
+
+For special cases that require comparisons other than identity, the optional teletype::compareFunc:: can be given to compare elements. This function will be passed two arguments, representing a single element from each array to compare, and it is the responsibility of the user to ensure that this function returns a boolean (teletype::true:: or teletype::false::) as to whether or not the elements are equal.


i also would avoid calling this a special case, the case insensitive example you gave below seems quite reasonable actually. you also don't need to "boolean (true or false)", "boolean" on its own is clear enough. "it is the responsibility of the user to ensure that this function returns" is to me an overly formal way to say "this function should return"

mossheim · 2020-01-04T23:11:42Z