sclang: Fix 'findRegexp' empty-result case #4241

jamshark70 · 2019-01-10T01:17:33Z

Purpose and Motivation

Previously, an empty search result would cause prString_FindRegexp() to exit early, resulting in stack corruption and an incorrect return value. We remove the early exit, so that this function always goes through the proper stack unwinding at the end.

Types of changes

Bug fix (non-breaking change which fixes an issue)

Checklist

All previous tests are passing
Tests were updated or created to address changes in this PR, and tests are passing
This PR is ready for review

Previously, an empty search result would cause prString_FindRegexp() to exit early, resulting in stack corruption and an incorrect return value. We remove the early exit, so that this function always goes through the proper stack unwinding at the end.

mossheim

thanks!

mossheim · 2019-01-10T02:52:21Z

testsuite/classlibrary/TestString.sc

+		)
+	}
+
+	test_findRegexp_EmptyResult {


s/Empty/empty

mossheim · 2019-01-10T02:58:14Z

testsuite/classlibrary/TestString.sc

+		this.assert(
+			result[0] == [0, 4, 10, 16] and: { result[1][2] == "brown" },
+			"`\"the quick brown fox\".findRegexp(\"[a-zA-Z]+\")` should return 4 results at indices [0, 4, 10, 16], and the third word should be 'brown'"
+		)


don't understand the need for the complexity of this test, it makes it difficult to understand and verify (why 4 matches, why only check the contents of the third match?). it could just as easily be "a".findRegexp("a"). maybe this function needs at most one more unit test to make sure it handles multiple matches, but beyond that it's basically testing boost.regex, which already has its own test suite :)

OK, I've simplified it, though not quite as simple as you propose. My thinking is: We don't have a functional spec for these methods, and nobody has time to write one. (Documentation is no help here, either -- the sum total of the findRegexp documentation is "POSIX regular expression search" -- which is... that's a joke, right? Gosh, OK, let me add another commit here, before merging, to provide proper documentation.) So, if somebody does something to one of these methods in the future that breaks the output format, it would be helpful if the test suite includes an example of the result format (and not only for the most trivial case -- if the example result is only [ [ 0, "a" ] ], it's hard to infer what 0 is... a character position? An index? but [[0, "two"], [4, "words"]] gives future developers more information).

Agreed that the test was too quirky before, no problem to change it.

sure, i don't feel strongly about it. i would probably write a non-match, single match, and multiple match test to cover the sanity check cases.

mossheim

thanks!

jamshark70 · 2019-01-11T01:50:32Z

But, hold on maybe an hour before anybody merges? String.schelp documentation of regexp methods is really appalling. I can't let that stand.

On branch topic/findRegexpStackFix .appveyor.yml~176211d75571fff83d7910a7c8d71e4c1c70d1ab nothing added to commit but untracked files present (use "git add" to track)

jamshark70 · 2019-01-11T02:13:06Z

OK, done... btw, why does findRegexpAt return an array [matching string, matching length]? Seems redundant... is there ever a case where:

x = str.findRegexpAt(regexp, index);
x[0].size != x[1]

?

mossheim · 2019-01-11T03:10:20Z

HelpSource/Classes/String.schelp

@@ -372,16 +372,19 @@ code::
 ::

 method::findRegexp
-POSIX regular expression search.
+POSIX regular expression search. This method searches exhaustively for all possible matches and collects them into an array of pairs, in the format code::[character index, matching string]::. As in the regular expression standard, code::*:: and code::+:: are greedy; the match list will include the largest possible distinct matches. Note, though, that parentheses for grouping will produce a separate result: code::"aaa".findRegexp("(a+)");:: appears to produce duplicated results code::[ [ 0, aaa ], [ 0, aaa ] ]::, but this is because the first match is for the parentheses and the second is for code::a+::.


fairly certain it's actually Perl flavored regex. https://www.boost.org/doc/libs/1_69_0/libs/regex/doc/html/boost_regex/syntax.html

There are three main syntax options available, depending upon how you construct the regular expression object:

Perl (this is the default behavior).

you can double check this but otherwise I don't think a match like "aaa".findRegexp("aa*?") would work. also, this does not search 'exhaustively for all possible matches', see for example "aaa".findRegexp("aa"). https://www.boost.org/doc/libs/1_69_0/libs/regex/doc/html/boost_regex/syntax/leftmost_longest_rule.html is relevant.

it seems the terminology for unescaped parens in this regex flavor is 'marked sub-expression', but i've also seen 'capturing group'. a more illustrative example to show how results are returned for marked sub-expressions might be "_ab_".findRegexp("(a)(b)"). (or just annotate the example below for "foobar".findRegexp("(o*)(bar)");

it would probably be a good idea to link directly to the documentation for this syntax. or copy the documentation here with appropriate attribution. I would include at least https://www.boost.org/doc/libs/1_69_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html, possibly also that leftmost-longest-rule document as cryptic as it is.

fairly certain it's actually Perl flavored regex

Fair enough... "POSIX" was the assertion in the original text.

I don't think a match like "aaa".findRegexp("aa*?") would work

I'll admit to being out of my depth here. I'm not fluent in the differences between varieties of regular expressions, and I don't understand the relevance of this search to POSIX vs Perl-style (and tbh, I'm not really so interested).

also, this does not search 'exhaustively for all possible matches'

I did say: "As in the regular expression standard, code::*:: and code::+:: are greedy; the match list will include the largest possible distinct matches." I guess I should change that to say "the match list will include only the leftmost largest possible..."

Fair enough to annotate that example.

New commit coming soon.

mossheim · 2019-01-11T03:19:16Z

Seems redundant...

it is AFAICT. whoops. no strong opinion, maybe worth deprecating and supplying one that just returns string/nil?

mossheim · 2019-01-16T03:10:17Z

HelpSource/Classes/String.schelp

-POSIX regular expression search.
+Perl regular expression search (see link::Classes/String#Regular expressions::). This method searches exhaustively for matches and collects them into an array of pairs, in the format code::[character index, matching string]::.
+
+"Leftmost largest match": As in the regular expression standard, code::*:: and code::+:: are greedy; if it is possible to have more than one overlapping match for a part of the regular expression, the match list will include only the leftmost and largest of them. In code::"foobar".findRegexp("(o*)(bar)");:: below, code::"o*":: may potentially have three matches: code::"o":: at index 1 (second character), code::"o":: at index 2, and code::"oo":: at index 1. code::findRegexp:: will return only the last of these (code::"oo"::), because it begins in the leftmost-possible matching position, and it is the longest possible match at that position.


there is no 'regular expression standard'. also this explanation is confusing -- the atom o* has 3 matches in foobar but only 2 if it's part of the pattern o*bar. if you're going to go to the trouble of writing out an example, i think you should choose a clearer one to avoid confusing people.

if you're going to go to the trouble of writing out an example, i think you should choose a clearer one to avoid confusing people.

You had suggested that example 😉 ("(or just annotate the example below for "foobar".findRegexp("(o*)(bar)");)")

But I reread your comment -- that suggestion was for marked sub-expressions, which got lost in my explanation.

Rewritten.

mossheim

thanks!

jamshark70 added bug Issues that relate to unexpected/unwanted behavior. Don't use for PRs. comp: sclang sclang C++ implementation (primitives, etc.). for changes to class lib use "comp: class library" labels Jan 10, 2019

jamshark70 mentioned this pull request Jan 10, 2019

[develop] String:findRegexp handles empty results incorrectly #4239

Closed

mossheim suggested changes Jan 10, 2019

View reviewed changes

patrickdupuis added this to the 3.11 milestone Jan 11, 2019

Testsuite: findRegexp tests: Fix naming and simplify

7339cef

mossheim approved these changes Jan 11, 2019

View reviewed changes

Help: Add detail to String regexp method documentation [skip ci]

6c5ba5b

On branch topic/findRegexpStackFix .appveyor.yml~176211d75571fff83d7910a7c8d71e4c1c70d1ab nothing added to commit but untracked files present (use "git add" to track)

mossheim reviewed Jan 11, 2019

View reviewed changes

Help: String regex methods, clarify "exhaustive" and other details

5e8a13f

mossheim suggested changes Jan 16, 2019

View reviewed changes

Help: String:findRegexp: Clarify leftmost-largest and subexpressions

48f8879

mossheim approved these changes Jan 17, 2019

View reviewed changes

mossheim merged commit a9f74f0 into supercollider:develop Jan 20, 2019

joshpar mentioned this pull request Jul 16, 2019

Fix build with boost-1.70 #4491

Closed

mtmccrea mentioned this pull request Jul 30, 2019

Topic/gc fixes post 3.10 #4192

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sclang: Fix 'findRegexp' empty-result case #4241

sclang: Fix 'findRegexp' empty-result case #4241

jamshark70 commented Jan 10, 2019

mossheim left a comment

mossheim Jan 10, 2019

mossheim Jan 10, 2019

jamshark70 Jan 11, 2019

mossheim Jan 11, 2019

mossheim left a comment

jamshark70 commented Jan 11, 2019

jamshark70 commented Jan 11, 2019

mossheim Jan 11, 2019

jamshark70 Jan 12, 2019

mossheim commented Jan 11, 2019

mossheim Jan 16, 2019

jamshark70 Jan 17, 2019

mossheim left a comment

sclang: Fix 'findRegexp' empty-result case #4241

sclang: Fix 'findRegexp' empty-result case #4241

Conversation

jamshark70 commented Jan 10, 2019

Purpose and Motivation

Types of changes

Checklist

mossheim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mossheim left a comment

Choose a reason for hiding this comment

jamshark70 commented Jan 11, 2019

jamshark70 commented Jan 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mossheim commented Jan 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mossheim left a comment

Choose a reason for hiding this comment