-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sclang: Fix 'findRegexp' empty-result case #4241
sclang: Fix 'findRegexp' empty-result case #4241
Conversation
Previously, an empty search result would cause prString_FindRegexp() to exit early, resulting in stack corruption and an incorrect return value. We remove the early exit, so that this function always goes through the proper stack unwinding at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
testsuite/classlibrary/TestString.sc
Outdated
) | ||
} | ||
|
||
test_findRegexp_EmptyResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Empty/empty
this.assert( | ||
result[0] == [0, 4, 10, 16] and: { result[1][2] == "brown" }, | ||
"`\"the quick brown fox\".findRegexp(\"[a-zA-Z]+\")` should return 4 results at indices [0, 4, 10, 16], and the third word should be 'brown'" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't understand the need for the complexity of this test, it makes it difficult to understand and verify (why 4 matches, why only check the contents of the third match?). it could just as easily be "a".findRegexp("a")
. maybe this function needs at most one more unit test to make sure it handles multiple matches, but beyond that it's basically testing boost.regex, which already has its own test suite :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've simplified it, though not quite as simple as you propose. My thinking is: We don't have a functional spec for these methods, and nobody has time to write one. (Documentation is no help here, either -- the sum total of the findRegexp
documentation is "POSIX regular expression search" -- which is... that's a joke, right? Gosh, OK, let me add another commit here, before merging, to provide proper documentation.) So, if somebody does something to one of these methods in the future that breaks the output format, it would be helpful if the test suite includes an example of the result format (and not only for the most trivial case -- if the example result is only [ [ 0, "a" ] ]
, it's hard to infer what 0 is... a character position? An index? but [[0, "two"], [4, "words"]]
gives future developers more information).
Agreed that the test was too quirky before, no problem to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, i don't feel strongly about it. i would probably write a non-match, single match, and multiple match test to cover the sanity check cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
But, hold on maybe an hour before anybody merges? String.schelp documentation of regexp methods is really appalling. I can't let that stand. |
On branch topic/findRegexpStackFix .appveyor.yml~176211d75571fff83d7910a7c8d71e4c1c70d1ab nothing added to commit but untracked files present (use "git add" to track)
OK, done... btw, why does
? |
HelpSource/Classes/String.schelp
Outdated
@@ -372,16 +372,19 @@ code:: | |||
:: | |||
|
|||
method::findRegexp | |||
POSIX regular expression search. | |||
POSIX regular expression search. This method searches exhaustively for all possible matches and collects them into an array of pairs, in the format code::[character index, matching string]::. As in the regular expression standard, code::*:: and code::+:: are greedy; the match list will include the largest possible distinct matches. Note, though, that parentheses for grouping will produce a separate result: code::"aaa".findRegexp("(a+)");:: appears to produce duplicated results code::[ [ 0, aaa ], [ 0, aaa ] ]::, but this is because the first match is for the parentheses and the second is for code::a+::. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fairly certain it's actually Perl flavored regex. https://www.boost.org/doc/libs/1_69_0/libs/regex/doc/html/boost_regex/syntax.html
There are three main syntax options available, depending upon how you construct the regular expression object:
- Perl (this is the default behavior).
you can double check this but otherwise I don't think a match like "aaa".findRegexp("aa*?")
would work. also, this does not search 'exhaustively for all possible matches', see for example "aaa".findRegexp("aa")
. https://www.boost.org/doc/libs/1_69_0/libs/regex/doc/html/boost_regex/syntax/leftmost_longest_rule.html is relevant.
it seems the terminology for unescaped parens in this regex flavor is 'marked sub-expression', but i've also seen 'capturing group'. a more illustrative example to show how results are returned for marked sub-expressions might be "_ab_".findRegexp("(a)(b)")
. (or just annotate the example below for "foobar".findRegexp("(o*)(bar)");
it would probably be a good idea to link directly to the documentation for this syntax. or copy the documentation here with appropriate attribution. I would include at least https://www.boost.org/doc/libs/1_69_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html, possibly also that leftmost-longest-rule document as cryptic as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fairly certain it's actually Perl flavored regex
Fair enough... "POSIX" was the assertion in the original text.
I don't think a match like
"aaa".findRegexp("aa*?")
would work
I'll admit to being out of my depth here. I'm not fluent in the differences between varieties of regular expressions, and I don't understand the relevance of this search to POSIX vs Perl-style (and tbh, I'm not really so interested).
also, this does not search 'exhaustively for all possible matches'
I did say: "As in the regular expression standard, code::*:: and code::+:: are greedy; the match list will include the largest possible distinct matches." I guess I should change that to say "the match list will include only the leftmost largest possible..."
Fair enough to annotate that example.
New commit coming soon.
it is AFAICT. whoops. no strong opinion, maybe worth deprecating and supplying one that just returns string/nil? |
HelpSource/Classes/String.schelp
Outdated
POSIX regular expression search. | ||
Perl regular expression search (see link::Classes/String#Regular expressions::). This method searches exhaustively for matches and collects them into an array of pairs, in the format code::[character index, matching string]::. | ||
|
||
"Leftmost largest match": As in the regular expression standard, code::*:: and code::+:: are greedy; if it is possible to have more than one overlapping match for a part of the regular expression, the match list will include only the leftmost and largest of them. In code::"foobar".findRegexp("(o*)(bar)");:: below, code::"o*":: may potentially have three matches: code::"o":: at index 1 (second character), code::"o":: at index 2, and code::"oo":: at index 1. code::findRegexp:: will return only the last of these (code::"oo"::), because it begins in the leftmost-possible matching position, and it is the longest possible match at that position. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no 'regular expression standard'. also this explanation is confusing -- the atom o*
has 3 matches in foobar
but only 2 if it's part of the pattern o*bar
. if you're going to go to the trouble of writing out an example, i think you should choose a clearer one to avoid confusing people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you're going to go to the trouble of writing out an example, i think you should choose a clearer one to avoid confusing people.
You had suggested that example 😉 ("(or just annotate the example below for "foobar".findRegexp("(o*)(bar)");
)")
But I reread your comment -- that suggestion was for marked sub-expressions, which got lost in my explanation.
Rewritten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
Purpose and Motivation
Fixes #4239.
Previously, an empty search result would cause
prString_FindRegexp()
to exit early, resulting in stack corruption and an incorrect return value. We remove the early exit, so that this function always goes through the proper stack unwinding at the end.Types of changes
Checklist