Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sclang: Fix 'findRegexp' empty-result case #4241

Merged
merged 5 commits into from
Jan 20, 2019

Conversation

jamshark70
Copy link
Contributor

Purpose and Motivation

Fixes #4239.

Previously, an empty search result would cause prString_FindRegexp() to exit early, resulting in stack corruption and an incorrect return value. We remove the early exit, so that this function always goes through the proper stack unwinding at the end.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • All previous tests are passing
  • Tests were updated or created to address changes in this PR, and tests are passing
  • This PR is ready for review

Previously, an empty search result would cause prString_FindRegexp()
to exit early, resulting in stack corruption and an incorrect return
value. We remove the early exit, so that this function always goes
through the proper stack unwinding at the end.
@jamshark70 jamshark70 added bug Issues that relate to unexpected/unwanted behavior. Don't use for PRs. comp: sclang sclang C++ implementation (primitives, etc.). for changes to class lib use "comp: class library" labels Jan 10, 2019
Copy link
Contributor

@mossheim mossheim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

)
}

test_findRegexp_EmptyResult {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Empty/empty

this.assert(
result[0] == [0, 4, 10, 16] and: { result[1][2] == "brown" },
"`\"the quick brown fox\".findRegexp(\"[a-zA-Z]+\")` should return 4 results at indices [0, 4, 10, 16], and the third word should be 'brown'"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't understand the need for the complexity of this test, it makes it difficult to understand and verify (why 4 matches, why only check the contents of the third match?). it could just as easily be "a".findRegexp("a"). maybe this function needs at most one more unit test to make sure it handles multiple matches, but beyond that it's basically testing boost.regex, which already has its own test suite :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've simplified it, though not quite as simple as you propose. My thinking is: We don't have a functional spec for these methods, and nobody has time to write one. (Documentation is no help here, either -- the sum total of the findRegexp documentation is "POSIX regular expression search" -- which is... that's a joke, right? Gosh, OK, let me add another commit here, before merging, to provide proper documentation.) So, if somebody does something to one of these methods in the future that breaks the output format, it would be helpful if the test suite includes an example of the result format (and not only for the most trivial case -- if the example result is only [ [ 0, "a" ] ], it's hard to infer what 0 is... a character position? An index? but [[0, "two"], [4, "words"]] gives future developers more information).

Agreed that the test was too quirky before, no problem to change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i don't feel strongly about it. i would probably write a non-match, single match, and multiple match test to cover the sanity check cases.

@patrickdupuis patrickdupuis added this to the 3.11 milestone Jan 11, 2019
Copy link
Contributor

@mossheim mossheim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@jamshark70
Copy link
Contributor Author

But, hold on maybe an hour before anybody merges? String.schelp documentation of regexp methods is really appalling. I can't let that stand.

On branch topic/findRegexpStackFix
	.appveyor.yml~176211d75571fff83d7910a7c8d71e4c1c70d1ab
nothing added to commit but untracked files present (use "git add" to track)
@jamshark70
Copy link
Contributor Author

OK, done... btw, why does findRegexpAt return an array [matching string, matching length]? Seems redundant... is there ever a case where:

x = str.findRegexpAt(regexp, index);
x[0].size != x[1]

?

@@ -372,16 +372,19 @@ code::
::

method::findRegexp
POSIX regular expression search.
POSIX regular expression search. This method searches exhaustively for all possible matches and collects them into an array of pairs, in the format code::[character index, matching string]::. As in the regular expression standard, code::*:: and code::+:: are greedy; the match list will include the largest possible distinct matches. Note, though, that parentheses for grouping will produce a separate result: code::"aaa".findRegexp("(a+)");:: appears to produce duplicated results code::[ [ 0, aaa ], [ 0, aaa ] ]::, but this is because the first match is for the parentheses and the second is for code::a+::.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fairly certain it's actually Perl flavored regex. https://www.boost.org/doc/libs/1_69_0/libs/regex/doc/html/boost_regex/syntax.html

There are three main syntax options available, depending upon how you construct the regular expression object:

  • Perl (this is the default behavior).

you can double check this but otherwise I don't think a match like "aaa".findRegexp("aa*?") would work. also, this does not search 'exhaustively for all possible matches', see for example "aaa".findRegexp("aa"). https://www.boost.org/doc/libs/1_69_0/libs/regex/doc/html/boost_regex/syntax/leftmost_longest_rule.html is relevant.

it seems the terminology for unescaped parens in this regex flavor is 'marked sub-expression', but i've also seen 'capturing group'. a more illustrative example to show how results are returned for marked sub-expressions might be "_ab_".findRegexp("(a)(b)"). (or just annotate the example below for "foobar".findRegexp("(o*)(bar)");

it would probably be a good idea to link directly to the documentation for this syntax. or copy the documentation here with appropriate attribution. I would include at least https://www.boost.org/doc/libs/1_69_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html, possibly also that leftmost-longest-rule document as cryptic as it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fairly certain it's actually Perl flavored regex

Fair enough... "POSIX" was the assertion in the original text.

I don't think a match like "aaa".findRegexp("aa*?") would work

I'll admit to being out of my depth here. I'm not fluent in the differences between varieties of regular expressions, and I don't understand the relevance of this search to POSIX vs Perl-style (and tbh, I'm not really so interested).

also, this does not search 'exhaustively for all possible matches'

I did say: "As in the regular expression standard, code::*:: and code::+:: are greedy; the match list will include the largest possible distinct matches." I guess I should change that to say "the match list will include only the leftmost largest possible..."

Fair enough to annotate that example.

New commit coming soon.

@mossheim
Copy link
Contributor

Seems redundant...

it is AFAICT. whoops. no strong opinion, maybe worth deprecating and supplying one that just returns string/nil?

POSIX regular expression search.
Perl regular expression search (see link::Classes/String#Regular expressions::). This method searches exhaustively for matches and collects them into an array of pairs, in the format code::[character index, matching string]::.

"Leftmost largest match": As in the regular expression standard, code::*:: and code::+:: are greedy; if it is possible to have more than one overlapping match for a part of the regular expression, the match list will include only the leftmost and largest of them. In code::"foobar".findRegexp("(o*)(bar)");:: below, code::"o*":: may potentially have three matches: code::"o":: at index 1 (second character), code::"o":: at index 2, and code::"oo":: at index 1. code::findRegexp:: will return only the last of these (code::"oo"::), because it begins in the leftmost-possible matching position, and it is the longest possible match at that position.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no 'regular expression standard'. also this explanation is confusing -- the atom o* has 3 matches in foobar but only 2 if it's part of the pattern o*bar. if you're going to go to the trouble of writing out an example, i think you should choose a clearer one to avoid confusing people.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're going to go to the trouble of writing out an example, i think you should choose a clearer one to avoid confusing people.

You had suggested that example 😉 ("(or just annotate the example below for "foobar".findRegexp("(o*)(bar)");)")

But I reread your comment -- that suggestion was for marked sub-expressions, which got lost in my explanation.

Rewritten.

Copy link
Contributor

@mossheim mossheim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@mossheim mossheim merged commit a9f74f0 into supercollider:develop Jan 20, 2019
@mtmccrea mtmccrea mentioned this pull request Jul 30, 2019
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issues that relate to unexpected/unwanted behavior. Don't use for PRs. comp: sclang sclang C++ implementation (primitives, etc.). for changes to class lib use "comp: class library"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[develop] String:findRegexp handles empty results incorrectly
3 participants