Add Regex functions using Exprtk and RE2 #1596
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is a first cut at adding regular expression functionality to Perspective's expression API, including (for now) three functions:
match(string, pattern)
returns True if any substring of string matches patternfullmatch(string, pattern)
returns True if the whole string matches patternsearch(string, pattern)
returns the first substring that matches the first capturing group in pattern. Because we are using RE2, a regex without a capturing group will be rejected by the type-checker as RE2 does not return the full match, only the results of the capturing group.The functions themselves are not extremely complicated, as they mostly defer to RE2 to perform the match. However, for performance and to avoid repeatedly compiling regex objects on the same string for each row, I've added a regex cache that works similarly to a
t_vocab
, so that all valid compiled regexes are stored for the lifetime of the table. This required a large scale refactor of the existing code, hence this PR being created before additional functionality is added.We chose RE2 as the regex implementation for several reasons:
boost::regex
and a custom regex matcher that farmed out functionality to the binding langauge (JS/Python) using Emscripten/Pybind, and while RE2 was slower than Boost and faster than the binding language solution, it offered the least uncertainty/unknowns in terms of the build.We need a few extra things to fully flesh this feature out:
function(..., intern('string literal'), ...)
and converts it tofunction(..., 'string literal', ...)
. It would be great to figure out a way to auto-intern strings only when we need them, as compared to the current iteration which interns all strings based on a regex that detects single quotes, and then we have to perform a complex replace operation to "un-intern" those strings. This will only get more complicated as our custom function signatures become more complex.var pattern := '(.*)'; search("a", pattern)
is invalid. This is annoying for writing longer, more complex regexes that are used in multiple functions, and goes back to the problem with intern as stated above.search(string, pattern, group_number=1)