Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Regex functions using Exprtk and RE2 #1596

Merged
merged 2 commits into from
Oct 31, 2021
Merged

Add Regex functions using Exprtk and RE2 #1596

merged 2 commits into from
Oct 31, 2021

Conversation

sc1f
Copy link
Contributor

@sc1f sc1f commented Oct 26, 2021

This PR is a first cut at adding regular expression functionality to Perspective's expression API, including (for now) three functions:

  • match(string, pattern) returns True if any substring of string matches pattern
  • fullmatch(string, pattern) returns True if the whole string matches pattern
  • search(string, pattern) returns the first substring that matches the first capturing group in pattern. Because we are using RE2, a regex without a capturing group will be rejected by the type-checker as RE2 does not return the full match, only the results of the capturing group.

The functions themselves are not extremely complicated, as they mostly defer to RE2 to perform the match. However, for performance and to avoid repeatedly compiling regex objects on the same string for each row, I've added a regex cache that works similarly to a t_vocab, so that all valid compiled regexes are stored for the lifetime of the table. This required a large scale refactor of the existing code, hence this PR being created before additional functionality is added.

We chose RE2 as the regex implementation for several reasons:

  • As opposed to Boost which has various unknowns that pertain to the build (is it already present on the system? is the version present header-only or built? is apache arrow pulling in boost already?), we can easily reason about the RE2 build: it's added as an external dependency and it's linked against in our CMakeLists.
  • The worst-case linear time guarantee means we don't have to build our own checks for "will this regex crash the Perspective engine" and focus on building out functionality and tuning the API
  • We benchmarked RE2 against both boost::regex and a custom regex matcher that farmed out functionality to the binding langauge (JS/Python) using Emscripten/Pybind, and while RE2 was slower than Boost and faster than the binding language solution, it offered the least uncertainty/unknowns in terms of the build.
  • The syntax is a little different, and some features (such as lookbehinds) are unavailable, but RE2 remains widely used in GoLang and Google products that use regex (Google Sheets, data products, etc.) and answers to common syntax questions are available online.

We need a few extra things to fully flesh this feature out:

  • Right now, regexes must be string literals, and we have a complicated regular expression that takes a string of the format function(..., intern('string literal'), ...) and converts it to function(..., 'string literal', ...). It would be great to figure out a way to auto-intern strings only when we need them, as compared to the current iteration which interns all strings based on a regex that detects single quotes, and then we have to perform a complex replace operation to "un-intern" those strings. This will only get more complicated as our custom function signatures become more complex.
  • Because the regex pattern can't be interned, any expression that assigns a pattern to a variable (such as var pattern := '(.*)'; search("a", pattern) is invalid. This is annoying for writing longer, more complex regexes that are used in multiple functions, and goes back to the problem with intern as stated above.
  • More functions: substring, replace, indexof, etc.
  • Accessing arbitrary capture groups - search() only returns the first capture group, but we should be able to return any group with something like search(string, pattern, group_number=1)
  • A consume function, which would consume the input string and write the results to a vector:
var results[3]; // any vector big enough to hold all the groups

// column = string with a weird date format like "05 19 2021"
var success := fullsearch("column", '^([0-9]{2}) ([0-9]{2}) ([0-9]{4})', results)
success ? date(integer(results[2]), integer(results[0]), integer([results[1])) : null

@sc1f sc1f requested a review from texodus October 26, 2021 22:20
@sc1f sc1f added C++ enhancement Feature requests or improvements labels Oct 26, 2021
Copy link
Member

@texodus texodus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for the PR!

@texodus texodus merged commit 7acc9b0 into master Oct 31, 2021
@texodus texodus deleted the regex_functions branch October 31, 2021 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ enhancement Feature requests or improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants