Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normative: Add RegExp v flag with set notation and properties of strings #2418

Merged
merged 1 commit into from
Jun 15, 2023

Conversation

mathiasbynens
Copy link
Member

@mathiasbynens mathiasbynens commented May 28, 2021

Proposal repo: https://github.com/tc39/proposal-regexp-set-notation

Preview link for the spec changes with inline diffs: https://arai-a.github.io/ecma262-compare/?pr=2418

@ljharb ljharb added needs test262 tests The proposal should specify how to test an implementation. Ideally via github.com/tc39/test262 normative change Affects behavior required to correctly evaluate some ECMAScript source text pending stage 4 This proposal has not yet achieved stage 4, but may otherwise be ready to merge. proposal This is related to a specific proposal, and will be closed/merged when the proposal reaches stage 4. labels May 31, 2021
@ljharb ljharb changed the title [Normative] Add RegExp v flag with set notation and properties of strings Normative: Add RegExp v flag with set notation and properties of strings May 31, 2021
@ljharb ljharb force-pushed the master branch 3 times, most recently from 3d0c24c to 7a79833 Compare June 29, 2021 02:21
spec.html Outdated Show resolved Hide resolved
@mathiasbynens mathiasbynens marked this pull request as ready for review November 17, 2021 07:10
@bakkot
Copy link
Contributor

bakkot commented Nov 30, 2021

This PR still has a bunch of "FYI" comments in it, which I assume you don't intend. Is it actually ready for review?

@ljharb
Copy link
Member

ljharb commented Dec 1, 2021

@FrankYFTang we don't use separated html files; can you inline the table?

@markusicu
Copy link
Contributor

This PR still has a bunch of "FYI" comments in it, which I assume you don't intend. Is it actually ready for review?

We thought that it's ok for stage 3 to still have editorial notes and explainers. We intend to eventually turn many of them into permanent NOTEs or whatever is a reasonable format, based on feedback. Ok?

@bakkot
Copy link
Contributor

bakkot commented Dec 2, 2021

Sure, that's ok. So do you just want the normative parts reviewed at the moment? Also, do you intend to address the TODOs before review? Some of them look like they'd be normative.

@markusicu
Copy link
Contributor

So do you just want the normative parts reviewed at the moment? Also, do you intend to address the TODOs before review? Some of them look like they'd be normative.

I believe that we covered nearly everything, but I plan to go through it myself once more with a fine-tooth comb.

I hope that this will not block stage 3 reviewers from reviewing and providing feedback.

@ljharb
Copy link
Member

ljharb commented Dec 2, 2021

(Generally those things are fine in the proposal, but don't go in the spec PR)

@FrankYFTang
Copy link

@FrankYFTang we don't use separated html files; can you inline the table?

See
https://github.com/tc39/ecma262/blob/main/table-binary-unicode-properties.html
https://github.com/tc39/ecma262/blob/main/table-nonbinary-unicode-properties.html
https://github.com/tc39/ecma262/blob/main/table-unicode-general-category-values.html
https://github.com/tc39/ecma262/blob/main/table-unicode-script-values.html

Notice these table are separated html files because for every new verison of Unicode standard @mathiasbynens update them. I am not sure he use some tool to generate them or by hand in the past. I just follow that

See
#1041
#1218

@bakkot
Copy link
Contributor

bakkot commented Dec 7, 2021

What is \q? The proposal readme doesn't say, I don't recall it being presented, and neither the google doc nor this PR give it any semantics.

@markusicu
Copy link
Contributor

markusicu commented Dec 7, 2021

What is \q? The proposal readme doesn't say, I don't recall it being presented, and neither the google doc nor this PR give it any semantics.

We flip-flopped on the string literal syntax, and went back from (string|literal) to \q{string|literal} which is what the Unicode regex spec (UTS 18) recommends. See tc39/proposal-regexp-v-flag#33 (comment)

This is one of a few things that we will note specially in next week's meeting.

We did edit the readme to change example string literals to this syntax.

Copy link
Contributor

@bakkot bakkot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had some initial comments. I haven't done a thorough review yet, this was just a first pass.

spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated
If _UnicodeSets_ is *false*, then a CharSetElement is a character in the sense of the Pattern Semantics above.
</li>
<li>
If _UnicodeSets_ is *true*, then a CharSetElement is either a character in the sense of the Pattern Semantics above, or it is a sequence of characters, that is, a string. This includes the empty String and strings with more than 1 character. A string of length 1 is the same as a single character.
Copy link
Contributor

@bakkot bakkot Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this type can hold sequences of multiple code points, it should be renamed to something other than CharSet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the flags, it's a set of code units, or code points, or code points plus sequences of code points. The concept of "character" under the new flag encompasses single code points but is also sufficiently general for the logical concept of "character" which frequently includes things that are encoded in Unicode with sequences of multiple code points. (Of course not vice versa.)

Also, renaming it would affect parts of the spec that need not be touched for the normative changes here.

There may or may not be a better name for this. If we find one, can we make that change later, on its own (separate from these normative changes)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is confusing enough that I would be quite uncomfortable landing it unless we've though looked hard for better names and failed to find one. I'll raise this at the editor call next week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Editors are agreed this will need a new name before landing the PR. That doesn't need to block anything until that point, though; it can still get stage 3.

spec.html Show resolved Hide resolved
spec.html Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
@mathiasbynens
Copy link
Member Author

(Note that we're not yet asking for Stage 3 advancement, although we are asking Stage 3 reviewers to start reviewing: tc39/agendas#1093)

@bakkot
Copy link
Contributor

bakkot commented Dec 9, 2021

Incidentally this will need a rebase, mostly due to #2531. I can take care of that for you, if you'd like. Sorry about the conflict, although I do think #2531 will make this easier to follow.

@mathiasbynens
Copy link
Member Author

Incidentally this will need a rebase, mostly due to #2531. I can take care of that for you, if you'd like.

That'd be lovely, if you don't mind! Thanks.

spec.html Show resolved Hide resolved
@bakkot
Copy link
Contributor

bakkot commented Dec 11, 2021

OK, rebased and adopted the conventions of #2531. Not certain I did it perfectly, but it seems to match up pretty well.

jmdyck
jmdyck previously requested changes May 30, 2023
Copy link
Collaborator

@jmdyck jmdyck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to add UnicodeSetsMode to Term, Assertion, QuantifiableAssertion, and ExtendedAtom.

You added UnicodeSetsMode to the LHS occurrences of those symbols, but not to their RHS occurrences.

Not sure about the prefix. ? might work everywhere, but it might be clearer to use ~ when a RHS is guarded by [~UnicodeMode].


On closer examination, QuantifiableAssertion and ExtendedAtom are only used under [~UnicodeMode], so assuming that the combination ~UnicodeMode, +UnicodeSetsMode never occurs, you don't need to add UnicodeSetsMode to those two nonterminals, you can instead just change ?UnicodeSetsMode to ~UnicodeSetsMode in their RHSs.

Here's what I mean: mathiasbynens#26

spec.html Outdated Show resolved Hide resolved
@mathiasbynens mathiasbynens requested a review from jmdyck May 31, 2023 08:01
@mathiasbynens
Copy link
Member Author

I’ve rebased this and the esmeta check now passes (thanks to #3078).

Are there any other blockers preventing this PR from being accepted?

@michaelficarra
Copy link
Member

@mathiasbynens No, just waiting on review from one more editor.

@ljharb ljharb requested a review from a team June 6, 2023 00:28
Copy link
Contributor

@bakkot bakkot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some relatively minor comments, of which only the CharSet one absolutely needs to be addressed before landing. Otherwise looks good to me.

A <dfn>CharSetElement</dfn> is one of the two following entities:
<ul>
<li>
If _rer_.[[UnicodeSets]] is *false*, then a CharSetElement is a character in the sense of the Pattern Semantics above.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this unbound alias, but I guess I don't have a better suggestion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility would be: "If 'v' does/doesn't appear in the RegExp's flags, ..." although then "the RegExp" doesn't have an antecedent.

Maybe "In the context of a RegExp with/without a 'v' flag, ..."

spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
spec.html Show resolved Hide resolved
spec.html Outdated Show resolved Hide resolved
Copy link
Contributor

@syg syg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % question

spec.html Outdated Show resolved Hide resolved
@michaelficarra michaelficarra added editor call to be discussed in the next editor call and removed editor call to be discussed in the next editor call labels Jun 14, 2023
Copy link
Contributor

@syg syg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still lgtm, thanks for the CharSetElement changes.

@michaelficarra michaelficarra added the ready to merge Editors believe this PR needs no further reviews, and is ready to land. label Jun 15, 2023
This enables the use of set notation, string literal syntax, and Unicode properties of strings in regular expressions.

Proposal repo: https://github.com/tc39/proposal-regexp-set-notation

Co-authored-by: Markus Scherer <markus.icu@gmail.com>
@ljharb ljharb dismissed jmdyck’s stale review June 15, 2023 19:37

changes addressed

@ljharb ljharb merged commit 26b2369 into tc39:main Jun 15, 2023
@mathiasbynens mathiasbynens deleted the regexp-v-flag branch June 15, 2023 21:06
@mathiasbynens
Copy link
Member Author

Thanks everyone!

jmdyck added a commit to jmdyck/ecma262 that referenced this pull request Jun 16, 2023
... prompted by tc39#2418 (comment)
from `Atom :: CharacterClass` semantics
to `AtomEscape :: CharacterClassEscape` semantics.
jmdyck added a commit to jmdyck/ecma262 that referenced this pull request Jul 7, 2023
- In CompileAtom, take the wording-changes under
  `Atom :: CharacterClass` that were prompted by
  tc39#2418 (comment)
  and copy them to `AtomEscape :: CharacterClassEscape`.

- In various operations, change more occurrences of
  'element' to 'CharSetElement'.

- In CompileAtom, change 2 occurrences of "which" to "that"
  because the usage is restrictive.
ljharb pushed a commit to jmdyck/ecma262 that referenced this pull request Jul 7, 2023
- In CompileAtom, take the wording-changes under
  `Atom :: CharacterClass` that were prompted by
  tc39#2418 (comment)
  and copy them to `AtomEscape :: CharacterClassEscape`.

- In various operations, change more occurrences of
  'element' to 'CharSetElement'.

- In CompileAtom, change 2 occurrences of "which" to "that"
  because the usage is restrictive.
zhangenming pushed a commit to zhangenming/ecma262 that referenced this pull request Dec 22, 2023
- In CompileAtom, take the wording-changes under
  `Atom :: CharacterClass` that were prompted by
  tc39#2418 (comment)
  and copy them to `AtomEscape :: CharacterClassEscape`.

- In various operations, change more occurrences of
  'element' to 'CharSetElement'.

- In CompileAtom, change 2 occurrences of "which" to "that"
  because the usage is restrictive.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has stage 4 This PR represents a proposal that has achieved stage 4, and is ready to merge. normative change Affects behavior required to correctly evaluate some ECMAScript source text proposal This is related to a specific proposal, and will be closed/merged when the proposal reaches stage 4. ready to merge Editors believe this PR needs no further reviews, and is ready to land.
Projects
None yet
Development

Successfully merging this pull request may close these issues.