-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<:Foo> syntax in regexes ambiguous #118
Comments
I think S05 is lacking intended details. No regex engine allows for arbitrary property values for all properties without the associated names, due to the obvious conflicts. Most regex engines allow standalone This discussion is happening simultaneously for an active ECMAScript proposal and the current plan is to only support standalone values for Also, supporting |
A good description of |
Even though the last change fixed the problem where the property code for 'space' was different than the one for 'White_Space', it failed some of the tests, notably testing ' ' ~~ /<:space>/ which may by this problem here: Raku/old-design-docs#118 Long discussion here: https://irclog.perlgeek.de/moarvm/2016-12-27#i_13805707
As part of my Unicode Grant I am having to address this. From a perspective of implementing it on MoarVM, we are given a name, lets say "Latin" and look up what property is associated with that. In this case it would be the "Script" property. Currently MoarVM throws all the property values in together and assumes that they are distinct with one property value to one specific property, which does not work in practice. As I work on re-implementing this part of the code I need to decide which property values should be resolvable to property names (which is needed for regex without specifying the actual property you are trying to query). I am going to put together a list of all of the conflicts and we can hopefully decide how we want to go about prioritizing them. Or at the very least knowing where all the overlaps are and which ones we want to prioritize and which are inconsequential. |
# All except <True False T F Yes No Y N> and Script/Block overlaps
L => ["Grapheme_Cluster_Break", "Hangul_Syllable_Type", "Bidi_Class", "Jamo_Short_Name", "Canonical_Combining_Class", "General_Category", "Joining_Type"],
Other => ["Indic_Syllabic_Category", "Grapheme_Cluster_Break", "Word_Break", "Sentence_Break", "General_Category"],
EX => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break", "Sentence_Break"],
Numeric => ["Word_Break", "Line_Break", "Sentence_Break", "Numeric_Type"],
XX => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break", "Sentence_Break"],
CR => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break", "Sentence_Break"],
R => ["Bidi_Class", "Jamo_Short_Name", "Canonical_Combining_Class", "Joining_Type"],
M => ["NFKC_Quick_Check", "Jamo_Short_Name", "General_Category", "NFC_Quick_Check"],
LF => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break", "Sentence_Break"],
Regional_Indicator => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
AL => ["Bidi_Class", "Canonical_Combining_Class", "Line_Break"],
EM => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
NU => ["Word_Break", "Line_Break", "Sentence_Break"],
A => ["East_Asian_Width", "Jamo_Short_Name", "Canonical_Combining_Class"],
E_Base => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
RI => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
B => ["Bidi_Class", "Jamo_Short_Name", "Canonical_Combining_Class"],
ZWJ => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
EB => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
Extend => ["Grapheme_Cluster_Break", "Word_Break", "Sentence_Break"],
None => ["Bidi_Paired_Bracket_Type", "Decomposition_Type", "Numeric_Type"],
S => ["Bidi_Class", "Jamo_Short_Name", "General_Category"],
E_Modifier => ["Grapheme_Cluster_Break", "Word_Break", "Line_Break"],
NA => ["Age", "Hangul_Syllable_Type", "Indic_Positional_Category"],
Format => ["Word_Break", "Sentence_Break", "General_Category"],
C => ["Jamo_Short_Name", "General_Category", "Joining_Type"],
Right => ["Canonical_Combining_Class", "Indic_Positional_Category"],
Unassigned => ["Age", "General_Category"],
Control => ["Grapheme_Cluster_Break", "General_Category"],
Nukta => ["Indic_Syllabic_Category", "Canonical_Combining_Class"],
E => ["Joining_Group", "Jamo_Short_Name"],
Surrogate => ["Line_Break", "General_Category"],
Punctuation => ["Block", "General_Category"],
V => ["Grapheme_Cluster_Break", "Hangul_Syllable_Type"],
Nonspacing_Mark => ["Bidi_Class", "General_Category"],
Number => ["Indic_Syllabic_Category", "General_Category"],
SP => ["Line_Break", "Sentence_Break"],
E_Base_GAZ => ["Grapheme_Cluster_Break", "Word_Break"],
Close_Punctuation => ["Line_Break", "General_Category"],
Unknown => ["Script", "Line_Break"],
GAZ => ["Grapheme_Cluster_Break", "Word_Break"],
LV => ["Grapheme_Cluster_Break", "Hangul_Syllable_Type"],
IS => ["Canonical_Combining_Class", "Line_Break"],
CL => ["Line_Break", "Sentence_Break"],
Open_Punctuation => ["Line_Break", "General_Category"],
Private_Use => ["Block", "General_Category"],
Paragraph_Separator => ["Bidi_Class", "General_Category"],
Pe => ["Joining_Group", "General_Category"],
D => ["Jamo_Short_Name", "Joining_Type"],
Narrow => ["East_Asian_Width", "Decomposition_Type"],
NL => ["Word_Break", "Line_Break"],
Wide => ["East_Asian_Width", "Decomposition_Type"],
Virama => ["Indic_Syllabic_Category", "Canonical_Combining_Class"],
Hebrew_Letter => ["Word_Break", "Line_Break"],
U => ["Jamo_Short_Name", "Joining_Type"],
LE => ["Word_Break", "Sentence_Break"],
Left => ["Canonical_Combining_Class", "Indic_Positional_Category"],
Glue_After_Zwj => ["Grapheme_Cluster_Break", "Word_Break"],
Close => ["Bidi_Paired_Bracket_Type", "Sentence_Break"],
BB => ["Jamo_Short_Name", "Line_Break"],
HL => ["Word_Break", "Line_Break"],
P => ["Jamo_Short_Name", "General_Category"],
Maybe => ["NFKC_Quick_Check", "NFC_Quick_Check"],
EBG => ["Grapheme_Cluster_Break", "Word_Break"],
Combining_Mark => ["Line_Break", "General_Category"],
LVT => ["Grapheme_Cluster_Break", "Hangul_Syllable_Type"],
FO => ["Word_Break", "Sentence_Break"],
H => ["East_Asian_Width", "Jamo_Short_Name"],
Ambiguous => ["East_Asian_Width", "Line_Break"], Here are all the ones that are Block and Script overlaps: # Only Script/Block overlap
Malayalam => ["Script", "Block"],
Sundanese => ["Script", "Block"],
Mahajani => ["Script", "Block"],
Pau_Cin_Hau => ["Script", "Block"],
Tibetan => ["Script", "Block"],
Sora_Sompeng => ["Script", "Block"],
Runic => ["Script", "Block"],
Thai => ["Script", "Block"],
Osage => ["Script", "Block"],
Rejang => ["Script", "Block"],
Bassa_Vah => ["Script", "Block"],
Gurmukhi => ["Script", "Block"],
Glagolitic => ["Script", "Block"],
Old_Hungarian => ["Script", "Block"],
Grantha => ["Script", "Block"],
Palmyrene => ["Script", "Block"],
Gothic => ["Script", "Block"],
Lao => ["Script", "Block"],
Nabataean => ["Script", "Block"],
Limbu => ["Script", "Block"],
Old_Persian => ["Script", "Block"],
Phoenician => ["Script", "Block"],
Tai_Le => ["Script", "Block"],
Ol_Chiki => ["Script", "Block"],
Khudawadi => ["Script", "Block"],
Old_Permic => ["Script", "Block"],
Elbasan => ["Script", "Block"],
Duployan => ["Script", "Block"],
Samaritan => ["Script", "Block"],
Syriac => ["Script", "Block"],
Devanagari => ["Script", "Block"],
Greek => ["Script", "Block"],
Lycian => ["Script", "Block"],
Ethiopic => ["Script", "Block"],
Thaana => ["Script", "Block"],
Hatran => ["Script", "Block"],
Siddham => ["Script", "Block"],
Psalter_Pahlavi => ["Script", "Block"],
Kharoshthi => ["Script", "Block"],
Mandaic => ["Script", "Block"],
Newa => ["Script", "Block"],
Kayah_Li => ["Script", "Block"],
Warang_Citi => ["Script", "Block"],
Multani => ["Script", "Block"],
Osmanya => ["Script", "Block"],
Georgian => ["Script", "Block"],
Armenian => ["Script", "Block"],
Sinhala => ["Script", "Block"],
Hiragana => ["Script", "Block"],
Shavian => ["Script", "Block"],
New_Tai_Lue => ["Script", "Block"],
Bamum => ["Script", "Block"],
Cyrillic => ["Script", "Block"],
Old_South_Arabian => ["Script", "Block"],
Myanmar => ["Script", "Block"],
Miao => ["Script", "Block"],
Meroitic_Cursive => ["Script", "Block"],
Tirhuta => ["Script", "Block"],
Coptic => ["Script", "Block"],
Caucasian_Albanian => ["Script", "Block"],
Hanunoo => ["Script", "Block"],
Tamil => ["Script", "Block"],
Avestan => ["Script", "Block"],
Cherokee => ["Script", "Block"],
Inscriptional_Pahlavi => ["Script", "Block"],
Kannada => ["Script", "Block"],
Tifinagh => ["Script", "Block"],
Javanese => ["Script", "Block"],
Inscriptional_Parthian => ["Script", "Block"],
Mro => ["Script", "Block"],
Cham => ["Script", "Block"],
Takri => ["Script", "Block"],
Hangul => ["Script", "Block"],
Old_Turkic => ["Script", "Block"],
Oriya => ["Script", "Block"],
Kaithi => ["Script", "Block"],
Ahom => ["Script", "Block"],
Linear_A => ["Script", "Block"],
Meetei_Mayek => ["Script", "Block"],
Egyptian_Hieroglyphs => ["Script", "Block"],
Ugaritic => ["Script", "Block"],
Buginese => ["Script", "Block"],
Tagalog => ["Script", "Block"],
Anatolian_Hieroglyphs => ["Script", "Block"],
Pahawh_Hmong => ["Script", "Block"],
Tangut => ["Script", "Block"],
Telugu => ["Script", "Block"],
Batak => ["Script", "Block"],
Phags_Pa => ["Script", "Block"],
Vai => ["Script", "Block"],
Mongolian => ["Script", "Block"],
Modi => ["Script", "Block"],
Bhaiksuki => ["Script", "Block"],
Lisu => ["Script", "Block"],
Lydian => ["Script", "Block"],
Brahmi => ["Script", "Block"],
Cuneiform => ["Script", "Block"],
Tai_Viet => ["Script", "Block"],
Syloti_Nagri => ["Script", "Block"],
Chakma => ["Script", "Block"],
Adlam => ["Script", "Block"],
Braille => ["Script", "Block"],
Marchen => ["Script", "Block"],
Deseret => ["Script", "Block"],
Imperial_Aramaic => ["Script", "Block"],
Arabic => ["Script", "Block"],
Khmer => ["Script", "Block"],
Balinese => ["Script", "Block"],
Bengali => ["Script", "Block"],
Bopomofo => ["Script", "Block"],
Tai_Tham => ["Script", "Block"],
Mende_Kikakui => ["Script", "Block"],
Hebrew => ["Script", "Block"],
Meroitic_Hieroglyphs => ["Script", "Block"],
Sharada => ["Script", "Block"],
Khojki => ["Script", "Block"],
Lepcha => ["Script", "Block"],
Saurashtra => ["Script", "Block"],
Tagbanwa => ["Script", "Block"],
Old_Italic => ["Script", "Block"],
Gujarati => ["Script", "Block"],
Carian => ["Script", "Block"],
Old_North_Arabian => ["Script", "Block"],
Ogham => ["Script", "Block"],
Buhid => ["Script", "Block"],
Manichaean => ["Script", "Block"],
Katakana => ["Script", "Block", "Word_Break"], |
All of the property names that conflict with values are Bool properties:
I would like this to be 0th in priority If we set our preferred properties to be From here we can resolve Canonical_Combining_Class, and also we should resolve Leaving us at a hierarchy of
I am open to adding whichever properties people think most important to the ordered priority list as well. The ones with overlap remaining after this point:
Any ideas above adding further to the hierarchy (even if they don't have any overlap presently [Unicode 9.0] it could be introduced later) will be appreciated. |
In S05 it defines <:Foo> as:
The second form is unambiguous. The first, less so. Here's a quote from the Unicode database (in PropertyValueAliases.txt):
Which raises the question of what <:AL> would mean, or <:Sc>. The one that actually tripped me up is <:space>, which can either be an alias for the
WSpace
property (per PropertyAliases.txt):Or a property value name from the linebreak property:
The ambiguity is currently resolved by the order we make entries into the lookup hash, which is defined by the order we generate the C code in ucd2c.pl, which in term is randomized due to Perl 5 hash order randomization. So, you can get a spectest fails, regenerate from the exact same Unicode database
version and ucd2c.pl, and "get lucky" next time around. I came upon this by getting "unlucky" when doing the Unicode 9 database version bump, but it's been a problem all along.
The text was updated successfully, but these errors were encountered: