Skip to content

<:Foo> syntax in regexes ambiguous #118

Open
@jnthn

Description

In S05 it defines <:Foo> as:

Unicode properties are indicated by use of pair notation in place of a normal rule name:

<:Letter>   # a letter
<:!Letter>  # a non-letter

Properties with arguments are passed as the argument to the pair:

<:East_Asian_Width<Narrow>>
<:!Blk<ASCII>>

The second form is unambiguous. The first, less so. Here's a quote from the Unicode database (in PropertyValueAliases.txt):

NOTE: Property value names are NOT unique across properties. For example:

AL means Arabic Letter for the Bidi_Class property, and
AL means Above_Left for the Canonical_Combining_Class property, and
AL means Alphabetic for the Line_Break property.

In addition, some property names may be the same as some property value names.
For example:

sc means the Script property, and
Sc means the General_Category property value Currency_Symbol (Sc)

The combination of property value and property name is, however, unique.

Which raises the question of what <:AL> would mean, or <:Sc>. The one that actually tripped me up is <:space>, which can either be an alias for the WSpace property (per PropertyAliases.txt):

WSpace                   ; White_Space                 ; space

Or a property value name from the linebreak property:

lb ; SP                               ; Space

The ambiguity is currently resolved by the order we make entries into the lookup hash, which is defined by the order we generate the C code in ucd2c.pl, which in term is randomized due to Perl 5 hash order randomization. So, you can get a spectest fails, regenerate from the exact same Unicode database
version and ucd2c.pl, and "get lucky" next time around. I came upon this by getting "unlucky" when doing the Unicode 9 database version bump, but it's been a problem all along.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions