Switch to fewer allocations in UnicodeSet parsing #3684

skius · 2023-07-14T15:51:35Z

icu_unicodeset_parser uses a bunch of allocating types internally to allow for

arbitrary-length escapes (\x{61 62 63 64...}) use Vec<char>
arbitrary-length strings ({abcd...}) use String

These can/should probably be swapped out for types with "small lives on stack, big lives on heap" semantics.

Linking PR: #3670

Discuss/decide: Priority of UnicodeSet parsing efficiency

The text was updated successfully, but these errors were encountered:

sffc · 2023-07-20T17:51:20Z

Discuss with:

sffc · 2023-07-20T18:01:08Z

@robertbastian - You could use an arena for example.
@sffc - OK to fix low-hanging fruilt. Replace Vec with SmallVec or ShortVec. Write benchmarks. UnicodeSetBuilder probably not the most hot code path.
@Manishearth - A stack-based thing is the way to go here. Seems like a common type of problem.
@robertbastian - Swapping these things out doesn't require a high amount of UnicodeSet expertise so it is something that could be done later

Conclusion: Nice to have but not top priority

skius · 2023-08-24T13:55:22Z

Update:

Arbitrary length escapes (\x{61 62 63 64...}) are now being handled without "needlessly" allocating. The code for that is a bit ugly (see discussion in #3728), so might be improved.

The "issue" still stands that arbitrary length strings ({abcd...}) use String, when potentially some other (partially stack-based) internal representation could be used. The strings end up in a VarZeroVec eventually, so something that integrates nicely with that would be best.

skius added discuss Discuss at a future ICU4X-SC meeting C-unicode Component: Props, sets, tries labels Jul 14, 2023

sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label Jul 20, 2023

sffc added this to the Priority Backlog ⟨P3⟩ milestone Jul 20, 2023

This was referenced Jul 23, 2023

Allocation-free hex escape parsing in UnicodeSet parsing #3725

Merged

Avoid allocations when parsing multi-escapes in UnicodeSets #3728

Merged

skius mentioned this issue Aug 29, 2023

Stabilize UnicodeSet parsing #3959

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to fewer allocations in UnicodeSet parsing #3684

Switch to fewer allocations in UnicodeSet parsing #3684

skius commented Jul 14, 2023 •

edited

Loading

sffc commented Jul 20, 2023 •

edited

Loading

sffc commented Jul 20, 2023

skius commented Aug 24, 2023

Switch to fewer allocations in UnicodeSet parsing #3684

Switch to fewer allocations in UnicodeSet parsing #3684

Comments

skius commented Jul 14, 2023 • edited Loading

sffc commented Jul 20, 2023 • edited Loading

sffc commented Jul 20, 2023

skius commented Aug 24, 2023

skius commented Jul 14, 2023 •

edited

Loading

sffc commented Jul 20, 2023 •

edited

Loading