Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to fewer allocations in UnicodeSet parsing #3684

Open
skius opened this issue Jul 14, 2023 · 3 comments
Open

Switch to fewer allocations in UnicodeSet parsing #3684

skius opened this issue Jul 14, 2023 · 3 comments
Labels
C-unicode Component: Props, sets, tries good first issue Good for newcomers S-small Size: One afternoon (small bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt

Comments

@skius
Copy link
Member

skius commented Jul 14, 2023

icu_unicodeset_parser uses a bunch of allocating types internally to allow for

  • arbitrary-length escapes (\x{61 62 63 64...}) use Vec<char>
  • arbitrary-length strings ({abcd...}) use String

These can/should probably be swapped out for types with "small lives on stack, big lives on heap" semantics.

Linking PR: #3670

Discuss/decide: Priority of UnicodeSet parsing efficiency

@skius skius added discuss Discuss at a future ICU4X-SC meeting C-unicode Component: Props, sets, tries labels Jul 14, 2023
@sffc
Copy link
Member

sffc commented Jul 20, 2023

Discuss with:

@sffc sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label Jul 20, 2023
@sffc
Copy link
Member

sffc commented Jul 20, 2023

  • @robertbastian - You could use an arena for example.
  • @sffc - OK to fix low-hanging fruilt. Replace Vec with SmallVec or ShortVec. Write benchmarks. UnicodeSetBuilder probably not the most hot code path.
  • @Manishearth - A stack-based thing is the way to go here. Seems like a common type of problem.
  • @robertbastian - Swapping these things out doesn't require a high amount of UnicodeSet expertise so it is something that could be done later

Conclusion: Nice to have but not top priority

@sffc sffc added this to the Priority Backlog ⟨P3⟩ milestone Jul 20, 2023
@sffc sffc added good first issue Good for newcomers T-techdebt Type: ICU4X code health and tech debt S-small Size: One afternoon (small bug fix or enhancement) and removed discuss Discuss at a future ICU4X-SC meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Jul 20, 2023
@skius
Copy link
Member Author

skius commented Aug 24, 2023

Update:

Arbitrary length escapes (\x{61 62 63 64...}) are now being handled without "needlessly" allocating. The code for that is a bit ugly (see discussion in #3728), so might be improved.

The "issue" still stands that arbitrary length strings ({abcd...}) use String, when potentially some other (partially stack-based) internal representation could be used. The strings end up in a VarZeroVec eventually, so something that integrates nicely with that would be best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries good first issue Good for newcomers S-small Size: One afternoon (small bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt
Projects
None yet
Development

No branches or pull requests

2 participants