Allocation-free hex escape parsing in UnicodeSet parsing #3725

skius · 2023-07-23T15:24:31Z

Avoids unnecessary allocations when parsing hex digits, instead creates a subslice from the source &str.

Part of #3684

Depends on #3670

(cc @younies)

dpulls · 2023-07-24T09:38:04Z

🎉 All dependencies have been resolved !

robertbastian · 2023-07-24T14:03:16Z

experimental/unicodeset_parser/src/parse.rs

+        let first_offset = self.must_peek_index()?;
+        let end_offset = self.validate_hex_digits(min, max)?;
+
+        // safety: validate_hex_digits ensures that chars (including the last one) are ascii hex digits,


nit: this isn't about "safety" but about not panicking

thanks, removed all other mentions in this file as well

robertbastian · 2023-07-24T14:10:47Z

experimental/unicodeset_parser/src/parse.rs

+        // safety: validate_hex_digits ensures that chars (including the last one) are ascii hex digits,
+        // which are all exactly one UTF-8 byte long, so slicing on these offsets always respects char boundaries
+        #[allow(clippy::indexing_slicing)]
+        let hex_source = &self.source[first_offset..=end_offset];


thought: if you make max a const-generic, you can keep the digits you've seen on the stack and don't need to hold on to source. Alternatively, you can merge this with validate_hex_digit and calculate the value while iterating with something like curr = curr * 16 + c.to_digit(16)?

How would the version that keeps digits on the stack work? IIUC, if I want to use a std parsing function I need a UTF-8 slice, but my iterator only gives me chars.

Do you think holding onto source is a big downside? If I understand correctly, even in the #3550 - polished version of this API the user will never have to deal with a struct that holds onto source (USParser::parse takes source and returns the CPILASL)

In any case, if we want to avoid the two-pass approach, I think manually implementing the hex parsing and doing it on the fly as you suggested is ok

Yeah the stack based approach is not great, you'd have a [u8; MAX] buffer that you'd unsafely interpret as a str. Holding on to source seems fine

skius mentioned this pull request Jul 23, 2023

Avoid allocations when parsing multi-escapes in UnicodeSets #3728

Merged

skius added 4 commits July 24, 2023 13:21

add source to UnicodeSetBuilder

2538e4e

avoid allocations during hex-escape parsing

1616a6c

fmt

081512f

clippy

090eb79

skius force-pushed the unicodeset-escape-no-alloc branch from 0190186 to 090eb79 Compare July 24, 2023 13:24

skius marked this pull request as ready for review July 24, 2023 13:25

skius requested a review from a team as a code owner July 24, 2023 13:25

skius requested a review from robertbastian July 24, 2023 13:25

robertbastian reviewed Jul 24, 2023

View reviewed changes

remove safety keyword in panic-safety comments

7b37d3d

skius requested a review from robertbastian July 24, 2023 14:45

robertbastian previously approved these changes Jul 24, 2023

View reviewed changes

add note about two-pass hex parsing

c6fa9ae

skius dismissed robertbastian’s stale review via c6fa9ae July 24, 2023 14:55

robertbastian approved these changes Jul 24, 2023

View reviewed changes

robertbastian merged commit 86c5c07 into unicode-org:main Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocation-free hex escape parsing in UnicodeSet parsing #3725

Allocation-free hex escape parsing in UnicodeSet parsing #3725

skius commented Jul 23, 2023 •

edited

Loading

dpulls bot commented Jul 24, 2023

robertbastian Jul 24, 2023

skius Jul 24, 2023

robertbastian Jul 24, 2023

skius Jul 24, 2023

robertbastian Jul 24, 2023

Allocation-free hex escape parsing in UnicodeSet parsing #3725

Allocation-free hex escape parsing in UnicodeSet parsing #3725

Conversation

skius commented Jul 23, 2023 • edited Loading

dpulls bot commented Jul 24, 2023

robertbastian Jul 24, 2023

Choose a reason for hiding this comment

skius Jul 24, 2023

Choose a reason for hiding this comment

robertbastian Jul 24, 2023

Choose a reason for hiding this comment

skius Jul 24, 2023

Choose a reason for hiding this comment

robertbastian Jul 24, 2023

Choose a reason for hiding this comment

skius commented Jul 23, 2023 •

edited

Loading