Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation-free hex escape parsing in UnicodeSet parsing #3725

Merged
merged 6 commits into from
Jul 25, 2023

Conversation

skius
Copy link
Member

@skius skius commented Jul 23, 2023

Avoids unnecessary allocations when parsing hex digits, instead creates a subslice from the source &str.

Part of #3684

Depends on #3670

(cc @younies)

@dpulls
Copy link

dpulls bot commented Jul 24, 2023

🎉 All dependencies have been resolved !

@skius skius force-pushed the unicodeset-escape-no-alloc branch from 0190186 to 090eb79 Compare July 24, 2023 13:24
@skius skius marked this pull request as ready for review July 24, 2023 13:25
@skius skius requested a review from a team as a code owner July 24, 2023 13:25
@skius skius requested a review from robertbastian July 24, 2023 13:25
let first_offset = self.must_peek_index()?;
let end_offset = self.validate_hex_digits(min, max)?;

// safety: validate_hex_digits ensures that chars (including the last one) are ascii hex digits,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this isn't about "safety" but about not panicking

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, removed all other mentions in this file as well

// safety: validate_hex_digits ensures that chars (including the last one) are ascii hex digits,
// which are all exactly one UTF-8 byte long, so slicing on these offsets always respects char boundaries
#[allow(clippy::indexing_slicing)]
let hex_source = &self.source[first_offset..=end_offset];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: if you make max a const-generic, you can keep the digits you've seen on the stack and don't need to hold on to source. Alternatively, you can merge this with validate_hex_digit and calculate the value while iterating with something like curr = curr * 16 + c.to_digit(16)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the version that keeps digits on the stack work? IIUC, if I want to use a std parsing function I need a UTF-8 slice, but my iterator only gives me chars.

Do you think holding onto source is a big downside? If I understand correctly, even in the #3550 - polished version of this API the user will never have to deal with a struct that holds onto source (USParser::parse takes source and returns the CPILASL)

In any case, if we want to avoid the two-pass approach, I think manually implementing the hex parsing and doing it on the fly as you suggested is ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the stack based approach is not great, you'd have a [u8; MAX] buffer that you'd unsafely interpret as a str. Holding on to source seems fine

@skius skius requested a review from robertbastian July 24, 2023 14:45
robertbastian
robertbastian previously approved these changes Jul 24, 2023
@robertbastian robertbastian merged commit 86c5c07 into unicode-org:main Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants