Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(sourcebundle): Check UTF-8 validity memory efficiently #890

Merged
merged 6 commits into from
Jan 20, 2025

Conversation

szokeasaurusrex
Copy link
Member

The current check to ensure a sourcebundle is valid UTF-8 reads the entire sourcebundle file into memory. This is inefficient for large files.

This PR introduces a UTF8Reader which wraps any reader. The UTF8Reader ensures that the stream is valid UTF8 as it is being read, while only requiring a small amount of memory (currently 8 KiB) to be allocated as a buffer.

Copy link
Member

@Swatinem Swatinem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there must be some kind of pre-existing crate that does this. I’m not sure we should implement this ourselves TBH

symbolic-debuginfo/src/sourcebundle/utf8_reader.rs Outdated Show resolved Hide resolved
symbolic-debuginfo/src/sourcebundle/utf8_reader.rs Outdated Show resolved Hide resolved
@szokeasaurusrex
Copy link
Member Author

@Swatinem I tried to find a crate that does this before implementing, but I could not find one. I agree, using a crate would be the ideal solution if one exists

Copy link
Contributor

@loewenheim loewenheim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work, I just have some nits. I agree that it's a shame to have to implement this ourselves.

Have you verified that if the UTF8Reader errors, nothing is written to the writer by io::copy? Also, this type would be ripe for fuzzing/property-based testing (generate random strings and see if they get read correctly).

symbolic-debuginfo/src/sourcebundle/utf8_reader.rs Outdated Show resolved Hide resolved
symbolic-debuginfo/src/sourcebundle/utf8_reader.rs Outdated Show resolved Hide resolved
symbolic-debuginfo/src/sourcebundle/utf8_reader.rs Outdated Show resolved Hide resolved
symbolic-debuginfo/src/sourcebundle/utf8_reader.rs Outdated Show resolved Hide resolved
symbolic-debuginfo/src/sourcebundle/utf8_reader.rs Outdated Show resolved Hide resolved
@Dav1dde
Copy link
Member

Dav1dde commented Jan 10, 2025

The code doesn't look like it even has to guarantee UTF-8 Validity, isn't it fine if the file contents are just copied as bytes? A reader must always ensure correctness anyways.

@szokeasaurusrex
Copy link
Member Author

The code doesn't look like it even has to guarantee UTF-8 Validity, isn't it fine if the file contents are just copied as bytes? A reader must always ensure correctness anyways.

@Dav1dde, @loewenheim added this in #816. The context is not entirely clear to me from that PR, but if I remember correctly, I think the reason it needed to be added was to have client-side verification that users upload valid source bundles to Sentry. Valid source bundles can only contain UTF-8 encoded data.

@szokeasaurusrex
Copy link
Member Author

I think I have addressed all feedback. Please let me know if I missed something

Copy link
Contributor

@loewenheim loewenheim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small typo, otherwise LGTM.

symbolic-debuginfo/src/sourcebundle/utf8_reader.rs Outdated Show resolved Hide resolved
@szokeasaurusrex szokeasaurusrex force-pushed the szokeasaurusrex/utf8-reader branch from 7dd2f8f to 6115047 Compare January 10, 2025 16:25
@szokeasaurusrex
Copy link
Member Author

I have verified with the memory profiler that when using a version of symbolic with this change, sentry-cli's total memory usage during a sourcemap upload is reduced, since we no longer allocate memory for the entire file.

@szokeasaurusrex
Copy link
Member Author

Hey @loewenheim, just saw this:

Have you verified that if the UTF8Reader errors, nothing is written to the writer by io::copy? Also, this type would be ripe for fuzzing/property-based testing (generat e random strings and see if they get read correctly).

Regarding the first question: no, I have not tested this, but I would expect that we would write everything to io::copy up until the error occurs. UTF8Reader performs the validation lazily as it reads from the stream it is wrapping, so whether we write anything to the writer depends on where the first UTF8 violation occurs in the reader stream, and how much io::copy reads at a time. I think the only way to guarantee we don't write anything to the writer in an error scenario (without loading the entire reader into memory) would be to io::copy the stream into a temp file, then copy the temp file to the writer stream. Do you think we need to do this?

As for the fuzzing/property based testing, I am not sure how to do this, but it sounds like a good idea!

@loewenheim
Copy link
Contributor

loewenheim commented Jan 14, 2025

My worry is this: if we write into the writer up to the first read error, how does that interact with add_file_skip_read_failed? Won't that mess up the sourcebundle?

As for proptesting, I've used it a fair bit and would be happy to give you an introduction

@szokeasaurusrex
Copy link
Member Author

@loewenheim, I have addressed your feedback

@szokeasaurusrex szokeasaurusrex force-pushed the szokeasaurusrex/utf8-reader branch from 84304d0 to 094cd96 Compare January 14, 2025 16:59
The current check to ensure a sourcebundle is valid UTF-8 reads the entire sourcebundle file into memory. This is inefficient for large files.

This PR introduces a UTF8Reader which wraps any reader. The UTF8Reader ensures that the stream is valid UTF8 as it is being read, while only requiring a small amount of memory (currently 8 KiB) to be allocated as a buffer.
szokeasaurusrex and others added 2 commits January 20, 2025 10:00
Co-authored-by: Sebastian Zivota <loewenheim@users.noreply.github.com>
@szokeasaurusrex szokeasaurusrex force-pushed the szokeasaurusrex/utf8-reader branch from 89a44ca to c81365a Compare January 20, 2025 09:00
@szokeasaurusrex szokeasaurusrex enabled auto-merge (squash) January 20, 2025 09:03
@szokeasaurusrex szokeasaurusrex merged commit 38c5a16 into master Jan 20, 2025
14 checks passed
@szokeasaurusrex szokeasaurusrex deleted the szokeasaurusrex/utf8-reader branch January 20, 2025 09:06
szokeasaurusrex added a commit to getsentry/sentry-cli that referenced this pull request Jan 20, 2025
Symbolic version `12.13.3` includes [a change](getsentry/symbolic#890), which will reduce, in some cases significantly, the memory usage of sourcemap uploads.

ref #2344
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants