ref(sourcebundle): Check UTF-8 validity memory efficiently #890

szokeasaurusrex · 2025-01-10T13:52:07Z

The current check to ensure a sourcebundle is valid UTF-8 reads the entire sourcebundle file into memory. This is inefficient for large files.

This PR introduces a UTF8Reader which wraps any reader. The UTF8Reader ensures that the stream is valid UTF8 as it is being read, while only requiring a small amount of memory (currently 8 KiB) to be allocated as a buffer.

Swatinem

there must be some kind of pre-existing crate that does this. I’m not sure we should implement this ourselves TBH

symbolic-debuginfo/src/sourcebundle/utf8_reader.rs

szokeasaurusrex · 2025-01-10T14:16:35Z

@Swatinem I tried to find a crate that does this before implementing, but I could not find one. I agree, using a crate would be the ideal solution if one exists

loewenheim

Impressive work, I just have some nits. I agree that it's a shame to have to implement this ourselves.

Have you verified that if the UTF8Reader errors, nothing is written to the writer by io::copy? Also, this type would be ripe for fuzzing/property-based testing (generate random strings and see if they get read correctly).

symbolic-debuginfo/src/sourcebundle/utf8_reader.rs

Dav1dde · 2025-01-10T14:34:51Z

The code doesn't look like it even has to guarantee UTF-8 Validity, isn't it fine if the file contents are just copied as bytes? A reader must always ensure correctness anyways.

szokeasaurusrex · 2025-01-10T14:41:22Z

The code doesn't look like it even has to guarantee UTF-8 Validity, isn't it fine if the file contents are just copied as bytes? A reader must always ensure correctness anyways.

@Dav1dde, @loewenheim added this in #816. The context is not entirely clear to me from that PR, but if I remember correctly, I think the reason it needed to be added was to have client-side verification that users upload valid source bundles to Sentry. Valid source bundles can only contain UTF-8 encoded data.

szokeasaurusrex · 2025-01-10T15:53:33Z

I think I have addressed all feedback. Please let me know if I missed something

loewenheim

One small typo, otherwise LGTM.

symbolic-debuginfo/src/sourcebundle/utf8_reader.rs

szokeasaurusrex · 2025-01-10T16:31:09Z

I have verified with the memory profiler that when using a version of symbolic with this change, sentry-cli's total memory usage during a sourcemap upload is reduced, since we no longer allocate memory for the entire file.

szokeasaurusrex · 2025-01-13T16:18:03Z

Hey @loewenheim, just saw this:

Have you verified that if the UTF8Reader errors, nothing is written to the writer by io::copy? Also, this type would be ripe for fuzzing/property-based testing (generat e random strings and see if they get read correctly).

Regarding the first question: no, I have not tested this, but I would expect that we would write everything to io::copy up until the error occurs. UTF8Reader performs the validation lazily as it reads from the stream it is wrapping, so whether we write anything to the writer depends on where the first UTF8 violation occurs in the reader stream, and how much io::copy reads at a time. I think the only way to guarantee we don't write anything to the writer in an error scenario (without loading the entire reader into memory) would be to io::copy the stream into a temp file, then copy the temp file to the writer stream. Do you think we need to do this?

As for the fuzzing/property based testing, I am not sure how to do this, but it sounds like a good idea!

loewenheim · 2025-01-14T13:21:36Z

My worry is this: if we write into the writer up to the first read error, how does that interact with add_file_skip_read_failed? Won't that mess up the sourcebundle?

As for proptesting, I've used it a fair bit and would be happy to give you an introduction

szokeasaurusrex · 2025-01-14T16:59:00Z

@loewenheim, I have addressed your feedback

symbolic-debuginfo/src/sourcebundle/mod.rs

symbolic-debuginfo/src/sourcebundle/utf8_reader.rs

The current check to ensure a sourcebundle is valid UTF-8 reads the entire sourcebundle file into memory. This is inefficient for large files. This PR introduces a UTF8Reader which wraps any reader. The UTF8Reader ensures that the stream is valid UTF8 as it is being read, while only requiring a small amount of memory (currently 8 KiB) to be allocated as a buffer.

Co-authored-by: Sebastian Zivota <loewenheim@users.noreply.github.com>

Symbolic version `12.13.3` includes [a change](getsentry/symbolic#890), which will reduce, in some cases significantly, the memory usage of sourcemap uploads. ref #2344

szokeasaurusrex requested review from loewenheim and Swatinem January 10, 2025 13:52

szokeasaurusrex force-pushed the szokeasaurusrex/utf8-reader branch from 4942f9e to d7e4852 Compare January 10, 2025 13:53

Swatinem reviewed Jan 10, 2025

View reviewed changes

symbolic-debuginfo/src/sourcebundle/utf8_reader.rs Outdated Show resolved Hide resolved

symbolic-debuginfo/src/sourcebundle/utf8_reader.rs Outdated Show resolved Hide resolved

loewenheim reviewed Jan 10, 2025

View reviewed changes

szokeasaurusrex force-pushed the szokeasaurusrex/utf8-reader branch from d7e4852 to 7dd2f8f Compare January 10, 2025 15:50

szokeasaurusrex requested review from loewenheim and Swatinem January 10, 2025 15:53

loewenheim approved these changes Jan 10, 2025

View reviewed changes

symbolic-debuginfo/src/sourcebundle/utf8_reader.rs Outdated Show resolved Hide resolved

szokeasaurusrex force-pushed the szokeasaurusrex/utf8-reader branch from 7dd2f8f to 6115047 Compare January 10, 2025 16:25

szokeasaurusrex force-pushed the szokeasaurusrex/utf8-reader branch from 6115047 to 84304d0 Compare January 14, 2025 16:58

szokeasaurusrex requested a review from loewenheim January 14, 2025 16:58

szokeasaurusrex requested a review from Dav1dde January 14, 2025 16:59

szokeasaurusrex force-pushed the szokeasaurusrex/utf8-reader branch from 84304d0 to 094cd96 Compare January 14, 2025 16:59

loewenheim reviewed Jan 15, 2025

View reviewed changes

symbolic-debuginfo/src/sourcebundle/mod.rs Outdated Show resolved Hide resolved

szokeasaurusrex mentioned this pull request Jan 15, 2025

Reduce memory usage of sourcemap uploads getsentry/sentry-cli#2344

Open

3 tasks

szokeasaurusrex requested a review from loewenheim January 16, 2025 11:00

loewenheim reviewed Jan 16, 2025

View reviewed changes

symbolic-debuginfo/src/sourcebundle/utf8_reader.rs Outdated Show resolved Hide resolved

szokeasaurusrex requested a review from loewenheim January 16, 2025 12:57

szokeasaurusrex added 3 commits January 20, 2025 10:00

ref: Utf8Reader no longer is buffering

4f6f0c5

ref: match instead of if

7e78aa4

szokeasaurusrex and others added 2 commits January 20, 2025 10:00

Update symbolic-debuginfo/src/sourcebundle/utf8_reader.rs

946bbf7

Co-authored-by: Sebastian Zivota <loewenheim@users.noreply.github.com>

proptest utf8-reader

c81365a

szokeasaurusrex force-pushed the szokeasaurusrex/utf8-reader branch from 89a44ca to c81365a Compare January 20, 2025 09:00

Update Changelog

811dd44

szokeasaurusrex enabled auto-merge (squash) January 20, 2025 09:03

szokeasaurusrex merged commit 38c5a16 into master Jan 20, 2025
14 checks passed

szokeasaurusrex deleted the szokeasaurusrex/utf8-reader branch January 20, 2025 09:06

szokeasaurusrex mentioned this pull request Jan 20, 2025

build: Bump symbolic to 12.13.3 getsentry/sentry-cli#2346

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref(sourcebundle): Check UTF-8 validity memory efficiently #890

ref(sourcebundle): Check UTF-8 validity memory efficiently #890

szokeasaurusrex commented Jan 10, 2025

Swatinem left a comment

szokeasaurusrex commented Jan 10, 2025

loewenheim left a comment

Dav1dde commented Jan 10, 2025

szokeasaurusrex commented Jan 10, 2025

szokeasaurusrex commented Jan 10, 2025

loewenheim left a comment

szokeasaurusrex commented Jan 10, 2025

szokeasaurusrex commented Jan 13, 2025

loewenheim commented Jan 14, 2025 •

edited

Loading

szokeasaurusrex commented Jan 14, 2025

ref(sourcebundle): Check UTF-8 validity memory efficiently #890

ref(sourcebundle): Check UTF-8 validity memory efficiently #890

Conversation

szokeasaurusrex commented Jan 10, 2025

Swatinem left a comment

Choose a reason for hiding this comment

szokeasaurusrex commented Jan 10, 2025

loewenheim left a comment

Choose a reason for hiding this comment

Dav1dde commented Jan 10, 2025

szokeasaurusrex commented Jan 10, 2025

szokeasaurusrex commented Jan 10, 2025

loewenheim left a comment

Choose a reason for hiding this comment

szokeasaurusrex commented Jan 10, 2025

szokeasaurusrex commented Jan 13, 2025

loewenheim commented Jan 14, 2025 • edited Loading

szokeasaurusrex commented Jan 14, 2025

loewenheim commented Jan 14, 2025 •

edited

Loading