-
-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref(sourcebundle): Check UTF-8 validity memory efficiently #890
Conversation
4942f9e
to
d7e4852
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there must be some kind of pre-existing crate that does this. I’m not sure we should implement this ourselves TBH
@Swatinem I tried to find a crate that does this before implementing, but I could not find one. I agree, using a crate would be the ideal solution if one exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive work, I just have some nits. I agree that it's a shame to have to implement this ourselves.
Have you verified that if the UTF8Reader
errors, nothing is written to the writer by io::copy
? Also, this type would be ripe for fuzzing/property-based testing (generate random strings and see if they get read correctly).
The code doesn't look like it even has to guarantee UTF-8 Validity, isn't it fine if the file contents are just copied as bytes? A reader must always ensure correctness anyways. |
@Dav1dde, @loewenheim added this in #816. The context is not entirely clear to me from that PR, but if I remember correctly, I think the reason it needed to be added was to have client-side verification that users upload valid source bundles to Sentry. Valid source bundles can only contain UTF-8 encoded data. |
d7e4852
to
7dd2f8f
Compare
I think I have addressed all feedback. Please let me know if I missed something |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small typo, otherwise LGTM.
7dd2f8f
to
6115047
Compare
I have verified with the memory profiler that when using a version of |
Hey @loewenheim, just saw this:
Regarding the first question: no, I have not tested this, but I would expect that we would write everything to As for the fuzzing/property based testing, I am not sure how to do this, but it sounds like a good idea! |
My worry is this: if we write into the writer up to the first read error, how does that interact with As for proptesting, I've used it a fair bit and would be happy to give you an introduction |
6115047
to
84304d0
Compare
@loewenheim, I have addressed your feedback |
84304d0
to
094cd96
Compare
The current check to ensure a sourcebundle is valid UTF-8 reads the entire sourcebundle file into memory. This is inefficient for large files. This PR introduces a UTF8Reader which wraps any reader. The UTF8Reader ensures that the stream is valid UTF8 as it is being read, while only requiring a small amount of memory (currently 8 KiB) to be allocated as a buffer.
Co-authored-by: Sebastian Zivota <loewenheim@users.noreply.github.com>
89a44ca
to
c81365a
Compare
Symbolic version `12.13.3` includes [a change](getsentry/symbolic#890), which will reduce, in some cases significantly, the memory usage of sourcemap uploads. ref #2344
The current check to ensure a sourcebundle is valid UTF-8 reads the entire sourcebundle file into memory. This is inefficient for large files.
This PR introduces a UTF8Reader which wraps any reader. The UTF8Reader ensures that the stream is valid UTF8 as it is being read, while only requiring a small amount of memory (currently 8 KiB) to be allocated as a buffer.