-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spec for the data section string literals feature #76139
Add spec for the data section string literals feature #76139
Conversation
utf8-string-literal-encoding
feature flagusing System.Text; | ||
|
||
[CompilerGenerated] | ||
internal static class <S>2D8BD7D9BB5F85BA643F0110D50CB506A1FE439E769A22503193EA6046BB87F7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to avoid having one type per string?
We can have all the members for a given string (the data field, the string
field (lazily initialized) and possibly a helper method to handle the lazy initialization) in <PrivateImplementationDetails>
? Those three members can include the hash in their name to avoid conflicts. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, by caching strings in a static field, are we creating a problem for the GC? Would using WeakReference
solve that problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have all the members for a given string (the data field, the
string
field (lazily initialized) and possibly a helper method to handle the lazy initialization) in<PrivateImplementationDetails>
? Those three members can include the hash in their name to avoid conflicts.
That's a possible alternative, I can mention it.
As @davidwrighton said elsewhere, the static constructor approach emits a single mov instruction and no branches and calls.
Would it be possible to avoid having one type per string?
There is also the alternative (mentioned in section "Configuration/emit alternatives") to group more than one string per one class.
Also, by caching strings in a static field, are we creating a problem for the GC?
It's not any different than current string
literal behavior I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I just caught up on a Teams discussion you forwarded me today. It looks like the trade-off was discussed and the runtime team is not too worried about the proliferation of types, they could do some optimization and we're engaged on benchmarking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[by caching strings in a static field, are we creating a problem for the GC?] It's not any different than current string literal behavior I think.
Let's confirm with runtime folks. I think we're changing the current string literal behavior.
Today, if I use M("hello")
we do a ldstr
with a metadata token to the string. This allocates for the string (unless the string is re-used/interned then we can reference existing allocated string) and later on that can be garbage collected.
With a static field, I don't think it can be collected unless the type is unloaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you already have a follow-up tracked below in "GC" section. Thanks
The utf8 string literal encoding emit strategy emits `ldsfld` of a field in a generated class instead. | ||
|
||
For every string literal, a unique internal static class is generated which | ||
- has name composed of `<S>` followed by a hex-encoded SHA-256 hash of the string, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insead of SHA-256 let's use a non-crypto hash like XXH128. Every use of a crypto hash requires future risk as we have to go back and change the implementations when SHA-256 is considered broken. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks. Note that I chose SHA-256 because that's being used for other similar situations when emitting <PrivateImplementationDetails>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is mostly an accident of history. Back then it was more acceptable to use a cryto hash and hashes like XXH128 weren't as readily available. Going forward when possible we should avoid crypto hashes unless they're specifically for crypto purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (commit 9)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with review pass (iteration 9)
@jjonescz Consider updating PR's title. The "feature flag" is just part of the feature, it is not the feature. |
The synthesized types are not part of ref assemblies. That makes them smaller | ||
and incremental compilation is faster because it does not need to recompile dependent projects when only string literal contents change. | ||
|
||
This is automatically implemented, because during metadata-only compilation, method bodies are not emitted, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been confirmed? This sounds like stating a fact. However, Jared mentioned that some IL emit artifacts are possibly making their way into ref assemblies. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have confirmed this. With metadata-only (e.g., using EmitOptions.Default.WithEmitMetadataOnly(true)
), the MethodCompiler
is not involved at all, instead a SynthesizedMetadataCompiler
is used.
Done with review pass (commit 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (commit 11)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Thanks (iteration 11)
Would be interesting to extend this to encoding constant strings being passed to The protobuf compiler for example generates such strings. And BTW if I replace |
Could you please elaborate what would be an advantage for doing this? Also, how does this align with the specific goal that we are trying to achieve here, which is not a goal of supporting different encodings for string literals? |
|
||
The utf8 string literal encoding emit strategy emits `ldsfld` of a field in a generated class instead. | ||
|
||
For every string literal, a unique internal static class is generated which: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "For every unique string literal, an internal static class ..."?
> In practice, there might not be many generated types, it depends on the kind of the program | ||
> (whether it has lots of short strings or a few large strings) and how [the threshold](#configuration) is configured. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the release build of Microsoft.CodeAnalysis.CSharp.dll has roughly 200 strings for file paths, presumably from calls to methods such as ExceptionUtilities.Unreachable()
that use [CallerFilePath]
.
Given that, won't many assemblies contain a fair number of strings at 100 characters from paths alone? If so, let's consider removing this NOTE
.
The threshold could be determined automatically with some objective, for example, | ||
use the utf8 encoding emit strategy for the lowest number of string literals necessary to avoid overflowing the UserString heap. | ||
|
||
The set of string literals is not know up front in the compiler, it is discovered lazily (and in parallel) by the emit layer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "not known up front ..."?
Implementation: #76036
Link to rendered markdown: https://github.com/jjonescz/roslyn/blob/DataSectionStringLiterals-spec/docs/features/string-literals-data-section.md