Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spec for the data section string literals feature #76139

Merged
merged 15 commits into from
Jan 15, 2025

Conversation

jjonescz
Copy link
Member

@jjonescz jjonescz commented Nov 28, 2024

@dotnet-issue-labeler dotnet-issue-labeler bot added Area-Compilers untriaged Issues and PRs which have not yet been triaged by a lead labels Nov 28, 2024
@jjonescz jjonescz added Documentation Area-Compilers and removed Area-Compilers untriaged Issues and PRs which have not yet been triaged by a lead labels Nov 28, 2024
@jjonescz jjonescz requested review from AlekseyTs, a team and jaredpar November 28, 2024 12:07
@jjonescz jjonescz changed the title Add spec for the utf8-string-literal-encoding feature flag Add spec for the data section string literals feature flag Nov 28, 2024
using System.Text;

[CompilerGenerated]
internal static class <S>2D8BD7D9BB5F85BA643F0110D50CB506A1FE439E769A22503193EA6046BB87F7
Copy link
Member

@jcouv jcouv Dec 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to avoid having one type per string?
We can have all the members for a given string (the data field, the string field (lazily initialized) and possibly a helper method to handle the lazy initialization) in <PrivateImplementationDetails>? Those three members can include the hash in their name to avoid conflicts. #Closed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, by caching strings in a static field, are we creating a problem for the GC? Would using WeakReference solve that problem?

Copy link
Member Author

@jjonescz jjonescz Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have all the members for a given string (the data field, the string field (lazily initialized) and possibly a helper method to handle the lazy initialization) in <PrivateImplementationDetails>? Those three members can include the hash in their name to avoid conflicts.

That's a possible alternative, I can mention it.

As @davidwrighton said elsewhere, the static constructor approach emits a single mov instruction and no branches and calls.

Would it be possible to avoid having one type per string?

There is also the alternative (mentioned in section "Configuration/emit alternatives") to group more than one string per one class.

Also, by caching strings in a static field, are we creating a problem for the GC?

It's not any different than current string literal behavior I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I just caught up on a Teams discussion you forwarded me today. It looks like the trade-off was discussed and the runtime team is not too worried about the proliferation of types, they could do some optimization and we're engaged on benchmarking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[by caching strings in a static field, are we creating a problem for the GC?] It's not any different than current string literal behavior I think.

Let's confirm with runtime folks. I think we're changing the current string literal behavior.
Today, if I use M("hello") we do a ldstr with a metadata token to the string. This allocates for the string (unless the string is re-used/interned then we can reference existing allocated string) and later on that can be garbage collected.
With a static field, I don't think it can be collected unless the type is unloaded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you already have a follow-up tracked below in "GC" section. Thanks

@jcouv jcouv self-assigned this Dec 1, 2024
docs/features/string-literals-data-section.md Show resolved Hide resolved
docs/features/string-literals-data-section.md Outdated Show resolved Hide resolved
docs/features/string-literals-data-section.md Outdated Show resolved Hide resolved
docs/features/string-literals-data-section.md Outdated Show resolved Hide resolved
docs/features/string-literals-data-section.md Outdated Show resolved Hide resolved
The utf8 string literal encoding emit strategy emits `ldsfld` of a field in a generated class instead.

For every string literal, a unique internal static class is generated which
- has name composed of `<S>` followed by a hex-encoded SHA-256 hash of the string,
Copy link
Member

@jaredpar jaredpar Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insead of SHA-256 let's use a non-crypto hash like XXH128. Every use of a crypto hash requires future risk as we have to go back and change the implementations when SHA-256 is considered broken. #Closed

Copy link
Member Author

@jjonescz jjonescz Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks. Note that I chose SHA-256 because that's being used for other similar situations when emitting <PrivateImplementationDetails>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is mostly an accident of history. Back then it was more acceptable to use a cryto hash and hashes like XXH128 weren't as readily available. Going forward when possible we should avoid crypto hashes unless they're specifically for crypto purposes.

docs/features/string-literals-data-section.md Outdated Show resolved Hide resolved
@jjonescz jjonescz requested a review from jcouv December 4, 2024 20:26
Copy link
Contributor

@AlekseyTs AlekseyTs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (commit 9)

Copy link
Member

@jcouv jcouv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with review pass (iteration 9)

@jjonescz jjonescz requested a review from AlekseyTs December 13, 2024 12:35
@AlekseyTs
Copy link
Contributor

@jjonescz Consider updating PR's title. The "feature flag" is just part of the feature, it is not the feature.

The synthesized types are not part of ref assemblies. That makes them smaller
and incremental compilation is faster because it does not need to recompile dependent projects when only string literal contents change.

This is automatically implemented, because during metadata-only compilation, method bodies are not emitted,
Copy link
Contributor

@AlekseyTs AlekseyTs Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been confirmed? This sounds like stating a fact. However, Jared mentioned that some IL emit artifacts are possibly making their way into ref assemblies. #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have confirmed this. With metadata-only (e.g., using EmitOptions.Default.WithEmitMetadataOnly(true)), the MethodCompiler is not involved at all, instead a SynthesizedMetadataCompiler is used.

@AlekseyTs
Copy link
Contributor

Done with review pass (commit 10)

@jjonescz jjonescz changed the title Add spec for the data section string literals feature flag Add spec for the data section string literals feature Dec 14, 2024
Copy link
Contributor

@AlekseyTs AlekseyTs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (commit 11)

Copy link
Member

@jcouv jcouv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks (iteration 11)

@teo-tsirpanis
Copy link
Contributor

Would be interesting to extend this to encoding constant strings being passed to Convert.FromBase64String, as raw binary data.

The protobuf compiler for example generates such strings. And BTW if I replace string.Concat with +s the compiler gets stuck.

@AlekseyTs
Copy link
Contributor

AlekseyTs commented Dec 26, 2024

@teo-tsirpanis

Would be interesting to extend this to encoding constant strings being passed to Convert.FromBase64String, as raw binary data.

Could you please elaborate what would be an advantage for doing this? Also, how does this align with the specific goal that we are trying to achieve here, which is not a goal of supporting different encodings for string literals?


The utf8 string literal encoding emit strategy emits `ldsfld` of a field in a generated class instead.

For every string literal, a unique internal static class is generated which:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "For every unique string literal, an internal static class ..."?

Comment on lines 149 to 150
> In practice, there might not be many generated types, it depends on the kind of the program
> (whether it has lots of short strings or a few large strings) and how [the threshold](#configuration) is configured.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the release build of Microsoft.CodeAnalysis.CSharp.dll has roughly 200 strings for file paths, presumably from calls to methods such as ExceptionUtilities.Unreachable() that use [CallerFilePath].

Given that, won't many assemblies contain a fair number of strings at 100 characters from paths alone? If so, let's consider removing this NOTE.

The threshold could be determined automatically with some objective, for example,
use the utf8 encoding emit strategy for the lowest number of string literals necessary to avoid overflowing the UserString heap.

The set of string literals is not know up front in the compiler, it is discovered lazily (and in parallel) by the emit layer.
Copy link
Member

@cston cston Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "not known up front ..."?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants