Add spec for the data section string literals feature #76139

jjonescz · 2024-11-28T11:33:51Z

Implementation: #76036
Link to rendered markdown: https://github.com/jjonescz/roslyn/blob/DataSectionStringLiterals-spec/docs/features/string-literals-data-section.md

docs/features/string-literals-data-section.md

jcouv · 2024-12-01T01:10:44Z

docs/features/string-literals-data-section.md

+using System.Text;
+
+[CompilerGenerated]
+internal static class <S>2D8BD7D9BB5F85BA643F0110D50CB506A1FE439E769A22503193EA6046BB87F7


Would it be possible to avoid having one type per string?
We can have all the members for a given string (the data field, the string field (lazily initialized) and possibly a helper method to handle the lazy initialization) in <PrivateImplementationDetails>? Those three members can include the hash in their name to avoid conflicts. #Closed

Also, by caching strings in a static field, are we creating a problem for the GC? Would using WeakReference solve that problem?

We can have all the members for a given string (the data field, the string field (lazily initialized) and possibly a helper method to handle the lazy initialization) in <PrivateImplementationDetails>? Those three members can include the hash in their name to avoid conflicts.

That's a possible alternative, I can mention it.

As @davidwrighton said elsewhere, the static constructor approach emits a single mov instruction and no branches and calls.

Would it be possible to avoid having one type per string?

There is also the alternative (mentioned in section "Configuration/emit alternatives") to group more than one string per one class.

Also, by caching strings in a static field, are we creating a problem for the GC?

It's not any different than current string literal behavior I think.

Thanks. I just caught up on a Teams discussion you forwarded me today. It looks like the trade-off was discussed and the runtime team is not too worried about the proliferation of types, they could do some optimization and we're engaged on benchmarking.

[by caching strings in a static field, are we creating a problem for the GC?] It's not any different than current string literal behavior I think.

Let's confirm with runtime folks. I think we're changing the current string literal behavior.
Today, if I use M("hello") we do a ldstr with a metadata token to the string. This allocates for the string (unless the string is re-used/interned then we can reference existing allocated string) and later on that can be garbage collected.
With a static field, I don't think it can be collected unless the type is unloaded.

I see you already have a follow-up tracked below in "GC" section. Thanks

docs/features/string-literals-data-section.md

jaredpar · 2024-12-02T17:09:14Z

docs/features/string-literals-data-section.md

+The utf8 string literal encoding emit strategy emits `ldsfld` of a field in a generated class instead.
+
+For every string literal, a unique internal static class is generated which
+- has name composed of `<S>` followed by a hex-encoded SHA-256 hash of the string,


Insead of SHA-256 let's use a non-crypto hash like XXH128. Every use of a crypto hash requires future risk as we have to go back and change the implementations when SHA-256 is considered broken. #Closed

Sounds good, thanks. Note that I chose SHA-256 because that's being used for other similar situations when emitting <PrivateImplementationDetails>.

I think that is mostly an accident of history. Back then it was more acceptable to use a cryto hash and hashes like XXH128 weren't as readily available. Going forward when possible we should avoid crypto hashes unless they're specifically for crypto purposes.

docs/features/string-literals-data-section.md

AlekseyTs

LGTM (commit 9)

jcouv

Done with review pass (iteration 9)

AlekseyTs · 2024-12-13T19:03:05Z

@jjonescz Consider updating PR's title. The "feature flag" is just part of the feature, it is not the feature.

AlekseyTs · 2024-12-13T19:10:08Z

docs/features/string-literals-data-section.md

+The synthesized types are not part of ref assemblies. That makes them smaller
+and incremental compilation is faster because it does not need to recompile dependent projects when only string literal contents change.
+
+This is automatically implemented, because during metadata-only compilation, method bodies are not emitted,


Has this been confirmed? This sounds like stating a fact. However, Jared mentioned that some IL emit artifacts are possibly making their way into ref assemblies. #Closed

Yes, I have confirmed this. With metadata-only (e.g., using EmitOptions.Default.WithEmitMetadataOnly(true)), the MethodCompiler is not involved at all, instead a SynthesizedMetadataCompiler is used.

AlekseyTs · 2024-12-13T19:12:36Z

Done with review pass (commit 10)

AlekseyTs

LGTM (commit 11)

jcouv

LGTM Thanks (iteration 11)

teo-tsirpanis · 2024-12-26T23:05:54Z

Would be interesting to extend this to encoding constant strings being passed to Convert.FromBase64String, as raw binary data.

The protobuf compiler for example generates such strings. And BTW if I replace string.Concat with +s the compiler gets stuck.

AlekseyTs · 2024-12-26T23:13:19Z

@teo-tsirpanis

Would be interesting to extend this to encoding constant strings being passed to Convert.FromBase64String, as raw binary data.

Could you please elaborate what would be an advantage for doing this? Also, how does this align with the specific goal that we are trying to achieve here, which is not a goal of supporting different encodings for string literals?

cston · 2025-01-14T05:12:55Z

docs/features/string-literals-data-section.md

+
+The utf8 string literal encoding emit strategy emits `ldsfld` of a field in a generated class instead.
+
+For every string literal, a unique internal static class is generated which:


Should this be "For every unique string literal, an internal static class ..."?

cston · 2025-01-14T06:42:15Z

docs/features/string-literals-data-section.md

+> In practice, there might not be many generated types, it depends on the kind of the program
+> (whether it has lots of short strings or a few large strings) and how [the threshold](#configuration) is configured.


It looks like the release build of Microsoft.CodeAnalysis.CSharp.dll has roughly 200 strings for file paths, presumably from calls to methods such as ExceptionUtilities.Unreachable() that use [CallerFilePath].

Given that, won't many assemblies contain a fair number of strings at 100 characters from paths alone? If so, let's consider removing this NOTE.

cston · 2025-01-14T06:45:05Z

docs/features/string-literals-data-section.md

+The threshold could be determined automatically with some objective, for example,
+use the utf8 encoding emit strategy for the lowest number of string literals necessary to avoid overflowing the UserString heap.
+
+The set of string literals is not know up front in the compiler, it is discovered lazily (and in parallel) by the emit layer.


Should this be "not known up front ..."?

jjonescz added 3 commits November 28, 2024 10:29

Copy feature spec

9058e0f

Add "literal" to the feature flag

5b4c80d

Add more details

ec93409

dotnet-issue-labeler bot added Area-Compilers untriaged Issues and PRs which have not yet been triaged by a lead labels Nov 28, 2024

jjonescz added Documentation Area-Compilers and removed Area-Compilers untriaged Issues and PRs which have not yet been triaged by a lead labels Nov 28, 2024

jjonescz mentioned this pull request Nov 28, 2024

Emit opted-in string literals into data section as UTF8 #76036

Merged

jjonescz added 2 commits November 28, 2024 12:36

Convert indentation to spaces

5aeecdb

Rename feature flag

991b785

jjonescz requested review from AlekseyTs, a team and jaredpar November 28, 2024 12:07

jjonescz changed the title ~~Add spec for the utf8-string-literal-encoding feature flag~~ Add spec for the data section string literals feature flag Nov 28, 2024

jcouv reviewed Nov 30, 2024

View reviewed changes

docs/features/string-literals-data-section.md Outdated Show resolved Hide resolved

jcouv reviewed Dec 1, 2024

View reviewed changes

jcouv self-assigned this Dec 1, 2024

jaredpar reviewed Dec 2, 2024

View reviewed changes

jjonescz added the Feature - String Literals in Data Section as UTF8 label Dec 3, 2024

jjonescz mentioned this pull request Dec 3, 2024

Test plan for "string literals in data section as utf8" #76234

Closed

Improve

eaefe41

jjonescz requested a review from jcouv December 4, 2024 20:26

AlekseyTs reviewed Dec 4, 2024

View reviewed changes

docs/features/string-literals-data-section.md Outdated Show resolved Hide resolved

AlekseyTs reviewed Dec 4, 2024

View reviewed changes

docs/features/string-literals-data-section.md Outdated Show resolved Hide resolved

AlekseyTs reviewed Dec 4, 2024

View reviewed changes

docs/features/string-literals-data-section.md Outdated Show resolved Hide resolved

AlekseyTs reviewed Dec 4, 2024

View reviewed changes

docs/features/string-literals-data-section.md Outdated Show resolved Hide resolved

AlekseyTs reviewed Dec 4, 2024

View reviewed changes

docs/features/string-literals-data-section.md Outdated Show resolved Hide resolved

AlekseyTs reviewed Dec 4, 2024

View reviewed changes

docs/features/string-literals-data-section.md Show resolved Hide resolved

AlekseyTs approved these changes Dec 6, 2024

View reviewed changes

jcouv reviewed Dec 9, 2024

View reviewed changes

Update after feature review

bcd4c03

jjonescz requested a review from AlekseyTs December 13, 2024 12:35

AlekseyTs reviewed Dec 13, 2024

View reviewed changes

jjonescz changed the title ~~Add spec for the data section string literals feature flag~~ Add spec for the data section string literals feature Dec 14, 2024

Make the shared helper private

51f9b02

AlekseyTs approved these changes Dec 16, 2024

View reviewed changes

jcouv approved these changes Dec 17, 2024

View reviewed changes

jjonescz added 2 commits December 27, 2024 16:54

Do not detect XXH128 collisions

eb7b374

Clarify feature flag values

6f8d5bc

cston reviewed Jan 14, 2025

View reviewed changes

jjonescz added 2 commits January 14, 2025 10:18

Improve wording

8b011c5

Remove note

1af62e9

cston approved these changes Jan 14, 2025

View reviewed changes

jjonescz merged commit f617688 into dotnet:main Jan 15, 2025
5 checks passed

jjonescz deleted the DataSectionStringLiterals-spec branch January 15, 2025 16:59

dotnet-policy-service bot added this to the Next milestone Jan 15, 2025

This was referenced Jan 22, 2025

[Automated] PRs inserted in VS build main-35721.127 #76844

Closed

[Automated] PRs inserted in VS build feature.debugger.shadowDebug-35722.156 #76874

Closed

[Automated] PRs inserted in VS build feature.debugger.main-35722.139 #76881

Closed

dibarbet modified the milestones: Next, 17.14 P1 Jan 28, 2025

dotnet-bot mentioned this pull request Jan 30, 2025

[Automated] PRs inserted in VS build feature.d18initial-10229.00 #76967

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spec for the data section string literals feature #76139

Add spec for the data section string literals feature #76139

jjonescz commented Nov 28, 2024 •

edited

Loading

jcouv Dec 1, 2024 •

edited

Loading

jcouv Dec 1, 2024

jjonescz Dec 3, 2024 •

edited

Loading

jcouv Dec 3, 2024

jcouv Dec 9, 2024

jcouv Dec 9, 2024

jaredpar Dec 2, 2024 •

edited by jcouv

Loading

jjonescz Dec 3, 2024 •

edited

Loading

jaredpar Dec 10, 2024

AlekseyTs left a comment

jcouv left a comment

AlekseyTs commented Dec 13, 2024

AlekseyTs Dec 13, 2024 •

edited

Loading

jjonescz Dec 14, 2024

AlekseyTs commented Dec 13, 2024

AlekseyTs left a comment

jcouv left a comment

teo-tsirpanis commented Dec 26, 2024

AlekseyTs commented Dec 26, 2024 •

edited

Loading

cston Jan 14, 2025

cston Jan 14, 2025

cston Jan 14, 2025 •

edited

Loading


		The utf8 string literal encoding emit strategy emits `ldsfld` of a field in a generated class instead.

		For every string literal, a unique internal static class is generated which:

		> In practice, there might not be many generated types, it depends on the kind of the program
		> (whether it has lots of short strings or a few large strings) and how [the threshold](#configuration) is configured.

Add spec for the data section string literals feature #76139

Add spec for the data section string literals feature #76139

Conversation

jjonescz commented Nov 28, 2024 • edited Loading

jcouv Dec 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjonescz Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaredpar Dec 2, 2024 • edited by jcouv Loading

Choose a reason for hiding this comment

jjonescz Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlekseyTs left a comment

Choose a reason for hiding this comment

jcouv left a comment

Choose a reason for hiding this comment

AlekseyTs commented Dec 13, 2024

AlekseyTs Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlekseyTs commented Dec 13, 2024

AlekseyTs left a comment

Choose a reason for hiding this comment

jcouv left a comment

Choose a reason for hiding this comment

teo-tsirpanis commented Dec 26, 2024

AlekseyTs commented Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cston Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

jjonescz commented Nov 28, 2024 •

edited

Loading

jcouv Dec 1, 2024 •

edited

Loading

jjonescz Dec 3, 2024 •

edited

Loading

jaredpar Dec 2, 2024 •

edited by jcouv

Loading

jjonescz Dec 3, 2024 •

edited

Loading

AlekseyTs Dec 13, 2024 •

edited

Loading

AlekseyTs commented Dec 26, 2024 •

edited

Loading

cston Jan 14, 2025 •

edited

Loading