Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BuiltinByteString literals aren't UTF8 encoded. #6655

Merged
merged 2 commits into from
Nov 18, 2024

Conversation

Unisay
Copy link
Contributor

@Unisay Unisay commented Nov 12, 2024

  • BuiltinByteString literals are utf-8 decoded in the plugin.
  • Haskell.IsString BuiltinByteString instance delegates to the Haskell ByteString instance.
  • Unit tests to verify that bytes in range from 0 to 255 (\x00 - \xFF) found in string literals are passed to the BuiltinByteString without the UTF8 conversion.

Closes #6505

@Unisay Unisay self-assigned this Nov 12, 2024
@@ -699,13 +732,21 @@ compileExpr e = traceCompilation 2 ("Compiling expr:" GHC.<+> GHC.ppr e) $ do
_ -> throwPlain $ CompilationError "No info for Pair builtin"

-- TODO: Maybe share this to avoid repeated lookups. Probably cheap, though.
(stringTyName, sbsName) <- case (Map.lookup ''Builtins.BuiltinString nameInfo, Map.lookup 'Builtins.stringToBuiltinString nameInfo) of
(stringTyName, sbsName) <-
case
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap long line

| let name = GHC.getName n
, name == GHC.unpackCStringName || name == GHC.unpackCStringUtf8Name ->
stringExprContent expr
data StringExprContentAs = AsBytes | AsText
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stringExprContent function is used to extract bytes for the BuiltinByteString as well as for the BuiltinString literals;
I have added this datatype as its parameter to distinguish between these scenarios as we want to utf-8 decode BuiltinByteString literals only.

This isn't a full UTF-8 decoder: it only decodes the subset of UTF-8 that
is expected to be found in bytestring literals: 0x00 - 0xFF
-}
fromUtf8 :: [Word8] -> Maybe [Word8]
Copy link
Contributor Author

@Unisay Unisay Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using Data.Text.Encoding.decodeUtf8 and it didn't work:

Cannot decode byte '\xc0': Data.Text.Encoding: Invalid UTF-8 stream

Maybe the implementation in GHC is not compatible with the one in Text (Currently GHC ships with at least five UTF-8 implementations)
It turned out that decoding rules are not super complex so I have implemented them below.

Copy link
Contributor Author

@Unisay Unisay Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a unit test to prove it works correctly on bytes 00 - FF

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably break my forehead against the wall of trying to reuse GHC stuff, but you really did the right thing here. Great work!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is surprising. An example in the comment would be helpful.

Also, why not fail fast here? That seems better than returning a Nothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why not fail fast here? That seems better than returning a Nothing.

the fromUtf8 function is used in the

stringExprContent :: StringExprContentAs -> GHC.CoreExpr -> Maybe BS.ByteString

which is not monadic and doesn support our throwPlain and throwSd functions.

stringExprContent is used in pattern positions (via ViewPatterns) so failing fast there is associated with the much larger code rewrite.

(bsTyName, sbbsName) <- case (Map.lookup ''Builtins.BuiltinByteString nameInfo, Map.lookup 'Builtins.stringToBuiltinByteString nameInfo) of
(Just t1, Just t2) -> pure (GHC.getName t1, GHC.getName t2)
_ -> throwPlain $ CompilationError "No info for ByteString builtin"
(builtinByteStringTyName, sbbsName) <-
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap another long line

Just tc ->
if
| GHC.getName tc == builtinByteStringTyName ->
case stringExprContent AsBytes (strip content) of
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting content as bytes undoes the UTF8 encoding for BuiltinByteString

Just bs ->
pure $ PIR.Constant annMayInline $ PLC.someValue bs
| GHC.getName tc == stringTyName ->
case stringExprContent AsText (strip content) of
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting content as text leaves original UTF8 encoding for BuiltinString

throwPlain $
CompilationError $
"Text literal with invalid UTF-8 content: " <> (T.pack $ show err)
(strip -> GHC.Var n) `GHC.App` (strip -> stringExprContent AsText -> Just bs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapped long line

@@ -91,21 +91,23 @@ map2 =
\n ->
let m1 =
Data.AssocMap.unsafeFromList
[ (n PlutusTx.+ 1, "one")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this test relies on UTF-8 encoding for BuiltinByteStrings is accidental

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-worked the test to use Integers instead of BuiltinByteStrings.

@Unisay Unisay force-pushed the yura/builtin-byte-string-literals branch from 0f08f16 to 439f7e5 Compare November 13, 2024 10:54
@Unisay Unisay force-pushed the yura/builtin-byte-string-literals branch from 439f7e5 to b4f1bb2 Compare November 13, 2024 10:57
@Unisay Unisay force-pushed the yura/builtin-byte-string-literals branch from 8c7865f to 7cd0c62 Compare November 13, 2024 11:20
@Unisay Unisay requested review from zliu41 and a team November 13, 2024 11:21
@Unisay Unisay marked this pull request as ready for review November 13, 2024 11:21
@Unisay Unisay force-pushed the yura/builtin-byte-string-literals branch from 7cd0c62 to 03011f0 Compare November 13, 2024 11:23
@Unisay Unisay force-pushed the yura/builtin-byte-string-literals branch from 03011f0 to b718e0b Compare November 13, 2024 12:20
@Unisay Unisay force-pushed the yura/builtin-byte-string-literals branch from b718e0b to 5ff83c0 Compare November 13, 2024 13:29
Copy link
Contributor

@effectfully effectfully left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff!

@@ -27,3 +27,4 @@
- ignore: {name: Use first, within: [UntypedPlutusCore.Evaluation.Machine.Cek]}
- ignore: {name: Redundant if, within: [PlutusLedgerApi.V1.Value, PlutusLedgerApi.V1.Data.Value]}
- ignore: {name: Replace case with maybe, within: [PlutusLedgerApi.V1.Value, PlutusLedgerApi.V1.Data.Value]}
- ignore: {name: Use bimap, within: [PlutusTx.Builtins.HasOpaque]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no no no no no, not this fuckery again. If we make it mandatory or it is already mandatory, I will not comply.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like ignoring HLint warnings in a place that's far away from the actual warning, but this is better for me than not doing anything, as HLint warnings interfere with the legit warnings and break my workflow.
What's your take on HLint warnings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you instruct HLint to use your own config file instead of the one in the repo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your take on HLint warnings?

I hate HLint and don't use it. I do not mind you using it of course.

Can you instruct HLint to use your own config file instead of the one in the repo?

Apparently there are two in the repo. One for CI and one for I don't really know what.

This isn't a full UTF-8 decoder: it only decodes the subset of UTF-8 that
is expected to be found in bytestring literals: 0x00 - 0xFF
-}
fromUtf8 :: [Word8] -> Maybe [Word8]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably break my forehead against the wall of trying to reuse GHC stuff, but you really did the right thing here. Great work!

"Use of fromString @BuiltinString with inscrutable content:"
GHC.<+> GHC.ppr content
Just bs ->
case TE.decodeUtf8' bs of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, decodeUtf8' didn't work for bytes, but works for text? What if text is bytes though? Or am I misunderstanding this comment:

I tried using Data.Text.Encoding.decodeUtf8 and it didn't work: the error was saying about invalid utf-8 bytes. Maybe the implementation in GHC is not compatible with the one in Text (Currently GHC ships with at least five UTF-8 implementations)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that I don't have unit test with the bytes that break Text.decodeUtf8, m.b. it was \xFF or \x00 that Text choke on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was \xC0:

Cannot decode byte '\xc0': Data.Text.Encoding: Invalid UTF-8 stream

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, are we fine with not allowing all bytes to be used in Plinth? What if some other toolchain produced those bytes and the user wanted to replicate the behavior in Plinth? Quite a stretch I guess.

plutus-tx-plugin/test/ByteStringLiterals/Spec.hs Outdated Show resolved Hide resolved
@Unisay Unisay force-pushed the yura/builtin-byte-string-literals branch from 5ff83c0 to 65de008 Compare November 18, 2024 12:39
@Unisay Unisay enabled auto-merge (squash) November 18, 2024 12:39
@Unisay Unisay merged commit 2498986 into master Nov 18, 2024
8 checks passed
@Unisay Unisay deleted the yura/builtin-byte-string-literals branch November 18, 2024 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix BuiltinByteString literal construction
4 participants