-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BuiltinByteString literals aren't UTF8 encoded. #6655
Conversation
...lc/evaluation/builtin/semantics/bls12_381_G1_scalarMul/mul-neg-one/mul-neg-one.uplc.expected
Outdated
Show resolved
Hide resolved
@@ -699,13 +732,21 @@ compileExpr e = traceCompilation 2 ("Compiling expr:" GHC.<+> GHC.ppr e) $ do | |||
_ -> throwPlain $ CompilationError "No info for Pair builtin" | |||
|
|||
-- TODO: Maybe share this to avoid repeated lookups. Probably cheap, though. | |||
(stringTyName, sbsName) <- case (Map.lookup ''Builtins.BuiltinString nameInfo, Map.lookup 'Builtins.stringToBuiltinString nameInfo) of | |||
(stringTyName, sbsName) <- | |||
case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap long line
...t-cases/uplc/evaluation/builtin/semantics/replicateByte/case-07/case-07.uplc.budget.expected
Outdated
Show resolved
Hide resolved
| let name = GHC.getName n | ||
, name == GHC.unpackCStringName || name == GHC.unpackCStringUtf8Name -> | ||
stringExprContent expr | ||
data StringExprContentAs = AsBytes | AsText |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stringExprContent
function is used to extract bytes for the BuiltinByteString
as well as for the BuiltinString
literals;
I have added this datatype as its parameter to distinguish between these scenarios as we want to utf-8 decode BuiltinByteString
literals only.
This isn't a full UTF-8 decoder: it only decodes the subset of UTF-8 that | ||
is expected to be found in bytestring literals: 0x00 - 0xFF | ||
-} | ||
fromUtf8 :: [Word8] -> Maybe [Word8] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using Data.Text.Encoding.decodeUtf8
and it didn't work:
Cannot decode byte '\xc0': Data.Text.Encoding: Invalid UTF-8 stream
Maybe the implementation in GHC is not compatible with the one in Text (Currently GHC ships with at least five UTF-8 implementations)
It turned out that decoding rules are not super complex so I have implemented them below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a unit test to prove it works correctly on bytes 00 - FF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably break my forehead against the wall of trying to reuse GHC stuff, but you really did the right thing here. Great work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is surprising. An example in the comment would be helpful.
Also, why not fail fast here? That seems better than returning a Nothing
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why not fail fast here? That seems better than returning a Nothing.
the fromUtf8
function is used in the
stringExprContent :: StringExprContentAs -> GHC.CoreExpr -> Maybe BS.ByteString
which is not monadic and doesn support our throwPlain
and throwSd
functions.
stringExprContent
is used in pattern positions (via ViewPatterns
) so failing fast there is associated with the much larger code rewrite.
(bsTyName, sbbsName) <- case (Map.lookup ''Builtins.BuiltinByteString nameInfo, Map.lookup 'Builtins.stringToBuiltinByteString nameInfo) of | ||
(Just t1, Just t2) -> pure (GHC.getName t1, GHC.getName t2) | ||
_ -> throwPlain $ CompilationError "No info for ByteString builtin" | ||
(builtinByteStringTyName, sbbsName) <- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap another long line
Just tc -> | ||
if | ||
| GHC.getName tc == builtinByteStringTyName -> | ||
case stringExprContent AsBytes (strip content) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracting content as bytes undoes the UTF8 encoding for BuiltinByteString
Just bs -> | ||
pure $ PIR.Constant annMayInline $ PLC.someValue bs | ||
| GHC.getName tc == stringTyName -> | ||
case stringExprContent AsText (strip content) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracting content as text leaves original UTF8 encoding for BuiltinString
throwPlain $ | ||
CompilationError $ | ||
"Text literal with invalid UTF-8 content: " <> (T.pack $ show err) | ||
(strip -> GHC.Var n) `GHC.App` (strip -> stringExprContent AsText -> Just bs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrapped long line
@@ -91,21 +91,23 @@ map2 = | |||
\n -> | |||
let m1 = | |||
Data.AssocMap.unsafeFromList | |||
[ (n PlutusTx.+ 1, "one") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that this test relies on UTF-8 encoding for BuiltinByteStrings is accidental
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've re-worked the test to use Integer
s instead of BuiltinByteString
s.
0f08f16
to
439f7e5
Compare
439f7e5
to
b4f1bb2
Compare
8c7865f
to
7cd0c62
Compare
7cd0c62
to
03011f0
Compare
03011f0
to
b718e0b
Compare
b718e0b
to
5ff83c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff!
@@ -27,3 +27,4 @@ | |||
- ignore: {name: Use first, within: [UntypedPlutusCore.Evaluation.Machine.Cek]} | |||
- ignore: {name: Redundant if, within: [PlutusLedgerApi.V1.Value, PlutusLedgerApi.V1.Data.Value]} | |||
- ignore: {name: Replace case with maybe, within: [PlutusLedgerApi.V1.Value, PlutusLedgerApi.V1.Data.Value]} | |||
- ignore: {name: Use bimap, within: [PlutusTx.Builtins.HasOpaque]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no no no no no, not this fuckery again. If we make it mandatory or it is already mandatory, I will not comply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like ignoring HLint warnings in a place that's far away from the actual warning, but this is better for me than not doing anything, as HLint warnings interfere with the legit warnings and break my workflow.
What's your take on HLint warnings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you instruct HLint to use your own config file instead of the one in the repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your take on HLint warnings?
I hate HLint and don't use it. I do not mind you using it of course.
Can you instruct HLint to use your own config file instead of the one in the repo?
Apparently there are two in the repo. One for CI and one for I don't really know what.
This isn't a full UTF-8 decoder: it only decodes the subset of UTF-8 that | ||
is expected to be found in bytestring literals: 0x00 - 0xFF | ||
-} | ||
fromUtf8 :: [Word8] -> Maybe [Word8] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably break my forehead against the wall of trying to reuse GHC stuff, but you really did the right thing here. Great work!
"Use of fromString @BuiltinString with inscrutable content:" | ||
GHC.<+> GHC.ppr content | ||
Just bs -> | ||
case TE.decodeUtf8' bs of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, decodeUtf8'
didn't work for bytes, but works for text? What if text is bytes though? Or am I misunderstanding this comment:
I tried using Data.Text.Encoding.decodeUtf8 and it didn't work: the error was saying about invalid utf-8 bytes. Maybe the implementation in GHC is not compatible with the one in Text (Currently GHC ships with at least five UTF-8 implementations)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that I don't have unit test with the bytes that break Text.decodeUtf8
, m.b. it was \xFF
or \x00
that Text
choke on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was \xC0
:
Cannot decode byte '\xc0': Data.Text.Encoding: Invalid UTF-8 stream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, are we fine with not allowing all bytes to be used in Plinth? What if some other toolchain produced those bytes and the user wanted to replicate the behavior in Plinth? Quite a stretch I guess.
5ff83c0
to
65de008
Compare
BuiltinByteString
literals are utf-8 decoded in the plugin.Haskell.IsString BuiltinByteString
instance delegates to the HaskellByteString
instance.BuiltinByteString
without the UTF8 conversion.Closes #6505