-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bip-0173: test vectors, HRP, and casing requirements updates #587
Conversation
In addition to the test vector updates, I made some updates, in a second commit, regarding casing requirements. Encoding:
For presentation, or QR code use, a simple uppercasing can be run externally on the encoding result.
Decoding:
Tangentially related: sipa/bech32#25 |
bip-0173.mediawiki
Outdated
If an uppercase version of the encoding result is desired, (e.g.- for presentation purposes, or QR code use), | ||
then an uppercasing procedure can be performed external to the encoding process. | ||
|
||
Decoders MUST accept uppercase, lowercase, and mixed-case strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decoders must not accept mixed-case strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a typo. I think decoders should (and thus "must") accept mixed-case. Because of the existing requirement to use the lowercase form to determine a character's value for checksum purposes, there is no difference between upper and lower case characters in any position of the bech32 string. That applies to the HRP because it is also used to compute the checksum. So, on a micro level, I don't see a reason why a decoder should fail the input. On a macro level, I can see where you might be trying to discourage mixed case. But, the other changes I made also push in that direction (in fact fixing this: segwit_addr.bech32_encode('A', [])
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rejecting mixed case is very intentional. Doing so massively reduces the chance of accepting arbitrary text as valid Bech32.
bip-0173.mediawiki
Outdated
* <tt>A1G7SGD8</tt>: The lowercase form is used when determining a character's value for checksum purposes. | ||
|
||
The following are invalid Bech32 strings (with reason for invalidity): | ||
* <tt>pzry9x0s0muk</tt>: No separator character |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section can be merged with the 'invalid Bech32 checksum' section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how it was to start. I moved it to a new section because I thought it was misleading in the current section because "No separator character" can not affect checksum validity because the separator char is not used to compute the checksum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but that's to for several of the existing examples in that section (like HRP character out of range, or length exceeded.
I think they can be combined into a single section with examples of invalid Bech32 strings.
I don't see it that way. The input is always being canonicalized to lower anyways. Am I misunderstanding you somehow? I was working on a java reference impl encoder/decoder and it works. I'll put it up shortly. |
@mruddy Just because it's canonicalized to lowercase for the purpose of checksum computation doesn't mean that it's valid. There is are extra requirements for validity besides checksum correctness (including maximum length, range of hrp characters, and no mixed case). |
There is a PR for a Java implementation already: sipa/bech32#19 I welcome review! |
Here's some java code. Maybe this will clarify: https://github.com/mruddy/bech32/blob/master/src/org/bitcoin/bips/bip173/bech32/Bech32.java#L110 https://github.com/mruddy/bech32/blob/master/src/org/bitcoin/bips/bip173/bech32/Bech32.java#L177 |
Yes, I saw that PR and thought that I could do better. This is my attempt https://github.com/mruddy/bech32/ :) |
Agreed, but the case of a-Z and A-Z is interchangeable due to the lowercase canonicalization. That's what I mean by mixed-case. |
I still don't understand why you want to add "Decoder must accept mixed case". I absolutely think it should not. |
Because from a user standpoint, if they happen to transcribe and get the case wrong somehow, it should still be accepted since the result will be the same. |
If they transcribe it wrong, the software should complain and they should go back and fix it. Trying to correct it for them is a footgun. If they can accidentally misread and transcribe as the wrong case, they could equally have transcribed as just the wrong character - and you wouldn't want to correct that for them either (even though error correction algorithms exist). Effectively, the consistent case is part of the consistency check of the address, together with the checksum. |
Well, "wrong" was not a great word to use. I should have said "different" because that's all it is -- not wrong, because they are equal. It's less code to be more user friendly and get the same result. Maybe I won't be able to persuade you on this point. But, how about the rest of the changes? Once we have general agreement, I'll update accordingly. |
Big concept ACK on the rest of the changes. I've had other reports that the language was unclear wrt the behaviour of the encoder and uppercase input, and wanted to improve it. This seems like a good solution. I do want to verify the extra test cases, though. |
Cool, thanks. Also, I was thinking of accepting mixed case not so much as trying to correct users as much as it was unifying code paths. BTW, the test cases should be testable with the examples in my first comment. I also have those test cases in my java. |
FYI, I just added a 3rd commit to capture what we discussed. I'll squash it all when we're all set. |
bip-0173.mediawiki
Outdated
@@ -264,6 +267,9 @@ P2PKH addresses can be used. | |||
|
|||
The following strings have a valid Bech32 checksum. | |||
* <tt>A12UEL5L</tt> | |||
* <tt>a12uel5l</tt> | |||
* <tt>10a06t8</tt> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a requirement that the HRP is non-empty, so while you're technically right that this has a valid checksum, it's not testable... any correct decoder should fail with that as input.
So I think it's maybe better to rename the section to "The following strings are valid Bech32" (and the next section to "The following string are not valid Bech32 (with reason for invalidity)", and have these two example with empty HRP in that section (with a note that technically their checksum is valid).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the BIP contrary to that. My interpretation was that no such requirement that the HRP is non-empty is in the BIP. The BIP even says that the HRP's validity is application specific (which builds on top of generic Bech32). That's why this works:
>>> segwit_addr.bech32_create_checksum('', [])
[15, 29, 15, 26, 11, 7]
>>> segwit_addr.bech32_verify_checksum('', [] + [15, 29, 15, 26, 11, 7])
True
It seems odd that we'd encode something that can't be decoded. I think the reference decoder is implementing a phantom requirement right now. If we add a requirement that HRP be non-empty, I think that's OK, just other things need to be updated also. Am I making sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right - it seems the non-empty HRP isn't mentioned in the BIP anywhere, except in the test vectors.
I prefer modifying it so that it does require that, and is uniformly implemented in decoders and encoders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i'll update to that effect.
btw, the non-checksum data portion can be empty right now. do you want to keep that as it is now? i think so, just checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sounds good. Thanks!
I've updated and squashed everything back into one commit. Besides this BIP update, the reference code will need some updates. Just wanted to point out that a couple of the things to fix, 1) mixed case encoding output and 2) using the upper case form of the HRP for checksum creation, are identified by this test:
|
Added some info about conversion to US-ASCII and unmappable characters. |
ACK 96d39e8, I've verified all the test vectors. Thank you! |
@sipa thanks for reviewing! One more thing. I think we should definitely use UTF-8 instead of US-ASCII. I've added a commit and will squash it upon approval. Doing this will avoid the unmappable character issue and make using this easier for many people using higher level languages that use unicode strings. UTF-8 is compatible with ASCII because the mapping for ASCII uncompatible code points goes to multibyte, with each byte being out of ASCII range (which we'll already reject). This is easy to see in the table at the top of the description at https://en.wikipedia.org/wiki/UTF-8.
|
@mruddy Yes, that's why it's restricted to the range of printable characters in the lower 7 bits, to avoid issues with character sets. You also can't say "UTF-8 code points" (as UTF-8 is a way of encoding code points as bytes). You could say Unicode code points (or Unicode characters), though. I think it's unnecessary though. The text does not say "The HRP must consist of Unicode characters, converted to US-ASCII" - that would be prone to having character conversion issues. It says "must consist of ASCII characters". If you're using it with other characters, regardless of how you encoding them, it's already used incorrectly. |
Agree, good catch. The reason I bring this up is because of languages like Java (but not just Java) where the String type is UTF-16. So, that's what a user of the library is going to have and then it's going to be sent in and the library is going to have to encode to what Bech32 needs. |
The API should just take in a String, and then check whether all characters are in the ASCII range. UTF-8 shouldn't be involved at any point. |
Oh, I suppose :) I dropped the UTF-8 commit and it's back to the one that you ACK'd, 96d39e8. Thanks! |
a12uel5l
: vector added to demonstrate that it and its uppercase version,A12UEL5L
are both valid and treated as equal.segwit_addr.bech32_encode('a', [])
10a06t8
: vector added to demonstrate that a combined empty HRP and empty data encode to a valid Bech32 string.segwit_addr.bech32_encode('', [])
1qzzfhee
: vector added to demonstrate that an empty HRP encodes to a valid Bech32 string.segwit_addr.bech32_encode('', [0])
pzry9x0s0muk
: vector moved to a new section as it is misleading in the current section because "No separator character" can not affect checksum validity because the separator char is not used to compute the checksum.1pzry9x0s0muk
: vector removed as incorrect because "Empty HRP" is Bech32 valid -- HRP validity is application specific.segwit_addr.bech32_encode('', [1,2,3,4,5,6])
A1g7sgd8
andA1G7SGD8
: vectors added to verify that implementations adhere to BIP requirement: "The lowercase form is used when determining a character's value for checksum purposes.". Note:10a06t8
: test vector added to demonstrate that while a valid Bech32 string, it is segwit application specific invalid.