-
Notifications
You must be signed in to change notification settings - Fork 743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve parsing of numbers in Kotlin lexer #1509
Conversation
goodhoko
commented
Apr 25, 2020
- Adds support for underscore separator.
- Add support for binary literals.
- Removes invalid support for lowercase "l" suffix.
- Adds support for unsigned literals.
8382068
to
dfd565d
Compare
@goodhoko Thanks for the PR! I realise it introduces a bit of duplication, but I thought from a maintainability perspective, it's a little bit easier to parse (as a human) the new rules if they're separated out. This is also more consistent with most other lexers and has the bonus benefit of allowing more targeted tokens. |
@pyrmont Cool! Thanks. I added one more commit that replaces the generi Num rule with more specific Num::Float. This is ready to me merged from my side now. .) |
@goodhoko I think we need to put the |
@pyrmont Exactly because the For example. |
Whoops—you're completely right. What do you want to do? You can either make a float rule that will ensure there's at least a decimal point or an exponent or we can leave it as-is and just tokenise most integers as floats. |
@pyrmont ah good catch! I didn't notice that the Float regexp would match integers too. Since correctness is probably not the ultimate goal of Rouge I propose to revert back to b01aa87 and just accept we won't parse reals as floats but rather as general But pursuing correctness is possible. Here's the correct regexp for reals: ((((([\d][\d_]*[\d]|[\d])?\.([\d][\d_]*[\d]|[\d])([eE][+-]?([\d][\d_]*[\d]|[\d]))?)|(([\d][\d_]*[\d]|[\d])([eE][+-]?([\d][\d_]*[\d]|[\d]))))[fF])|(([\d][\d_]*[\d]|[\d])[fF]))|(((([\d][\d_]*[\d]|[\d])?\.([\d][\d_]*[\d]|[\d])([eE][+-]?([\d][\d_]*[\d]|[\d]))?)|(([\d][\d_]*[\d]|[\d])([eE][+-]?([\d][\d_]*[\d]|[\d]))))) Constructed recursively from the Grammar. Matches only Doubles or Floats. To reduce the overwhelming complexity, we could break the regexp into parts (just as on the grammar page) and then use these parts in the rule definitions. For example decDigit = %r"[0-9]"
decDigitOrSeparator = %r"#{decDigit}|_"
decDigits = %r"(#{decDigit}(#{decDigitOrSeparator})*#{decDigit})|#{decDigit}"
doubleExponent = %r"[eE][+-]?(#{decDigits})"
doubleLiteral = %r"((#{decDigits})?\.#{decDigits}(#{doubleExponent})?)|#{decDigits}#{doubleExponent}"
floatLiteral = %r"(#{doubleLiteral}[fF])|(#{decDigits}[fF])"
...
rule %r"(#{floatLiteral})|(#{doubleLiteral})", Num::Float
# and so on for other Num literals I'd gladly do that, I'm just not sure if this level of granularity is wanted here in Rouge. .) |
The Rust lexer has arguably a decent compromise: rouge/lib/rouge/lexers/rust.rb Lines 164 to 175 in 2e9c3cd
We could do that here, too? Otherwise I don't mind going back to b01aa87. |
@pyrmont cool. Gimme some time I'll try to rewrite this into something sane. .) |
- Adds support for underscore separator. - Add support for binary literals. - Removes invalid support for lowercase "l" suffix. - Adds support for unsigned literals. - Adds support for dot-starting real literals.
4b0cde4
to
9ee947c
Compare
@pyrmont I reworked this, please check it out. The rules for numbers are now correct - they reflect the grammar with no shortcomings. I also moved the number rules above the |
This commit improves the handling of numbers in the Kotlin lexer. Specifically, it: 1. adds support for underscore separators; 2. adds support for binary literals; 3. removes invalid support for the lowercase "l" suffix; and 4. adds support for unsigned literals.