Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some updates to the PHP lexer #1397

Merged
merged 15 commits into from
Apr 7, 2020
Merged

some updates to the PHP lexer #1397

merged 15 commits into from
Apr 7, 2020

Conversation

julp
Copy link
Contributor

@julp julp commented Jan 18, 2020

Hi,

I would like to submit some updates and improvements to the PHP lexer:

  • fix case insensitivity of (accordingly to the language they are case insensitive but currently are not):
    • <?php
    • as in use list
    • keywords (when checked in @keywords)
    • function/method names (when checked in @builtins)
  • fix heredoc: the syntax with " around the starting label was not recognized as is
  • updates from the language:
    • support _ in (binary, decimal, hexadecimal, ...) numbers (7.4.0)
    • support for binary numbers (0b...) (5.4.0)
    • Unicode codepoints escape syntax (\u{...}) (7.0.0)
    • add some missing keywords:
      • 7.4.0: fn
      • type declarations (PHP 7, including void and nullable types from PHP 7.1)
      • 7.0.0: class (anonymous classes), yield from
      • 5.4.0: callable, insteadof, trait, __TRAIT__
      • 5.3.0: goto, __NAMESPACE__, __DIR__
      • others: instanceof, __CLASS__, __FUNCTION__, __METHOD__, __halt_compiler
      • self even if it's not really a reserved word

I haven't deleted them but I am in favor of the removal of the current builtin/predefined constants (E_* and PHP_*) and class stdClass from @keywords:

  • they are not keywords
  • these constants are case sensitive (which is not the case of the keywords in PHP as said above)

Let me know if you agree and if I missed or forgot something.

Also, is there a reason for TRUE/FALSE/NULL to be in @keywords? I mean shouldn't rule %r/(true|false|null)\b/, Keyword::Constant be made case insensitive instead?

Thanks.

@pyrmont pyrmont self-assigned this Jan 18, 2020
@pyrmont pyrmont added the needs-review The PR needs to be reviewed label Jan 18, 2020
Copy link
Contributor

@pyrmont pyrmont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the submission! Some comments within.

lib/rouge/lexers/php.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/php.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/php.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/php.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/php.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/php.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/php.rb Show resolved Hide resolved
lib/rouge/lexers/php.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/php.rb Outdated Show resolved Hide resolved
@pyrmont pyrmont added author-action The PR has been reviewed but action by the author is needed needs-review The PR needs to be reviewed and removed needs-review The PR needs to be reviewed author-action The PR has been reviewed but action by the author is needed labels Jan 22, 2020
@pyrmont
Copy link
Contributor

pyrmont commented Feb 5, 2020

@julp Sorry for the delay in getting back to you. I've resolved the completed issues and left comments on the ones that remain outstanding. Thanks for all your work on this!

@pyrmont pyrmont added author-action The PR has been reviewed but action by the author is needed and removed needs-review The PR needs to be reviewed labels Feb 5, 2020
@julp
Copy link
Contributor Author

julp commented Feb 5, 2020

No problem ;)

@pyrmont pyrmont added needs-review The PR needs to be reviewed and removed author-action The PR has been reviewed but action by the author is needed labels Feb 11, 2020
@pyrmont
Copy link
Contributor

pyrmont commented Apr 3, 2020

@julp Sorry again for the delay. Added two comments!

@pyrmont pyrmont added author-action The PR has been reviewed but action by the author is needed and removed needs-review The PR needs to be reviewed labels Apr 3, 2020
julp added 9 commits April 4, 2020 16:19
  + `<?php`
  + keywords
  + function/method names
* updates to the language:
  + support `_` in (binary, decimal, hexadecimal, ...) numbers (7.4.0)
  + support for binary numbers (`0b...`) (5.4.0)
  + Unicode codepoints escape syntax (`\u{...}`) (7.0.0)
  + add some missing keywords:
    * `fn` keyword (7.4.0)
    * type declarations (PHP 7, including `void` and nullable types from PHP 7.1)
    * 7.0.0: `class` (anonymous classes), `yield from`
    * 5.4.0: `callable`, `insteadof`, `trait`, `__TRAIT__`
    * 5.3.0: `goto`, `__NAMESPACE__`, `__DIR__`
    * others: casts, `instanceof`, `__CLASS__`, `__FUNCTION__`, `__METHOD__`, `__halt_compiler`
    * `self` even if it's not really a reserved word
we already had a rule to handle their lower case form,
just make it case insensitive.
* remove E_* and PHP_* constants: these are not keywords and are case sensitive
(keywords are not)
* `not` doesn't exist
* neither `this`
* `empty` should be treated in the same way as `isset` and `unset` (currently functions)
* `virtual` is a function, not a keyword
* `php_user_filter` is a class, not a keyword
@julp
Copy link
Contributor Author

julp commented Apr 4, 2020

Don't worry.

Huge thanks to you for taking care of rouge and the guidance/help!

Copy link
Contributor

@pyrmont pyrmont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these!

Would you be able to add something to the visual sample so that we can see the effects of some of the changes? I'm thinking especially of things like the number rules. Something also using yield would also be good, I think.

lib/rouge/lexers/php.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/php.rb Outdated Show resolved Hide resolved
@pyrmont
Copy link
Contributor

pyrmont commented Apr 5, 2020

@julp I realise this is still in 'draft' but I'm ready to merge this if there's nothing you have left to add :)

@julp
Copy link
Contributor Author

julp commented Apr 5, 2020

I messed up, I didn't see your commit.

Anyway, I think It's good now except if you see something else ;)

Next PR: fixing the use statement.

@pyrmont pyrmont marked this pull request as ready for review April 5, 2020 21:05
@pyrmont
Copy link
Contributor

pyrmont commented Apr 5, 2020

@julp This is the commit message I want to use:

This commit adds a number of fixes to the PHP lexer to both fix
outstanding issues and to add support for features introduced in
versions of PHP up to 7.4.0.

In particular, this commit:

- makes the following terms match case-insensitively: `php`, `as`,
  `use`, the keywords defined in `@keywords` and the keywords defined in
  `@builtins`;
- fixes an issue with heredoc syntax where `"` around the starting label
  would lex incorrectly;
- adds support for:
  - binary numbers;
  - use of `_` as a separator in numbers;
  - Unicode codepoint escape syntax; and
  - type declaration;
- adds the following keywords: `fn`, `class`, `yield`, `callable`,
  `insteadof`, `trait`, `__TRAIT__`, `goto`, `__NAMESPACE__`, `__DIR__`,
  `instanceof`, `__CLASS__`, `__FUNCTION__`, `__METHOD__`,
   `__halt_compiler`, `self`; and
- simplifies rules for `E_*` and `PHP_*` constants.

Could you correct anything you think is wrong?

@julp
Copy link
Contributor Author

julp commented Apr 5, 2020

php

I think people will misunderstand that point, it will be more obvious written as <?php (except if writing <? is an issue) or saying "open tag" maybe?

yield

To be precise: yield was already implemented but not in its form yield from

There are few removals from keywords (I just write them here if needed as a reminder):

  • this, removed: not a keyword
  • php_user_filter, removed: it's a class
  • stdClass moved to be known as a classname
  • empty: it's one but seems like a function and is in @builtins (like isset and unset)
  • virtual, removed: it's a function (also in @builtins)
  • __sleep and __wakeup, removed: these are magic method names (they are deprecated and all the others weren't listed here - __get, __set, __isset and so on)

I think the list is exhaustive and well summarized 👍

EDIT:

type declaration

Partial typehinting, limited to (a few) builtin types. I don't think we can do better, handling nullable instances (?ClassName) is easy but not non-nullable instances (ClassName): it's impossible at a lexer level to state if the token is the name of a class, a constant or something else. Maybe it could be achieved with additionnal states on function declarations but it will be really complex (way more than the use statement: you also have to handle potential default values, arguments packing, etc and the type of the return value if any).


And before merging, let me check if rule %r/(void|\??(int|float|bool|string|iterable|self|callable))\b/i, Keyword::Type has to be moved before rule %r/[~!%^&*+=\|:.<>\/?@-]+/, Operator.

typehinting should be placed before the one whith `?` and which
recognizes, among others, operators like `?`, `??`, `??=`, ...

In clear, `?int` is actually tokenized as:
`['Operator', '?'], ['Keyword.Type', 'int']`
Instead of:
`['Keyword.Type', '?int']`
@pyrmont pyrmont merged commit 8777663 into rouge-ruby:master Apr 7, 2020
@pyrmont
Copy link
Contributor

pyrmont commented Apr 7, 2020

@julp Thanks for all your help on this one! I think there are a lot of good fixes in there. The next version of Rouge will come out on Tuesday 14 April and this will be part of that release 🎉

@pyrmont pyrmont removed the author-action The PR has been reviewed but action by the author is needed label Apr 7, 2020
mattt pushed a commit to NSHipster/rouge that referenced this pull request May 21, 2020
…y#1397)

This commit adds a number of fixes to the PHP lexer to both fix
outstanding issues and to add support for features introduced in
versions of PHP up to 7.4.0.

In particular, this commit:

- makes the following terms match case-insensitively: `<?php`, `as`,
  `use`, the strings in `@keywords` and the strings in `@builtins`;
  
- fixes an issue with heredoc syntax where `"` around the starting label
  would lex incorrectly;

- adds support for:
  - binary numbers;
  - use of `_` as a separator in numbers;
  - `yield from` keyword;
  - Unicode codepoint escape syntax; and
  - partial type hinting;

- adds the following words to `@keywords`: `__CLASS__`, `__DIR__`,
  `__FUNCTION__`, `__halt_compiler`, `__METHOD__`, `__NAMESPACE__`,
  `__TRAIT__`, `callable`, `class`, `fn`, `goto`, `instanceof`,
  `insteadof`, `self`, `trait`;

- removes the following words from `@keywords`: `__sleep`, `__wakeup`,
  `empty`, `php_user_filter`, `stdClass`, `this`, `virtual`; and

- simplifies rules for `E_*` and `PHP_*` constants.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants