Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite PHP lexer to support use statements, function declarations and type declarations #1489

Merged
merged 32 commits into from
Aug 11, 2020

Conversation

pyrmont
Copy link
Contributor

@pyrmont pyrmont commented Apr 8, 2020

PHP's use keyword can be used to (1) import namespaces into the current scope and (2) include traits in a class. While the PHP lexer currently supports use in some contexts, it doesn't allow for the importing of multiple namespaces at once, nor does it tokenise a given input correctly when use is used inside a class.

This PR improves the support for use. It uses a simple conditional to determine whether identifiers that are captured by the rules in the :uselist state are keywords, namespaces or function names.

It fixes #1353 and it fixes #1361.

@pyrmont pyrmont added the needs-review The PR needs to be reviewed label Apr 8, 2020
@pyrmont pyrmont self-assigned this Apr 8, 2020
@pyrmont
Copy link
Contributor Author

pyrmont commented Apr 8, 2020

@julp I'm sorry to ask but could I ask you to have a look at this? You mentioned in 826b516 that the commit fixed #1353 and #1361 but I don't think that's the case.

To better tokenise the identifiers in the :uselist state, I have assumed that (by convention at least) trait names are capitalised whereas function names always begin with lowercase. Please let me know if you see anything amiss.

@pyrmont
Copy link
Contributor Author

pyrmont commented Apr 8, 2020

@HypeMC Sorry it's taken so long to get to this. Here's what it looks like now:

traits

Let me know if you have any thoughts.

@pyrmont pyrmont changed the title Support traits in PHP lexer Improve support for use keyword in PHP lexer Apr 8, 2020
@julp
Copy link
Contributor

julp commented Apr 8, 2020

This is the question I was about to ask: for PHP, should we highlight names based on conventions?

  • class/trait names are pascal case (MyClass)
  • function names are snake case (mb_convert_case)
  • method names are camel case (isEnabled)
  • constant names are upper snake case (FOO_BAR)

(with a potential namespace in front of all of these - \My\Namespace\my_function, Foo\Bar\MyClass)

Also note that use can also be used to import variables from the parent scope into an anonymous function (function sort_on($a, $b) use($key) { ... })

I think we shouldn't bother to handle use statements (namespace related) with those states. It does not worth it, it's too complex and unmaintainable. There is so many edge cases to deal with.

use /*Foo, */ Bar;
<?php use Foo ?>
use /*not*/some\name\{
    function some_fn,
    #const Foo\BAR,
    SomeClass,
};

Also, what do you think about this approach?

@pyrmont
Copy link
Contributor Author

pyrmont commented Apr 9, 2020

@julp I got really excited by your idea and used it to further simplify the lexer. I made a couple of changes to how your code worked to make the tokens more specific (especially regarding functions) but it's largely the same concept.

I did have to fiddle with one of the tests. We previously had it so that the class keyword in code beginning an anonymous class (Class {) was tokenised as Keyword rather than Keyword::Declaration (which is how it's tokenised everywhere else). That doesn't make any sense to me. class is still functioning to declare a class regardless of whether the class is anonymous or not so I think the token should be the same.

What do you think now?

@pyrmont
Copy link
Contributor Author

pyrmont commented Apr 9, 2020

@julp I've probably become a bit carried away. I ended up replacing nsid with a new pattern (id_with_ns_and_paren) that does kind of what you expect from the name. Because it matches an opening parenthesis, the rule that matches that pattern is able to apply the Name::Function token to identifiers throughout the code that it can guess are functions. It seems to be working well on the sample.

Eager to hear your thoughts.

@julp
Copy link
Contributor

julp commented Apr 9, 2020

Awesome, good job, it's way better!

Could we not lookup for a builtin if we are lexing one of those statements? See the log method (public function log(string $msg) { in spec/visual/samples/php), it is highlighted as a builtin instead of a function (elsif self.builtins.include? name => elsif !@next_token and self.builtins.include? name?)

I just saw that I've forgotten to add object, added by PHP 7.4, to typehinting (%r/(void|\??(int|float|bool|string|iterable|self|callable|object))\b/i).

Some more points I've missed:

  • fn is kind of another keyword for function
  • for typehinting, builtin types excluded (the rule above), a ? at the beginning of a name means it's a class (eg ?mysqli, ?\Some\Class, ?An\Other\Class) but only on a function/fn declaration and class properties because ?FOO can also be found on a ternary $condition?FOO:BAR or a coalesce $exist??DEFAULT but people usually put spaces around operators

@pyrmont
Copy link
Contributor Author

pyrmont commented Apr 10, 2020

@julp I kind of feel like using a stack is the wrong way to go about it but I couldn't think of a better way so I brought the concept into this PR. Rather than refer to it as a 'stack', I called it a 'magic token' and changed the API a little bit. I also kept the name-guessing heuristic (and kept the broader identifier pattern that I had introduced last time to make that possible).

What do you think? One difference is that your code has this rule about closing parentheses. I didn't see the point of it so have left it out but let me know if there's something that I'm missing.

@julp
Copy link
Contributor

julp commented Apr 10, 2020

A state machine would be a better fit (independant from lexer states to handle only relevant tokens).

The more you want to handle, the more complex it becomes. At start, it seemed easy but you add one case then another then... (it's PHP, not Ruby or Erlang/Elixir sadly) Based only on conventions, the result is not so bad, it seems pretty accurate to me and probably is the best compromise accuracy/simplicity.

One difference is that your code has this rule about closing parentheses. I didn't see the point of it so have left it out but let me know if there's something that I'm missing.

For something like function connect(string $password): mysqli;, it's to switch on Name::Class for mysqli (instead to get a Name::Constant)

For fn, => should act like { or ; for function: fn and function are quite equivalent but the syntax differs :/ One more case to manage specifically.

To avoid a conflict on typehinting with nullable types (?classname or ?builtintype) and the ternary and coalesce operators, it would require a stack on scopes (class, function, ... - push on the corresponding { and pop on }) to choose the good one but it's too difficult to implement.

class Foo extends AbstractFoo implements Bar {
    private ?PDO $dbh = NULL;
}

$a = fn (int $a, int $b): int => FOO + $a <=> $b;
echo $a(1, 2);

echo self::class;
echo FOO;

echo $condition?FOO:BAR;
echo $condition??DEFAULT;

Someone else tried the code and/or have any thoughts?

@pyrmont
Copy link
Contributor Author

pyrmont commented Apr 12, 2020

@julp Thanks again for all the help so far. I'm still having a think about the best way to do things. I agree that a state machine is largely inevitable but some of my hesitancy comes from the fact that Rouge already provides a state machine with the states in the lexer. Having an additional state machine feels duplicative and inconsistent with the way the other lexers are implemented.

@julp
Copy link
Contributor

julp commented Apr 13, 2020

Sure, I understand.

What ever we could do, we will never get something 100% accurate. I think we have to live with that and keep things as easy and simple as possible in order to facilitate maintenance and future evolvements inherent to the language.

Thanks for your time and support as well.

@pyrmont pyrmont force-pushed the bugfix.php-traits branch from d0b0b5f to d33ef44 Compare July 7, 2020 07:01
@pyrmont
Copy link
Contributor Author

pyrmont commented Jul 7, 2020

@julp I know it's been a few months since the last update but I hadn't completely forgotten about this PR (although you could have been forgiven for thinking otherwise!).

I've made some rather large changes to the structure but I think the result is easier to understand and easier to extend. The main idea was to move from using a :template state and a :php state to using a :root state and a :php state. This mean that when ?> is hit, the state of the lexer can be reset by simply calling reset_stack. Once I had that in place, I put a series of general rules in :php that I then vary by going into various :in_ states when hitting particular keywords (e.g. use goes into :in_use, trait goes into :in_declaration, etc).

I also revised the PHP visual sample; ripping out the copypasta and trying to group examples together so that they are more logically arranged.

II don't know if you've got time at the moment but if you have, could you have a look and let me know what you think?

@pyrmont pyrmont changed the title Improve support for use keyword in PHP lexer Rewrite PHP lexer to support use statements, function declarations and type declarations Jul 7, 2020
@pyrmont
Copy link
Contributor Author

pyrmont commented Jul 7, 2020

Here's what it looks like at the moment:

Screen Shot 2020-07-07 at 6 33 11 pm

@julp
Copy link
Contributor

julp commented Jul 10, 2020

Hi,

you did a great job however PHP is not a mature language with some kind of final/established syntax like many others. PHP 8.0.0 is already in alpha and some major changes are coming: union-types, attributes (a replacement for annotations as comments), pattern matching (an alternative to switch/case - seems not yet present in alpha releases), a mixed type, ... (sorry, I feel your pain)

I'm a little worried, with such a technique, on how difficult it might become to maintain the lexer (it begins to do what a parser should in my opinion). For now, you don't have so many intermediate states so it seems manageable but with PHP, expect regular changes (= always more cases to handle).

@pyrmont
Copy link
Contributor Author

pyrmont commented Jul 10, 2020

@julp Well, we also have a JavaScript lexer so this isn't a completely unknown problem :P

I did see some of these proposals and structured this lexer partly with extensibility in mind. The reason that I mixed in the :return state to the end of all the :in_ states is that it means there won't really be any error tokens if unknown syntax is encountered. At worst, a state will be left prematurely and that'll screw up the highlighting after that point but we have issues like that all the time and we just fix it as and when the need arises.

If the lexer looks like it's highlighting PHP correctly (for now), I think we can merge it in and then handle changes to the syntax in due course.

@julp
Copy link
Contributor

julp commented Jul 12, 2020

Hi,

I haven't tested it on invalid scripts but the result seems really good.

I found one issue: the leading \ for FQDN names (all of them - constants, classes, functions, ...) appears as error

echo collection_radio_buttons('date', new \DatePeriod(new \DateTime, \DateInterval::createFromDateString('1 day'), 7), format_datetime_as_Ymd(date_create('tomorrow')), 'format_datetime_as_Ymd', 'l');

On the above example DatePeriod is also higlighted as a function (instead of class) but I think this is due to this leading \ issue because with this error token we consequently fallback in recover/reset mode (the :return state).

I also noted that on the rule %r/\??#{id}/, Keyword::Type, we could find a namespaced classname (?Foo\Bar) so I think id_with_ns would be more accurate than just id.

Anyone else gave it a shot?

@pyrmont pyrmont merged commit ba2d158 into rouge-ruby:master Aug 11, 2020
@pyrmont pyrmont removed the needs-review The PR needs to be reviewed label Aug 11, 2020
@pyrmont
Copy link
Contributor Author

pyrmont commented Aug 11, 2020

@julp At this point, I think it's easiest to generate feedback by putting it out into the world and seeing the response. I'm including it in the next version of Rouge I'm prepping right now. Thanks for all your help! I couldn't have done what I did without your advice and support :)

mattt pushed a commit to NSHipster/rouge that referenced this pull request May 19, 2021
…d type declarations (rouge-ruby#1489)

This commit involves a large-scale rewrite of the PHP lexer with the aim
of improving the level of semantic detail. Partly this is achieved by
the use of various heuristics that, while not always correct, do hold
for the majority of cases. In particular, the lexer has improved
support for `use` statements, function declarations and type
declarations.

Co-authored-by: julp <julp@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PHP traits not always highlighted properly PHP multiple use statements combined not highlighted properly
2 participants