-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Nim (with tests) #633
base: master
Are you sure you want to change the base?
Conversation
macOS CI doesn't like it
There are still a few bugs, mostly related to EOL/DEDENT compositions.
Previously non-capturing groups are used, which works fine but is on the overkill side and probably complicated the parser a tad bit.
The scanner is now a bottom-up scanner with hopes that this will make adding more rules in easier.
This is no longer a problem with the switch to indentation relations.
This reduces the number of syntax nodes drastically.
This commit removes newline from "tokens that can appear anywhere" and replacing it with strict layout definition. This prevents erroneous code like ``` type X = int ``` from being parsed as a single type definition.
The statement -> expression -> ... tree makes the generated CST hard to navigate while providing little information of use. This commit hides them. With that hidden, we can now separate calls syntax into two without making the CST looks weird.
This fits the definition of how command call is evaluated perfectly!
Follow up to 5976424
This commit add new grammar for the following blocks: `block`, `if`, `when`, `case`, `try`, `for`, `while`.
Tuple expression with one entry has to have a trailing comma
Covers basic statements like: `import`, `export`, `discard`, `return`, etc.
This prevents multiple syntax node being created for each characters but instead try to group them whenever possible.
Previously, lexer's layouting tokens are composed of both newlines and whitespaces. This required disambiguation at the lexer level and things like DEDENT -> INDENT_EQ required lexing tricks such as expecting the other token to be requested precisely after DEDENT. Stuff like lines into comment also required special tokens like SPACES_BEFORE_COMMENT. This commit decouples newlines from layouting tokens and uses the grammar to specify them. This brings several advantages: * Comments can be transparently handled since extra tokens are scanned automatically between newline -> layout. * The parser can catch ambiguous definition at compile time and require explicit disambiguation. * The loosened tokens can be used as extra tokens in the grammar to track indentation across all lines. This improves the accuracy of layout tokens and allow for some helpers to be removed. Alongside this, long string content has been replicated in the grammar itself and only the quote handling portion is offloaded.
A complete rewrite of the grammar, with almost no parts used from the original. The parser is now capable of parsing ~97% of all files within the nim-lang/Nim repository, however the parser size is extremely large: 132MiB. Some features are currently disabled to prevent the parser from exceeding tree-sitter's own limit.
This allows us to remove 11 external scanning nodes. Tie breaking against command call & prefix is now done solely on the prefix operator. Symbolic operators are now exposed as nodes to allow syntax queries to match against them. Due to size explosion, unicode operators are disabled. Included are some fixes for scanner flags reset on new line.
Put infix operators as inline within the grammar This allows us to remove 11 external scanning nodes. Tie breaking against command call & prefix is now done solely on the prefix operator. Symbolic operators are now exposed as nodes to allow syntax queries to match against them. Due to size explosion, Unicode operators are disabled. Included are some fixes for scanner flags reset on new line.
improve field accuracy and introduce new fields This pull includes two changes: * Make sure field pin-point to the interested node * Improve field consistency within the grammar and introduce more fields for other complex structures. This pull contains breaking change for a few field names.
Allowing layout_end on EOF allows the parser to close and isolate grammar portions where layout_end is typically not allowed (ie. in parentheses), which enabled better error recovery. The empty termination hack does not seem to contribute anymore with this change and was removed. With this change, we also remove the use of synchronize node for fixing after newline. Instead it is cleared based on column position at the start of the lexer. This should reduce errors from node reuse, but is more or less a hunch since it is hard to test incremental parsing.
scanner: allow end on EOF and use column for flag invalidation Allowing layout_end on EOF allows the parser to close and isolate grammar portions where layout_end is typically not allowed (ie. in parentheses), which enabled better error recovery. The empty termination hack does not seem to contribute anymore with this change and was removed. With this change, we also remove the use of synchronize node for fixing after newline. Instead it is cleared based on column position at the start of the lexer. This should reduce errors from node reuse, but is more or less a hunch since it is hard to test incremental parsing. Fixes alaviss/tree-sitter-nim#64.
This syntax quirk is utilized by system.nim Fixes alaviss/tree-sitter-nim#66.
make type declaration RHS optional This syntax quirk is utilized by `system.nim` Fixes alaviss/tree-sitter-nim#66.
bump version to 0.4.0
In certain scenarios, the parser might crash due to an OOB in tree-sitter during get_column() at EOF. Since tree-sitter still hasn't released a new version with the fix, we will have to solve it here ourselves. Ref tree-sitter/tree-sitter#2563
scanner: add workaround for column at EOF
This should align these with their when/case conditional counterparts.
add missing consequence field to object declaration
add missing alternative field to object variants
Since the scanner no longer consumes input when emitting layout tokens, we can recover the scanner state by rescanning the input instead of storing them. This allows the removal of some state invalidation schemes and avoids get_column overhead. This change have some implications to error recovery as layout termination no longer change the scanner state. For the regressed cases, it appears that making the body of a section optional was enough to make tree-sitter produce better recoveries.
This would make indentation queries for this case a lot easier to write.
This drops the parser size by 6MiB. An unfortunate consequence is that `else` no longers terminate a case expression, but that feature shouldn't be used by anyone.
tree-sitter regexes are not the same as JS regexes, and "useless" escapes are often required.
This one trick dropped state count/large state count from 20478/10711 to 20193/10568.
This is 1:1 against the way the compiler parses these. The infix parsing grammar had to be duplicated due to the unique position that this matches, which is quite ugly. Unfortunately this syntax is widely used and thus parser support must be provided.
Also added metadata to show Sponsor button on GitHub.
This update required both actions to be updated at the same time.
changes staged for 0.5.0 Notable changes: - Reduced the amount of states used for tracking layout - Support for concept without a body - Support for type(x) expressions at the top level Shortlog: Leorize (9): remove flag and indentation tracking across scans allow concept body to be omitted grammar: share if alternatives between if and case eslint: disable useless-escape rule grammar: factor out for loop body support old type(x) expression in statement lists update readme for the current project status ci: bump upload/download artifact version bump version to 0.5.0
Checked on a few popular repositories on github like nitter and jester.
Thanks for the PR! I'm afraid I can't accept this in the current state, the parser is just too big (parser.c is 66MiB). Difftastic already has problems with the git repo being too big, and this parser is bigger than the largest parsers currently included. I'd like to support Nim, but I need a smaller file. Say something smaller than 30MiB. |
@Wilfred, if you're open to enabling the You could probably make it seamless for users by bundling a list of known grammars (with file extensions and such) and just store URLs where the corresponding WASM files can be downloaded from. It's a very new Tree-sitter feature, developed for the Zed editor's new extension system, but it works pretty well, and I think it might be well-suited for your use case, and solve the problem of needing to bundle a large set of languages. I know this is off-topic; I just thought I'd mention it here, since this PR was linked from a HN thread. |
@maxbrunsfeld ooh, I am very interested in this! The I need (I imagine Zed also needs to associate file extensions with languages, just like difftastic, so maybe you have a solution for that metadata too?) |
Yeah, in Zed, extensions are specified via a combination of:
I'm guessing Difftastic would want a slightly different packaging format, because you don't need all of the stuff Zed uses, but I think a similar approach would probably work. For now, these WASM files would need to be hosted somewhere. The WASM mode of compiling parsers isn't widely used yet, but down the road, I'd love to start standardizing on ways that Tree-sitter grammars store the WASM builds and queries. Maybe just GitHub release assets. |
Nim (formerly nimrod) is a compiled systems language with type inference, macros, and memory safety. It's becoming more common. Nitter for example uses it.
This patch adds tree-sitter support from https://github.com/alaviss/tree-sitter-nim , which uses the MPL-2 license. If this isn't acceptable, I can try to find another implementation, e.g. alaviss/tree-sitter-nim#11 mentions https://github.com/aMOPel/tree-sitter-nim.