Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge decaffinated master into rebrand #29

Merged
merged 46 commits into from
Sep 4, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
477b075
➡️ Migrate all language packages
icecream17 Jun 25, 2022
3399c06
lint
icecream17 Jun 26, 2022
18164b3
Bumped all languages to recent tree-sitter
mauricioszabo Jun 26, 2022
7299966
Bump Atom's tree-sitter
mauricioszabo Jun 26, 2022
ca062aa
remove all references to atom or atom community
Aug 7, 2022
9ccd895
Change the Discord links
Aug 7, 2022
bb658d9
Decaffeinate.
fabianfiorotto Aug 5, 2022
f6331c3
Merge remote-tracking branch 'origin/master' into bump-tree-sitter
mauricioszabo Aug 12, 2022
a61fd8c
Merge remote-tracking branch 'origin/master' into bump-tree-sitter
mauricioszabo Aug 12, 2022
3953211
Simpler way of running tests
mauricioszabo Aug 12, 2022
fd5028a
Fixed package-lock name
mauricioszabo Aug 13, 2022
63d1f26
Fixed tokenization for method invocation
mauricioszabo Aug 13, 2022
96c087f
Disabled an outdated test, and one that is a bug on tree-sitter
mauricioszabo Aug 13, 2022
59c0795
Adding some development notes
mauricioszabo Aug 14, 2022
23f08de
Fixed typo
mauricioszabo Aug 14, 2022
56ef773
Code-fying doc
mauricioszabo Aug 14, 2022
61b9df4
Merge remote-tracking branch 'origin/master' into bump-tree-sitter
mauricioszabo Aug 16, 2022
9eba102
Javascript fix
mauricioszabo Aug 16, 2022
ca3f822
Merge remote-tracking branch 'origin/master' into bump-tree-sitter
mauricioszabo Aug 16, 2022
e2d7a00
Update autocomplete-html
mauricioszabo Aug 16, 2022
16ad343
Lockfile update
mauricioszabo Aug 16, 2022
6e24438
Merge pull request #11 from stech11845/stech11845-patch-1
mauricioszabo Aug 18, 2022
6e42472
Merge pull request #14 from pulsar-edit/bump-tree-sitter
mauricioszabo Aug 18, 2022
e7f345d
Merge branch 'master' into decaffeinate
fabianfiorotto Aug 19, 2022
95bbdef
Fix some errors.
fabianfiorotto Aug 23, 2022
d56d37d
Merge pull request #13 from fabianfiorotto/decaffeinate
mauricioszabo Aug 23, 2022
9faae4f
Merge remote-tracking branch 'origin/master' into rebrand
Spiker985 Aug 23, 2022
c6c175d
Update docs/dev/README.md
confused-Techie Aug 29, 2022
02a42e5
Update packages/language-c/CONTRIBUTING.md
confused-Techie Aug 29, 2022
4b44a9c
Update packages/language-git/CONTRIBUTING.md
confused-Techie Aug 29, 2022
53afd13
Update packages/language-git/README.md
confused-Techie Aug 29, 2022
e53d789
Update packages/language-git/README.md
confused-Techie Aug 29, 2022
b083210
Update packages/language-git/README.md
confused-Techie Aug 29, 2022
ce27885
Update packages/language-c/README.md
confused-Techie Aug 29, 2022
34d8d80
Update packages/language-c/README.md
confused-Techie Aug 29, 2022
917d0b0
Update packages/language-clojure/README.md
confused-Techie Aug 29, 2022
26d12ca
Update packages/language-clojure/README.md
confused-Techie Aug 29, 2022
1143f35
Update packages/language-coffee-script/README.md
confused-Techie Aug 29, 2022
85a2c77
Update packages/language-coffee-script/README.md
confused-Techie Aug 29, 2022
b9d0df2
Update packages/language-csharp/README.md
confused-Techie Aug 29, 2022
32c267f
Update packages/language-csharp/README.md
confused-Techie Aug 29, 2022
a65ad52
Update packages/language-css/CONTRIBUTING.md
confused-Techie Aug 29, 2022
75c1a18
Update packages/language-css/README.md
confused-Techie Aug 29, 2022
f19b1a3
Update packages/language-css/README.md
confused-Techie Aug 29, 2022
9f3e1f0
Update packages/language-gfm/README.md
confused-Techie Aug 29, 2022
b0cf0d2
Update packages/language-gfm/CONTRIBUTING.md
confused-Techie Aug 29, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Adding some development notes
  • Loading branch information
mauricioszabo committed Aug 14, 2022
commit 59c0795c8a25fd23b03297a524ce0a28b1aa7bba
5 changes: 5 additions & 0 deletions docs/dev/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Development README

On this directory, we can include things that we found out how they work, and how do we want to handle that in the future

- [Tree Sitter](tree-sitter.md), the tokenizer for the Atom Text Editor
confused-Techie marked this conversation as resolved.
Show resolved Hide resolved
103 changes: 103 additions & 0 deletions docs/dev/tree-sitter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Tree Sitter in Pulsar

Tree-sitter is a tokenizer that uses native modules. The idea is that a language generates an AST of the source code and then Pulsar will tokenize these on the editor with some rules on CSON files (that kind of resemble CSS selectors)

## Debugging a Grammar

Inside Pulsar's source code is possible to require Tree-Sitter and try to parse some grammar. To do this, run this code on Devtools:

```js
const Parser = require('tree-sitter');
const Java = require('tree-sitter-java');

const parser = new Parser();
parser.setLanguage(Java);

tree = parser.parse(`
class A {
void func() {
obj.func2("arg");
}
}
`);
console.log(tree.rootNode.toString());
```

This will create a parser, set its language to Java, and try to parse the source code that we sent. This specific fragment of code will print:

```
(program
(class_declaration
name: (identifier)
body: (class_body
(method_declaration
type: (void_type)
name: (identifier)
parameters: (formal_parameters)
body: (block
(expression_statement
(method_invocation
object: (identifier)
name: (identifier)
arguments: (argument_list
(string_literal)))))))))
```

I did the pretty-print manually. Basically, it says that the "root node" is a `program` that contains a `class_declaration`. Following that, comes the class's name, then its body, etc etc.

## Modern tree-sitter

If you look at the AST above, you'll see that there are things inside parenthesis and things like `name: ` and `body: `. This second one is what Tree-Sitter now calls "field name", and Pulsar is not yet using this anywhere. This is problematic for multiple reasons, but the main one is that tokenization gets wrong: for example, in the code above, we want to tokenize `obj.func2("arg")` by marking `func2` as a function that's being called, but the AST for that fragment is:

```
(method_invocation
object: (identifier)
name: (identifier)
arguments: (argument_list (string_literal)))
```

What disambiguates the method name from other things is the field name: `obj` have field name `object`, and `func2` have field name `name`. As Pulsar is not parsing this, the closest match we can get is:

```cson
'method_invocation > identifier': 'entity.name.function'
```

But unfortunately, this does not solve the issue - both `obj` and `func2` are tokenized as functions in this case.

### Fixing this

`src/tree-sitter-language-mode.js` is where the sintax tree is walked to generate tokens. It basically have methods like `seek`, `_moveDown`, etc that `.push` some token into `containingNodeTypes` and other local fields. Later, these are tokenized via `_currentScopeId` that basically tries to match the rule we're in inside `this.languageLayer.grammar.scopeMap` data structure.

This data structure is defined in `src/syntax-scope-map.js`, and contains `anonymousScopeTable` (that is, AFAIK, a list of words that are tokenized always the same - think like "keywords" on the language) and a `namedScopeTable` (which, surprisingly, does not treat the "field name" even though it has `name` on it). This structure is basically a "leaf first" structure. So, tokenizing `obj.func("a string")`, we would get:

1. `method_invocation`, that gets `push`ed into `containingNodeTypes` then we "move down"
1. `identifier` (for `obj`), that also gets `push`ed into `identifier`
1. We check the tokenID then "move right"
1. `identifier` (for `func`), replaces the sibling's `identifier` that was pushed before, and we check tokenID, and "move right" again
1. Repeate the process from the beginning, but for `argument_list` instead of `method_invocation` (replace the sibling's `identifier` with `argument_list`, then move down to push `string_literal`)
1. Finally "move up", `pop`ing the `string_literal`, then `argument_list`, and finally `method_invocation`, and continue walking the rest of the AST

To get the Token ID, we walk though the data structure, checking things as we go. So for example, in this case, after `push`ing things for `obj`, we have inside `containingNodeTypes`: `['method_invocation', 'identifier']`. We have this same structure for `func`.

If we look at the `scopeMap` structure, inside `namedScopeTable`, we'll see something like:

```js
identifier: {
parents: {
method_invocation: {
result: ["entity.name.function", ...]
}
}
}
```

And this is how the tokenizer is done. Is also how the bug appears: both `func` and `obj` have the same `containingNodeTypes`.


### Possible solution

To make `src/syntax-scope-map.js` aware of "named fields" (we can do that by checking the `cursor.currentFieldName` or by `push`ing the `this.treeCursor.currentFieldName`), then match things correctly.

We will also need to decide on a syntax on the CSON file to this format, and also parse this format inside the `namedScopeTable`.

Finally, we'll need to change the `get` method of the SyntaxScopeMap` to match things correctly and get tokenization for things filtered by the field name.
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"id": "pulsar",
"name": "Pulsar",
"urlWeb": "https://atom.io/",
"urlGH": "https://github.com/pulsar-edit",
"urlGH": "https://github.com/pulsar-edit"
},
"main": "./src/main-process/main.js",
"repository": {
Expand Down