Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON rewrite #1029

Merged
merged 9 commits into from
Jul 19, 2019
Merged

JSON rewrite #1029

merged 9 commits into from
Jul 19, 2019

Conversation

miparnisari
Copy link
Contributor

image

@miparnisari
Copy link
Contributor Author

Fixes #819

@ffes
Copy link

ffes commented Jan 24, 2019

I hope one of the members of this project finds the time to review this

@pyrmont pyrmont added the needs-review The PR needs to be reviewed label May 29, 2019
@pyrmont
Copy link
Contributor

pyrmont commented May 29, 2019

@miparnisari I'm sorry this has been outstanding for so long :( I'm working my way through the backlog and hope to get to this soon!

lib/rouge/lexers/json.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/json.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/json.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/json.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/json.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/json.rb Outdated Show resolved Hide resolved
@pyrmont pyrmont added author-action The PR has been reviewed but action by the author is needed and removed needs-review The PR needs to be reviewed labels May 29, 2019
@jneen
Copy link
Member

jneen commented May 29, 2019

For the json lexer in particular, performance on large inputs is important. Would you mind testing with larger data? I believe if you search the issues for "json" you will find a few issues that prompted us to make the lexer extremely simple.

Also @pyrmont for avoiding future performance issues I wonder if it might be useful to have some kind of max_length option for the lex - where if the input is longer than the max length, it will give up and highlight the rest as Text. For use cases like Gitlab where they can't control the code this sort of thing could be very helpful.

@pyrmont
Copy link
Contributor

pyrmont commented May 30, 2019

I had a look through the issues and while I think I found the ones that @jneen is referring to (see #41, #114, #244), they don't actually include an example of the JSON being tested.

I made one using Mockaroo and have put it in a gist. It's not as large as the 200KB one referenced in #41 but you could always just double it if a larger file is desired.

I'm personally curious what the speed difference is with newer versions of Ruby and will try to give it a spin a bit later.

@pyrmont
Copy link
Contributor

pyrmont commented May 30, 2019

@jneen Yeah, a max_length option sounds like an idea worth serious consideration. Will have a bit of a think about this, too.

@jneen
Copy link
Member

jneen commented May 30, 2019

Now that I look closely, I think #525 is relevant (the actual issue there was with the JSON lexer)

@jneen
Copy link
Member

jneen commented May 30, 2019

the offending case was a string that contained a large number of backslashes: https://github.com/jneen/rouge/files/410135/bad_note.txt

@miparnisari
Copy link
Contributor Author

I'm personally curious what the speed difference is with newer versions of Ruby and will try to give it a spin a bit later.

Can you please edit the README to include instructions on how to measure performance of two lexers? Or even in the PR template.

@pyrmont
Copy link
Contributor

pyrmont commented May 30, 2019

@miparnisari wrote:

Can you please edit the README to include instructions on how to measure performance of two lexers? Or even in the PR template.

I don't think generally we'd expect submissions to include speedtests; this felt like a special case because of: (1) the prevalence of JSON; and (2) the fact JSON files might realistically be quite large.

That said, providing some 'official' support for benchmarking doesn't sound like a bad idea. I was planning on using the Shootout library mentioned in #114 and will try to see how simple it can be made to test different versions of Rouge (it's really designed to benchmark Rouge against different syntax highlighting libraries which isn't quite what I'm concerned with). To compare versions of Rouge, I was anticipating switching between git branches and then running the tests.

Oh, and as a bonus, I realised the 200KB JSON file is in there.

@miparnisari
Copy link
Contributor Author

a max_length option sounds like an idea worth serious consideration. Will have a bit of a think about this, too.

Agree. I know my company has this logic in place - if the file is too big, it bails out and renders it as plain text.

I don't think generally we'd expect submissions to include speedtests;

Agree. It could be an optional input of the PR.

For the json lexer in particular, performance on large inputs is important. Would you mind testing with larger data?

I tried the 200 kb file. When I use rackup, both this and the existing lexer take around 1.5 sec. (I hope this is correct, I looked at the last number in the request log)

@pyrmont pyrmont added needs-review The PR needs to be reviewed and removed author-action The PR has been reviewed but action by the author is needed labels May 30, 2019
@pyrmont
Copy link
Contributor

pyrmont commented May 30, 2019

I got Shootout working and ran it against both this PR and the current master.

                      master               PR #1029
JSON (217 kB)
=> terminal           1209 kB/s            1039 kB/s
=> html               1454 kB/s            1203 kB/s

There's definitely a slow down: terminal is 14% slower and HTML is 17% slower. That's not great but I'm inclined to think this is acceptable given the fact this has been requested multiple times and does seem like an overall usability win.

What do you think @jneen?

@pyrmont pyrmont added maintainer-action The PR has been reviewed but action by a maintainer is required and removed needs-review The PR needs to be reviewed labels May 30, 2019
@jneen
Copy link
Member

jneen commented May 31, 2019

The code for rouge.jneen.net (in this same org) contains a class that can be used to run different versions of rouge in the same runtime, which might be helpful.

@pyrmont
Copy link
Contributor

pyrmont commented Jul 9, 2019

@miparnisari Thanks for keeping this up to date. I plan to have a look tonight at the case @jneen linked to earlier. I never did actually check that.

@pyrmont pyrmont added needs-review The PR needs to be reviewed and removed maintainer-action The PR has been reviewed but action by a maintainer is required labels Jul 9, 2019
@pyrmont
Copy link
Contributor

pyrmont commented Jul 10, 2019

@miparnisari Sorry, I didn't mean to submit two commits in quick succession but as I was writing a message to explain why the code didn't support highlighting what I'm calling 'bare' key-value pairs for performance reasons, I realised there was a way to do it that didn't involve excess backtracking.

In summary, these commits makes several changes to the proposed lexer:

  • First, they change the tokens used in JavaScript objects. It was proposed that keys be tokenised as Str::Double and that string values be tokenised as Name::Tag. The token Name::Label seems a more appropriate token for keys while Str::Double should be used for string values.

  • Second, they remove the tokenisation of escape sequences inside object keys. While this tokenisation was correct insofar as a JSON parser would parse the string, it produces an odd result where Name::Label token can be interrupted by Str::Escape tokens.

  • Third, they add support for 'bare' key-value pairs, that is text of the form <string>: <value>. This preserves backwards compatibility with how the JSON lexer previously worked. These 'bare' key-value pairs are tokenised roughly the same way as they would be within an object. The difference is that escaped characters are not tokenised differently to other characters in a string. This seems an acceptable trade-off to me given that bare strings are not correct JSON in the first place and so how they should be tokenised outside of an object is not really defined.

Hopefully that all makes sense. I'll try to add some more examples tonight to the visual sample to demonstrate how this works.

lib/rouge/lexers/json.rb Outdated Show resolved Hide resolved
@pyrmont
Copy link
Contributor

pyrmont commented Jul 11, 2019

@miparnisari My changes aren't working for subsequent bare key-value pairs. Need to have more of a think about this one.

miparnisari and others added 7 commits July 12, 2019 16:50
This commit makes several changes to the proposed lexer.

First, it changes the tokens used in JavaScript objects. It was proposed
that keys be tokenised as `Str::Double` and that string values be
tokenised as `Name::Tag`. The token `Name::Label` seems a more
appropriate token for keys while `Str::Double` should be used for
string values.

Second, it remoeves the tokenisation of escape sequences inside object
keys. While this is correct insofar as the keys are merely strings, it
fits awkwardly with the tokenisation of keys with a non-string token.
This commit tokenises all characters within the key as `Name::Label`.

Third, it adds support for 'bare' key-value pairs, that is text of the
form <string>: <value>. These 'bare' key-value pairs are not tokenised
the same way as they would be within objects. Rather, keys are tokenised
as ordinary strings. This preserves backwards compatability with how the
JSON lexer previously worked.
This commit provides support for tokenising bare keys as `Name::Label`
without introducing backtracking for non-key strings.
@pyrmont
Copy link
Contributor

pyrmont commented Jul 12, 2019

@miparnisari Sorry for overwriting your repository. It was the only way I could see to get the branch caught up to the latest version. I needed to do that because the HOCON lexer (which did not exist at the time you created this branch) subclasses the JSON lexer. Since we're pretty radically changing how it works, that broke that lexer.

It's all fixed now and looks to be making all those keys look nice and pretty :)

spec/lexers/json_spec.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/hocon.rb Outdated Show resolved Hide resolved
lib/rouge/lexers/hocon.rb Show resolved Hide resolved
@pyrmont pyrmont added author-action The PR has been reviewed but action by the author is needed and removed needs-review The PR needs to be reviewed labels Jul 18, 2019
@pyrmont pyrmont merged commit 028f392 into rouge-ruby:master Jul 19, 2019
@pyrmont pyrmont removed the author-action The PR has been reviewed but action by the author is needed label Jul 19, 2019
@MaximeKjaer MaximeKjaer mentioned this pull request Nov 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants