Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dupe and escape span values only if necessary #1244

Merged
merged 2 commits into from
Jul 3, 2019

Conversation

ashmaroli
Copy link
Contributor

@ashmaroli ashmaroli commented Jul 3, 2019

Rationale

String.gsub creates a copy of every string passed to it irrespective of whether a substitution occurs.
This can lead to unnecessary memory consumption.

Summary

Use a private helper to substitute values conditionally.
Invoke String.gsub only if the string matches our pattern beforehand. Otherwise, return the given string itself.

Impact

Local tests show a reduction in values reported by the profile_memory task.

Report diff of Travis logs
--- master https://travis-ci.org/rouge-ruby/rouge/jobs/553590198
+++ PR     https://travis-ci.org/rouge-ruby/rouge/jobs/553607013

- Total allocated: 12.66 MB (123308 objects)
- Total retained:  4.51 MB (31707 objects)
+ Total allocated: 12.56 MB (120831 objects)
+ Total retained:  4.51 MB (31709 objects)

Basic benchmarks show that the proposed route is faster for strings that don't need escaping, but equally slower for strings that do need escaping.

Benchmark script
# frozen_string_literal: true

require 'benchmark/ips'

TABLE_FOR_ESCAPE_HTML = {
  '&' => '&',
  '<' => '&lt;',
  '>' => '&gt;',
}

def escape(value)
  value.gsub(/[&<>]/, TABLE_FOR_ESCAPE_HTML)
end

def cond_escape(value)
  escape_regex = /[&<>]/
  return value unless value =~ escape_regex

  value.gsub(escape_regex, TABLE_FOR_ESCAPE_HTML)
end

[
  ['plain', 'The quick brown badgers jumped over the lazy dog'],
  ['specl', 'The quick brown <fox> jumped & over the lazy dog']
].each do |key, value|
  Benchmark.ips do |x|
    x.report("esc #{key}") { escape(value) }
    x.report("cond_esc #{key}") { cond_escape(value) }
    x.compare!
  end
end
Benchmark results
Warming up --------------------------------------
           esc plain    50.502k i/100ms
      cond_esc plain    52.509k i/100ms
Calculating -------------------------------------
           esc plain    944.667k (A± 1.3%) i/s -      4.747M in   5.026118s
      cond_esc plain      1.056M (A± 0.4%) i/s -      5.303M in   5.024585s

Comparison:
      cond_esc plain:  1055512.1 i/s
           esc plain:   944666.6 i/s - 1.12x  slower

Warming up --------------------------------------
           esc specl    13.415k i/100ms
      cond_esc specl    12.547k i/100ms
Calculating -------------------------------------
           esc specl    169.639k (A± 0.7%) i/s -    858.560k in   5.061389s
      cond_esc specl    154.341k (A± 1.8%) i/s -    777.914k in   5.042006s

Comparison:
           esc specl:   169638.6 i/s
      cond_esc specl:   154341.2 i/s - 1.10x  slower

@jneen
Copy link
Member

jneen commented Jul 3, 2019

Would you mind benchmarking on a very long comment that contains escape characters?

@ashmaroli
Copy link
Contributor Author

Would you mind benchmarking on a very long comment

If you provide a sample string, I shall use that.

@jneen
Copy link
Member

jneen commented Jul 3, 2019

You can find some fairly long comments in spec/visual/samples, maybe add a & inside the comment to the end of it? My concern is that for large single tokens we will scan the whole thing twice. It's possible this is very low impact, but I want to make sure.

@ashmaroli
Copy link
Contributor Author

@jneen Okay, updated the above benchmark script with the following:

require 'yaml'

# set up a long string text

PLAINTEXT = YAML.load(<<-TEXT)
comment: >
  @param [Proc] callback a block that will be evaluated in the context of the lexer
  if `re` matches. This block has access to a number of lexer methods, including
  `RegexLexer#push`, `RegexLexer#pop!`, `RegexLexer#token` et `RegexLexer#delegate`.
  The first argument can be used to access the match groups.
  -endcomment-
TEXT

RAWTEXT = YAML.load(<<-TEXT)
comment: >
  @param [Proc] callback a block that will be evaluated in the context of the lexer
  if `re` matches. This block has access to a number of lexer methods, including
  #<RegexLexer#push>, #<RegexLexer#pop!>, #<RegexLexer#token>, & #<RegexLexer#delegate>.
  The first argument can be used to access the match groups.
  <endcomment>
TEXT

[
  ['plain', PLAINTEXT['comment']],
  ['specl', RAWTEXT['comment']]
].each do |key, value|
  Benchmark.ips do |x|
    ...

The result are similar to previous ones:

Warming up --------------------------------------
           esc plain    39.150k i/100ms
      cond_esc plain    41.220k i/100ms
Calculating -------------------------------------
           esc plain    603.987k (A± 0.7%) i/s -      3.054M in   5.056187s
      cond_esc plain    666.529k (A± 0.5%) i/s -      3.339M in   5.009407s

Comparison:
      cond_esc plain:   666529.2 i/s
           esc plain:   603986.5 i/s - 1.10x  slower

Warming up --------------------------------------
           esc specl     6.219k i/100ms
      cond_esc specl     5.802k i/100ms
Calculating -------------------------------------
           esc specl     68.860k (A± 0.6%) i/s -    348.264k in   5.057804s
      cond_esc specl     64.418k (A± 0.6%) i/s -    324.912k in   5.044026s

Comparison:
           esc specl:    68859.7 i/s
      cond_esc specl:    64417.6 i/s - 1.07x  slower

@jneen
Copy link
Member

jneen commented Jul 3, 2019

Okay! I was thinking about the occasional pages-long comment that you will see on gitlab, etc, but honestly I think that's rare enough that the tradeoff is worth it. If you could add a comment explaining that it's done that way for performance reasons, then it's +1 to merge from me.

@pyrmont pyrmont merged commit 43615a6 into rouge-ruby:master Jul 3, 2019
@ashmaroli ashmaroli deleted the conditional-escapes branch July 4, 2019 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants