Dupe and escape span values only if necessary #1244

ashmaroli · 2019-07-03T06:14:52Z

Rationale

String.gsub creates a copy of every string passed to it irrespective of whether a substitution occurs.
This can lead to unnecessary memory consumption.

Summary

Use a private helper to substitute values conditionally.
Invoke String.gsub only if the string matches our pattern beforehand. Otherwise, return the given string itself.

Impact

Local tests show a reduction in values reported by the profile_memory task.

Report diff of Travis logs

--- master https://travis-ci.org/rouge-ruby/rouge/jobs/553590198
+++ PR     https://travis-ci.org/rouge-ruby/rouge/jobs/553607013

- Total allocated: 12.66 MB (123308 objects)
- Total retained:  4.51 MB (31707 objects)
+ Total allocated: 12.56 MB (120831 objects)
+ Total retained:  4.51 MB (31709 objects)

Basic benchmarks show that the proposed route is faster for strings that don't need escaping, but equally slower for strings that do need escaping.

Benchmark script

# frozen_string_literal: true

require 'benchmark/ips'

TABLE_FOR_ESCAPE_HTML = {
  '&' => '&amp;',
  '<' => '&lt;',
  '>' => '&gt;',
}

def escape(value)
  value.gsub(/[&<>]/, TABLE_FOR_ESCAPE_HTML)
end

def cond_escape(value)
  escape_regex = /[&<>]/
  return value unless value =~ escape_regex

  value.gsub(escape_regex, TABLE_FOR_ESCAPE_HTML)
end

[
  ['plain', 'The quick brown badgers jumped over the lazy dog'],
  ['specl', 'The quick brown <fox> jumped & over the lazy dog']
].each do |key, value|
  Benchmark.ips do |x|
    x.report("esc #{key}") { escape(value) }
    x.report("cond_esc #{key}") { cond_escape(value) }
    x.compare!
  end
end

Benchmark results

Warming up --------------------------------------
           esc plain    50.502k i/100ms
      cond_esc plain    52.509k i/100ms
Calculating -------------------------------------
           esc plain    944.667k (A± 1.3%) i/s -      4.747M in   5.026118s
      cond_esc plain      1.056M (A± 0.4%) i/s -      5.303M in   5.024585s

Comparison:
      cond_esc plain:  1055512.1 i/s
           esc plain:   944666.6 i/s - 1.12x  slower

Warming up --------------------------------------
           esc specl    13.415k i/100ms
      cond_esc specl    12.547k i/100ms
Calculating -------------------------------------
           esc specl    169.639k (A± 0.7%) i/s -    858.560k in   5.061389s
      cond_esc specl    154.341k (A± 1.8%) i/s -    777.914k in   5.042006s

Comparison:
           esc specl:   169638.6 i/s
      cond_esc specl:   154341.2 i/s - 1.10x  slower

jneen · 2019-07-03T07:30:18Z

Would you mind benchmarking on a very long comment that contains escape characters?

ashmaroli · 2019-07-03T07:32:54Z

Would you mind benchmarking on a very long comment

If you provide a sample string, I shall use that.

jneen · 2019-07-03T08:11:39Z

You can find some fairly long comments in spec/visual/samples, maybe add a & inside the comment to the end of it? My concern is that for large single tokens we will scan the whole thing twice. It's possible this is very low impact, but I want to make sure.

ashmaroli · 2019-07-03T08:43:07Z

@jneen Okay, updated the above benchmark script with the following:

require 'yaml'

# set up a long string text

PLAINTEXT = YAML.load(<<-TEXT)
comment: >
  @param [Proc] callback a block that will be evaluated in the context of the lexer
  if `re` matches. This block has access to a number of lexer methods, including
  `RegexLexer#push`, `RegexLexer#pop!`, `RegexLexer#token` et `RegexLexer#delegate`.
  The first argument can be used to access the match groups.
  -endcomment-
TEXT

RAWTEXT = YAML.load(<<-TEXT)
comment: >
  @param [Proc] callback a block that will be evaluated in the context of the lexer
  if `re` matches. This block has access to a number of lexer methods, including
  #<RegexLexer#push>, #<RegexLexer#pop!>, #<RegexLexer#token>, & #<RegexLexer#delegate>.
  The first argument can be used to access the match groups.
  <endcomment>
TEXT

[
  ['plain', PLAINTEXT['comment']],
  ['specl', RAWTEXT['comment']]
].each do |key, value|
  Benchmark.ips do |x|
    ...

The result are similar to previous ones:

Warming up --------------------------------------
           esc plain    39.150k i/100ms
      cond_esc plain    41.220k i/100ms
Calculating -------------------------------------
           esc plain    603.987k (A± 0.7%) i/s -      3.054M in   5.056187s
      cond_esc plain    666.529k (A± 0.5%) i/s -      3.339M in   5.009407s

Comparison:
      cond_esc plain:   666529.2 i/s
           esc plain:   603986.5 i/s - 1.10x  slower

Warming up --------------------------------------
           esc specl     6.219k i/100ms
      cond_esc specl     5.802k i/100ms
Calculating -------------------------------------
           esc specl     68.860k (A± 0.6%) i/s -    348.264k in   5.057804s
      cond_esc specl     64.418k (A± 0.6%) i/s -    324.912k in   5.044026s

Comparison:
           esc specl:    68859.7 i/s
      cond_esc specl:    64417.6 i/s - 1.07x  slower

jneen · 2019-07-03T15:32:56Z

Okay! I was thinking about the occasional pages-long comment that you will see on gitlab, etc, but honestly I think that's rare enough that the tradeoff is worth it. If you could add a comment explaining that it's done that way for performance reasons, then it's +1 to merge from me.

Dupe and escape span values only if necessary

89b09be

Add a comment detailing the intent behind method

3b05244

pyrmont merged commit 43615a6 into rouge-ruby:master Jul 3, 2019

ashmaroli deleted the conditional-escapes branch July 4, 2019 09:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dupe and escape span values only if necessary #1244

Dupe and escape span values only if necessary #1244

ashmaroli commented Jul 3, 2019 •

edited

Loading

jneen commented Jul 3, 2019

ashmaroli commented Jul 3, 2019

jneen commented Jul 3, 2019

ashmaroli commented Jul 3, 2019

jneen commented Jul 3, 2019

Dupe and escape span values only if necessary #1244

Dupe and escape span values only if necessary #1244

Conversation

ashmaroli commented Jul 3, 2019 • edited Loading

Rationale

Summary

Impact

jneen commented Jul 3, 2019

ashmaroli commented Jul 3, 2019

jneen commented Jul 3, 2019

ashmaroli commented Jul 3, 2019

jneen commented Jul 3, 2019

ashmaroli commented Jul 3, 2019 •

edited

Loading