Reduce block overhead #6005

headius · 2019-12-20T19:33:54Z

This PR contains several improvements to block dispatch to reduce overhead.

Make Block.type final, to take advantage of final optimizations in some JVM JITs. If it needs to be modified, the Block should be cloned. Most blocks will never need this since they're always normal or always lambda.
Move BlockBody.evalType to Block and make it a final field. This eliminates the ThreadLocal in IRBlockBody and avoids constantly updating it back to normal, which was done every call to any IR block in the system.
Make escapeBlock final. This is only ever set at construction time and never modified again. Setting it here allows final optimizations to reduce loads when checking for escaped blocks.
Reset and recalculate IRScope flags before running AddCallProtocol. This allows ACP to reduce how much framing/scoping is needed when other passes (like DeadCode) have removed code. Previously the flags would be tainted by earlier calculation and never cleared after optimizations have run.
Enable indy-based direct block yielding. This allows a monomorphic yield yo be inlined into its called. This feature was available before behind a property, but was not enabled by default due to the lack of guards against polymorphic blocks.
Disable indy-based yielding when a polymorphic block is detected. This can be expanded in the future to a PIC for small-morphic yields, but the long term solutions will be using IR inlining or method splitting to keep yields naturally monomorphic.
Minor improvements to Proc#call to reduce how many levels of calls are necessary and to arity-split for reduced argument overhead along some paths.

This is a step toward separating the unique lambda logic for blocks from the much more common normal logic. Specifically this change is intended to work toward eliminating constant checks of block type for e.g. arity checking in argument processing.

One flag was being lost due to zero arg calls not propagating the procNew flag.

There's still argument boxing happening but a couple unnecessary checks disappear in the arity != 1 paths.

headius · 2019-12-20T19:43:36Z

Some numbers to go with this. Note that this is still only working for monomorphic blocks but expanding it to other forms will happen soon.

For a simple benchmark of a times style loop:

Source:

class Integer
  def times(&block)
    i = 0
    while i < self
      yield i
      i+=1
    end
  end
end
loop {
  t = Time.now
  100_000_000.times {}
  puts Time.now - t
}

JRuby 9.2.9 on Java 8, no indy yield:

2.992584
2.7188019999999997
2.445827
2.457916
2.507064
2.468093
2.469516
2.481936
2.487838
2.457182

JRuby 9.2.9 on Java 8, with indy yield:

1.740728
1.609607
1.59323
1.5898210000000002
1.559908
1.5705019999999998
1.5718400000000001
1.557177
1.58254
1.582277

This PR on Java 8 (indy yield on by default):

0.961075
0.772228
0.454656
0.478141
0.45460700000000004
0.42695
0.430082
0.427406
0.439078
0.42669799999999997

These optimizations also help JITs like Graal optimize the entire times body along with the block.

JRuby 9.2.9 on GraalVM CE with indy yield:

1.747857
1.614819
1.622004
1.571566
1.554442
1.537487
1.547927
1.5644609999999999
1.575542
1.5453059999999998

This PR on GraalVM CE (with indy yield):

0.8623149999999999
0.6013459999999999
0.5804619999999999
0.431682
0.446851
0.433122
0.431614
0.423745
0.437696
0.433704

And because of this optimization, disabling the fixnum cache allows the loop itself to elide allocations.

This PR on GraalVM CE with fixnum caching disabled:

0.13884400000000002
0.162255
0.14796399999999998
0.14643599999999998
0.145918
0.147859
0.146317
0.146055
0.14645899999999998
0.14596399999999998

headius · 2019-12-20T19:46:19Z

Another interesting benchmark is using the native Integer#times, which calls through the Block.yield machinery and does not inline. This shows the improvement due to the other optimizations.

JRuby 9.2.9 on Java 8, yield from Java:

1.465008
1.546333
1.418008
1.419014
1.4463789999999999
1.431911
1.434404
1.424123
1.4492530000000001
1.442501

This PR on Java 8, yield from Java:

headius added 14 commits December 17, 2019 12:09

Make Block.type final.

2b31906

This is a step toward separating the unique lambda logic for blocks from the much more common normal logic. Specifically this change is intended to work toward eliminating constant checks of block type for e.g. arity checking in argument processing.

Recompute flags before ACP.

8cf6050

One flag was being lost due to zero arg calls not propagating the procNew flag.

Enable full indy logic for yield.

045a127

Disable indy yield for polymorphic sites.

899e242

Make escapeBlock final.

4e1c4c3

Eliminate extra constructor and make evalType ThreadLocal final.

2bb4443

No need for volatile here.

856d5ff

Temporarily delegate evalType to fix constructors.

4ebe8c8

Test block-local evalType.

b493510

Make evalType final and clone to change it.

c85141b

Unbreak yield polymorphism check.

8a3b7d9

Inline a few levels to simplify this path.

3727ba9

Acquire Lookup once

9ffb4e9

Arity-split Proc#call

24b5bb3

There's still argument boxing happening but a couple unnecessary checks disappear in the arity != 1 paths.

headius added this to the JRuby 9.2.10.0 milestone Dec 20, 2019

enebo merged commit a318333 into master Dec 20, 2019

headius deleted the reduce_block_overhead branch December 20, 2019 20:22

kares mentioned this pull request Apr 24, 2020

JRuby fatal crash when reloading elastic/logstash#11835

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce block overhead #6005

Reduce block overhead #6005

headius commented Dec 20, 2019

headius commented Dec 20, 2019

headius commented Dec 20, 2019

Reduce block overhead #6005

Reduce block overhead #6005

Conversation

headius commented Dec 20, 2019

headius commented Dec 20, 2019

headius commented Dec 20, 2019