Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce block overhead #6005

Merged
merged 14 commits into from
Dec 20, 2019
Merged

Reduce block overhead #6005

merged 14 commits into from
Dec 20, 2019

Conversation

headius
Copy link
Member

@headius headius commented Dec 20, 2019

This PR contains several improvements to block dispatch to reduce overhead.

  • Make Block.type final, to take advantage of final optimizations in some JVM JITs. If it needs to be modified, the Block should be cloned. Most blocks will never need this since they're always normal or always lambda.
  • Move BlockBody.evalType to Block and make it a final field. This eliminates the ThreadLocal in IRBlockBody and avoids constantly updating it back to normal, which was done every call to any IR block in the system.
  • Make escapeBlock final. This is only ever set at construction time and never modified again. Setting it here allows final optimizations to reduce loads when checking for escaped blocks.
  • Reset and recalculate IRScope flags before running AddCallProtocol. This allows ACP to reduce how much framing/scoping is needed when other passes (like DeadCode) have removed code. Previously the flags would be tainted by earlier calculation and never cleared after optimizations have run.
  • Enable indy-based direct block yielding. This allows a monomorphic yield yo be inlined into its called. This feature was available before behind a property, but was not enabled by default due to the lack of guards against polymorphic blocks.
  • Disable indy-based yielding when a polymorphic block is detected. This can be expanded in the future to a PIC for small-morphic yields, but the long term solutions will be using IR inlining or method splitting to keep yields naturally monomorphic.
  • Minor improvements to Proc#call to reduce how many levels of calls are necessary and to arity-split for reduced argument overhead along some paths.

This is a step toward separating the unique lambda logic for
blocks from the much more common normal logic. Specifically this
change is intended to work toward eliminating constant checks of
block type for e.g. arity checking in argument processing.
One flag was being lost due to zero arg calls not propagating the
procNew flag.
There's still argument boxing happening but a couple unnecessary
checks disappear in the arity != 1 paths.
@headius headius added this to the JRuby 9.2.10.0 milestone Dec 20, 2019
@headius
Copy link
Member Author

headius commented Dec 20, 2019

Some numbers to go with this. Note that this is still only working for monomorphic blocks but expanding it to other forms will happen soon.

For a simple benchmark of a times style loop:

Source:

class Integer
  def times(&block)
    i = 0
    while i < self
      yield i
      i+=1
    end
  end
end
loop {
  t = Time.now
  100_000_000.times {}
  puts Time.now - t
}

JRuby 9.2.9 on Java 8, no indy yield:

2.992584
2.7188019999999997
2.445827
2.457916
2.507064
2.468093
2.469516
2.481936
2.487838
2.457182

JRuby 9.2.9 on Java 8, with indy yield:

1.740728
1.609607
1.59323
1.5898210000000002
1.559908
1.5705019999999998
1.5718400000000001
1.557177
1.58254
1.582277

This PR on Java 8 (indy yield on by default):

0.961075
0.772228
0.454656
0.478141
0.45460700000000004
0.42695
0.430082
0.427406
0.439078
0.42669799999999997

These optimizations also help JITs like Graal optimize the entire times body along with the block.

JRuby 9.2.9 on GraalVM CE with indy yield:

1.747857
1.614819
1.622004
1.571566
1.554442
1.537487
1.547927
1.5644609999999999
1.575542
1.5453059999999998

This PR on GraalVM CE (with indy yield):

0.8623149999999999
0.6013459999999999
0.5804619999999999
0.431682
0.446851
0.433122
0.431614
0.423745
0.437696
0.433704

And because of this optimization, disabling the fixnum cache allows the loop itself to elide allocations.

This PR on GraalVM CE with fixnum caching disabled:

0.13884400000000002
0.162255
0.14796399999999998
0.14643599999999998
0.145918
0.147859
0.146317
0.146055
0.14645899999999998
0.14596399999999998

@headius
Copy link
Member Author

headius commented Dec 20, 2019

Another interesting benchmark is using the native Integer#times, which calls through the Block.yield machinery and does not inline. This shows the improvement due to the other optimizations.

JRuby 9.2.9 on Java 8, yield from Java:

1.465008
1.546333
1.418008
1.419014
1.4463789999999999
1.431911
1.434404
1.424123
1.4492530000000001
1.442501

This PR on Java 8, yield from Java:

0.55583
0.570769
0.497565
0.49922
0.507973
0.507676
0.514382
0.511301
0.510311
0.503312

@enebo enebo merged commit a318333 into master Dec 20, 2019
@headius headius deleted the reduce_block_overhead branch December 20, 2019 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants