Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up JSON::Validator.validate #255

Closed
wants to merge 9 commits into from

Conversation

mjc
Copy link
Contributor

@mjc mjc commented May 14, 2015

Addressible's URI parsing shows up at the top of the profiler for our use case on JRuby, as well as YAML dumping.

This patch

  1. adds a cache to JSON::Util::URI.normalized_uri
  2. adds a cache for Addressable::URI.parse in JSON::Util::URI.parse
  3. prevents clearing an already cleared URI fragment in JSON::Schema#initialize
  4. uses Hash#hash for calculating uuids instead of serializing the hash.

There are many further gains to be had still, especially if we can find a way to remove the mutex around initializing a schema, as that is proving to be terrible for performance on JRuby in a multi-threaded environment.

I'm pretty much finished with this pass - almost 8x is a pretty nice gain.

benchmark: https://github.com/mjc/json-schema-perf

MRI 2.2.2 results: 2.89x faster from file, 5.55x faster from hash.

JRuby 1.7.20 results: 2.38x faster from file, 7.86x faster from hash.

MRI 2.2.2 before

Calculating -------------------------------------
json-schema from file
                       181.000  i/100ms
json-schema from hash
                        87.000  i/100ms
-------------------------------------------------
json-schema from file
                          1.962k (± 6.3%) i/s -     58.644k
json-schema from hash
                        938.211  (± 6.2%) i/s -     28.101k

Comparison:
json-schema from file:     1961.8 i/s
json-schema from hash:      938.2 i/s - 2.09x slower

MRI 2.2.2 after

json-schema from file
                          5.668k (± 7.4%) i/s -    169.162k
json-schema from hash
                          5.210k (± 7.1%) i/s -    155.696k


Comparison:
json-schema from file:     5668.0 i/s
json-schema from hash:     5210.2 i/s - 1.09x slower

JRuby 1.7.20 before

Calculating -------------------------------------
json-schema from file
                       200.000  i/100ms
json-schema from hash
                       138.000  i/100ms
-------------------------------------------------
json-schema from file
                          2.075k (± 6.1%) i/s -     62.000k
json-schema from hash
                          1.401k (± 5.4%) i/s -     41.952k

Comparison:
json-schema from file:     2074.9 i/s
json-schema from hash:     1401.0 i/s - 1.48x slower

JRuby 1.7.20 after

Calculating -------------------------------------
json-schema from file
                       474.000  i/100ms
json-schema from hash
                         1.002k i/100ms
-------------------------------------------------
json-schema from file
                          4.942k (± 5.7%) i/s -    147.888k
json-schema from hash
                         10.773k (± 6.5%) i/s -    322.644k

Comparison:
json-schema from hash:    10773.1 i/s
json-schema from file:     4942.2 i/s - 2.18x slower

@mjc mjc force-pushed the cache-normalized-uri branch 5 times, most recently from 5daa621 to 1c86b40 Compare May 14, 2015 19:35
Addressible's URI parsing shows up at the top of the profiler for our use case.

This patch adds a cache to JSON::Util::URI.normalized_uri, to trade CPU time for memory.

There are other expensive Addressible methods that should be cached elsewhere.

benchmark: https://github.com/mjc/json-schema-perf

MRI 2.2.2 before: 825.363  (± 6.2%) i/s
MRI 2.2.2 after:  1.436k (± 4.5%) i/s

1.74x faster

JRuby 1.7.20 before: 1.445k (± 2.5%) i/s
JRuby 1.7.20 after: 2.272k (± 6.1%) i/s

1.57x faster
@mjc mjc force-pushed the cache-normalized-uri branch from 1c86b40 to 078f745 Compare May 14, 2015 19:36
@mjc
Copy link
Contributor Author

mjc commented May 18, 2015

I also need to research finding ways to test concurrency and remove the mutex around schema initialization safely.


def self.normalized_uri(uri)
instance.cache ||= {}
instance.cache[uri] ||= instance.normalized_uri(uri)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caching here is totally unsafe in a multithreaded environment. You're going to end up with race conditions. But the side effect of that is (I admit) fairly minor here because the caller will always get the correct result.

However, I've had a long think about whether caching is the right approach (since you raised this pull request) and I think you're probably right to add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iainbeeston right, the only effect of the race condition is a cache miss, and that's OK, given that it's a standard effect of most caching. I can add a comment noting that if desired.

@iainbeeston
Copy link
Contributor

With regards to the mutex, I'm not even sure what it's purpose is. So far as I can tell it ensures that only one validator can be initialised at once. But it's not protecting access to any class variables, and it's not protecting a validator from being used by several threads concurrently and getting into an invalid state.

It's possible the answer is in the git history, though I'm not in front of a computer right now.

@hoxworth Do you remember why the mutex was introduced?

@mjc
Copy link
Contributor Author

mjc commented Jun 1, 2015

I'm pretty much done with this pass if we can't remove the mutex.

@RST-J
Copy link
Contributor

RST-J commented Jun 2, 2015

To me that looks OK and I guess using some more memory is a reasonable trade for increased performance.

# a cache miss
def self.normalized_uri(uri)
instance.normalize_cache ||= {}
instance.normalize_cache[uri] ||= instance.normalized_uri(uri)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last question - should the cache key here be "uri.to_s"? (As it is for the parse cache?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

@iainbeeston
Copy link
Contributor

I've added a couple more questions, but overall this looks great and I'll be happy to merge once I hear back from @mjc. I especially appreciate the benchmark times that accompany the changes (thanks @mjc!)

@mjc
Copy link
Contributor Author

mjc commented Jun 4, 2015

@iainbeeston no problem! Happy to help. Trying to improve performance if you don't know where the bottlenecks are and whether you made a useful difference is silly.

@mjc
Copy link
Contributor Author

mjc commented Jun 4, 2015

Weird, I think maybe the .parse cache just has to go. :(

@iainbeeston
Copy link
Contributor

So it works when the key is a uri object but not when it is a string?

@mjc
Copy link
Contributor Author

mjc commented Jun 5, 2015

It looks like the objects get modified in incompatible ways sometimes? The
error on Travis is proving hard to track down

On Thu, Jun 4, 2015 at 3:19 PM Iain Beeston notifications@github.com
wrote:

So it works when the key is a uri object but not when it is a string?


Reply to this email directly or view it on GitHub
#255 (comment)
.

@mjc
Copy link
Contributor Author

mjc commented Jun 5, 2015

Do you think it'd be possible to use the stdlib's URI stuff?
http://ruby-doc.org/stdlib-2.2.2/libdoc/uri/rdoc/URI.html
On Thu, Jun 4, 2015 at 7:36 PM Michael Cohen michael.joseph.cohen@gmail.com
wrote:

It looks like the objects get modified in incompatible ways sometimes? The
error on Travis is proving hard to track down

On Thu, Jun 4, 2015 at 3:19 PM Iain Beeston notifications@github.com
wrote:

So it works when the key is a uri object but not when it is a string?


Reply to this email directly or view it on GitHub
#255 (comment)
.

@iainbeeston
Copy link
Contributor

Is it possible to freeze the URI object? Would that help?

The gem used to do that but although it's faster it has trouble with
unusual URIs (file URIs I think are an example of that, but generally
addressable is more capable)
On Fri, 5 Jun 2015 at 2:37 am, Michael J. Cohen notifications@github.com
wrote:

Do you think it'd be possible to use the stdlib's URI stuff?
http://ruby-doc.org/stdlib-2.2.2/libdoc/uri/rdoc/URI.html
On Thu, Jun 4, 2015 at 7:36 PM Michael Cohen <
michael.joseph.cohen@gmail.com>
wrote:

It looks like the objects get modified in incompatible ways sometimes?
The
error on Travis is proving hard to track down

On Thu, Jun 4, 2015 at 3:19 PM Iain Beeston notifications@github.com
wrote:

So it works when the key is a uri object but not when it is a string?


Reply to this email directly or view it on GitHub
<
#255 (comment)

.


Reply to this email directly or view it on GitHub
#255 (comment)
.

@iainbeeston
Copy link
Contributor

@RST-J and @mjc Could you please take a look at #285 , which is my attempt to salvage this and make it releasable?

@iainbeeston
Copy link
Contributor

I'm closing this for now. I've merged most of the changes with the exception of using has to generate uuids (which gave a good speed improvement but feels risky to me)

@iainbeeston iainbeeston closed this Jan 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants