-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework SerializableByKey handling to improve performance #6469
Conversation
In #6115 we are considering migrating to |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #6469 +/- ##
==========================================
- Coverage 97.82% 97.82% -0.01%
==========================================
Files 1115 1115
Lines 97468 97390 -78
==========================================
- Hits 95347 95268 -79
- Misses 2121 2122 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review is focused on python code, since the JSON files should be fully verified by tests.
LGTM! Happy to see this get further refined.
def default(self, o): | ||
# Object with custom method? | ||
if hasattr(o, '_json_dict_'): | ||
return _json_dict_with_cirq_type(o) | ||
json_dict = _json_dict_with_cirq_type(o) | ||
if isinstance(o, SerializableByKey): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary - SerializableByKey
wraps SupportsJSON
without modification, so the hasattr
check above already confirms that o
is SerializableByKey
.
This also suggests that SerializableByKey
can be entirely replaced by SupportsJSON
, although that may be more effort than it's worth. I suspect that in early iterations SerializableByKey
had some functionality, but I removed it before merging without removing the class itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that SerializableByKey
doesn't add any methods, but it still serves as a marker for types that want to opt in to this "deduplication" during serialization. One this that is being added here is that we now require that SerializableByKey
classes must be hashable, which is not necessarily the case with all classes that implement SupportsJSON
. If we want to change the opt-in mechanism that would be fine with me; I do think the name SerializableByKey
is a bit misleading (and has been since #3673 when the _serialization_key_
method was removed) so at very least we might consider changing the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Renaming feels like it would be a headache to me, but given that this changes the serialization format anyways I guess now would be the time for it.
Happy to review any changes from this here or as a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the SerializableByKey
name itself no longer appears in the serialized format, renaming will be a source-only change and would not affect compatibility of deserializing stored data, so that should be low risk. Nevertheless, since naming always invites bikeshedding, I'm inclined to defer that to a future PR, so I'll leave this as-is for now :-)
context_map.update({key: obj}) | ||
|
||
|
||
class _ContextualSerialization(SupportsJSON): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[No change needed] From our discussion offline: I think the issue with nested objects I was thinking of was linked to this "context table" implementation I used, which did not allow nested definitions. The new design, which allows a VAL
inside of a VAL
, should bypass the issue.
A "contextual serialization" mechanism was added in #3601 and #3673 that allows serializing certain objects once even when they occur multiple times within a larger structure. The only place we use this mechanism in cirq itself is for
FrozenCircuit
, since a given circuit can be included as a subcircuit multiple times within a larger circuit.On internal benchmarks, the existing contextual serialization mechanism is a major performance bottleneck, in particular the
has_serializable_by_key
function takes a significant amount of time even when no contextual serialization is needed because it traverses the entire object being serialized and makes lots ofisinstance
calls. The reason for these pre-checks is that the context serialization works by changing the outer-most serialized object to instead be a_ContextualSerialization
instance, which includes allSerializableByKey
instances in anobject_dag
list created from a pre-order traversal. The current implementation also has the downside that during serialization it finds already-seen instances by traversing theobject_dag
list itself, potentially incurring quadratic overhead if the list grows large; this is necessary becauseSerializableByKey
instances are not guaranteed to be hashable (though in practice the only existing case ofFrozenCircuit
is hashable).This mechanism can be simplified by simply keeping track of
SerializableByKey
instances as we encounter them and associating them with keys. The first time such an instance is encountered during serialization it is assigned a key and then serialized as a "value":{"cirq_type": "VAL": "key": key, "val": val}
wherekey
is the assigned key andval
is the normal serialization of the object. If the same instance is encountered later during serialization, it is serialized as a "reference":{"cirq_type": "REF", "key": key}
. We require thatSerializableByKey
instances be hashable so that the internal memoization can use a dict instead of a list, to avoid quadratic overheads from repeatedly searching for memoized instances. This would potentially allow us to use this mechanism on many more types without worrying about the runtime impact. Note that this requires that the traversal order when deserializing be consistent with that used when serializing, so that we never encounter aREF
before the correspondingVAL
. AFAICT this will always be the case, because thejson
library respects the order in which object keys appear when deserializing (functions likejson.load
can take anobject_pairs_hook
which takes key-value pairs as a list, and when these are collected in a dict and passed toobject_hook
the python dict implementation also preserves order).This change gives a significant improvement in serialization performance; on one internal benchmark the time spent in
json_serialization.to_json
when profiling dropped from 719s to 39s for an 18x speedup.There's still a bit of work to do to clean up json test cases since in a few places we already had existing
.json_inward
tests and I need to figure out how to preserve those as well as moving the existing.json
tests to be.json_inward
tests. But otherwise this is ready for review and would be good to discuss the proposed approach.