-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
recursive and unhashable closure #90
Conversation
Thanks for spending time on this @llllllllll |
9a2d625
to
ba23a20
Compare
Codecov Report
@@ Coverage Diff @@
## master #90 +/- ##
==========================================
+ Coverage 79.8% 79.88% +0.08%
==========================================
Files 2 2
Lines 505 522 +17
Branches 104 109 +5
==========================================
+ Hits 403 417 +14
- Misses 73 75 +2
- Partials 29 30 +1
Continue to review full report at Codecov.
|
Is this related to #87 by any chance? |
cloudpickle/cloudpickle.py
Outdated
|
||
_cell_set = pythonapi.PyCell_Set | ||
_cell_set.argtypes = (py_object, py_object) | ||
_cell_set.restype = c_int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed my code a few minutes before this to do this slightly differently since it's best not to modify members in another module. See the updated version here.
(Sorry for the comment repost, GitHub deleted two comments instead of one.)
It is! This is the first time I have gone through the cloudpickle repo. I think the chain there is probably faster than doing all of the copying into lists though. |
cloudpickle/cloudpickle.py
Outdated
else: | ||
supports_recursive_closure = True | ||
|
||
def compress_closure(closure): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is badly named, since it merely returns the closure length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point here is that the closure "compression" is different on different versions of Python. on pypy this is an identity function. I didn't want to leak the implementation here, the only important thing is that this works when decompress_closure
is called on the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I understand better now. Can you add a comment explaining design choices at the beginning of those changes? So that further readers don't get lost.
cloudpickle/cloudpickle.py
Outdated
supports_recursive_closure = True | ||
|
||
def compress_closure(closure): | ||
return len(closure) if closure is not None else -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why -1? Why not just 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a test that shows the problem. Using 0 causes the deserialized function to have a __closure__
of ()
instead of None which is different behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter? Why not special-case 0 in decompress_closure
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have explicitly created a function with a closure of ()
you want to have this appear on the other side. It seems weird to lose this information if we don't need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure who would explicitly create such a function (and then depend on the closure being ()
). For the sake of simplicity, I think we should only care about real-world use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with @llllllllll on this. IMO you should not lose information if you don't need to.
cloudpickle/cloudpickle.py
Outdated
|
||
# create a skeleton function object and memoize it | ||
save(_make_skel_func) | ||
save((code, closure, base_globals)) | ||
save((code, compress_closure(func.__closure__), base_globals)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why func.__closure__
instead of just closure
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to my comment above: I wanted to be able to tell the difference between None and ()
to preserve this across a cloudpickle round trip.
cloudpickle/cloudpickle.py
Outdated
""" Fills in the rest of function data into the skeleton function object | ||
that were created via _make_skel_func(). | ||
""" | ||
func.__globals__.update(globals) | ||
func.__defaults__ = defaults | ||
func.__dict__ = dict | ||
|
||
fill_cells(func.__closure__, closure) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work only on CPython? I'm a bit lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct, this is a nop function on pypy where the original closure is passed to the skeleton function.
cloudpickle/cloudpickle.py
Outdated
write(pickle.REDUCE) | ||
self.memoize(func) | ||
|
||
# save the rest of the func data needed by _fill_function | ||
save(f_globals) | ||
save(defaults) | ||
save(dct) | ||
save(closure) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try to understand: this saves the closure twice in PyPy? Once in compress_closure
above, and once here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct, we should optimize out the second case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to remove this call if it is not needed
tests/cloudpickle_test.py
Outdated
supports_recursive_closure, | ||
reason="Recursive closures shouldn't raise an exception if supported" | ||
) | ||
@pytest.mark.xfail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a xfail, I'd like to see the specific exception tested for inside the test, so that we don't miss it failing for other reasons...
Thank you all for helping with this! This is great! |
cloudpickle/cloudpickle.py
Outdated
@@ -818,19 +873,17 @@ def _reconstruct_closure(values): | |||
return tuple([_make_cell(v) for v in values]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should _reconstruct_closure
be removed now?
tests/cloudpickle_test.py
Outdated
g2 = pickle_depickle(f2(2)) | ||
self.assertEqual(g2(5), 240) | ||
|
||
@pytest.mark.skipif( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test actually should succeed on PyPy, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is a mistake
I have a question about cloudpickle compatibility: Should users be able to serialize objects in PyPy and send them to CPython or vice versa. Currently this will not work in that case. I can investigate that if it is a goal of the project. |
Ideally, pickles work on all Python versions and implementations which support the given pickle protocol version. |
I have an idea that will work without ctypes and ideally on PyPy. I am trying to get a MWE now. |
fb27c76
to
cb524a4
Compare
Okay, I just pushed a change, tests pass for me locally with pypy an no more skips! |
Impressive! Now we can finally rename |
I was trying to think of a better name but decided I should probably just inline those functions since they have a single call site each. I also made some variable names more consistent with |
This is excellent work. I'll merge soon if no further comments. |
cloudpickle/cloudpickle.py
Outdated
return types.FunctionType(code, base_globals, | ||
None, None, closure) | ||
closure = ( | ||
tuple(_make_cell(None) for _ in range(cell_count)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last idea I had would be to actually create empty cells here so that if there is a bug or issue with the setting of the cells you will get UnboundLocalError instead of a None object. This might make debugging easier in the future. We could make an empty cell with:
def _make_empty_cell():
if False:
value = None
def inner():
value
return inner.__closure__[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushed
Last comment: should there be unit tests for cell_set(), _make_empty_cell() and fill_function()? |
Yeah, I can write some unit tests when I get home tonight. |
951acd4
to
020be10
Compare
020be10
to
cf882c6
Compare
This is great! Thank you for this. |
Thanks @llllllllll , remind me to ping you on more issues in the future :) |
Nice job @llllllllll and @mehrdadn, great work! |
This brings in fixes and upgrades from the [cloudpickle](https://github.com/cloudpipe/cloudpickle) module, notably: * Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80) * Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90) * Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88) * Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85) * Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72) * Allow pickling of builtin methods (cloudpipe/cloudpickle#57) * Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52) * Support method descriptor (cloudpipe/cloudpickle#46) * No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32)
## What changes were proposed in this pull request? Based on apache#18282 by rgbkrk this PR attempts to update to the current released cloudpickle and minimize the difference between Spark cloudpickle and "stock" cloud pickle with the goal of eventually using the stock cloud pickle. Some notable changes: * Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80) * Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90) * Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88) * Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85) * Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72) * Allow pickling of builtin methods (cloudpipe/cloudpickle#57) * Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52) * Support method descriptor (cloudpipe/cloudpickle#46) * No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32) * ** Remove non-standard __transient__check (cloudpipe/cloudpickle#110)** -- while we don't use this internally, and have no tests or documentation for its use, downstream code may use __transient__, although it has never been part of the API, if we merge this we should include a note about this in the release notes. * Support for pickling loggers (yay!) (cloudpipe/cloudpickle#96) * BUG: Fix crash when pickling dynamic class cycles. (cloudpipe/cloudpickle#102) ## How was this patch tested? Existing PySpark unit tests + the unit tests from the cloudpickle project on their own. Author: Holden Karau <holden@us.ibm.com> Author: Kyle Kelley <rgbkrk@gmail.com> Closes apache#18734 from holdenk/holden-rgbkrk-cloudpickle-upgrades.
xref: #89, #86, #62
NOTE: This is based off of the work by @mehrdadn. I had a couple of ideas to improve performance but the core idea is theirs. I apologize for opening another PR, but it seemed easier than trying to communicate my idea without code.
Instead of serializing the empty cell objects, this change serializes just an integer which tells the receiver how man empty cells to create in
_make_skel_func
. This reduces the size of the serialized result, for example, given the function:With #89:
With this PR:
The other (and much smaller) change is to remove the
PyCell_Set is not None
checks in favor of stub functions. This only affects pypy where these identity or nop functions should have very little cost, and it keeps the more complicated serialization function free of extra branches.In order to fix the unhashable closure cells and globals, we use
itertools.chain
to walk over the values with the belief that duplicate values should be very rare and they will already be memoized.UPDATE:
I am now serializing a -1 for the closure length if the
__closure__
is None. Now we will get a closure of None instead of an empty tuple on the other side.