recursive and unhashable closure #90

llllllllll · 2017-05-25T08:26:27Z

NOTE: This is based off of the work by @mehrdadn. I had a couple of ideas to improve performance but the core idea is theirs. I apologize for opening another PR, but it seemed easier than trying to communicate my idea without code.

Instead of serializing the empty cell objects, this change serializes just an integer which tells the receiver how man empty cells to create in _make_skel_func. This reduces the size of the serialized result, for example, given the function:

def f():
    def g():
        return g
    return g

With #89:

In [3]: len(dumps(f()))
Out[3]: 366

With this PR:

In [3]: len(dumps(f()))
Out[3]: 321

The other (and much smaller) change is to remove the PyCell_Set is not None checks in favor of stub functions. This only affects pypy where these identity or nop functions should have very little cost, and it keeps the more complicated serialization function free of extra branches.

In order to fix the unhashable closure cells and globals, we use itertools.chain to walk over the values with the belief that duplicate values should be very rare and they will already be memoized.

UPDATE:

I am now serializing a -1 for the closure length if the __closure__ is None. Now we will get a closure of None instead of an empty tuple on the other side.

mrocklin · 2017-05-25T08:29:00Z

Thanks for spending time on this @llllllllll

cc @pcmoritz @pitrou

codecov-io · 2017-05-25T08:31:44Z

Codecov Report

Merging #90 into master will increase coverage by 0.08%.
The diff coverage is 86.66%.

@@            Coverage Diff             @@
##           master      #90      +/-   ##
==========================================
+ Coverage    79.8%   79.88%   +0.08%     
==========================================
  Files           2        2              
  Lines         505      522      +17     
  Branches      104      109       +5     
==========================================
+ Hits          403      417      +14     
- Misses         73       75       +2     
- Partials       29       30       +1

Impacted Files	Coverage Δ
cloudpickle/cloudpickle.py	`79.76% <86.66%> (+0.08%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 825e5aa...cf882c6. Read the comment docs.

mehrdadn · 2017-05-25T08:34:11Z

Is this related to #87 by any chance?

mehrdadn · 2017-05-25T08:37:30Z

cloudpickle/cloudpickle.py

+
+    _cell_set = pythonapi.PyCell_Set
+    _cell_set.argtypes = (py_object, py_object)
+    _cell_set.restype = c_int


I changed my code a few minutes before this to do this slightly differently since it's best not to modify members in another module. See the updated version here.
(Sorry for the comment repost, GitHub deleted two comments instead of one.)

llllllllll · 2017-05-25T08:37:40Z

It is! This is the first time I have gone through the cloudpickle repo. I think the chain there is probably faster than doing all of the copying into lists though.

pitrou · 2017-05-25T08:51:15Z

cloudpickle/cloudpickle.py

+else:
+    supports_recursive_closure = True
+
+    def compress_closure(closure):


This is badly named, since it merely returns the closure length.

The point here is that the closure "compression" is different on different versions of Python. on pypy this is an identity function. I didn't want to leak the implementation here, the only important thing is that this works when decompress_closure is called on the result.

Ok, I understand better now. Can you add a comment explaining design choices at the beginning of those changes? So that further readers don't get lost.

pitrou · 2017-05-25T08:51:25Z

cloudpickle/cloudpickle.py

+    supports_recursive_closure = True
+
+    def compress_closure(closure):
+        return len(closure) if closure is not None else -1


Why -1? Why not just 0?

I just pushed a test that shows the problem. Using 0 causes the deserialized function to have a __closure__ of () instead of None which is different behavior.

Does it matter? Why not special-case 0 in decompress_closure?

If you have explicitly created a function with a closure of () you want to have this appear on the other side. It seems weird to lose this information if we don't need to.

I'm not sure who would explicitly create such a function (and then depend on the closure being ()). For the sake of simplicity, I think we should only care about real-world use cases.

I'm with @llllllllll on this. IMO you should not lose information if you don't need to.

pitrou · 2017-05-25T08:51:58Z

cloudpickle/cloudpickle.py


        # create a skeleton function object and memoize it
        save(_make_skel_func)
-        save((code, closure, base_globals))
+        save((code, compress_closure(func.__closure__), base_globals))


Why func.__closure__ instead of just closure?

Related to my comment above: I wanted to be able to tell the difference between None and () to preserve this across a cloudpickle round trip.

pitrou · 2017-05-25T08:56:06Z

cloudpickle/cloudpickle.py

    """ Fills in the rest of function data into the skeleton function object
        that were created via _make_skel_func().
         """
    func.__globals__.update(globals)
    func.__defaults__ = defaults
    func.__dict__ = dict
-
+    fill_cells(func.__closure__, closure)


Does this work only on CPython? I'm a bit lost.

That is correct, this is a nop function on pypy where the original closure is passed to the skeleton function.

pitrou · 2017-05-25T08:59:06Z

cloudpickle/cloudpickle.py

        write(pickle.REDUCE)
        self.memoize(func)

        # save the rest of the func data needed by _fill_function
        save(f_globals)
        save(defaults)
        save(dct)
+        save(closure)


Let me try to understand: this saves the closure twice in PyPy? Once in compress_closure above, and once here?

That is correct, we should optimize out the second case here.

updated to remove this call if it is not needed

pitrou · 2017-05-25T09:07:02Z

tests/cloudpickle_test.py

+        supports_recursive_closure,
+        reason="Recursive closures shouldn't raise an exception if supported"
+    )
+    @pytest.mark.xfail


Rather than a xfail, I'd like to see the specific exception tested for inside the test, so that we don't miss it failing for other reasons...

pcmoritz · 2017-05-25T09:24:08Z

Thank you all for helping with this! This is great!

pitrou · 2017-05-25T09:26:06Z

cloudpickle/cloudpickle.py

@@ -818,19 +873,17 @@ def _reconstruct_closure(values):
    return tuple([_make_cell(v) for v in values])


Should _reconstruct_closure be removed now?

pitrou · 2017-05-25T09:39:48Z

tests/cloudpickle_test.py

+        g2 = pickle_depickle(f2(2))
+        self.assertEqual(g2(5), 240)
+
+    @pytest.mark.skipif(


This test actually should succeed on PyPy, no?

yes, this is a mistake

llllllllll · 2017-05-25T18:33:19Z

I have a question about cloudpickle compatibility: Should users be able to serialize objects in PyPy and send them to CPython or vice versa. Currently this will not work in that case. I can investigate that if it is a goal of the project.

pitrou · 2017-05-25T18:36:08Z

Ideally, pickles work on all Python versions and implementations which support the given pickle protocol version.
In practice, this is an edge case that I don't think is very annoying if it's not portable accross implementations.

llllllllll · 2017-05-25T18:50:27Z

I have an idea that will work without ctypes and ideally on PyPy. I am trying to get a MWE now.

llllllllll · 2017-05-25T19:14:06Z

Okay, I just pushed a change, tests pass for me locally with pypy an no more skips!

pitrou · 2017-05-25T21:20:46Z

Impressive! Now we can finally rename compress_closure and decompress_closure :-)

llllllllll · 2017-05-25T21:50:56Z

I was trying to think of a better name but decided I should probably just inline those functions since they have a single call site each. I also made some variable names more consistent with closure meaning the iterable of cells, closure_values meaning the cell_contents, and cell_count meaning the length of the closure or closure_values

pitrou · 2017-05-26T07:09:34Z

This is excellent work. I'll merge soon if no further comments.

llllllllll · 2017-05-26T08:48:51Z

cloudpickle/cloudpickle.py

-    return types.FunctionType(code, base_globals,
-                              None, None, closure)
+    closure = (
+        tuple(_make_cell(None) for _ in range(cell_count))


One last idea I had would be to actually create empty cells here so that if there is a bug or issue with the setting of the cells you will get UnboundLocalError instead of a None object. This might make debugging easier in the future. We could make an empty cell with:

def _make_empty_cell(): if False: value = None def inner(): value return inner.__closure__[0]

That sounds like a good idea.

pitrou · 2017-05-26T16:20:47Z

Last comment: should there be unit tests for cell_set(), _make_empty_cell() and fill_function()?

llllllllll · 2017-05-26T18:49:00Z

Yeah, I can write some unit tests when I get home tonight.

pitrou · 2017-05-30T07:35:55Z

This is great! Thank you for this.

mrocklin · 2017-05-30T11:19:49Z

Thanks @llllllllll , remind me to ping you on more issues in the future :)

robertnishihara · 2017-05-31T06:43:13Z

Nice job @llllllllll and @mehrdadn, great work!

This brings in fixes and upgrades from the [cloudpickle](https://github.com/cloudpipe/cloudpickle) module, notably: * Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80) * Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90) * Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88) * Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85) * Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72) * Allow pickling of builtin methods (cloudpipe/cloudpickle#57) * Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52) * Support method descriptor (cloudpipe/cloudpickle#46) * No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32)

## What changes were proposed in this pull request? Based on apache#18282 by rgbkrk this PR attempts to update to the current released cloudpickle and minimize the difference between Spark cloudpickle and "stock" cloud pickle with the goal of eventually using the stock cloud pickle. Some notable changes: * Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80) * Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90) * Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88) * Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85) * Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72) * Allow pickling of builtin methods (cloudpipe/cloudpickle#57) * Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52) * Support method descriptor (cloudpipe/cloudpickle#46) * No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32) * ** Remove non-standard __transient__check (cloudpipe/cloudpickle#110)** -- while we don't use this internally, and have no tests or documentation for its use, downstream code may use __transient__, although it has never been part of the API, if we merge this we should include a note about this in the release notes. * Support for pickling loggers (yay!) (cloudpipe/cloudpickle#96) * BUG: Fix crash when pickling dynamic class cycles. (cloudpipe/cloudpickle#102) ## How was this patch tested? Existing PySpark unit tests + the unit tests from the cloudpickle project on their own. Author: Holden Karau <holden@us.ibm.com> Author: Kyle Kelley <rgbkrk@gmail.com> Closes apache#18734 from holdenk/holden-rgbkrk-cloudpickle-upgrades.

support recursive closure cells

da4dd39

support unhashable closure values

ba23a20

llllllllll force-pushed the recursive-closure branch from 9a2d625 to ba23a20 Compare May 25, 2017 08:31

mehrdadn reviewed May 25, 2017

View reviewed changes

Joe Jevnik added 2 commits May 25, 2017 04:45

pypy fixes and _cell_set definition update

2c49ba0

test the expected failure for pypy

4b251a0