Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recursive and unhashable closure #90

Merged
merged 14 commits into from
May 30, 2017

Conversation

llllllllll
Copy link
Contributor

@llllllllll llllllllll commented May 25, 2017

xref: #89, #86, #62

NOTE: This is based off of the work by @mehrdadn. I had a couple of ideas to improve performance but the core idea is theirs. I apologize for opening another PR, but it seemed easier than trying to communicate my idea without code.

Instead of serializing the empty cell objects, this change serializes just an integer which tells the receiver how man empty cells to create in _make_skel_func. This reduces the size of the serialized result, for example, given the function:

def f():
    def g():
        return g
    return g

With #89:

In [3]: len(dumps(f()))
Out[3]: 366

With this PR:

In [3]: len(dumps(f()))
Out[3]: 321

The other (and much smaller) change is to remove the PyCell_Set is not None checks in favor of stub functions. This only affects pypy where these identity or nop functions should have very little cost, and it keeps the more complicated serialization function free of extra branches.

In order to fix the unhashable closure cells and globals, we use itertools.chain to walk over the values with the belief that duplicate values should be very rare and they will already be memoized.

UPDATE:

I am now serializing a -1 for the closure length if the __closure__ is None. Now we will get a closure of None instead of an empty tuple on the other side.

@mrocklin
Copy link
Contributor

Thanks for spending time on this @llllllllll

cc @pcmoritz @pitrou

@llllllllll llllllllll force-pushed the recursive-closure branch from 9a2d625 to ba23a20 Compare May 25, 2017 08:31
@codecov-io
Copy link

codecov-io commented May 25, 2017

Codecov Report

Merging #90 into master will increase coverage by 0.08%.
The diff coverage is 86.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #90      +/-   ##
==========================================
+ Coverage    79.8%   79.88%   +0.08%     
==========================================
  Files           2        2              
  Lines         505      522      +17     
  Branches      104      109       +5     
==========================================
+ Hits          403      417      +14     
- Misses         73       75       +2     
- Partials       29       30       +1
Impacted Files Coverage Δ
cloudpickle/cloudpickle.py 79.76% <86.66%> (+0.08%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 825e5aa...cf882c6. Read the comment docs.

@mehrdadn
Copy link

Is this related to #87 by any chance?


_cell_set = pythonapi.PyCell_Set
_cell_set.argtypes = (py_object, py_object)
_cell_set.restype = c_int

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed my code a few minutes before this to do this slightly differently since it's best not to modify members in another module. See the updated version here.
(Sorry for the comment repost, GitHub deleted two comments instead of one.)

@llllllllll
Copy link
Contributor Author

It is! This is the first time I have gone through the cloudpickle repo. I think the chain there is probably faster than doing all of the copying into lists though.

else:
supports_recursive_closure = True

def compress_closure(closure):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is badly named, since it merely returns the closure length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point here is that the closure "compression" is different on different versions of Python. on pypy this is an identity function. I didn't want to leak the implementation here, the only important thing is that this works when decompress_closure is called on the result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I understand better now. Can you add a comment explaining design choices at the beginning of those changes? So that further readers don't get lost.

supports_recursive_closure = True

def compress_closure(closure):
return len(closure) if closure is not None else -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why -1? Why not just 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a test that shows the problem. Using 0 causes the deserialized function to have a __closure__ of () instead of None which is different behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter? Why not special-case 0 in decompress_closure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have explicitly created a function with a closure of () you want to have this appear on the other side. It seems weird to lose this information if we don't need to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure who would explicitly create such a function (and then depend on the closure being ()). For the sake of simplicity, I think we should only care about real-world use cases.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @llllllllll on this. IMO you should not lose information if you don't need to.


# create a skeleton function object and memoize it
save(_make_skel_func)
save((code, closure, base_globals))
save((code, compress_closure(func.__closure__), base_globals))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why func.__closure__ instead of just closure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my comment above: I wanted to be able to tell the difference between None and () to preserve this across a cloudpickle round trip.

""" Fills in the rest of function data into the skeleton function object
that were created via _make_skel_func().
"""
func.__globals__.update(globals)
func.__defaults__ = defaults
func.__dict__ = dict

fill_cells(func.__closure__, closure)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work only on CPython? I'm a bit lost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, this is a nop function on pypy where the original closure is passed to the skeleton function.

write(pickle.REDUCE)
self.memoize(func)

# save the rest of the func data needed by _fill_function
save(f_globals)
save(defaults)
save(dct)
save(closure)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to understand: this saves the closure twice in PyPy? Once in compress_closure above, and once here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, we should optimize out the second case here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to remove this call if it is not needed

supports_recursive_closure,
reason="Recursive closures shouldn't raise an exception if supported"
)
@pytest.mark.xfail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a xfail, I'd like to see the specific exception tested for inside the test, so that we don't miss it failing for other reasons...

@pcmoritz
Copy link
Contributor

Thank you all for helping with this! This is great!

@@ -818,19 +873,17 @@ def _reconstruct_closure(values):
return tuple([_make_cell(v) for v in values])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should _reconstruct_closure be removed now?

g2 = pickle_depickle(f2(2))
self.assertEqual(g2(5), 240)

@pytest.mark.skipif(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test actually should succeed on PyPy, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is a mistake

@llllllllll
Copy link
Contributor Author

I have a question about cloudpickle compatibility: Should users be able to serialize objects in PyPy and send them to CPython or vice versa. Currently this will not work in that case. I can investigate that if it is a goal of the project.

@pitrou
Copy link
Member

pitrou commented May 25, 2017

Ideally, pickles work on all Python versions and implementations which support the given pickle protocol version.
In practice, this is an edge case that I don't think is very annoying if it's not portable accross implementations.

@llllllllll
Copy link
Contributor Author

I have an idea that will work without ctypes and ideally on PyPy. I am trying to get a MWE now.

@llllllllll llllllllll force-pushed the recursive-closure branch from fb27c76 to cb524a4 Compare May 25, 2017 19:12
@llllllllll
Copy link
Contributor Author

Okay, I just pushed a change, tests pass for me locally with pypy an no more skips!

@pitrou
Copy link
Member

pitrou commented May 25, 2017

Impressive! Now we can finally rename compress_closure and decompress_closure :-)

@llllllllll
Copy link
Contributor Author

I was trying to think of a better name but decided I should probably just inline those functions since they have a single call site each. I also made some variable names more consistent with closure meaning the iterable of cells, closure_values meaning the cell_contents, and cell_count meaning the length of the closure or closure_values

@pitrou
Copy link
Member

pitrou commented May 26, 2017

This is excellent work. I'll merge soon if no further comments.

return types.FunctionType(code, base_globals,
None, None, closure)
closure = (
tuple(_make_cell(None) for _ in range(cell_count))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last idea I had would be to actually create empty cells here so that if there is a bug or issue with the setting of the cells you will get UnboundLocalError instead of a None object. This might make debugging easier in the future. We could make an empty cell with:

def _make_empty_cell():
    if False:
        value = None

    def inner():
        value

    return inner.__closure__[0]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed

@pitrou
Copy link
Member

pitrou commented May 26, 2017

Last comment: should there be unit tests for cell_set(), _make_empty_cell() and fill_function()?

@llllllllll
Copy link
Contributor Author

Yeah, I can write some unit tests when I get home tonight.

@llllllllll llllllllll force-pushed the recursive-closure branch from 951acd4 to 020be10 Compare May 30, 2017 00:56
@llllllllll llllllllll force-pushed the recursive-closure branch from 020be10 to cf882c6 Compare May 30, 2017 01:22
@pitrou
Copy link
Member

pitrou commented May 30, 2017

This is great! Thank you for this.

@mrocklin
Copy link
Contributor

Thanks @llllllllll , remind me to ping you on more issues in the future :)

@robertnishihara
Copy link

Nice job @llllllllll and @mehrdadn, great work!

rgbkrk added a commit to rgbkrk/spark that referenced this pull request Jun 12, 2017
This brings in fixes and upgrades from the [cloudpickle](https://github.com/cloudpipe/cloudpickle) module, notably:

* Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80)
* Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90)
* Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88)
* Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85)
* Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72)
* Allow pickling of builtin methods (cloudpipe/cloudpickle#57)
* Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52)
* Support method descriptor (cloudpipe/cloudpickle#46)
* No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32)
ghost pushed a commit to dbtsai/spark that referenced this pull request Aug 22, 2017
## What changes were proposed in this pull request?

Based on apache#18282 by rgbkrk this PR attempts to update to the current released cloudpickle and minimize the difference between Spark cloudpickle and "stock" cloud pickle with the goal of eventually using the stock cloud pickle.

Some notable changes:
* Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80)
* Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90)
* Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88)
* Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85)
* Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72)
* Allow pickling of builtin methods (cloudpipe/cloudpickle#57)
* Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52)
* Support method descriptor (cloudpipe/cloudpickle#46)
* No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32)
* ** Remove non-standard __transient__check (cloudpipe/cloudpickle#110)** -- while we don't use this internally, and have no tests or documentation for its use, downstream code may use __transient__, although it has never been part of the API, if we merge this we should include a note about this in the release notes.
* Support for pickling loggers (yay!) (cloudpipe/cloudpickle#96)
* BUG: Fix crash when pickling dynamic class cycles. (cloudpipe/cloudpickle#102)

## How was this patch tested?

Existing PySpark unit tests + the unit tests from the cloudpickle project on their own.

Author: Holden Karau <holden@us.ibm.com>
Author: Kyle Kelley <rgbkrk@gmail.com>

Closes apache#18734 from holdenk/holden-rgbkrk-cloudpickle-upgrades.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants