-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyDot objects are not pickleble #217
Comments
i started working on this, but i didnt reach anything useful. the main problem is the very dynamic nature of how Dot methods are generated. it's easy enough to write a at this point i gave up, i think a reorganization of pydot code is necessary to be able to pickle |
with this simple change i am able to serialize objects using
it would be helpful if we can include this in pydot |
This commit wraps up a series of changes aiming to provide full support for pickle serialization of pydot objects. The last few commits ensured that dynamically defined methods are bound to their classes instead of their instances. With that in place, there is no more need to limit serialization of objects dictionaries to only their `obj_dict` attribute. Therefore, this commit removes the custom `__getstate__` and `__setstate__` methods. This makes pickle return to its normal behavior of serializing the entire dictionary, which means that an unpickled object will now have all the same attributes as its original. This resolves errors like this from unpickled objects: AttributeError: 'Dot' object has no attribute 'prog' AttributeError: 'Dot' object has no attribute 'shape_files' Together with the last few commits, issues like pydot#127, pydot#216 and pydot#217 should now be fixed for pickle and similar tools like dill. Background: Git history shows that in the Initial import of 2007 (pydot 0.9.10, commit c0233c6), `__getstate__` and `__setstate__` selected the entire instance's dictionary and from that deleted the convenience methods. In 2010 (between pydot 1.0.2 and 1.0.4, commit b125c5d) this was changed to selecting only the `obj_dict` attribute of the instance's dictionary, which did not include the convenience methods to start with. That change was obviously related to issue pydot#16 that was closed a few days before it. It seems that the idea was to move towards keeping all instance-related data under a single `obj_dict` attribute and not have any other important data in attributes directly on the instance. However, currently, a newly created `Dot` instance contains, besides the `obj_dict`, the attributes `prog` and `shape_files`. Both are set during instantiation (`Dot.__init__()`), and as such will normally not be recreated on unpickling [[1]]. Still, they seem important enough that they should actually get pickled, which is what this commit accomplishes. The `prog` attribute is just a short string, and `shape_files` just a list of file paths, so it seems unlikely there could ever have been any real concerns about pickling them. Instances of `Graph`, `Subgraph`, `Cluster`, `Node` and `Edge` do not have dictionary entries other than `obj_dict`, at least not on instantiation. [1]: https://docs.python.org/3.9/library/pickle.html#pickling-class-instances Some examples sizes for a `Dot` object with 1 named node serialized by dill, an alternative to pickle: - Before the previous commit, when instances still had all those convenience methods bound to them, but `__getstate__` would only pickle `obj_dict`: 351 bytes. - Same, but then what happens when `__getstate__` would not exist or a `__reduce__` method is added that does not call `__getstate__` and the entire object dictionary gets pickled: 22986 bytes. That shows the weight of all those methods that get packed then. - With this commit, now that the previous commit removed all those methods and we can safely start pickling the entire dictionary again: 413 bytes. The extra 62 bytes compared to the first case reflect the object attributes `prog` and `shapefiles` that get pickled now as well. (Measured using `len(dill.dumps(g))`. Dill was used because pickle was not able to pickle pydot objects before. It can now. For the final result, I checked that pickle gave the same result as dill.) Further notes: - The ChangeLog will mention that pickled objects created with the previous code are incompatible with the new code and vice versa. This is not unheard of with pickle in general [[2]]. Class versioning and conversion [[3]] could solve this, but my guess is that most use cases have no need for this. Otherwise please report or provide a patch. [2]: https://nedbatchelder.com/blog/202006/pickles_nine_flaws.html#h_old_pickles_look_like_old_code [3]: https://docs.python.org/3.9/library/pickle.html#what-can-be-pickled-and-unpickled - Because the Python documentation lists `__getstate__` and `__setstate__` only in relation with pickle [[4]], I expect that this change will have not have any side effects on other uses of pydot. [4]: https://docs.python.org/3.9/genindex-_.html - This commit will cause Pylint 2.6.0 to emit 9 errors like: pydot.py:520:19: E1101: Instance of 'Common' has no 'obj_dict' member (no-member) Actually, this reflects an existing situation not caused by this commit. Pylint just did not report it as strongly before, because it saw `obj_dict` get assigned to from `__setstate__`. Before this commit, there was only this 1 warning: pydot.py:431:8: W0201: Attribute 'obj_dict' defined outside __init__ (attribute-defined-outside-init) I considered adding an `__init__()` method to class `Common` just to set `obj_dict` to an empty dictionary, but decided it would not be good to do that just for Pylint. None of the subclasses currently call `super().__init__()` anyway and the `Common` docstring clearly states that it should not be used directly. There seems to be room for unifying the `__init__()` methods of the various classes and some deduplication to `Common.__init__()`. That would bring a meaningful resolution to these Pylint errors. However, such changes deserve to be discussed separately. - Pickle-related tests were updated and extended. Notes: - The main pickle tests are now performed on a `Dot` instance instead of on a `Graph`, because `Dot` is the only class that has both the dynamically generated `create_*`/`write_*` and `get_*`/`set_*` methods. - Regarding the removal of the use of a tuple in the initialization of one of the edges: Support for that was removed in 2016 (pydot 1.2.4, commit de30c14) and it would lead to a `TypeError` now that we start requesting a string representation. If tuple support is ever restored again, it should have its own test anyway.
Hi @sandrotosi, First of all, thank you for maintaining the Debian package for pydot. I use Debian myself, so I have used your package as well. I am currently preparing a pydot 1.4.2 bugfix release and after that will open the master branch for 2.0.0 development. I doubt if we can make a final 2.0.0 release before the next Debian 11 bullseye freeze early 2021 though, because there is still a lot of work to do and I want to combine all the API breaking changes in one release and not have to jump to 3.0.0 within the year again. But we will see, there is still time. About the pickle changes I am proposing (see below), I am afraid those are too breaking/risky for a mere bugfix release 1.4.2, so I am only planning to merge them for 2.0.0. However, as it is now, the changes are still Python 2 compatible and could apply on the 1.x branch. OTOH, I see you already patched the Debian package with the change you proposed above, so maybe you don't mind keeping that in until 2.0.0 comes out.
And that's what I did. Please check out #242. I think it should solve your #216 and #217, as well as the earlier #127. I had a look at your solution (comment above) as well. It works because the tuple returned by
Still, I hope you can also have a look at my proposal in #242. It does away with all those instance-bound methods, |
This commit wraps up a series of changes aiming to provide full support for pickle serialization of pydot objects. The last few commits ensured that dynamically defined methods are bound to their classes instead of their instances. With that in place, there is no more need to limit serialization of objects dictionaries to only their `obj_dict` attribute. Therefore, this commit removes the custom `__getstate__` and `__setstate__` methods. This makes pickle return to its normal behavior of serializing the entire dictionary, which means that an unpickled object will now have all the same attributes as its original. This resolves errors like this from unpickled objects: AttributeError: 'Dot' object has no attribute 'prog' AttributeError: 'Dot' object has no attribute 'shape_files' Together with the last few commits, issues like pydot#127, pydot#216 and pydot#217 should now be fixed for pickle and similar tools like dill. Background: Git history shows that in the Initial import of 2007 (pydot 0.9.10, commit c0233c6), `__getstate__` and `__setstate__` selected the entire instance's dictionary and from that deleted the convenience methods. In 2010 (between pydot 1.0.2 and 1.0.4, commit b125c5d) this was changed to selecting only the `obj_dict` attribute of the instance's dictionary, which did not include the convenience methods to start with. That change was obviously related to issue pydot#16 that was closed a few days before it. It seems that the idea was to move towards keeping all instance-related data under a single `obj_dict` attribute and not have any other important data in attributes directly on the instance. However, currently, a newly created `Dot` instance contains, besides the `obj_dict`, the attributes `prog` and `shape_files`. Both are set during instantiation (`Dot.__init__()`), and as such will normally not be recreated on unpickling [[1]]. Still, they seem important enough that they should actually get pickled, which is what this commit accomplishes. The `prog` attribute is just a short string, and `shape_files` just a list of file paths, so it seems unlikely there could ever have been any real concerns about pickling them. Instances of `Graph`, `Subgraph`, `Cluster`, `Node` and `Edge` do not have dictionary entries other than `obj_dict`, at least not on instantiation. [1]: https://docs.python.org/3.9/library/pickle.html#pickling-class-instances Some examples sizes for a `Dot` object with 1 named node serialized by dill, an alternative to pickle: - Before the previous commit, when instances still had all those convenience methods bound to them, but `__getstate__` would only pickle `obj_dict`: 351 bytes. - Same, but then what happens when `__getstate__` would not exist or a `__reduce__` method is added that does not call `__getstate__` and the entire object dictionary gets pickled: 22986 bytes. That shows the weight of all those methods that get packed then. - With this commit, now that the previous commit removed all those methods and we can safely start pickling the entire dictionary again: 413 bytes. The extra 62 bytes compared to the first case reflect the object attributes `prog` and `shapefiles` that get pickled now as well. (Measured using `len(dill.dumps(g))`. Dill was used because pickle was not able to pickle pydot objects before. It can now. For the final result, I checked that pickle gave the same result as dill.) Further notes: - The ChangeLog will mention that pickled objects created with the previous code are incompatible with the new code and vice versa. This is not unheard of with pickle in general [[2]]. Class versioning and conversion [[3]] could solve this, but my guess is that most use cases have no need for this. Otherwise please report or provide a patch. [2]: https://nedbatchelder.com/blog/202006/pickles_nine_flaws.html#h_old_pickles_look_like_old_code [3]: https://docs.python.org/3.9/library/pickle.html#what-can-be-pickled-and-unpickled - Because the Python documentation lists `__getstate__` and `__setstate__` only in relation with pickle [[4]], I expect that this change will have not have any side effects on other uses of pydot. [4]: https://docs.python.org/3.9/genindex-_.html - This commit will cause Pylint 2.6.0 to emit 9 errors like: pydot.py:520:19: E1101: Instance of 'Common' has no 'obj_dict' member (no-member) Actually, this reflects an existing situation not caused by this commit. Pylint just did not report it as strongly before, because it saw `obj_dict` get assigned to from `__setstate__`. Before this commit, there was only this 1 warning: pydot.py:431:8: W0201: Attribute 'obj_dict' defined outside __init__ (attribute-defined-outside-init) I considered adding an `__init__()` method to class `Common` just to set `obj_dict` to an empty dictionary, but decided it would not be good to do that just for Pylint. None of the subclasses currently call `super().__init__()` anyway and the `Common` docstring clearly states that it should not be used directly. There seems to be room for unifying the `__init__()` methods of the various classes and some deduplication to `Common.__init__()`. That would bring a meaningful resolution to these Pylint errors. However, such changes deserve to be discussed separately. - Pickle-related tests were updated and extended. Notes: - The main pickle tests are now performed on a `Dot` instance instead of on a `Graph`, because `Dot` is the only class that has both the dynamically generated `create_*`/`write_*` and `get_*`/`set_*` methods. - Regarding the removal of the use of a tuple in the initialization of one of the edges: Support for that was removed in 2016 (pydot 1.2.4, commit de30c14) and it would lead to a `TypeError` now that we start requesting a string representation. If tuple support is ever restored again, it should have its own test anyway.
@sandrotosi: This is in reply to your report in #256 that with newly released pydot 1.4.2 you get this error when running the unittests:
This test does not fail on plain pydot 1.4.2 as we distribute it here on GitHub and on PyPI.org, but only after applying the additional patch for the Debian package I reviewed that patch a few months ago (comment of 2020-10-27 above), but did not get any reply. As I wrote back then, full pickle support is planned only for pydot 2.0.0 through PR #242. In the mean time, I see 4 different solutions for you:
I am not sure which option is best for you. It depends on how you weigh the risk of potential regressions against the benefit of having more or less serialization support. Please be aware that I made a conscious choice to not incorporate any of these patches in pydot 1.4.2 yet, because I wanted it to be a bug-fix release with minimal, targeted bug fixes only, to reduce the risk of regressions. For at least the next few weeks, my time will be limited due to other obligations. I hope to still be able to answer a quick question now and then, but doing any further testing or investigation (like I did for this comment) will not be possible in the next few weeks. Hope you understand and that you can find a good solution. Please let me know. |
@sandrotosi Btw, if you choose that 3rd option (changing the test suite to use dill instead of pickle), you may need to add dill to your build dependencies. |
that comes from #216:
any idea on how to address this is helpful
The text was updated successfully, but these errors were encountered: