Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds SERDE for persisters #485

Merged
merged 1 commit into from
Dec 26, 2024
Merged

Adds SERDE for persisters #485

merged 1 commit into from
Dec 26, 2024

Conversation

skrawcz
Copy link
Contributor

@skrawcz skrawcz commented Dec 26, 2024

The persisters can now be pickled and unpickled.
This means that they can be properly serialized across process boundaries.

Changes

  • adds setstate and getstate to persisters implementations

How I tested this

  • locally for postgres
  • via integration tests for others

Notes

  • this assumes the client is fixed, so if someone wants another client type, they would have to override the serde functions.

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

Important

Add serialization and deserialization support to MongoDB, Redis, and PostgreSQL persisters using __getstate__ and __setstate__.

  • Behavior:
    • Add __getstate__ and __setstate__ methods to MongoDBBasePersister, RedisBasePersister, and PostgreSQLPersister for serialization.
    • Persisters can now be pickled and unpickled, enabling cross-process serialization.
  • Testing:
    • Add tests for serialization and deserialization using pickle in test_b_mongodb.py, test_b_redis.py, and test_postgresql.py.
    • Tests ensure state consistency after serialization and deserialization.
  • Assumptions:
    • Assumes fixed client types for deserialization; custom clients require overriding serde functions.

This description was created by Ellipsis for c5deb2d. It will automatically update as commits are pushed.

The persisters can now be pickled and unpickled.
This means that they can be properly serialized across
process boundaries.
Copy link

github-actions bot commented Dec 26, 2024

A preview of c2cc4bb is uploaded and can be seen here:

https://burr.dagworks.io/pull/485

Changes may take a few minutes to propagate. Since this is a preview of production, content with draft: true will not be rendered. The source is here: https://github.com/DAGWorks-Inc/burr/tree/gh-pages/pull/485/

Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to c2cc4bb in 1 minute and 37 seconds

More details
  • Looked at 198 lines of code in 6 files
  • Skipped 0 files when reviewing.
  • Skipped posting 3 drafted comments based on config settings.
1. burr/integrations/persisters/b_mongodb.py:136
  • Draft comment:
    self.client.address is not a valid attribute for MongoClient. Use self.client.HOST and self.client.PORT instead to get the connection details.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
  1. The comment is about the new getstate implementation. 2. Looking at pymongo docs, MongoClient.address is actually a valid attribute that returns a (host, port) tuple. 3. HOST and PORT are not standard attributes of MongoClient. 4. The code correctly uses address[0] and address[1] to get host and port from the tuple.
    Could there be a version of pymongo where address isn't available? Could HOST/PORT be better practice that I'm unaware of?
    The address attribute is well-documented in pymongo and returns exactly what we need. The comment appears to be incorrect about HOST/PORT.
    The comment should be deleted as it is incorrect - self.client.address is a valid and appropriate way to get the connection details.
2. burr/integrations/persisters/b_redis.py:165
  • Draft comment:
    Consider handling the case where connection_pool is not present to ensure complete state serialization.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.
3. burr/integrations/persisters/postgresql.py:251
  • Draft comment:
    Consider handling the case where info is not present to ensure complete state serialization.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.

Workflow ID: wflow_JiWlZwz3bmJEL2fu


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link
Contributor

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, think we can make it easier by abstracting a bit but I think it's fine for now

burr/integrations/persisters/b_mongodb.py Show resolved Hide resolved
Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Incremental review on c5deb2d in 1 minute and 12 seconds

More details
  • Looked at 224 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 5 drafted comments based on config settings.
1. burr/integrations/persisters/b_mongodb.py:141
  • Draft comment:
    Accessing self.client.address[0] and self.client.address[1] assumes the address attribute is a tuple with at least two elements. Consider adding checks to ensure these attributes exist to avoid potential errors.
  • Reason this comment was not posted:
    Comment did not seem useful.
2. burr/integrations/persisters/b_mongodb.py:158
  • Draft comment:
    The __setstate__ method assumes connection_params always contains uri and port. Consider adding checks to handle missing keys to prevent potential KeyError.
  • Reason this comment was not posted:
    Comment was on unchanged code.
3. burr/integrations/persisters/b_redis.py:171
  • Draft comment:
    The get_connection_params method assumes connection_pool.connection_kwargs contains host, port, db, and password. Consider adding checks to handle missing keys to prevent potential KeyError.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    The Redis connection is created with explicit parameters in from_values() and init(). get_connection_params() is only called on existing connections. The Redis client would fail earlier if these parameters were missing. The connection_pool.connection_kwargs should always have these values since they were provided during connection creation.
    I could be wrong about the Redis client implementation - maybe it allows connections without some of these parameters? Also, what if someone subclasses and overrides default_client() with a different client?
    Even if some parameters are optional in Redis, the code explicitly provides all these values during connection creation. If someone overrides default_client(), they should also override get_connection_params() as indicated by the comment on line 179.
    The comment suggests defensive programming that isn't necessary given the class design and usage patterns. The code is correct as is.
4. burr/integrations/persisters/b_redis.py:187
  • Draft comment:
    The __setstate__ method assumes connection_params always contains necessary keys for Redis connection. Consider adding checks to handle missing keys to prevent potential KeyError.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    The connection_params dict is created by get_connection_params() which explicitly returns a dict with all required keys. These params come from an existing Redis connection, so they must have been valid to begin with. The suggestion to add checks seems unnecessary since the data structure is controlled internally.
    I could be wrong if there are subclasses that override get_connection_params() with different keys. However, the default_client() method suggests Redis is expected.
    Even with subclasses, they would need to provide compatible connection params since they're using default_client() which expects Redis connection parameters. The architecture enforces this contract.
    The comment should be deleted because the connection parameters are internally controlled and guaranteed to have the required keys through get_connection_params().
5. burr/integrations/persisters/postgresql.py:272
  • Draft comment:
    The __setstate__ method assumes connection_params always contains necessary keys for PostgreSQL connection. Consider adding checks to handle missing keys to prevent potential KeyError.
  • Reason this comment was not posted:
    Comment was on unchanged code.

Workflow ID: wflow_znu5DZSQdhleNbf9


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

burr/integrations/persisters/postgresql.py Show resolved Hide resolved
@skrawcz skrawcz merged commit 3b70072 into main Dec 26, 2024
22 checks passed
@skrawcz skrawcz deleted the persister_serde branch December 26, 2024 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants