[Data] Avoid serializing datasource for Parquet read tasks #41712

bveeramani · 2023-12-07T23:49:27Z

Why are these changes needed?

#41118 added an include_paths parameter to ParquetDatasource. As part of the PR, we pass an self._include_paths attribute to Parquet read tasks:

ray/python/ray/data/datasource/parquet_datasource.py

Lines 385 to 394 in e862ec2

    
           ReadTask( 
        
               lambda f=fragments: _read_fragments( 
        
                   block_udf, 
        
                   to_batches_kwargs, 
        
                   default_read_batch_size_rows, 
        
                   columns, 
        
                   schema, 
        
                   f, 
        
                   self._include_paths, 
        
               ),

As a result, the datasource (self) gets serialized with each read tasks. Normally, this isn't an issue, but if you're working with a large dataset (like in the failing release test), then the datasource is slow to serialize.

This PR fixes the issue by removing the reference to self.

Related issue number

Fixes #41390

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

c21

Thanks for the fix!

c21 · 2023-12-08T01:00:13Z

Is there a way to add a unit test to prevent the issue in the future?

c21 · 2023-12-08T01:03:21Z

This sounds to me a release blocker. We should cherry pick to prevent regression.

raulchen · 2023-12-08T04:22:08Z

python/ray/data/datasource/parquet_datasource.py

@@ -390,7 +391,7 @@ def get_read_tasks(self, parallelism: int) -> List[ReadTask]:
                        columns,
                        schema,
                        f,
-                        self._include_paths,
+                        include_paths,


we can make the ParquetDatasource class non-serializable to avoid this issue in the future.

bveeramani · 2023-12-08T08:00:30Z

Is there a way to add a unit test to prevent the issue in the future?

@c21 I can't think of a good way to directly unit test this. If it's okay with you, I'll merge this PR now since it's a release blocker, and if we come up with an effective test, I'll add it in a follow-up PR?

Also, per @raulchen's suggestion, we could make ParquetDatasource non-serializable in the future.

…ct#41712) ray-project#41118 added an include_paths parameter to ParquetDatasource. As part of the PR, we pass an self._include_paths attribute to Parquet read tasks. As a result, the datasource (self) gets serialized with each read tasks. Normally, this isn't an issue, but if you're working with a large dataset (like in the failing release test), then the datasource is slow to serialize. This PR fixes the issue by removing the reference to self. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

…41725) #41118 added an include_paths parameter to ParquetDatasource. As part of the PR, we pass an self._include_paths attribute to Parquet read tasks. As a result, the datasource (self) gets serialized with each read tasks. Normally, this isn't an issue, but if you're working with a large dataset (like in the failing release test), then the datasource is slow to serialize. This PR fixes the issue by removing the reference to self. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Initial commit

5ae675f

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

bveeramani requested review from ericl, scv119, c21, amogkam, scottjlee, raulchen, stephanie-wang and Zandew as code owners December 7, 2023 23:49

bveeramani marked this pull request as draft December 7, 2023 23:49

bveeramani assigned scottjlee Dec 7, 2023

bveeramani marked this pull request as ready for review December 7, 2023 23:54

c21 approved these changes Dec 8, 2023

View reviewed changes

c21 added the ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) label Dec 8, 2023

scottjlee approved these changes Dec 8, 2023

View reviewed changes

raulchen approved these changes Dec 8, 2023

View reviewed changes

c21 added the release-blocker P0 Issue that blocks the release label Dec 8, 2023

raulchen reviewed Dec 8, 2023

View reviewed changes

bveeramani merged commit 039d67b into ray-project:master Dec 8, 2023
15 of 16 checks passed

bveeramani deleted the fix-release-test branch December 8, 2023 08:01

bveeramani mentioned this pull request Dec 8, 2023

[Cherry-pick][Data] Avoid serializing datasource for Parquet read tasks (#41712) #41725

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Avoid serializing datasource for Parquet read tasks #41712

[Data] Avoid serializing datasource for Parquet read tasks #41712

bveeramani commented Dec 7, 2023 •

edited

Loading

c21 left a comment

c21 commented Dec 8, 2023

c21 commented Dec 8, 2023

raulchen Dec 8, 2023

bveeramani commented Dec 8, 2023

	ReadTask(
	lambda f=fragments: _read_fragments(
	block_udf,
	to_batches_kwargs,
	default_read_batch_size_rows,
	columns,
	schema,
	f,
	self._include_paths,
	),

[Data] Avoid serializing datasource for Parquet read tasks #41712

[Data] Avoid serializing datasource for Parquet read tasks #41712

Conversation

bveeramani commented Dec 7, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

c21 left a comment

Choose a reason for hiding this comment

c21 commented Dec 8, 2023

c21 commented Dec 8, 2023

raulchen Dec 8, 2023

Choose a reason for hiding this comment

bveeramani commented Dec 8, 2023

bveeramani commented Dec 7, 2023 •

edited

Loading