Skip to content

Commit

Permalink
[Data] Avoid serializing datasource for Parquet read tasks (ray-proje…
Browse files Browse the repository at this point in the history
…ct#41712)

ray-project#41118 added an include_paths parameter to ParquetDatasource. As part of the PR, we pass an self._include_paths attribute to Parquet read tasks.

As a result, the datasource (self) gets serialized with each read tasks. Normally, this isn't an issue, but if you're working with a large dataset (like in the failing release test), then the datasource is slow to serialize.

This PR fixes the issue by removing the reference to self.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
  • Loading branch information
bveeramani authored Dec 8, 2023
1 parent 507d3d1 commit 039d67b
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions python/ray/data/datasource/parquet_datasource.py
Original file line number Diff line number Diff line change
Expand Up @@ -375,11 +375,12 @@ def get_read_tasks(self, parallelism: int) -> List[ReadTask]:
)
else:
default_read_batch_size_rows = PARQUET_READER_ROW_BATCH_SIZE
block_udf, to_batches_kwargs, columns, schema = (
block_udf, to_batches_kwargs, columns, schema, include_paths = (
self._block_udf,
self._to_batches_kwargs,
self._columns,
self._schema,
self._include_paths,
)
read_tasks.append(
ReadTask(
Expand All @@ -390,7 +391,7 @@ def get_read_tasks(self, parallelism: int) -> List[ReadTask]:
columns,
schema,
f,
self._include_paths,
include_paths,
),
meta,
)
Expand Down

0 comments on commit 039d67b

Please sign in to comment.