[Data] Avoid serializing datasource for Parquet read tasks (ray-proje…

…ct#41712) ray-project#41118 added an include_paths parameter to ParquetDatasource. As part of the PR, we pass an self._include_paths attribute to Parquet read tasks. As a result, the datasource (self) gets serialized with each read tasks. Normally, this isn't an issue, but if you're working with a large dataset (like in the failing release test), then the datasource is slow to serialize. This PR fixes the issue by removing the reference to self. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
ShuN6211 · Dec 8, 2023 · 039d67b · 039d67b
1 parent 507d3d1
commit 039d67b
Showing 1 changed file with 3 additions and 2 deletions.
diff --git a/python/ray/data/datasource/parquet_datasource.py b/python/ray/data/datasource/parquet_datasource.py
@@ -375,11 +375,12 @@ def get_read_tasks(self, parallelism: int) -> List[ReadTask]:
                 )
             else:
                 default_read_batch_size_rows = PARQUET_READER_ROW_BATCH_SIZE
-            block_udf, to_batches_kwargs, columns, schema = (
+            block_udf, to_batches_kwargs, columns, schema, include_paths = (
                 self._block_udf,
                 self._to_batches_kwargs,
                 self._columns,
                 self._schema,
+                self._include_paths,
             )
             read_tasks.append(
                 ReadTask(
@@ -390,7 +391,7 @@ def get_read_tasks(self, parallelism: int) -> List[ReadTask]:
                         columns,
                         schema,
                         f,
-                        self._include_paths,
+                        include_paths,
                     ),
                     meta,
                 )