[BUG] Error with dask-expr using categorical dtype #17415
Open
Description
This is the same issue as rapidsai/dask-cuda#1408 . Cross-posting here as it's more related to cuDF instead of dask-cuda
.
The following snippet works with DASK_DATAFRAME__QUERY_PLANNING=FALSE
but fails with dask-expr
.
import dask
import dask.dataframe as dd
import dask_cuda
from dask_cuda import LocalCUDACluster
from distributed import Client
def main(client):
print(dask_cuda.__version__)
print(dask.__version__)
df = dd.from_dict({"qid": [1, 2, 1, 0, 2]}, npartitions=3)
df.qid.astype("category").cat.as_known().compute()
if __name__ == "__main__":
with LocalCUDACluster() as cluster:
with Client(cluster) as client:
with dask.config.set(
{"array.backend": "cupy", "dataframe.backend": "cudf"}
):
main(client)
Traceback (most recent call last):
File "/home/jiamingy/workspace/xgboost_dev/XGBoostUtils/dask-issues/as-cat.py", line 20, in <module>
main(client)
File "/home/jiamingy/workspace/xgboost_dev/XGBoostUtils/dask-issues/as-cat.py", line 11, in main
df.qid.astype("category").cat.as_known().compute()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jiamingy/.anaconda/envs/xgboost_dev_125/lib/python3.11/site-packages/dask_expr/_categorical.py", line 83, in as_known
return self.set_categories(categories.values)
^^^^^^^^^^^^^^^^^
File "/home/jiamingy/.anaconda/envs/xgboost_dev_125/lib/python3.11/site-packages/cudf/core/index.py", line 1636, in values
return self._column.values
^^^^^^^^^^^^^^^^^^^
File "/home/jiamingy/.anaconda/envs/xgboost_dev_125/lib/python3.11/site-packages/cudf/core/column/string.py", line 5873, in values
raise TypeError("String Arrays is not yet implemented in cudf")
TypeError: String Arrays is not yet implemented in cudf
Environment overview (please complete the following information)
-
Environment location: Bare-metal
-
Method of cuDF install: conda
-
dask-cuda: 24.12.00a12
-
dask: 2024.10.0
-
Python: 3.11.10