CategoricalAccessor.as_know()
produces incorrect dtype with "p2p" shuffle enabled
#11572
Labels
bug
Something is broken
Context: dask/dask-expr#659 recently adjusted the default shuffle method used by
series.unique()
(from "tasks" to "p2p"). I don't think there was anything wrong with that particular PR. However, that change exposed the fact that.cat.as_known()
does not properly preserve the original dtype of the categories when "p2p" is enabled:I don't think the problem is in "p2p" itself. Rather, the proper metadata seems to be lost before the data is shuffled. For "tasks", the proper metadata is recovered after the intermediate compute step. However, we aren't as lucky for "p2p", because we round-trip the data to PyArrow.
The text was updated successfully, but these errors were encountered: