Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing Spatially-Partitioned GeoParquet #789

Open
3 tasks
kylebarron opened this issue Sep 22, 2024 · 0 comments
Open
3 tasks

Writing Spatially-Partitioned GeoParquet #789

kylebarron opened this issue Sep 22, 2024 · 0 comments

Comments

@kylebarron
Copy link
Member

We should have an option to spatially partition data before writing to GeoParquet. At this stage, we should do it entirely in-memory, which we can relax at a later date. This would also be a good precursor to a DataFusion-based extension further down the line.

Steps:

  • Sorting:
    • We can use the Sort trait of geo-index (e.g. https://github.com/kylebarron/geo-index/blob/main/src/rtree/sort/str.rs)
    • We may want to relax the bounds on Sort. Right now it requires the total bounds of all input boxes before sorting. But only HilbertSort requires the global input bounding box. STRSort doesn't use it; it only needs the number of items and the node size. It would be nice to update that API so that STRSort doesn't need that input. (Maybe refactor to a different method, outside the trait, and then we can call STRSort::sort() manually without going through the trait)
    • It would also be nice to split the sorting and have a lower level API to sort only the raw boxes and not the higher level nodes, because in GeoParquet we don't have higher-level nodes.
    • Note here that we want to sort across multiple chunks; we don't want to solely sort within each input chunk. So we'll presumably want to allocate an arange for all rows across all chunks.
  • Partitioning:
    • Then we want to handle a "chunked take"/partitioning/rechunking across all those input chunks, given some
  • Writing bounding box column.
    • We should have a way to write the bounding box column of GeoParquet 1.1. Should we only write this when we're writing WKB geometries? It's unclear, because it can also be beneficial to use the bounding box column when the native geometries are very large, because the bbox should be way smaller than big polygons.
    • Note also that the result of geo-index's Sort should give you the sorted bounding boxes. So you shouldn't need to compute the bboxes again.

cc @DahnJ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant