-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JavaScript GeoArrow Module Proposal #283
Comments
Stupid question but is implementing something like |
👋 Hi @ibesora , thanks for chiming in
Yes... but that's why I plan to publish all the above modules as standalone NPM packages. So if you only want the I/O, you can only import Effectively it just makes you choose which sets of functionality you want when adding the dependency. It's a downside of WebAssembly, but I think there are more than enough upsides to still warrant the work.
There are differences of opinion in the community on this topic, but my own opinion is that it's not the best use of engineering effort. Parquet is super complex with extensive data types (e.g. recursive nested lists and structs), varied encodings (e.g. run length encoding, delta encoding), and an array of available compressions. As of when I wrote With Wasm, I'm able to reuse rock-solid libraries. No one has ever made an issue in parquet-wasm with a Parquet file that failed to read, because the Rust implementation of Parquet is really solid. And on top of that, we get really good performance for free. When I tested against loaders.gl's implementation in April 2022, the Wasm version was 480x faster. That's not intended to be disrespectful to loaders.gl and Ib's efforts... just that this is really, really hard! Of course you can write a really efficient JS GeoParquet library with enough engineering resources, but I'm trying to bootstrap an ecosystem of GeoArrow with just myself and mostly in free time. And by putting as much as possible in Rust, we can reuse the exact same core code in Wasm and in Python, for free. |
Given the presence of parquet IO (with a fair few differences in priority - e.g. parquet-wasm is obviously not intended to have a python binding) in both this repo and parquet-wasm, is it still a worthwhile goal to delegate to parquet-wasm? I get the sense that the cross-crate interaction is proving to be too much of an impediment (that or the API surfaces are just too different), or is the current situation one of 'implement separately, unify when the dust settles'? |
Thanks for chiming in @H-Plus-Time! This has been on my mind recently, and I really don't have any conclusions, so any suggestions are welcomed. I think the core problem is I wish to have Parquet support that is
How to reuse code across those is unclear, especially with a tangled web of dependencies. At this point parquet-wasm is intricately tied to wasm-bindgen. And its arrow table object is an arrow-wasm table. In this repo I'm exploring how parquet works with object store because for Python remote support for e.g. s3 is crucial. Maybe I was wrong in kylebarron/parquet-wasm#392 (comment) and having an object-store based implementation in JS will be easiest? Or the GeoParquet reader uses the rust implementation from this repo instead of from parquet-wasm |
Agreed, I wouldn't use it outside the js geoparquet-wasm subcrate (the js dir). Both the wasm and python targeting parts necessarily have their own binding-specific bits, that's honestly the most useful part of parquet-wasm (that and the quasi-ObjectStore).
I wonder about this - am I right in figuring that going from an ArrowTable to a GeoTable (or vice versa) would be relatively low-cost? I can kind of see how one would do from_arrow_wasm_table in the outer GeoTable (sort of, it does look like the build_arrow_schema function requires a builder, though I suppose setting parse_geoparquet_metadata to pub would be sufficient when dealing with an already finalized table). Since most of the arrow-wasm types have bidirectional From impls for their equivalent types, might be able to get away with it without too much extra code. The streams would be another kettle of fish - I suspect that a more generic version of SharedIO (also a... much better name :| ), with as of AsyncParquetTable's behaviour shoved into it as possible, would be part of that. Ignoring all the custom IO bits, that one top level reader struct would be quite acceptable to duplicate (since it's impossible to involve traits or generics in wasm-bindgen'd structs) - <50 lines of duplication.
Yeah, I didn't think deeply enough about it at the time - for this proposal to work, the bulk of the IO code really needs to come from neither geoarrow-rs nor parquet-wasm, object-store is the way. I think with a combination of object-store-wasm, and avoiding the extra HEAD request, it should be feasible to pull in parquet-wasm as a dep of geoparquet-wasm. I should have a repo up for that last part today (just as soon as I get off this paper straw of a connection (plane wifi)). |
that'll always be O(1). as a note I do want to rework the GeoTable a bit to relax the geometry restriction and allow it to have either no geometry or multiple geometry columns, which might bring it to be just a |
I think it's probably fine for geoparquet-wasm to return the same general arrow-wasm object as parquet wasm. As long as it contains the extension metadata you'll still be able to see it represents a geometry |
JavaScript GeoArrow Module Proposal
The strength of Arrow is in its interoperability, and therefore I think it's worthwhile to discuss how to ensure all the pieces around GeoArrow in JavaScript fit together really well.
This is a corollary to the Python GeoArrow Module Proposal but focused on GeoArrow interoperability in JavaScript and WebAssembly. I don't know anyone doing GeoArrow-Wasm stuff in C, so this will focus on my efforts in Rust and TypeScript. Unlike in Python, there aren't other people currently working on JavaScript GeoArrow infrastructure, so this is a manifesto to solidify my ideas.
WebAssembly limitations
WebAssembly is sandboxed, which means that Wasm code can only access and modify memory within its own memory space. So Wasm code cannot access JavaScript objects directly.
This also means that two Wasm modules can't share memory. So if you have one Wasm-based NPM library that loads GeoParquet to GeoArrow and another Wasm-based NPM library that implements spatial operations on GeoArrow, there must be a copy from the first module's memory space into JavaScript and then into the second module's memory space.
This means that grouping Wasm functionality together into a single module is more performant, as I/O and operations can be done in a single memory space. This runs up against bundle size: JavaScript bundlers are able to tree-shake JavaScript code, but they can't tree-shake a prebuilt Wasm binary. Instead, the original Rust would have to be recompiled, excluding unwanted functions.
The solution I'm gravitating towards is to have a variety of NPM libraries, described in this document, where I/O or operations are distributed both as their own libraries but also in a "kitchen sink" build, which contains everything at the cost of a larger bundle size. Advanced users can compile custom Wasm binaries from the rust source, with only the desired functionality.
Goals
Similar goals to the Python module proposal:
geoarrow-wasm
and largely reuse its JS bindings without having to create ones from scratchconvex_hull
should always return aPolygonArray
instead of a genericGeometryArray
that the user can't "see into" statically.Data Movement
In contrast to Python, which is able to share the same memory space with native code, data movement between Wasm and JS is not always free, because they occupy two separate memory spaces. JS can see into Wasm memory but not the opposite. This means that data movement from Wasm -> JS can be zero-copy, but JS -> Wasm requires a copy.
The easiest data movement in JS is to use Arrow IPC buffers to move serialized data between JS and Wasm, but this has a number of drawbacks:
Data
chunks need to be copied into a newArrayBuffer
, a full copy of the dataset, before the copy into/out of Wasm.Data
chunks in JS memory are references onto the same backingArrayBuffer
(from the original IPC buffer), which means aData
instance can't be transferred to a WebWorker without a copy.The most performant data movement in JS is to directly view data from Wasm memory and conversely for JS to write array data directly into the Wasm memory space. I've been working on this in
arrow-js-ffi
and it's a crucial part of Arrow interoperability in Wasm. This solves both of the downsides of Arrow IPC, as it avoids an extra data copy and theData
instances in JS have a backing buffer not shared with any otherData
.Module hierarchy
Here's a quick (messy) picture of the dependency graph. An arrow points to the library it depends on, so here
geoarrow-wasm
depends ongeoarrow-rs
.The most important part is that there are no dependency cycles.
Rust Core (non-Wasm)
geoarrow-rs
is the rust core with all core GeoArrow functionality. All algorithms, core I/O, etc are implemented in this crate so that as much as possible can be shared among pure-Rust, JS, and Python.This crate does not on its own have any JS bindings. All JS functionality is exported in separate crates/packages below.
geoarrow
Arrow-Wasm Core
Shared arrow definitions and FFI functionality to/from Arrow JS.
arrow-wasm
arrow
crate.Table
,Vector
,Data
,DataType
.Computational library
Standalone library for spatial operations on GeoArrow arrays, without any I/O except for Arrow IPC and FFI. The
slim
compilation feature ofgeoarrow-wasm
.geoarrow-wasm
@geoarrow/geoarrow-wasm-slim
geoarrow-rs
for computational algorithms to wrap for JSarrow-wasm
for JS bindings for Arrow FFI with Arrow JSfull
compilation feature, described below under "Kitchen Sink"I/O Wasm libraries
There should exist standalone libraries with a minimal bundle size to read and write various file formats to/from GeoArrow.
parquet-wasm
Standalone library to read and write Parquet files in Wasm.
parquet-wasm
parquet-wasm
arrow-wasm
for JS bindings for Arrow FFI with Arrow JSgeoparquet-wasm
Standalone library to read and write GeoParquet files in Wasm.
geoparquet-wasm
@geoarrow/geoparquet-wasm
parquet-wasm
for JS bindings to read/write Parquetgeoarrow-rs
to encode/decode WKB geometries to/from GeoArrowreadGeoParquet
: wrapsparquet-wasm
'sreadParquet
, converting WKB column to GeoArrow before returning anarrow-wasm
Table
instancewriteGeoParquet
: wrapsparquet-wasm
'swriteParquet
, converting GeoArrow in theTable
input to WKB before passing on towriteParquet
.readGeoParquetStream
: wrapsparquet-wasm
'sreadParquetStream
flatgeobuf-wasm
Standalone library to read and write FlatGeobuf files in Wasm.
flatgeobuf-wasm
@geoarrow/flatgeobuf-wasm
arrow-wasm
for JS bindings for Arrow FFI with Arrow JSgeoarrow-rs
to read/write FlatGeobuf to/from GeoArrowreadFlatGeobuf
: parses FlatGeobuf buffer, returning anarrow-wasm
Table
instancewriteFlatGeobuf
: creates a FlatGeobuf buffer from anarrow-wasm
Table
instance.readFlatGeobufStream
: generates an async iterable ofarrow-wasm
RecordBatch
from a remote FlatGeobuf fileThe kitchen sink
The
full
compilation feature ofgeoarrow-wasm
.geoarrow-wasm
@geoarrow/geoarrow-wasm
arrow-wasm
for JS bindings for Arrow FFI with Arrow JSgeoparquet-wasm
for JS bindings for GeoParquetflatgeobuf-wasm
for JS bindings for FlatGeobufgeoarrow-rs
for algorithmsPure JS Interop
This is designed to smoothly interop with pure-JavaScript Arrow libraries.
Arrow JS
The canonical implementation of Arrow in JS. It only supports IPC for data I/O.
Arrow JS FFI
A library to read/write Arrow data across the Wasm boundary. This interops with the core
arrow-wasm
crate above.GeoArrow JS
A pure-JavaScript (TypeScript) implementation of GeoArrow. This uses the exact same memory layout as GeoArrow in Rust, so it should be possible to mix and match between pure-JS and wasm-based algorithms without changing data representations.
The text was updated successfully, but these errors were encountered: