vineyard-io: IO drivers for vineyard
vineyard-io is a collection of IO drivers for vineyard. Currently it supports
- Local filesystem
- AWS S3
- Aliyun OSS
- Hadoop filesystem
The vineyard-io package leverages the filesystem-spec to support other storage sinks and sources in a unified fashion. Other adaptors that works for fsspec could be plugged in as well.
Vineyard has a set of prebuilt IO adaptors, that can serve as common routines for various IO operations and can take place of boilerplate parts in computation tasks.
Vineyard is capable of reading from and writing data to multiple file systems.
Behind the scene, it leverage fsspec
to delegate the workload to various file system implementations.
Specifically, we can specify parameters to be passed to the file system, through the storage_options
parameter.
storage_options
is a dict that pass additional keywords to the file system,
For instance, we could combine path
= hdfs:///path/to/file with storage_options = {"host": "localhost", "port": 9600}
to read from a HDFS.
Note that you must encode the storage_options
by base64 before passing it to the scripts.
Alternatively, we can encode such information into the path,
such as: hdfs://<ip>:<port>/path/to/file
.
To read from multiple files you can pass a glob string or a list of paths, with the caveat that they must all have the same protocol.
Their functionality are described as follows:
read_bytes
Usage: vineyard_read_bytes <ipc_socket> <path> <storage_options> <read_options> <proc_num> <proc_index>
Read a file on local file systems, OSS, HDFS, S3, etc. to
ByteStream
.write_bytes
Usage: vineyard_write_bytes <ipc_socket> <path> <stream_id> <storage_options> <write_options> <proc_num> <proc_index>
Write a
ByteStream
to a file on local file system, OSS, HDFS, S3, etc.read_orc
Usage: vineyard_read_orc <ipc_socket> <path/directory> <storage_options> <read_options> <proc_num> <proc_index>
Read a ORC file on local file systems, OSS, HDFS, S3, etc. to
DataframeStream
.write_orc
Usage: vineyard_read_orc <ipc_socket> <path/directory> <storage_options> <read_options> <proc_num> <proc_index>
Write a
DataframeStream
to a ORC file on local file system, OSS, HDFS, S3, etc.read_vineyard_dataframe
Usage: vineyard_read_vineyard_dataframe <ipc_socket> <vineyard_address> <storage_options> <read_options> <proc num> <proc index>
Read a
DataFrame
in vineyard as aDataframeStream
.write_vineyard_dataframe
Usage: vineyard_write_vineyard_dataframe <ipc_socket> <stream_id> <proc_num> <proc_index>
Write a
DataframeStream
to aDataFrame
in vineyard.serializer
Usage: vineyard_serializer <ipc_socket> <object_id>
Serialize a vineyard object (non-global or global) as a
ByteStream
or a set ofByteStream
(StreamCollection
).deserializer
Usage: vineyard_deserializer <ipc_socket> <object_id>
Deserialize a
ByteStream
or a set ofByteStream
(StreamCollection
) as a vineyard object.read_bytes_collection
Usage: vineyard_read_bytes_collection <ipc_socket> <prefix> <storage_options> <proc_num> <proc_index>
Read a directory (on local filesystem, OSS, HDFS, S3, etc.) as a
ByteStream
or a set ofByteStream
(StreamCollection
).write_bytes_collection
Usage: vineyard_write_vineyard_dataframe <ipc_socket> <stream_id> <proc_num> <proc_index>
Write a
ByteStream
or a set ofByteStream
(StreamCollection
) to a directory (on local filesystem, OSS, HDFS, S3, etc.).parse_bytes_to_dataframe
Usage: vineyard_parse_bytes_to_dataframe.py <ipc_socket> <stream_id> <proc_num> <proc_index>
Parse a
ByteStream
(in CSV format) as aDataframeStream
.parse_dataframe_to_bytes
Usage: vineyard_parse_dataframe_to_bytes <ipc_socket> <stream_id> <proc_num> <proc_index>
Serialize a
DataframeStream
to aByteStream
(in CSV format).dump_dataframe
Usage: vineyard_dump_dataframe <ipc_socket> <stream_id>
Dump the content of a
DataframeStream
, for debugging usage.