Skip to content

Latest commit

 

History

History

io

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
vineyard

vineyard-io: IO drivers for vineyard

vineyard-io is a collection of IO drivers for vineyard. Currently it supports

  • Local filesystem
  • AWS S3
  • Aliyun OSS
  • Hadoop filesystem

The vineyard-io package leverages the filesystem-spec to support other storage sinks and sources in a unified fashion. Other adaptors that works for fsspec could be plugged in as well.

IO Adaptors

Vineyard has a set of prebuilt IO adaptors, that can serve as common routines for various IO operations and can take place of boilerplate parts in computation tasks.

Vineyard is capable of reading from and writing data to multiple file systems. Behind the scene, it leverage fsspec to delegate the workload to various file system implementations.

Specifically, we can specify parameters to be passed to the file system, through the storage_options parameter. storage_options is a dict that pass additional keywords to the file system, For instance, we could combine path = hdfs:///path/to/file with storage_options = {"host": "localhost", "port": 9600} to read from a HDFS.

Note that you must encode the storage_options by base64 before passing it to the scripts.

Alternatively, we can encode such information into the path, such as: hdfs://<ip>:<port>/path/to/file.

To read from multiple files you can pass a glob string or a list of paths, with the caveat that they must all have the same protocol.

Their functionality are described as follows:

  • read_bytes

    Usage: vineyard_read_bytes <ipc_socket> <path> <storage_options> <read_options> <proc_num> <proc_index>

    Read a file on local file systems, OSS, HDFS, S3, etc. to ByteStream.

  • write_bytes

    Usage: vineyard_write_bytes <ipc_socket> <path> <stream_id> <storage_options> <write_options> <proc_num> <proc_index>

    Write a ByteStream to a file on local file system, OSS, HDFS, S3, etc.

  • read_orc

    Usage: vineyard_read_orc <ipc_socket> <path/directory> <storage_options> <read_options> <proc_num> <proc_index>

    Read a ORC file on local file systems, OSS, HDFS, S3, etc. to DataframeStream.

  • write_orc

    Usage: vineyard_read_orc <ipc_socket> <path/directory> <storage_options> <read_options> <proc_num> <proc_index>

    Write a DataframeStream to a ORC file on local file system, OSS, HDFS, S3, etc.

  • read_vineyard_dataframe

    Usage: vineyard_read_vineyard_dataframe <ipc_socket> <vineyard_address> <storage_options> <read_options> <proc num> <proc index>

    Read a DataFrame in vineyard as a DataframeStream.

  • write_vineyard_dataframe

    Usage: vineyard_write_vineyard_dataframe <ipc_socket> <stream_id> <proc_num> <proc_index>

    Write a DataframeStream to a DataFrame in vineyard.

  • serializer

    Usage: vineyard_serializer <ipc_socket> <object_id>

    Serialize a vineyard object (non-global or global) as a ByteStream or a set of ByteStream (StreamCollection).

  • deserializer

    Usage: vineyard_deserializer <ipc_socket> <object_id>

    Deserialize a ByteStream or a set of ByteStream (StreamCollection) as a vineyard object.

  • read_bytes_collection

    Usage: vineyard_read_bytes_collection <ipc_socket> <prefix> <storage_options> <proc_num> <proc_index>

    Read a directory (on local filesystem, OSS, HDFS, S3, etc.) as a ByteStream or a set of ByteStream (StreamCollection).

  • write_bytes_collection

    Usage: vineyard_write_vineyard_dataframe <ipc_socket> <stream_id> <proc_num> <proc_index>

    Write a ByteStream or a set of ByteStream (StreamCollection) to a directory (on local filesystem, OSS, HDFS, S3, etc.).

  • parse_bytes_to_dataframe

    Usage: vineyard_parse_bytes_to_dataframe.py <ipc_socket> <stream_id> <proc_num> <proc_index>

    Parse a ByteStream (in CSV format) as a DataframeStream.

  • parse_dataframe_to_bytes

    Usage: vineyard_parse_dataframe_to_bytes <ipc_socket> <stream_id> <proc_num> <proc_index>

    Serialize a DataframeStream to a ByteStream (in CSV format).

  • dump_dataframe

    Usage: vineyard_dump_dataframe <ipc_socket> <stream_id>

    Dump the content of a DataframeStream, for debugging usage.