Skip to content

Commit

Permalink
Add references documentation for the vineyard-graph-loader (v6d-io#1270)
Browse files Browse the repository at this point in the history
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
  • Loading branch information
sighingnow authored Mar 28, 2023
1 parent 5474528 commit 99f0784
Showing 1 changed file with 182 additions and 67 deletions.
249 changes: 182 additions & 67 deletions modules/graph/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,26 @@ vineyard-graph
:code:`vineyard-graph` defines the graph data structures that can be shared
among graph computing engines.

* `vineyard-graph-loader <#vineyard-graph-loader>`_

* `Usage <#usage>`_

* `Using Command-line Arguments <#using-command-line-arguments>`_
* `Using a JSON Configuration <#using-a-json-configuration>`_

* `References <#references>`_

* `Vertices <#vertices>`_
* `Edges <#edges>`_
* `Data Sources <#data-sources>`_
* `Read Options <#read-options>`_
* `Global Options <#global-options>`_

vineyard-graph-loader
---------------------

:code:`vineyard-graph-loader` is a graph loader that can be used to loading
graphs from the CSV format into vineyard.
:code:`vineyard-graph-loader` is a graph loader used to load graphs from
the CSV format into vineyard.

Usage
^^^^^
Expand All @@ -24,68 +39,168 @@ Usage
or: ./vineyard-graph-loader [--socket <vineyard-ipc-socket>] --config <config.json>
The program :code:`vineyard-graph-loader` first accepts an option argument :code:`--socket <vineyard-ipc-socket>`
which points the IPC docket that the loader will connected to. If the option is not provided, the loader will
try to resolve the IPC socket from environment variable `VINEYARD_IPC_SOCKET`.

The graph can be loaded from the following two approaches:

- using command line arguments

The :code:`vineyard-graph-loader` accepts a sequence of command line arguments to specify the edge files
and vertex files, e.g.,

.. code:: bash
$ ./vineyard-graph-loader 2 "modern_graph/knows.csv#header_row=true&src_label=person&dst_label=person&label=knows&delimiter=|" \
"modern_graph/created.csv#header_row=true&src_label=person&dst_label=software&label=created&delimiter=|" \
2 "modern_graph/person.csv#header_row=true&label=person&delimiter=|" \
"modern_graph/software.csv#header_row=true&label=software&delimiter=|"
- using a config file

The :code:`vineyard-graph-loader` can accept a config file (in JSON format) as well to tell the edge files
and vertex files that would be loaded, e.g.,

.. code:: bash
$ ./vineyard-graph-loader --config config.json
The config file could be (using the "modern graph" as an example):

.. code:: json
{
"vertices": [
{
"data_path": "modern_graph/person.csv",
"label": "person",
"options": "header_row=true&delimiter=|"
},
{
"data_path": "modern_graph/software.csv",
"label": "software",
"options": "header_row=true&delimiter=|"
}
],
"edges": [
{
"data_path": "modern_graph/knows.csv",
"label": "knows",
"src_label": "person",
"dst_label": "person",
"options": "header_row=true&delimiter=|"
},
{
"data_path": "modern_graph/created.csv",
"label": "created",
"src_label": "person",
"dst_label": "software",
"options": "header_row=true&delimiter=|"
}
],
"directed": 1,
"generate_eid": 1,
"string_oid": 0,
"local_vertex_map": 0
}
The program :code:`vineyard-graph-loader` first accepts an optional argument
:code:`--socket <vineyard-ipc-socket>` which specifies the IPC docket that the
loader will connected to. If the option is not provided, the loader will try to
resolve the IPC socket from the environment variable `VINEYARD_IPC_SOCKET`.

The graph can be loaded either via command line arguments or a JSON configuration.

Using Command-line Arguments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The :code:`vineyard-graph-loader` accepts a sequence of command line arguments to
specify the edge files and vertex files, e.g.,

.. code:: bash
$ ./vineyard-graph-loader 2 "modern_graph/knows.csv#header_row=true&src_label=person&dst_label=person&label=knows&delimiter=|" \
"modern_graph/created.csv#header_row=true&src_label=person&dst_label=software&label=created&delimiter=|" \
2 "modern_graph/person.csv#header_row=true&label=person&delimiter=|" \
"modern_graph/software.csv#header_row=true&label=software&delimiter=|"
Using a JSON Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~

The :code:`vineyard-graph-loader` can also accept a config file (in JSON format) as well
to specify the vertex files and edge files that would be loaded. as well as global
flags, for example,

.. code:: bash
$ ./vineyard-graph-loader --config config.json
Here is an example of the `config.json` file for the "modern graph":

.. code:: json
{
"vertices": [
{
"data_path": "modern_graph/person.csv",
"label": "person",
"options": "header_row=true&delimiter=|"
},
{
"data_path": "modern_graph/software.csv",
"label": "software",
"options": "header_row=true&delimiter=|"
}
],
"edges": [
{
"data_path": "modern_graph/knows.csv",
"label": "knows",
"src_label": "person",
"dst_label": "person",
"options": "header_row=true&delimiter=|"
},
{
"data_path": "modern_graph/created.csv",
"label": "created",
"src_label": "person",
"dst_label": "software",
"options": "header_row=true&delimiter=|"
}
],
"directed": 1,
"generate_eid": 1,
"string_oid": 0,
"local_vertex_map": 0,
"print_normalized_schema": 1
}
References
^^^^^^^^^^

Vertices
~~~~~~~~

Each vertices can have the following configurations:

- :code:`data_path`: the path of the given sources, environment variables are supported,
e.g., :code:`$HOME/data/person.csv`. See also `Data Sources <#data-sources>`_.
- :code:`label`: the label of the vertex, e.g., :code:`person`.
- :code:`options`: the options used to read the file, e.g., :code:`header_row=true&delimiter=|`.
The detailed options are listed in `Read Options <#read-options>`_.

Edges
~~~~~

Each edges can have the following configurations:

- :code:`data_path`: the path of the given sources, environment variables are supported,
e.g., :code:`$HOME/data/knows.csv`. See also `Data Sources <#data-sources>`_.
- :code:`label`: the label of the edge, e.g., :code:`knows`.
- :code:`src_label`: the label of the source vertex, e.g., :code:`person`.
- :code:`dst_label`: the label of the destination vertex, e.g., :code:`person`.
- :code:`options`: the options used to read the file, e.g., :code:`header_row=true&delimiter=|`.
The detailed options are listed in `Read Options <#read-options>`_.

Data Sources
~~~~~~~~~~~~

The support for various sources in :code:`data_path` can be archived in two approaches:

- Option 1: use `vineyard.io <https://github.com/v6d-io/v6d/tree/main/modules/io/python/drivers/io/adaptors>`_
to read the given sources as vineyard streams first, and pass the stream as :code:`vineyard://<object_id_string>`
as :code:`data_path` to the loader.

- Option 2: onfigure the arrow dependency that used to build the vineyard-graph-loader to support
S3 and HDFS with `extra cmake flags <https://arrow.apache.org/docs/developers/cpp/building.html#optional-components>`_.

Read Options
~~~~~~~~~~~~

The read options are used to specify how to read the given sources, multiple options
should be separated by :code:`&` or :code:`#`, and are listed as follows:

- :code:`header_row`: whether the first row of CSV file is the header row or not,
default is :code:`0`.
- :code:`delimiter`: the delimiter of the CSV file, default is :code:`,`.

- :code:`schema`: the columns to specify in the CSV file, default is empty that indicates
all columns will be included. The :code:`schema` is a :code:`,`-separated list of column names
or column indices, e.g., :code:`name,age` or :code:`0,1`.
- :code:`column_types`: specify the data type of each column, default is empty that
indicates the types will be inferred from the data. The `column_types` is a `,`
separated list of data types, e.g., :code:`string,int64`. **If specified, the types
of ALL columns must be specified and partial-specification won't work.**
- :code:`include_all_columns`: whether to include all columns in the CSV file or not,
default is :code:`0`. **If specified, the columns that exists in the data file,
but not be listed in the `schema` option will be read as well.**

The combination of :code:`schema` and :code:`include_all_columns` is useful for scenarios
where we need to specify the order the columns that not the same with the content of the
file, but do not want to tell all column names in detail. For example, if the file contains
the ID column in the **third** column but we want to use it as the vertices IDs, we
could have :code:`schema=2&include_all_columns=1` the all columns will be read, but the
**third** column in the file will be placed at the **first** column in the result table.

Global Options
~~~~~~~~~~~~~~

Global options controls how the fragment is constructed from given vertices
and edges and are listed as follows:

- :code:`directed`: whether the graph is directed or not, default is :code:`1`.
- :code:`generate_eid`: whether to generate edge id or not, default is :code:`0`. **Generating
edge id is usually required in GraphScope GIE.**
- :code:`retain_oid`: whether to retain the original ID of the vertex's property table or not,
default is :code:`0`. **Retaining original ID in vertex's property table is usually required
in GraphScope GIE.**
- :code:`oid_type`: the type of the original ID of the vertices, default is :code:`int64_t`.
Can be :code:`int64_t` and :code:`string`.

- :code:`large_vid`: whether the vertex id is large or not, default is :code:`1`. If you are
sure that the number of vertices is fairly small (:code:`< 2^(31-log2(vertex_label_number)-1)`),
setting :code:`large_vid` to :code:`0` can reduce the memory usage. **Note that
:code:`large_vid=0` isn't compatible with GraphScope GIE.**
- :code:`local_vertex_map`: whether to use local vertex map or not, default is :code:`0`.
Using local vertex map is usually helpful to reduce the memory usage. **Note that
:code:`local_vertex_map=0` isn't compatible with GraphScope GIE.**

- :code:`print_memory_usage`: whether to print the memory usage of the graph to :code:`STDERR` or not,
default is :code:`0`.
- :code:`print_normalized_schema`: whether to print the **normalized** schema of the graph to
:code:`STDERR` or not, default is :code:`0`. The word "normalized" means make the same property
name has the same property id across different labels, **which is required by GraphScope GIE.**

0 comments on commit 99f0784

Please sign in to comment.