vineyard-graph
defines the graph data structures that can be shared
among graph computing engines.
VINEYARD_GRAPH_MAX_LABEL_ID
The internal vertex id (aka.
VID
) in vineyard is encoded as fragment id, vertex label id and vertex offset. The optionVINEYARD_GRAPH_MAX_LABEL_ID
decides the bit field width of label id inVID
. Decreasing this value can be helpful to support larger number of vertices when usingint32_t
asVID_T
.Defaults to 128, can be 1, 2, 4, 8, 16, 32, 64, or 128.
vineyard-graph-loader
is a graph loader used to load graphs from
the CSV format into vineyard.
$ ./vineyard-graph-loader
Usage: loading vertices and edges as vineyard graph.
./vineyard-graph-loader [--socket <vineyard-ipc-socket>] \
<e_label_num> <efiles...> <v_label_num> <vfiles...> \
[directed] [generate_eid] [string_oid]
or: ./vineyard-graph-loader [--socket <vineyard-ipc-socket>] --config <config.json>
The program vineyard-graph-loader
first accepts an optional argument
--socket <vineyard-ipc-socket>
which specifies the IPC docket that the
loader will connected to. If the option is not provided, the loader will try to
resolve the IPC socket from the environment variable VINEYARD_IPC_SOCKET.
The graph can be loaded either via command line arguments or a JSON configuration.
The vineyard-graph-loader
accepts a sequence of command line arguments to
specify the edge files and vertex files, e.g.,
$ ./vineyard-graph-loader 2 "modern_graph/knows.csv#header_row=true&src_label=person&dst_label=person&label=knows&delimiter=|" \
"modern_graph/created.csv#header_row=true&src_label=person&dst_label=software&label=created&delimiter=|" \
2 "modern_graph/person.csv#header_row=true&label=person&delimiter=|" \
"modern_graph/software.csv#header_row=true&label=software&delimiter=|"
The vineyard-graph-loader
can also accept a config file (in JSON format) as well
to specify the vertex files and edge files that would be loaded. as well as global
flags, for example,
$ ./vineyard-graph-loader --config config.json
Here is an example of the config.json file for the "modern graph":
{
"vertices": [
{
"data_path": "modern_graph/person.csv",
// can also be absolute path or path with environment variables
//
// "data_path": "/datasets/modern_graph/person.csv",
// "data_path": "$DATASET/modern_graph/person.csv",
"label": "person",
"options": "header_row=true&delimiter=|"
},
{
"data_path": "modern_graph/software.csv",
"label": "software",
"options": "header_row=true&delimiter=|"
}
],
"edges": [
{
"data_path": "modern_graph/knows.csv",
"label": "knows",
"src_label": "person",
"dst_label": "person",
"options": "header_row=true&delimiter=|"
},
{
"data_path": "modern_graph/created.csv",
"label": "created",
"src_label": "person",
"dst_label": "software",
"options": "header_row=true&delimiter=|"
}
],
"directed": 1,
"generate_eid": 1,
"string_oid": 0,
"local_vertex_map": 0,
"print_normalized_schema": 1
}
Each vertices can have the following configurations:
data_path
: the path of the given sources, environment variables are supported, e.g.,$HOME/data/person.csv
. See also Data Sources.label
: the label of the vertex, e.g.,person
.options
: the options used to read the file, e.g.,header_row=true&delimiter=|
. The detailed options are listed in Read Options.
Each edges can have the following configurations:
data_path
: the path of the given sources, environment variables are supported, e.g.,$HOME/data/knows.csv
. See also Data Sources.label
: the label of the edge, e.g.,knows
.src_label
: the label of the source vertex, e.g.,person
.dst_label
: the label of the destination vertex, e.g.,person
.options
: the options used to read the file, e.g.,header_row=true&delimiter=|
. The detailed options are listed in Read Options.
The data_path
can be local files, S3 files, HDFS files, or vineyard streams.
When it comes to local files, it can be a relative path, an absolute path, or a path with environment variables, e.g.,
data/person.csv
/dataset/data/person.csv
$HOME/data/person.csv
When it comes to S3 files and HDFS files, the support for various sources in data_path
can be archived in two approaches:
- Option 1: use vineyard.io
to read the given sources as vineyard streams first, and pass the stream as
vineyard://<object_id_string>
asdata_path
to the loader. - Option 2: configure the arrow dependency that used to build the vineyard-graph-loader to support S3 and HDFS with extra cmake flags.
For edges that have different kinds of (src, dst)
pair, just repeat the "edge" object in
the configuration file, e.g.,
{
"vertices": [
...
],
"edges": [
{
"data_path": "person_knows_person.csv",
"label": "knows",
"src_label": "person",
"dst_label": "person",
"options": "header_row=true&delimiter=|"
},
{
"data_path": "person_knows_item.csv",
"label": "knows",
"src_label": "person",
"dst_label": "item",
"options": "header_row=true&delimiter=|"
},
...
],
...
}
The read options are used to specify how to read the given sources, multiple options
should be separated by &
or #
, and are listed as follows:
header_row
: whether the first row of CSV file is the header row or not, default is0
.delimiter
: the delimiter of the CSV file, default is,
.schema
: the columns to specify in the CSV file, default is empty that indicates all columns will be included. Theschema
is a,
-separated list of column names or column indices, e.g.,name,age
or0,1
.column_types
: specify the data type of each column, default is empty that indicates the types will be inferred from the data. The column_types is a , separated list of data types, e.g.,string,int64
. If specified, the types of ALL columns must be specified and partial-specification won't work.The supported data types are listed as follows:
bool
: boolean type.int8_t
,int8
,byte
: signed 8-bit integer type.uint8_t
,uint8
,char
: unsigned 8-bit integer type.int16_t
,int16
,half
: signed 16-bit integer type.uint16_t
,uint16
: unsigned 16-bit integer type.int32_t
,int32
,int
: signed 32-bit integer type.uint32_t
,uint32
: unsigned 32-bit integer type.int64_t
,int64
,long
: signed 64-bit integer type.uint64_t
,uint64
: unsigned 64-bit integer type.float
: 32-bit floating point type.double
: 64-bit floating point type.string
,std::string
,str
: string type.
include_all_columns
: whether to include all columns in the CSV file or not, default is0
. If specified, the columns that exists in the data file, but not be listed in the `schema` option will be read as well.The combination of
schema
andinclude_all_columns
is useful for scenarios where we need to specify the order the columns that not the same with the content of the file, but do not want to tell all column names in detail. For example, if the file contains the ID column in the third column but we want to use it as the vertices IDs, we could haveschema=2&include_all_columns=1
the all columns will be read, but the third column in the file will be placed at the first column in the result table.
Global options controls how the fragment is constructed from given vertices and edges and are listed as follows:
directed
: whether the graph is directed or not, default is1
.generate_eid
: whether to generate edge id or not, default is0
. Generating edge id is usually required in GraphScope GIE.retain_oid
: whether to retain the original ID of the vertex's property table or not, default is0
. Retaining original ID in vertex's property table is usually required in GraphScope GIE.oid_type
: the type of the original ID of the vertices, default isint64_t
. Can beint64_t
andstring
.large_vid
: whether the vertex id is large or not, default is1
. If you are sure that the number of vertices is fairly small (< 2^(31-log2(vertex_label_number)-1)
), settinglarge_vid
to0
can reduce the memory usage. Note that :code:`large_vid=0` isn't compatible with GraphScope GIE.local_vertex_map
: whether to use local vertex map or not, default is0
. Using local vertex map is usually helpful to reduce the memory usage. Note that :code:`local_vertex_map=0` isn't compatible with GraphScope GIE.print_memory_usage
: whether to print the memory usage of the graph toSTDERR
or not. Default is0
.print_normalized_schema
: whether to print the normalized schema of the graph toSTDERR
or not, default is0
. The word "normalized" means make the same property name has the same property id across different labels, which is required by GraphScope GIE.dump
: a string that indicates a directory to dump the graph to, default is empty that indicates no dump, e.g.,"dump": "/tmp/dump-graph"
.dump_dry_run_rounds
: if greater than0
, will traverse the graph fordump_dry_run_rounds
times to measure the edge (CSR) accessing performance. Default is0
.use_perfect_hash
whether to use perfect map when construct vertex map. Default is0
. Using perfect map is usually helpful to reduce the memory usage. But it is not recommended when the graph is small.