Skip to content

Latest commit

 

History

History

graph

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

vineyard-graph

vineyard-graph defines the graph data structures that can be shared among graph computing engines.

CMake configure options

  • VINEYARD_GRAPH_MAX_LABEL_ID

    The internal vertex id (aka. VID) in vineyard is encoded as fragment id, vertex label id and vertex offset. The option VINEYARD_GRAPH_MAX_LABEL_ID decides the bit field width of label id in VID. Decreasing this value can be helpful to support larger number of vertices when using int32_t as VID_T.

    Defaults to 128, can be 1, 2, 4, 8, 16, 32, 64, or 128.

vineyard-graph-loader

vineyard-graph-loader is a graph loader used to load graphs from the CSV format into vineyard.

Usage

$ ./vineyard-graph-loader
Usage: loading vertices and edges as vineyard graph.

       ./vineyard-graph-loader [--socket <vineyard-ipc-socket>] \
                               <e_label_num> <efiles...> <v_label_num> <vfiles...> \
                               [directed] [generate_eid] [string_oid]

   or: ./vineyard-graph-loader [--socket <vineyard-ipc-socket>] --config <config.json>

The program vineyard-graph-loader first accepts an optional argument --socket <vineyard-ipc-socket> which specifies the IPC docket that the loader will connected to. If the option is not provided, the loader will try to resolve the IPC socket from the environment variable VINEYARD_IPC_SOCKET.

The graph can be loaded either via command line arguments or a JSON configuration.

Using Command-line Arguments

The vineyard-graph-loader accepts a sequence of command line arguments to specify the edge files and vertex files, e.g.,

$ ./vineyard-graph-loader 2 "modern_graph/knows.csv#header_row=true&src_label=person&dst_label=person&label=knows&delimiter=|" \
                            "modern_graph/created.csv#header_row=true&src_label=person&dst_label=software&label=created&delimiter=|" \
                          2 "modern_graph/person.csv#header_row=true&label=person&delimiter=|" \
                            "modern_graph/software.csv#header_row=true&label=software&delimiter=|"

Using a JSON Configuration

The vineyard-graph-loader can also accept a config file (in JSON format) as well to specify the vertex files and edge files that would be loaded. as well as global flags, for example,

$ ./vineyard-graph-loader --config config.json

Here is an example of the config.json file for the "modern graph":

{
    "vertices": [
        {
            "data_path": "modern_graph/person.csv",
            // can also be absolute path or path with environment variables
            //
            // "data_path": "/datasets/modern_graph/person.csv",
            // "data_path": "$DATASET/modern_graph/person.csv",
            "label": "person",
            "options": "header_row=true&delimiter=|"
        },
        {
            "data_path": "modern_graph/software.csv",
            "label": "software",
            "options": "header_row=true&delimiter=|"
        }
    ],
    "edges": [
        {
            "data_path": "modern_graph/knows.csv",
            "label": "knows",
            "src_label": "person",
            "dst_label": "person",
            "options": "header_row=true&delimiter=|"
        },
        {
            "data_path": "modern_graph/created.csv",
            "label": "created",
            "src_label": "person",
            "dst_label": "software",
            "options": "header_row=true&delimiter=|"
        }
    ],
    "directed": 1,
    "generate_eid": 1,
    "string_oid": 0,
    "local_vertex_map": 0,
    "print_normalized_schema": 1
}

References

Vertices

Each vertices can have the following configurations:

  • data_path: the path of the given sources, environment variables are supported, e.g., $HOME/data/person.csv. See also Data Sources.
  • label: the label of the vertex, e.g., person.
  • options: the options used to read the file, e.g., header_row=true&delimiter=|. The detailed options are listed in Read Options.

Edges

Each edges can have the following configurations:

  • data_path: the path of the given sources, environment variables are supported, e.g., $HOME/data/knows.csv. See also Data Sources.
  • label: the label of the edge, e.g., knows.
  • src_label: the label of the source vertex, e.g., person.
  • dst_label: the label of the destination vertex, e.g., person.
  • options: the options used to read the file, e.g., header_row=true&delimiter=|. The detailed options are listed in Read Options.

Data Sources

The data_path can be local files, S3 files, HDFS files, or vineyard streams.

When it comes to local files, it can be a relative path, an absolute path, or a path with environment variables, e.g.,

  • data/person.csv
  • /dataset/data/person.csv
  • $HOME/data/person.csv

When it comes to S3 files and HDFS files, the support for various sources in data_path can be archived in two approaches:

  • Option 1: use vineyard.io to read the given sources as vineyard streams first, and pass the stream as vineyard://<object_id_string> as data_path to the loader.
  • Option 2: configure the arrow dependency that used to build the vineyard-graph-loader to support S3 and HDFS with extra cmake flags.

For edges that have different kinds of (src, dst) pair, just repeat the "edge" object in the configuration file, e.g.,

{
    "vertices": [
       ...
    ],
    "edges": [
        {
            "data_path": "person_knows_person.csv",
            "label": "knows",
            "src_label": "person",
            "dst_label": "person",
            "options": "header_row=true&delimiter=|"
        },
        {
            "data_path": "person_knows_item.csv",
            "label": "knows",
            "src_label": "person",
            "dst_label": "item",
            "options": "header_row=true&delimiter=|"
        },
        ...
    ],
   ...
}

Read Options

The read options are used to specify how to read the given sources, multiple options should be separated by & or #, and are listed as follows:

  • header_row: whether the first row of CSV file is the header row or not, default is 0.

  • delimiter: the delimiter of the CSV file, default is ,.

  • schema: the columns to specify in the CSV file, default is empty that indicates all columns will be included. The schema is a ,-separated list of column names or column indices, e.g., name,age or 0,1.

  • column_types: specify the data type of each column, default is empty that indicates the types will be inferred from the data. The column_types is a , separated list of data types, e.g., string,int64. If specified, the types of ALL columns must be specified and partial-specification won't work.

    The supported data types are listed as follows:

    • bool: boolean type.
    • int8_t, int8, byte: signed 8-bit integer type.
    • uint8_t, uint8, char: unsigned 8-bit integer type.
    • int16_t, int16, half: signed 16-bit integer type.
    • uint16_t, uint16: unsigned 16-bit integer type.
    • int32_t, int32, int: signed 32-bit integer type.
    • uint32_t, uint32: unsigned 32-bit integer type.
    • int64_t, int64, long: signed 64-bit integer type.
    • uint64_t, uint64: unsigned 64-bit integer type.
    • float: 32-bit floating point type.
    • double: 64-bit floating point type.
    • string, std::string, str: string type.
  • include_all_columns: whether to include all columns in the CSV file or not, default is 0. If specified, the columns that exists in the data file, but not be listed in the `schema` option will be read as well.

    The combination of schema and include_all_columns is useful for scenarios where we need to specify the order the columns that not the same with the content of the file, but do not want to tell all column names in detail. For example, if the file contains the ID column in the third column but we want to use it as the vertices IDs, we could have schema=2&include_all_columns=1 the all columns will be read, but the third column in the file will be placed at the first column in the result table.

Global Options

Global options controls how the fragment is constructed from given vertices and edges and are listed as follows:

  • directed: whether the graph is directed or not, default is 1.
  • generate_eid: whether to generate edge id or not, default is 0. Generating edge id is usually required in GraphScope GIE.
  • retain_oid: whether to retain the original ID of the vertex's property table or not, default is 0. Retaining original ID in vertex's property table is usually required in GraphScope GIE.
  • oid_type: the type of the original ID of the vertices, default is int64_t. Can be int64_t and string.
  • large_vid: whether the vertex id is large or not, default is 1. If you are sure that the number of vertices is fairly small (< 2^(31-log2(vertex_label_number)-1)), setting large_vid to 0 can reduce the memory usage. Note that :code:`large_vid=0` isn't compatible with GraphScope GIE.
  • local_vertex_map: whether to use local vertex map or not, default is 0. Using local vertex map is usually helpful to reduce the memory usage. Note that :code:`local_vertex_map=0` isn't compatible with GraphScope GIE.
  • print_memory_usage: whether to print the memory usage of the graph to STDERR or not. Default is 0.
  • print_normalized_schema: whether to print the normalized schema of the graph to STDERR or not, default is 0. The word "normalized" means make the same property name has the same property id across different labels, which is required by GraphScope GIE.
  • dump: a string that indicates a directory to dump the graph to, default is empty that indicates no dump, e.g., "dump": "/tmp/dump-graph".
  • dump_dry_run_rounds: if greater than 0, will traverse the graph for dump_dry_run_rounds times to measure the edge (CSR) accessing performance. Default is 0.
  • use_perfect_hash whether to use perfect map when construct vertex map. Default is 0. Using perfect map is usually helpful to reduce the memory usage. But it is not recommended when the graph is small.