Run-length encoding offsets #12

daniel-j-h · 2021-07-21T21:51:16Z

We are storing two arrays in our csr graph format

offsets into the targets array
targets array with target vertices

for example for the edges

(0 -> 1)
(0 -> 2)
(1 -> 0)
(1 -> 2)
(2 -> 1)

we are storing internally

offsets: [0 2 4 5]
targets: [1 2 0 2 1]

The offset's deltas describe how many out edges a vertex has

offsets e.g. of [0 1 2 3 ..] means each vertex has a single edge
offsets e.g. of [0, 2, 4, 6, ..] means each vertex has two edges

One idea now could be the following

construct a graph
sort all vertices by number of out edges
partition into sub-ranges of the same number of out ranges (e.g. 1-out edge sub-range, 2-out edge sub-range, etc.)
run-length encode these sub-ranges, so that all we are saving are e.g. the two items (1, n), (2, m) for all 1 and 2 out edge vertices

Think

offsets: [0 1 2 3 4 6 8 10]
offsets: [[0, 1, 2, 3], [4, 6, 8, 10]]
offsets: [(1, 4), (2, 4)]

for a query for vertex v we'd then have to first find the sub-range it belongs to, and then go to the targets array.

Maybe it makes sense to initially compute a histogram of the number of edges to then determine when it makes sense to do this. For example for road networks it probably makes sense to special case the 1/2/3 out-edge cases, just because they are so common, since road networks are quite sparse.

We'd then have to re-number vertices between [0, n] such that they follow the out-edge sorting.

sorting Sorting edges #8
renumbering Renumber vertices #10

cc @ucyo

The text was updated successfully, but these errors were encountered:

daniel-j-h · 2021-07-29T21:27:28Z

The insights in Partitioned Elias-Fano Indexes (#13) sound interesting

While its space oc-
cupancy is competitive with some state-of-the-art methods
such as γ-δ-Golomb codes and PForDelta, it fails to exploit
the local clustering that inverted lists usually exhibit, namely
the presence of long subsequences of close identifiers.
In this paper we describe a new representation based on
partitioning the list into chunks and encoding both the chunks
and their endpoints with Elias-Fano, hence forming a two-
level data structure. This partitioning enables the encoding
to better adapt to the local statistics of the chunk, thus
exploiting clustering and improving compression.

if we re-order based on number of outgoing edges, then we'll get offset array sub-ranges with deltas of

all 1s for vertices with a single out edge,
all 2s for the vertices with two out edges,
and so on

These sub-ranges we can then run-length encode? Or just try the partitioned EF coding.

This was referenced Jul 21, 2021

Elias-Fano Coding #13

Open

Reorder graph nodes to improve locality #17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run-length encoding offsets #12

Run-length encoding offsets #12

daniel-j-h commented Jul 21, 2021

daniel-j-h commented Jul 29, 2021

Run-length encoding offsets #12

Run-length encoding offsets #12

Comments

daniel-j-h commented Jul 21, 2021

daniel-j-h commented Jul 29, 2021