Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run-length encoding offsets #12

Open
daniel-j-h opened this issue Jul 21, 2021 · 1 comment
Open

Run-length encoding offsets #12

daniel-j-h opened this issue Jul 21, 2021 · 1 comment

Comments

@daniel-j-h
Copy link
Member

We are storing two arrays in our csr graph format

  1. offsets into the targets array
  2. targets array with target vertices

for example for the edges

(0 -> 1)
(0 -> 2)
(1 -> 0)
(1 -> 2)
(2 -> 1)

we are storing internally

offsets: [0 2 4 5]
targets: [1 2 0 2 1]

The offset's deltas describe how many out edges a vertex has

  • offsets e.g. of [0 1 2 3 ..] means each vertex has a single edge
  • offsets e.g. of [0, 2, 4, 6, ..] means each vertex has two edges

One idea now could be the following

  1. construct a graph
  2. sort all vertices by number of out edges
  3. partition into sub-ranges of the same number of out ranges (e.g. 1-out edge sub-range, 2-out edge sub-range, etc.)
  4. run-length encode these sub-ranges, so that all we are saving are e.g. the two items (1, n), (2, m) for all 1 and 2 out edge vertices

Think

  1. offsets: [0 1 2 3 4 6 8 10]
  2. offsets: [[0, 1, 2, 3], [4, 6, 8, 10]]
  3. offsets: [(1, 4), (2, 4)]

for a query for vertex v we'd then have to first find the sub-range it belongs to, and then go to the targets array.

Maybe it makes sense to initially compute a histogram of the number of edges to then determine when it makes sense to do this. For example for road networks it probably makes sense to special case the 1/2/3 out-edge cases, just because they are so common, since road networks are quite sparse.

We'd then have to re-number vertices between [0, n] such that they follow the out-edge sorting.

Related

cc @ucyo

@daniel-j-h
Copy link
Member Author

The insights in Partitioned Elias-Fano Indexes (#13) sound interesting

While its space oc-
cupancy is competitive with some state-of-the-art methods
such as γ-δ-Golomb codes and PForDelta, it fails to exploit
the local clustering that inverted lists usually exhibit, namely
the presence of long subsequences of close identifiers.
In this paper we describe a new representation based on
partitioning the list into chunks and encoding both the chunks
and their endpoints with Elias-Fano, hence forming a two-
level data structure. This partitioning enables the encoding
to better adapt to the local statistics of the chunk, thus
exploiting clustering and improving compression.

if we re-order based on number of outgoing edges, then we'll get offset array sub-ranges with deltas of

  • all 1s for vertices with a single out edge,
  • all 2s for the vertices with two out edges,
  • and so on

These sub-ranges we can then run-length encode? Or just try the partitioned EF coding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant