-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a rank-select data structure, see #14 #41
base: main
Are you sure you want to change the base?
Conversation
// A succinct rank-select datastructure over a bitset. | ||
// | ||
// rank1(bits, n): sum of 1s in bits before position n | ||
// | ||
// select1(bits, n): position of nth 1 bit set in bits | ||
// | ||
// We implement a very simple rank-select datastructure | ||
// with room for improvement in terms of space and runtime | ||
// but the benefit is a relatively simple implementation. | ||
// | ||
// For rank: We make use of recent learnings from [1] and | ||
// [3]: We use basic blocks of 512 bits (cache line size) | ||
// and within these basic blocks make use of repeated | ||
// popcnt instructions. This is similar to rank9 from [4] | ||
// but without the 9-bit second level index and with the | ||
// popcnt hardware instruction instead of broadword tricks. | ||
// With a cache line aligned bitvector we should see a max | ||
// of two cache misses: one for the basic block and then one | ||
// for the remainder in the bitvector itself. Potential | ||
// for improvements are listed in [1] and [2] e.g. the | ||
// use of multi level rank inventories. | ||
// | ||
// For select: We make use of recent learnings from [1] and | ||
// [3] where we sample every e.g. 8192'th bit set and from | ||
// there use a scan over the rank inventory to speed up | ||
// the search. The final cache line select works using | ||
// the recent pdep hardware instruction. Note that worst | ||
// case runtime is in O(n) for adversarily bitvector cases | ||
// such as: a very large bit vector but only one bit set at | ||
// the very start and one bit set at the very end. | ||
// Potential for improvements are listed in [1] and [2]. | ||
// In addition one idea could be tuning that 8192 constant | ||
// e.g. based on the bitvector's load factor (how many bits | ||
// are set) to make sure scanning ranks stays reasonable. | ||
// The simple-select datastructure in [3] is especially | ||
// tuned for select (does not rank) and seems to be strong | ||
// in runtime but also needs more space. | ||
// | ||
// | ||
// [1] SPIDER: Improved Succinct Rank and Select Performance | ||
// https://arxiv.org/abs/2405.05214 | ||
// | ||
// [2] Engineering Compact Data Structures for Rank and Select Queries on Bit Vectors | ||
// https://arxiv.org/abs/2206.01149 | ||
// | ||
// [3] A Fast x86 Implementation of Select | ||
// https://arxiv.org/abs/1706.00990 | ||
// | ||
// [4] Broadword Implementation of Rank/Select Queries | ||
// https://vigna.di.unimi.it/ftp/papers/Broadword.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impl. details and context
// We can compute rank from two parts: the pre-computed | ||
// ranks for each cache line of size 512, and in case | ||
// we rank inside a cache line adding the remaining | ||
// residual from ranking the bits inside the cache line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rank
// To select the nth set bit on the bitset | ||
// | ||
// 1. Use the samples array to find two positions | ||
// where the nth bit has to be within. This range | ||
// in general is unrestricted and in adversarial | ||
// cases the range might span the whole bitset. | ||
// That is why select is in O(n) but fast in | ||
// practice and works fine for our use cases. | ||
// | ||
// 2. Use the ranks array to search for a block in | ||
// the range where the nth bit must occur. | ||
// | ||
// 3. Within the 512 bit block where the nth bit | ||
// must occur, rely on our highly tuned | ||
// select512 implementations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select
// Our data structure can be constructed in a single linear scan | ||
// over the bitvector: for every block of 512 bits we store the | ||
// computed rank up to this position in the ranks array. And for | ||
// every 8192th bit set we store its position in the samples array. | ||
// | ||
// By iterating one cache line at a time we can make use of our | ||
// highly tuned rank512 and select512 implementations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
construction
Simple rank-select data structure for ranking (sum of 1 bits) and selecting (position of ith 1).
We will need this for an even tinier graph representation that
for graph traversal we can then get edges as in
[select(start_vertex), select(start_vertex + 1)]
to then first decode the vbytes and then delta-decode it all in one go. I'm interested in how well this will actually work in practice nowadays.The big downside at the moment is we make heavy use of the popcnt instruction that's only available in Haswell x86 processors. For arm we don't have a reasonable implementation at the moment and also no access to machines to test and benchmark.
See the comments inline for more context and implementation details.