-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for faster scans #15
Comments
how fast are scans currently compared to say parquet or avro? |
Succinct is currently not optimized for full scans, but supports queries like random access (i.e., The time taken to extract the data for full scans would depend on the dataset size and the dataset itself. I haven't benchmarked the scan performance against parquet and avro, but I would parquet and avro to be faster for full scans. What would your intended use-case be? |
The use case would be similar to elasticsearch or hbase: most access is by On Fri, Apr 22, 2016 at 6:02 PM, Anurag Khandelwal <notifications@github.com
|
The full scan performance for Succinct could be slower than HBase depending on the size of the dataset and the size of the cluster being used. However, with the planned optimization described here, we should be able to support full scans at the rate that snappy codec can decompress data, at the cost of some reduction in overall compression factor (since the succinct representation would be supplemented with scan-efficient compressed representation). Again, I don't have performance comparison against Avro or Parquet, but snappy codec is known to provide very fast decompression rates (decompression rates of the order of ~500Mb/s per thread, as suggested here). Would that cater to your use case? I'd be happy to assign higher priority to this if it helps! |
Faster scans can be supported by having a snappy compressed representation of the data along with the Succinct data structures; operations on the Succinct RDDs / DataFrame that require full scans (e.g., aggregates), can execute efficiently on the alternate representation, whereas search/random access queries are handled by the Succinct data structures. The two representations should remain under the hood -- exposing a single unified interface to the Succinct RDDs / DataFrame.
The text was updated successfully, but these errors were encountered: