Python implementation of the online grammar compression algorithm OLCA. POLCA was developed for studying OLCA, so its time and space consumption and the output compressed size are much worse than lcacomp which is a C++ implemenation of OLCA. However, POLCA has an advantage, lcacomp does not have, that it can treat each multi byte character as a single character.
Note: The main drawback of POLCA is that it stores each whole line of input in RAM because python does not provide an interface for reading by characters.
Python 3.5 or higher is required.
See requirements.txt
for other required libraries.
They will be installed by the following command.
$ pip install -r requirements.txt
$ python src/polcarun.py -h
src/polcarun.py: Python implementation of the online grammar compression algorithm OLCA.
Usage:
src/polcarun.py (comp | decomp) [options]
Options:
-i FILE --input FILE input file [default: stdin]
-o FILE --output FILE output file [default: stdout]
-e ENCODING --encoding ENCODING open input/output by ENCODING for comp/decomp [default: binary]
if ENCODING="binary", treat a byte as a single character
otherwise, treat a multi-byte encoded by ENCODING as a single character
An example for a small text.
$ echo "hogehoge" > hoge.txt
$ python src/polcarun.py comp -i hoge.txt -o hoge.enc
$ python src/polcarun.py decomp -i hoge.enc -o hoge.decomp
$ cat hoge.decomp
hogehoge
An example for a highly-repetitive text. Note: The compressed size is very small, but POLCA store whole input (about 17M) in RAM. This is because the input huge data is written in a line.
$ python -c "print('a'*(2**24))" > fuga.txt # 16777217 bytes (about 17M)
$ python src/polcarun.py comp -i fuga.txt -o fuga.enc # 266 bytes
$ python src/polcarun.py comp -i fuga.enc -o fuga.decomp
$ diff fuga.txt fuga.decomp
# no output
An example for a multi-byte text.
$ echo "âêîôû" | python src/polcarun.py comp --encoding=utf8 | python src/polcarun.py decomp --encoding=utf8
âêîôû
# ofcourse, you can treat a byte as a character also for a multi-byte text!
$ echo "âêîôû" | python src/polcarun.py comp | python src/polcarun.py decomp
âêîôû
$ cd data
$ ./download.sh # download corpus
$ cd ../
$ python -m unittest -v src/test_polca.py
download.sh
downloads files for test from The Canterbury Corpus.
MIT © 2017 Keisuke Goto
- Shirou Maruyama, Hiroshi Sakamoto, Masayuki Takeda: An Online Algorithm for Lightweight Grammar-Based Compression. Algorithms 5(2): 214-235 (2012)
- lcacomp: A C++ implementation of OLCA.