This software is the implementation of egsa [1] (http://link.springer.com/chapter/10.1007%2F978-3-642-38905-4_20), an external memory algorithm to construct generalized enhanced suffix arrays.
Given a collection of K strings, egsa outputs the:
- Generalized suffix array
- LCP-array
- Burrows-Wheeler transform.
git clone https://github.com/felipelouza/egsa.git
cd egsa
Compile
make compile
Given a collection of K strings concatenated in a single file INPUT in the directory DIR.
fasta
make run DIR=dataset/fasta/ INPUT=proteins-100.fasta K=100
fastq
make run DIR=dataset/fastq/ INPUT=reads-100.fastq K=100
txt
make run DIR=dataset/txt/ INPUT=input-100.txt K=100
- strings are separated per '\0' (new line)
Examples
One could find examples in dataset folder, at least one for each setting mode.
The generalized enhanced suffix array is stored in the same directory DIR in a file called:
$(INPUT).K.gesa
Temporary files are stored in folders
$(DIR)/partition/
$(DIR)/tmp/
Output the Burrows-Wheeler transform:
make compile BWT=1
One can inform the maximum available internal memory to be used (in MB):
make clean
make compile MEMLIMIT=10
debug
To see a more detailed execution output use:
make compile DEBUG=1
One can check if the output produced by egsa is correct:
make run CHECK=1
##external resources:
We have included the source codes of the following algorithms:
- gsaca-k: generalized suffix array construction algorithm [2]
- lcp-PHI: LCP-array construction algorithm given the suffix array [3]
Please, if you use egsa tool in an academic setting cite the following paper:
@inproceedings{DBLP:conf/cpm/LouzaTC13,
author = {Louza, Felipe A. and Telles, Guilherme P. and Ciferri, Cristina D. A.},
title = {External Memory Generalized Suffix and {LCP} Arrays Construction},
year = {2013},
isbn = {978-3-642-38904-7},
booktitle = {Combinatorial Pattern Matching, 24th Annual Symposium, {CPM} 2013,
Bad Herrenalb, Germany, June 17-19, 2013. Proceedings},
pages = {201--210},
year = {2013},
series = {Lecture Notes in Computer Science},
volume = {7922},
publisher = {Springer}
}
[1] Louza, F. A., Telles, G. P., Ciferri, C. D. A. (2013). External Memory Generalized Suffix and LCP Arrays Construction. In Proc. CPM (pp. 201-210).
[2] Louza, F. A., Gog, S., Telles, G. P. (2016). Induced Suffix Sorting for String Collections. In Proc. DCC (pp. 43-52), https://github.com/felipelouza/gsa-is
[3] Kärkkäinen, J., Manzini, G., & Puglisi, S. J. (2009). Permuted Longest-Common-Prefix Array. In G. Kucherov & E. Ukkonen (Eds.), Proc. CPM (Vol. 5577, pp. 181–192).
##thanks:
Thanks to Fabio Garofalo, Giovanna Rosone and Guilherme P. Telles by helpful suggestions and debugging.