Skip to content

Generalized enhanced suffix array construction in external memory [CPM'13, AMB 2017]

License

Notifications You must be signed in to change notification settings

felipelouza/egsa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#egsa tool

This software is the implementation of egsa [1] (http://link.springer.com/chapter/10.1007%2F978-3-642-38905-4_20), an external memory algorithm to construct generalized enhanced suffix arrays.

Given a collection of K strings, egsa outputs the:

  • Generalized suffix array
  • LCP-array
  • Burrows-Wheeler transform.

##install:

git clone https://github.com/felipelouza/egsa.git
cd egsa

Compile:

make compile 

##run:

Given a collection of K strings concatenated in a single file INPUT in the directory DIR.

fasta

make run DIR=dataset/fasta/ INPUT=proteins-100.fasta K=100

fastq

make run DIR=dataset/fastq/ INPUT=reads-100.fastq K=100

txt

make run DIR=dataset/txt/	INPUT=input-100.txt K=100
  • strings are separated per '\0' (new line)

-- Examples:

One could find examples in dataset folder, at least one for each setting mode.

##output:

The generalized enhanced suffix array is stored in the same directory DIR in a file called:

$(INPUT).K.gesa

Temporary files are stored in folders

$(DIR)/partition/
$(DIR)/tmp/

##options:

Output the Burrows-Wheeler transform:

make compile BWT=1

One can inform the maximum available internal memory to be used (in MB):

make clean
make compile MEMLIMIT=10

debug:

To see a more detailed execution output use:

make compile DEBUG=1

One can check if the output produced by egsa is correct:

make run CHECK=1

##external resources:

We have included the source codes of the following algorithms:

  • gsaca-k: generalized suffix array construction algorithm [2]
  • lcp-PHI: LCP-array construction algorithm given the suffix array [3]

##citation:

Please, if you use egsa tool in an academic setting cite the following paper:

@inproceedings{DBLP:conf/cpm/LouzaTC13,
  author = {Louza, Felipe A. and Telles, Guilherme P. and Ciferri, Cristina D. A.},
  title = {External Memory Generalized Suffix and {LCP} Arrays Construction},
  year = {2013},
  isbn = {978-3-642-38904-7},
  booktitle = {Combinatorial Pattern Matching, 24th Annual Symposium, {CPM} 2013,
           Bad Herrenalb, Germany, June 17-19, 2013. Proceedings},
  pages = {201--210},
  year = {2013},
  series = {Lecture Notes in Computer Science},
  volume = {7922},
  publisher = {Springer}
}

##references:

[1] Louza, F. A., Telles, G. P., Ciferri, C. D. A. (2013). External Memory Generalized Suffix and LCP Arrays Construction. In Proc. CPM (pp. 201-210).

[2] Louza, F. A., Gog, S., Telles, G. P. (2016). Induced Suffix Sorting for String Collections. In Proc. DCC (pp. 43-52), https://github.com/felipelouza/gsa-is

[3] Kärkkäinen, J., Manzini, G., & Puglisi, S. J. (2009). Permuted Longest-Common-Prefix Array. In G. Kucherov & E. Ukkonen (Eds.), Proc. CPM (Vol. 5577, pp. 181–192).

##thanks:

Thanks to Fabio Garofalo, Giovanna Rosone and Guilherme P. Telles by helpful suggestions and debugging.