This repository contains a set of algorithms to compute the Lyndon-array (LA) for a string T[1,n].
In particular, we introduce algorithm SACA-K+LA [1] that computes LA together with the suffix array (SA) for T[1,n] in linear time.
SACA-K+LA extends the optimal suffix sorting algorithm SACA-K [2] to also compute LA in linear time using O(\sigma) words of additional space, which is optimal for alphabets of constant size.
Experimental results have shown that our algorithm is competitive in time and space, compared to solutions that compute only LA.
We provide four versions of SACA-K+LA, the last version, option -A 10
, corresponds to the optimal solution.
An ANSI C Compiler (e.g. GNU GCC)
Compilation:
make
Available options:
-A a preferred algorithm to use (default 10)
-d D use the first D documents of the INPUT
-b read INPUT as binary input
-f read INPUT as (default) formatted input (.txt, .fasta or .fastq)
-v verbose output
-o output computed arrays to disk (INPUT.la and INPUT.sa)
-s compute some statistics for LA
-l output lyndon-factors (start positions) to (INPUT.pos)
-L output lyndon-factors (substrings) to (INPUT.lyn)
-c check output (for debug)
-p P print the output arrays LA[1,P] and SA[1,P] (for debug)
-h this help message
Notes:
- Supported extensions are .txt, .fasta and .fastq.
Algorithms:
-A | Algorithm | Output | Running time | Total space | Auxiliary arrays |
---|---|---|---|---|---|
1 | Lyndon_NSV [3] | LA | O(n) | 9n bytes + O(n)-words | ISA+Stack |
2 | GSACA_LA [4] | LA | O(n) | 17n bytes | GSIZE+PREV+ISA |
3 | MAX_LYN [5] | LA | O(n^2) | 5n bytes | |
4 | Lyndon_BWT [6] | LA | O(n) | 9n bytes + O(n)-words | LF+Stack |
5 | BWT_INPLACE_LA | LA + BWT | O(n^2) | 5n bytes | |
6 | GSACA_LA+SA [4] | LA + SA | O(n) | 17n bytes | GSIZE+PREV+ISA |
7 | SACA-K+LA | LA + SA | O(n*avg_lyndon) | 9n bytes | |
8 | SACA-K+LA_17n | LA + SA | O(n) | 17n bytes | PREV+NEXT |
9 | SACA-K+LA_13n | LA + SA | O(n) | 13n bytes | PREV |
10 | SACA-K+LA_9n | LA + SA | O(n) | 9n bytes |
Notes:
-
BWT_INPLACE_LYN is a modification of the BWT-Inplace algorithm by Crochemore et al. [7] to also compute LA.
-
SACA-K+LA [1] is a modification of the suffix sorting algorithm SACAK by Nong [2] to also compute LA.
Run a test:
./lyndon-array dataset/input.txt -A 10 -p 10
Output:
d = 1
N = 7 bytes
sizeof(int) = 4 bytes
## SACAK_LYNDON 9n (linear) ##
TOTAL:
CLOCK = 0.000039 TIME = 0.000000
0.000039
########
i) LA SA suffixes
0) 1 6 banana#
1) 2 5 anana#
2) 1 3 nana#
3) 2 1 ana#
4) 1 0 na#
5) 1 4 a#
6) 1 2 #
########
malloc_count ### exiting, total: 24,478, peak: 21,272, current: 1,144
[1] Louza, F. A., Mantaci, S., Manzini G., Sciortino, M., Telles, G. P.: Inducing the Lyndon Array, SPIRE 2019: 138-151 link
[2] Nong, G., Practical linear-time O(1)-workspace suffix sorting for constant alphabets, ACM Trans. Inform. Syst., vol. 31, no. 3, pp. 1-15, 2013
[3] Christophe Hohlweg, Christophe Reutenauer: Lyndon words, permutations and trees. Theor. Comput. Sci. 307(1): 173-178 (2003)
[4] Uwe Baier: Linear-time Suffix Sorting - A New Approach for Suffix Array Construction. CPM 2016: 23:1-23:12 link
[5] Frantisek Franek, A. S. M. Sohidull Islam, Mohammad Sohel Rahman, William F. Smyth: Algorithms to Compute the Lyndon Array. Stringology 2016: 172-184
[6] Louza, F. A., Smyth W. F., Manzini G., Telles, G. P.: Lyndon Array Construction during Burrows-Wheeler Inversion, J. Discrete Algorithms, vol. 50, pp. 2-9, 2018 link
[7] Maxime Crochemore, Roberto Grossi, Juha Kärkkäinen, Gad M. Landau: Computing the Burrows-Wheeler transform in place and in small space. J. Discrete Algorithms 32: 44-52 (2015) link
Thanks to Uwe Baier for kindly providing the source codes for GSACA_LA and GSACA_LA+SA: link.
Supported by the project Italian MIUR-SIR CMACBioSeq ("Combinatorial methods for analysis and compression of biological sequences") grant n.~RBSI146R5L. P.I. Giovanna Rosone ↩