This repository contains the implementation of algorithm [1] that computes the document-array (DA) for a string collection in linear time using constant additional space.
Our algorithm receives the suffix array (SA) of the concatenated string [2] and reuses its space to store the LF-array. Then SA is reconstructed during the computation of DA.
An ANSI C Compiler (e.g. GNU GCC)
/** @brief computes the document array of string s[0, n-1] given its SA * * @param T input concatenated string, using separators s[i]=1 and with s[n-1]=0 * @param SA Suffix array * @param DA Document array * @param n string length * @param d number of documents * @return -1 if an error occured, otherwise returns 0 */ int document_array_9n(unsigned char* T, int_t* SA, int_da* DA, uint_t n, uint_t d);
Compilation:
make
Available options:
-d D use the first D documents of the INPUT -v verbose output -o output computed arrays to disk (INPUT.da and INPUT.sa) -c check output (for debug) -p P print the output arrays LA[1,P] and SA[1,P] (for debug) -h this help message
Notes:
- Supported extensions are .txt, .fasta and .fastq.
Run a test:
/main dataset/input.txt -p 10 -c
Output:
d = 3 N = 32 bytes sizeof(int) = 4 bytes TOTAL: CLOCK = 0.000059 TIME = 0.000000 0.000059 SA: isSorted!! DA: isDA!! ######## i) DASAsuffixes 0) 331# 1) 06$abacabad$... 2) 115$banaanana... 3) 230$# 4) 05a$abacabad... 5) 229a$# 6) 224aanana$# 7) 219aananaanan... 8) 17abacabad$b... 9) 111abad$banaa... ######## malloc_count ### exiting, total: 25,032, peak: 21,512, current: 1,024
[1] Louza, F. A., A Simple Algorithm for Computing the Document Array. CoRR abs/1812.09094 (2018)
[2] Louza, F. A., Gog, S., Telles, G. P., Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci., vol. 678, pp. 22-39, 2017, Elsevier.
Thanks to Giovanni Manzini, Travis Gagie and Nicola Prezza for helpful discussions