Skip to content

felipelouza/document-array

Repository files navigation

document-array

This repository contains the implementation of algorithm [1] that computes the document-array (DA) for a string collection in linear time using constant additional space.

Introduction

Our algorithm receives the suffix array (SA) of the concatenated string [2] and reuses its space to store the LF-array. Then SA is reconstructed during the computation of DA.

Build requirements

An ANSI C Compiler (e.g. GNU GCC)

API

/** @brief computes the document array of string s[0, n-1] given its SA  *  * @param T input concatenated string, using separators s[i]=1 and with s[n-1]=0  * @param SA Suffix array   * @param DA Document array  * @param n string length  * @param d number of documents  * @return -1 if an error occured, otherwise returns 0  */ int document_array_9n(unsigned char* T, int_t* SA, int_da* DA, uint_t n, uint_t d);

Example

Compilation:

make

Available options:

-d D	use the first D documents of the INPUT -v	verbose output -o	output computed arrays to disk (INPUT.da and INPUT.sa) -c	check output (for debug) -p P	print the output arrays LA[1,P] and SA[1,P] (for debug) -h	this help message

Notes:

  • Supported extensions are .txt, .fasta and .fastq.

Run a test:

/main dataset/input.txt -p 10 -c

Output:

d = 3 N = 32 bytes sizeof(int) = 4 bytes TOTAL: CLOCK = 0.000059 TIME = 0.000000 0.000059 SA: isSorted!! DA: isDA!! ######## i) DASAsuffixes 0) 331# 1) 06$abacabad$... 2) 115$banaanana... 3) 230$# 4) 05a$abacabad... 5) 229a$# 6) 224aanana$# 7) 219aananaanan... 8) 17abacabad$b... 9) 111abad$banaa... ######## malloc_count ### exiting, total: 25,032, peak: 21,512, current: 1,024

References

[1] Louza, F. A., A Simple Algorithm for Computing the Document Array. CoRR abs/1812.09094 (2018)

[2] Louza, F. A., Gog, S., Telles, G. P., Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci., vol. 678, pp. 22-39, 2017, Elsevier.

Thanks

Thanks to Giovanni Manzini, Travis Gagie and Nicola Prezza for helpful discussions

About

A simple algorithm for computing the document array [IPL 2020]

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published