The document introduces two algorithms for constructing a suffix array: SA-IS and SA-DS. SA-IS uses induced sorting of longest common prefix substrings, while SA-DS uses radix sorting of fixed-length substrings. The document provides pseudocode for the algorithms and explains various terms and data structures used, including longest minimal suffixes, L-type and S-type characters, and buckets for sorting.
概要 • Two Efficient Algorithms for Linear Suffix Array ConstrucMon – Authors: Ge Nong, Sen Zhang, and Wai Hong Chan • 1st algorithm: SA-‐IS – Induced SorMng Variable-‐Length LMS-‐Substrings • 2nd algorithm: SA-‐DS – Radix SorMng Fixed-‐Length d-‐CriMcal Substrings
3.
SA-‐IS/SA-‐DS アルゴリズム概要 SA-‐IS(S,SA) Scan S to create t Find all LMS-‐substrings to create P1 Induced-‐sort all the LMS-‐substrings using P1 and B Name each LMS-‐substring to create S1 If each char in S1 is unique: SA1[S1[i]] = i for all i Else SA-‐IS(S1, SA1) Induce SA from SA1 SA-‐DS(S,SA) Scan S to create t Find all the d-‐critical substrings to create P1 Radix sort all the d-‐critical substrings in P1 using B Name each d-‐critical substring to create S1 If each char in S1 is unique: SA1[S1[i]] = i for all i Else SA-‐DS(S1, SA1) Induce SA from SA1
4.
データの説明 (1) • S: 入力文字列 – 長さ n とする – The senMnel $ で終端されていることを仮定. S[i] > $ for all i in [0, n-‐1) • SA: 出力 Suffix Array • t: 長さ n のビット列 – S[i] の L/S-‐type を表す(後述) – t[i] = 1 if S[i] is S-‐type, else 0
データ例 0 1 Idx: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 S: m m i i s s i i s s i i p p i i $ t: 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 LMS: * * * * P1: 2 6 10 16 B: $:1, i:8, m:2, p:2, s:4, others:0 ($ < i < m < p < s) $ i m p s tmp: {_} {_ _ _ _ _ _ _ _} {_ _} {_ _} {_ _ _ _}
8.
SA-‐IS algorithm piaces • Find all LSM substring to create P1 • Induced-‐sort all the LMS-‐substrings using P1 and B • Name each LMS-‐substring to create S1 • Recursive call SA-‐IS(S1, SA1) • Induce SA from SA1
9.
Induced-‐sort all the LMS-‐substrs • (1) IniMalize tmp where each member is empty • (2) Scan P1 and put to the correct bucket from right to leg • (3) Scan tmp from leg to right and t[tmp[i] – 1] is 0 then put it to the bucket • (4) Scan tmp from right to leg and t[tmp[i] – 1] is 1 then put it to the bucket 0 1 Idx: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 S: m m i i s s i i s s i i p p i i $ t: 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 LMS: * * * * P1: 2 6 10 16 $ i m p s tmp: {16} { _ _ _ _ _ 10 6 2} { _ _} { _ _} { _ _ _ _} tmp: {16} {15 14 _ _ _ 10 6 2} { 1 0} {13 12} { 9 5 8 4} tmp: {16} {15 14 10 6 2 11 7 3} { 1 0} {13 12} { 9 5 8 4}
10.
0 1 Idx: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 S: m m i i s s i i s s i i p p i i $ t: 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 LMS: * * * * P1: 2 6 10 16 $ i m p s tmp: {16} {15 14 10 6 2 11 7 3} { 1 0} {13 12} { 9 5 8 4} LSM-‐substrs in the order of suffix array (tmp) 16 10 6 2 $ iippii$ iissi iissi Rename items where the same LMS-‐substrings indicate the same name 0 1 2 2 Renamed items in the order of S S1: 2 2 1 0 Find the lexicographic names of all substrings
11.
Recursive call of SA-‐IS(S1, SA1) Idx: 0 1 2 3 S1: 2 2 1 0 t: 0 0 0 0 LMS: * P1: 3 0 1 2 tmp: {3} {2} {1 0} All items are unique in the created suffix array (tmp) SA1: 3 2 1 0
12.
Induce SA from SA1 0 1 Idx: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 S: m m i i s s i i s s i i p p i i $ t: 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 LMS: * * * * P1: 2 6 10 16 SA1 : 3 2 1 0 for i from 4-‐1 to 0: put P1[SA1[i]] in the suffix array (tmp) $ i m p s tmp: {16} { _ _ _ _ _ 10 6 2} { _ _} { _ _} { _ _ _ _} tmp: {16} {15 14 _ _ _ 10 6 2} { 1 0} {13 12} { 9 5 8 4} tmp: {16} {15 14 10 6 2 11 7 3} { 1 0} {13 12} { 9 5 8 4}
13.
SA-‐DS algorithm piaces • Find all the d-‐criMcal substrings to create P1 • Radix sort all the d-‐criMcal substrings in P1 using B
d-‐CriMal substrs 0 1 Idx: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 S: m m i i s s i i s s i i p p i i $ t: 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 LMS: * * * * P1: 2 4 6 8 10 12 14 16 2: iiss 4: ssii 6: iiss 8: ssii 10: iipp 12: ppii 14: ii$$ 16: $$$$
18.
Radix sort of P1 0 1 Idx: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 S: m m i i s s i i s s i i p p i i $ t: 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 LMS: * * * * P1: 2 4 6 8 10 12 14 16 2: iiss 14: ii$$ 14: ii$$ 16: $$$$ 16: $$$$ 4: ssii 16: $$$$ 16: $$$$ 14: ii$$ 14: ii$$ 6: iiss 12: ppii 12: ppii 10: iipp 10: iipp 8: ssii 4: ssii 4: ssii 2: iiss 2: iiss 10: iipp 8: ssii 8: ssii 6: iiss 6: iiss 12: ppii 10: iipp 10: iipp 12: ppii 12: ppii 14: ii$$ 2: iiss 2: iiss 4: ssii 4: ssii 16: $$$$ 6: iiss 6: iiss 8: ssii 8: ssii P1’: 16 14 10 2 6 12 4 8
19.
Find lexicographic names 0 1 Idx: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 S: m m i i s s i i s s i i p p i i $ t: 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 LMS: * * * * P1: 2 4 6 8 10 12 14 16 P1’: 16 14 10 2 6 12 4 8 Name: 0 1 2 3 3 4 5 5 (order of P1’) S1: 3 5 3 5 2 4 1 0 (order of P1)
20.
Recursive call of SA-‐DS 0 1 Idx: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 S: m m i i s s i i s s i i p p i i $ t: 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 LMS: * * * * P1: 2 4 6 8 10 12 14 16 P1’: 16 14 10 2 6 12 4 8 S1: 3 5 3 5 2 4 1 0 t1: 1 0 1 0 1 0 0 1 P2: 2 4 7 P2’: 7 4 2 S2: 2 1 0 SA2: 2 1 0
21.
評価データ rithms were implementedin C++ and compiled by g++ with the option of -O3. T d from Sanders’s website [19]. For the KA algorithm, we use an improved versio o’s website [20]) from Yuta Mori. The source code of our algorithm IS is given DS1 and DS2 were embodied in less than 150 and 250 effective lines of code, r on request. Table 1: Data Used in the Experiments Data Characters Σ Description bible.txt 4 047 392 63 King James Bible chr22.dna 34 553 758 4 Human chromosome 22 E.coli 4 638 690 4 Escherichia coli genome etext99 105 277 340 146 Texts from Gutenberg project howto 39 422 105 197 Linux Howto files pic 513 216 159 Black and white fax picture sprot34.dat 109 617 186 66 Swissprot V34 protein database world192.txt 2 473 400 94 CIA world fact book alphabet 100 000 26 Repetitions of the alphabet [a-z] random 100 000 64 Randomly selected from 64 characters ace The time for each algorithm is the mean of 3 runs, and the space is the hea emusage command to fire the running of each program. The total time (in se