Suffix Array 構築方法の紹介

SuﬃxArray 構築方法の紹介 Takashi HOSHINO Cybozu Labs, Inc. 2013-‐04-‐19 社内機械学習勉強会資料 20130618 資料完成

概要 •  Two Eﬃcient Algorithms for Linear Suﬃx Array ConstrucMon –  Authors: Ge Nong, Sen Zhang, and Wai Hong Chan •  1st algorithm: SA-‐IS –  Induced SorMng Variable-‐Length LMS-‐Substrings •  2nd algorithm: SA-‐DS –  Radix SorMng Fixed-‐Length d-‐CriMcal Substrings

SA-‐IS/SA-‐DS アルゴリズム概要 SA-‐IS(S,SA) Scan S to create t Find all LMS-‐substrings to create P1 Induced-‐sort all the LMS-‐substrings using P1 and B Name each LMS-‐substring to create S1 If each char in S1 is unique: SA1[S1[i]] = i for all i Else SA-‐IS(S1, SA1) Induce SA from SA1 SA-‐DS(S,SA) Scan S to create t Find all the d-‐critical substrings to create P1 Radix sort all the d-‐critical substrings in P1 using B Name each d-‐critical substring to create S1 If each char in S1 is unique: SA1[S1[i]] = i for all i Else SA-‐DS(S1, SA1) Induce SA from SA1

データの説明 (1) •  S: 入力文字列 –  長さ n とする –  The senMnel $ で終端されていることを仮定． S[i] > $ for all i in [0, n-‐1) •  SA: 出力 Suﬃx Array •  t: 長さ n のビット列 –  S[i] の L/S-‐type を表す(後述) –  t[i] = 1 if S[i] is S-‐type, else 0

データの説明 (2) •  P1: 長さ n1 の整数列 (n1 <= n/2) –  SA-‐IS と SA-‐DS で異なる(後述) •  K: 文字種の数 –  文字が 1 byte とすると K = 256 –  再帰したときは，n1 以下の値 •  B: バケツソート用のデータ –  長さ K + 1 の整数列 –  各整数は [0, n] の範囲

L/S-‐Type, LMS-‐char/substring •  L/S-‐type: –  S の各文字は L-‐type か S-‐type のいずれかに分類できる (後述) –  $ は S-‐type –  S[i] < S[i + 1] à S-‐type –  S[i] > S[i + 1] à L-‐type –  S[i] == S[i + 1] à type of S[i + 1] •  LMS-‐char: –  LMS: Leg-‐Most-‐S –  S[i] が S-‐type で S[i-‐1] が L-‐type のときの S[i] •  LMS-‐substring: S[i..j] –  S[i] と S[j] が LMS-‐char かつ S[i+1..j-‐1] は LMS-‐char を含まない

データ例 0 1 Idx: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 S: m m i i s s i i s s i i p p i i $ t: 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 LMS: * * * * P1: 2 6 10 16 B: $:1, i:8, m:2, p:2, s:4, others:0 ($ < i < m < p < s) $ i m p s tmp: {_} {_ _ _ _ _ _ _ _} {_ _} {_ _} {_ _ _ _}

SA-‐IS algorithm piaces •  Find all LSM substring to create P1 •  Induced-‐sort all the LMS-‐substrings using P1 and B •  Name each LMS-‐substring to create S1 •  Recursive call SA-‐IS(S1, SA1) •  Induce SA from SA1

Induced-‐sort all the LMS-‐substrs •  (1) IniMalize tmp where each member is empty •  (2) Scan P1 and put to the correct bucket from right to leg •  (3) Scan tmp from leg to right and t[tmp[i] – 1] is 0 then put it to the bucket •  (4) Scan tmp from right to leg and t[tmp[i] – 1] is 1 then put it to the bucket 0 1 Idx: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 S: m m i i s s i i s s i i p p i i $ t: 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 LMS: * * * * P1: 2 6 10 16 $ i m p s tmp: {16} { _ _ _ _ _ 10 6 2} { _ _} { _ _} { _ _ _ _} tmp: {16} {15 14 _ _ _ 10 6 2} { 1 0} {13 12} { 9 5 8 4} tmp: {16} {15 14 10 6 2 11 7 3} { 1 0} {13 12} { 9 5 8 4}

0 1 Idx: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 S: m m i i s s i i s s i i p p i i $ t: 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 LMS: * * * * P1: 2 6 10 16 $ i m p s tmp: {16} {15 14 10 6 2 11 7 3} { 1 0} {13 12} { 9 5 8 4} LSM-‐substrs in the order of suffix array (tmp) 16 10 6 2 $ iippii$ iissi iissi Rename items where the same LMS-‐substrings indicate the same name 0 1 2 2 Renamed items in the order of S S1: 2 2 1 0 Find the lexicographic names of all substrings

Recursive call of SA-‐IS(S1, SA1) Idx: 0 1 2 3 S1: 2 2 1 0 t: 0 0 0 0 LMS: * P1: 3 0 1 2 tmp: {3} {2} {1 0} All items are unique in the created suffix array (tmp) SA1: 3 2 1 0

Induce SA from SA1 0 1 Idx: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 S: m m i i s s i i s s i i p p i i $ t: 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 LMS: * * * * P1: 2 6 10 16 SA1 : 3 2 1 0 for i from 4-‐1 to 0: put P1[SA1[i]] in the suffix array (tmp) $ i m p s tmp: {16} { _ _ _ _ _ 10 6 2} { _ _} { _ _} { _ _ _ _} tmp: {16} {15 14 _ _ _ 10 6 2} { 1 0} {13 12} { 9 5 8 4} tmp: {16} {15 14 10 6 2 11 7 3} { 1 0} {13 12} { 9 5 8 4}

SA-‐DS algorithm piaces •  Find all the d-‐criMcal substrings to create P1 •  Radix sort all the d-‐criMcal substrings in P1 using B

d-‐CriMcal char/substring •  What is d? –  定数 –  2 <= d •  d-‐CriMcal char: –  S[i] が LMS-‐char à S[i] は d-‐criMcal char –  S[i-‐d] が d-‐criMcal char かつ S[i-‐1] と S[i+1] が LMS-‐char でないとき à S[i] は d-‐criMcal char •  d-‐CriMcal substring: S[i..i+d+1] –  S[i] が d-‐criMcal char –  後ろの長さが足りないものは S[n-‐1] すなわち $ で埋めたものとする –  長さは d + 2 固定

P1 について •  SA-‐IS –  LMS-‐substring の先頭文字の S 内でのインデクス列 –  ただし，S と t を見れば P1 は分かるため，SA-‐IS では明示的に P1 は作らない •  SA-‐DS –  d-‐CriMcal substring の先頭文字の S 内でのインデクス列 –  これを radix sort するので必ず生成

ω/γ-‐waited substrs •  Sω[i] = 2S[i] + t[i] for all i in [0, n) •  ω-‐weighted substring: Sω[i..j] •  γ-‐weighted substring: –  Sγ[i..j] = S[i..j-‐1]Sω[j] •  P1 を radix sort するときに key を w-‐weighted d-‐criMcal substring とする必要あり •  Sω[i..j] の代わりに Sγ[i..j] で足りる

d-‐CriMal substrs 0 1 Idx: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 S: m m i i s s i i s s i i p p i i $ t: 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 LMS: * * * * P1: 2 4 6 8 10 12 14 16 2: iiss 4: ssii 6: iiss 8: ssii 10: iipp 12: ppii 14: ii$$ 16: $$$$

Radix sort of P1 0 1 Idx: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 S: m m i i s s i i s s i i p p i i $ t: 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 LMS: * * * * P1: 2 4 6 8 10 12 14 16 2: iiss 14: ii$$ 14: ii$$ 16: $$$$ 16: $$$$ 4: ssii 16: $$$$ 16: $$$$ 14: ii$$ 14: ii$$ 6: iiss 12: ppii 12: ppii 10: iipp 10: iipp 8: ssii 4: ssii 4: ssii 2: iiss 2: iiss 10: iipp 8: ssii 8: ssii 6: iiss 6: iiss 12: ppii 10: iipp 10: iipp 12: ppii 12: ppii 14: ii$$ 2: iiss 2: iiss 4: ssii 4: ssii 16: $$$$ 6: iiss 6: iiss 8: ssii 8: ssii P1’: 16 14 10 2 6 12 4 8

Find lexicographic names 0 1 Idx: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 S: m m i i s s i i s s i i p p i i $ t: 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 LMS: * * * * P1: 2 4 6 8 10 12 14 16 P1’: 16 14 10 2 6 12 4 8 Name: 0 1 2 3 3 4 5 5 (order of P1’) S1: 3 5 3 5 2 4 1 0 (order of P1)

Recursive call of SA-‐DS 0 1 Idx: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 S: m m i i s s i i s s i i p p i i $ t: 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 LMS: * * * * P1: 2 4 6 8 10 12 14 16 P1’: 16 14 10 2 6 12 4 8 S1: 3 5 3 5 2 4 1 0 t1: 1 0 1 0 1 0 0 1 P2: 2 4 7 P2’: 7 4 2 S2: 2 1 0 SA2: 2 1 0

評価データ rithms were implemented in C++ and compiled by g++ with the option of -O3. T d from Sanders’s website [19]. For the KA algorithm, we use an improved versio o’s website [20]) from Yuta Mori. The source code of our algorithm IS is given DS1 and DS2 were embodied in less than 150 and 250 effective lines of code, r on request. Table 1: Data Used in the Experiments Data Characters Σ Description bible.txt 4 047 392 63 King James Bible chr22.dna 34 553 758 4 Human chromosome 22 E.coli 4 638 690 4 Escherichia coli genome etext99 105 277 340 146 Texts from Gutenberg project howto 39 422 105 197 Linux Howto files pic 513 216 159 Black and white fax picture sprot34.dat 109 617 186 66 Swissprot V34 protein database world192.txt 2 473 400 94 CIA world fact book alphabet 100 000 26 Repetitions of the alphabet [a-z] random 100 000 64 Randomly selected from 64 characters ace The time for each algorithm is the mean of 3 runs, and the space is the hea emusage command to fire the running of each program. The total time (in se

評価結果構築時間 Table 2: Time Data Time (Seconds) IS DS1 DS2 KS KA bible 2.7 3.11 3.9 8.9 3.62 chr22 24.7 31.5 39.6 92.8 34.1 E.coli 2.8 3.53 4.3 10 3.98 etext 101 123.2 150.4 428.1 149.67 howto 30.4 36.3 44.05 130.4 42.85 pic 0.06 0.09 0.13 0.56 0.29 sprot 94.6 111.59 139.6 356 132.91 world 1.3 1.61 2 4.8 1.84 alphabet 0.00 0.01 0.02 0.15 0.02 random 0.02 0.01 0.01 0.06 0.02 Total 257.58 310.95 384.01 1031.77 369.3 Mean 0.90 1.08 1.34 3.60 1.29 Norm. 1 1.21 1.49 4.01 1.43

評価結果: SAサイズ random 0.02 0.01 0.01 0.06 0.02 Total 257.58 310.95 384.01 1031.77 369.3 Mean 0.90 1.08 1.34 3.60 1.29 Norm. 1 1.21 1.49 4.01 1.43 Table 3: Space Data Space (MBytes) IS DS1 DS2 KS KA bible 20.86 21.50 20.30 90.40 34.45 chr22 178.09 184.44 171.41 819.25 289.97 E.coli 24.29 25.15 23.23 105.93 40.01 etext 542.17 559.55 521.85 2369.92 907.34 howto 203.16 208.08 195.55 932.07 331.54 pic 2.57 2.76 2.79 15.51 3.11 sprot 554.58 560.44 543.26 2591.62 930.06 world 12.70 12.91 12.50 55.24 21.24 alphabet 0.49 0.74 0.75 3.03 0.52 random 0.61 0.74 0.74 2.26 0.88 Total 1539.52 1576.31 1492.37 6985.23 2559.12 Mean 5.37 5.50 5.20 24.36 8.92 Norm. 1.03 1.06 1 4.68 1.72

評価結果: 再帰の深さ Total 1539.52 1576.31 1492.37 6985.23 2559.12 Mean 5.37 5.50 5.20 24.36 8.92 Norm. 1.03 1.06 1 4.68 1.72 Table 4: Recursion Depth and Reduction Ratio Data Depth Ratio IS DS KS KA IS DS KS KA bible 6 6 6 7 .34 .37 .67 .46 chr22 6 10 12 9 .31 .36 .67 .44 E.coli 7 8 7 9 .32 .36 .67 .45 etext 11 12 12 15 .33 .37 .67 .45 howto 9 10 11 13 .32 .36 .67 .45 pic 5 9 10 5 .26 .35 .67 .39 sprot 7 8 9 10 .31 .37 .67 .45 world 6 7 6 7 .32 .37 .67 .45 alphabet 2 10 11 2 .02 .34 .67 .02 random 2 1 2 2 .33 .36 .67 .47 Total 61 81 86 80 2.86 3.61 6.7 4.03 Mean 6.1 8.1 8.6 8.0 .29 .36 .67 .40 Norm. 1 1.33 1.41 1.31 1 1.26 2.34 1.38

結論 •  SA-‐IS で FA

Suffix Array 構築方法の紹介

In this document

More Related Content

What's hot

Similar to Suffix Array 構築方法の紹介

More from Takashi Hoshino

Recently uploaded

Suffix Array 構築方法の紹介