Advanced Algorithms – COMS31900 Pattern Matching part two Suffix Arrays Benjamin Sach
Text indexing T Preprocess a text string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n
Text indexing T Preprocess a text string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m
Text indexing T Preprocess a text string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T.
Text indexing T Preprocess a text string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T. e.g. 4, 6, 10 4 6 10
Text indexing T Preprocess a text string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T. • Last lecture we saw the that text indexing problem can be solved using a suffix tree e.g. 4, 6, 10 4 6 10 which uses O(n) space (when it’s stored compacted)
Text indexing T Preprocess a text string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T. • Last lecture we saw the that text indexing problem can be solved using a suffix tree • Queries take O(m + occ) time when the alphabet size is constant - occ is the number of occurences (matches) e.g. 4, 6, 10 4 6 10 which uses O(n) space (when it’s stored compacted)
Text indexing T Preprocess a text string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T. • Last lecture we saw the that text indexing problem can be solved using a suffix tree • Queries take O(m + occ) time when the alphabet size is constant - occ is the number of occurences (matches) • Suffix trees can be constructed in O(n) time (but we only saw how to achieve O(n2) time) e.g. 4, 6, 10 4 6 10 which uses O(n) space (when it’s stored compacted)
The suffix array T b n aaT a sn n
The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn
The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa snsuffix 1
The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn
The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn
The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order
The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs:
The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a
The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a
The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a
The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c<
The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c<
The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c<
The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a
The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a
The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
The suffix array T b n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
The suffix array T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
The suffix array T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
The suffix array T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
The suffix array T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
The suffix array T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller) just a fancy name for the order the strings would appear in the dictionary In lexicographical ordering we sort strings based on the first symbol that differs:
The suffix array T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order If the symbols don’t have a natural order, we use their binary representation in memory b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller) just a fancy name for the order the strings would appear in the dictionary In lexicographical ordering we sort strings based on the first symbol that differs:
The suffix array T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn
The suffix array T b n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 10 2 3 4 5 6
The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n 10 2 3 4 5 6
The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n 10 2 3 4 5 6
The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n 10 2 3 4 5 6
The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n The suffix array is much smaller than the suffix tree (in terms of constants) 10 2 3 4 5 6
The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n The suffix array is much smaller than the suffix tree (in terms of constants) 10 2 3 4 5 6
From Suffix Trees to Suffix Arrays a s$ nas$ nas$ nas$ s$ na s$ bananas$ 1 3 5 0 2 4 6 T b n aaT a sn n Suffix Array Suffix Tree 1 0 625 43 n Recall that we can get get the Suffix Array from the Suffix Tree 1 3 5 0 2 4 6 10 2 3 4 5 6 using depth-first search in O(n) time
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 157 100 1 43 112 1396 1485 12 c db cb a c bc c d adb
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find c db cb a c bc c d adb m
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences could start anywhere Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences could start anywhere Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences could start anywhere Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c d adb> c db cb a c bc c d adb m c db cb occurences could start anywhere Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c d adb> c db cb a c bc c d adb m c db cb occurences could start anywhere Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c d adb> c db cb a c bc c d adb m c db cb occurences could start anywhere Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c d adb> c db cb a c bc c d adb m c db cb occurences must start in here Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb < c bc db c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb < c bc db c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb < c bc db c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: = c db cb a c bc c d adb c db cb m c db cb occurences must start in here Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: = c db cb a c bc c d adb c db cb m c db cb occurences must start in here we found a match! Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb How long does this take? m Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb How long does this take? O(m) time to compare two strings m Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb How long does this take? O(m) time to compare two strings so O(m log n) time in total m Find an occurence of P using binary search
Searching in the Suffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb How long does this take? O(m) time to compare two strings so O(m log n) time in total This method generalises to O(m log n + occ) time m to find all occ occurences. Find an occurence of P using binary search by continuing the binary search (we will skip the details)
The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n P aa n m an Finding an occurrence of a pattern (length m) takes O(m log n) time Finding all occurrences takes O(m log n + occ) time where occ is the number of occurences
The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n P aa n m an Finding an occurrence of a pattern (length m) takes O(m log n) time Finding all occurrences takes O(m log n + occ) time where occ is the number of occurences This can be further improved to O(m + log n + occ) time (using LCP queries which we will see in a future lecture)
The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n P aa n m an Finding an occurrence of a pattern (length m) takes O(m log n) time Finding all occurrences takes O(m log n + occ) time where occ is the number of occurences Do we really need to build the suffix tree to construct the suffix array? This can be further improved to O(m + log n + occ) time (using LCP queries which we will see in a future lecture)
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T =
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 R1 is split into blocks of length 3
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 R2 is also split into blocks of length 3
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol)
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 3
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 5
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 we assume that the bit representation of each symbol uses O(log n) bits. (which is a common and realistic assumption) This can be done by sorting the blocks in O(n) time using radix sort
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : 1 2 3 4 5 6 70
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : 1 2 3 4 5 6 70 0 1 6 4 2 5 3 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : 1 2 3 4 5 6 70 0 1 6 4 2 5 3 7 How do we compute the suffix array for R ? Recursion! (Notice that R has length 2n/3)
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : 1 2 3 4 5 6 70 0 1 6 4 2 5 3 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : what use1 2 3 4 5 6 70 0 1 6 4 2 5 3 7 is this?!
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : what use1 2 3 4 5 6 70 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 is this?!
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 is this?!
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 is this?!
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 a b d ob a a b b a d d oa b b a d is this?! 1 5
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d is this?! 1 5
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d is this?! 1 5
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ is this?! 1 5
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ is this?! 1 5
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ d oa b b a d $ $ is this?! 1 5
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ d oa b b a d $ $ their order is given by the suffix array of R : is this?! 1 5
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ d oa b b a d $ $ their order is given by the suffix array of R : is this?! 1 5
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ d oa b b a d $ $ their order is given by the suffix array of R : is this?! 1 5 Suffix is smaller than suffix 1 55
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?!
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7 ob b a d $ b d ob a a b b a d $ $
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7 d oa b b a d $ $ ob b a d $ b d ob a a b b a d $ $
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7 d oa b b a d $ $ ob b a d $ b d ob a a b b a d $ $
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7 d oa b b a d $ $ ob b a d $ b d ob a a b b a d $ $
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7 d oa b b a d $ $ ob b a d $ b d ob a a b b a d $ $ Suffix is smaller than suffix 7 5
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?!
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 b d ob a a b b a d2
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 b d ob a a b b a d2 d oa b b a d $ $ b d ob a a b b a d $ $
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 b d ob a a b b a d2 d oa b b a d $ $ b d ob a a b b a d $ $
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 b d ob a a b b a d2 d oa b b a d $ $ b d ob a a b b a d $ $
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 b d ob a a b b a d2 d oa b b a d $ $ b d ob a a b b a d $ $ Suffix is smaller than suffix 2 5
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 is this?!
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7 after relabelling,
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7 after relabelling, 1 24 111058 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7 after relabelling, we have the suffix array of just the suffixes from B1 ∪ B2 1 24 111058 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7 after relabelling, we have the suffix array of just the suffixes from B1 ∪ B2 1 24 111058 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7 after relabelling, we have the suffix array of just the suffixes from B1 ∪ B2 1 24 111058 7 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 24 111058 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 1 24 111058 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y 1 24 111058 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y suffix 0 is a y 1 24 111058 7 followed by suffix 1
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y 1 24 111058 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 1 24 111058 7
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank:
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6) Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank:
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6) Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6)
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6)
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6)
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6) Suffix array for just B0 6 09 3
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6) Suffix array for just B0 6 09 3 How do we merge these?
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these?
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . .
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . which is smaller, suffix or ?1 6
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . which is smaller, suffix or ? + 7a= 1 6 6 1 +a= 2
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . which is smaller, suffix or ? + 7a= 1 6 6 1 +a= 2 (a, 4) (a, 3)
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . which is smaller, suffix or ? + 7a= 1 6 6 1 +a= 2 (a, 4) (a, 3) It takes O(1) time to decide that 1 is smaller
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . which is smaller, suffix or ? + 7a= 1 6 6 1 +a= 2 (a, 4) (a, 3) It takes O(1) time to decide that 1 is smaller 1
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 which is smaller, suffix or4 6 ?
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 which is smaller, suffix or4 6 + 7a=6 4 +a= 5 ?
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 which is smaller, suffix or4 6 + 7a=6 4 +a= 5 (a, 4) (a, 5) ?
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 which is smaller, suffix or4 6 + 7a=6 4 +a= 5 (a, 4) (a, 5) Again, it takes O(1) time to decide that is smaller6 ?
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 which is smaller, suffix or4 9 ?
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 )4(which is smaller, suffix or4 9 ? is smaller
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 )4(which is smaller, suffix or4 9 ? is smaller
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 8which is smaller, suffix or 9 ?
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 8 +a= +b= 9 10 8 9 which is smaller, suffix or 9 ?
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 8 +a= +b= 9 10 8 9 Uh oh! how do we compare 9 10to ? which is smaller, suffix or 9 ?
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 8 Uh oh! how do we compare 9 10to ? +a= +b= 9 8 d a + + 11 10 It still takes O(1) time to decide that is smaller9 which is smaller, suffix or 9 ?
The DC3 method y a b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 Overall this merging phase takes O(n) time (because processing each suffix takes O(1) time)
The DC3 method Theorem The DC3 algorithm constructs a suffix array in O(n) time.
The DC3 method Theorem The DC3 algorithm constructs a suffix array in O(n) time. Proof Suppose T(n) is the running time. We have T(n) = T(2n/3) + O(n)
The DC3 method Theorem The DC3 algorithm constructs a suffix array in O(n) time. Proof Suppose T(n) is the running time. We have T(n) = T(2n/3) + O(n) radix sorting and merging recursion to construct a suffix array of size 2n/3
The DC3 method Theorem The DC3 algorithm constructs a suffix array in O(n) time. Proof Suppose T(n) is the running time. We have T(n) = T(2n/3) + O(n) Solving this recurrence gives T(n) ∈ O(n). radix sorting and merging recursion to construct a suffix array of size 2n/3
The suffix array T b n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n P aa n m an Finding an occurrence of a pattern (length m) takes O(m log n) time Finding all occurrences takes O(m log n + occ) time where occ is the number of occurences We can construct the suffix array in O(n) time This can be further improved to O(m + log n + occ) time (using LCP queries which we will see in a future lecture)

Pattern Matching Part Two: Suffix Arrays

  • 1.
    Advanced Algorithms –COMS31900 Pattern Matching part two Suffix Arrays Benjamin Sach
  • 2.
    Text indexing T Preprocess atext string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n
  • 3.
    Text indexing T Preprocess atext string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m
  • 4.
    Text indexing T Preprocess atext string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T.
  • 5.
    Text indexing T Preprocess atext string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T. e.g. 4, 6, 10 4 6 10
  • 6.
    Text indexing T Preprocess atext string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T. • Last lecture we saw the that text indexing problem can be solved using a suffix tree e.g. 4, 6, 10 4 6 10 which uses O(n) space (when it’s stored compacted)
  • 7.
    Text indexing T Preprocess atext string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T. • Last lecture we saw the that text indexing problem can be solved using a suffix tree • Queries take O(m + occ) time when the alphabet size is constant - occ is the number of occurences (matches) e.g. 4, 6, 10 4 6 10 which uses O(n) space (when it’s stored compacted)
  • 8.
    Text indexing T Preprocess atext string T (length n) to answer pattern matching queries. . . ba b c a b a cb a a b a n After preprocessing, a query is a pattern P (length m), P a b a m the output is a list of all matches in T. • Last lecture we saw the that text indexing problem can be solved using a suffix tree • Queries take O(m + occ) time when the alphabet size is constant - occ is the number of occurences (matches) • Suffix trees can be constructed in O(n) time (but we only saw how to achieve O(n2) time) e.g. 4, 6, 10 4 6 10 which uses O(n) space (when it’s stored compacted)
  • 9.
    The suffix array Tb n aaT a sn n
  • 10.
    The suffix array Tb n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn
  • 11.
    The suffix array Tb n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa snsuffix 1
  • 12.
    The suffix array Tb n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn
  • 13.
    The suffix array Tb n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn
  • 14.
    The suffix array Tb n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order
  • 15.
    The suffix array Tb n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs:
  • 16.
    The suffix array Tb n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a
  • 17.
    The suffix array Tb n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a
  • 18.
    The suffix array Tb n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a
  • 19.
    The suffix array Tb n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c<
  • 20.
    The suffix array Tb n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c<
  • 21.
    The suffix array Tb n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c<
  • 22.
    The suffix array Tb n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a
  • 23.
    The suffix array Tb n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a
  • 24.
    The suffix array Tb n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
  • 25.
    The suffix array Tb n aaT a sn n 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s Sort the suffixes lexicographically 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
  • 26.
    The suffix array Tb n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
  • 27.
    The suffix array Tb n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
  • 28.
    The suffix array Tb n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
  • 29.
    The suffix array Tb n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order In lexicographical ordering we sort strings based on the first symbol that differs: b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller)
  • 30.
    The suffix array Tb n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller) just a fancy name for the order the strings would appear in the dictionary In lexicographical ordering we sort strings based on the first symbol that differs:
  • 31.
    The suffix array Tb n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn • The symbols themselves must have an order throughout we will use alphabetical order If the symbols don’t have a natural order, we use their binary representation in memory b a<a a b c< b c< a (in a ‘tie’, the shorter string is smaller) just a fancy name for the order the strings would appear in the dictionary In lexicographical ordering we sort strings based on the first symbol that differs:
  • 32.
    The suffix array Tb n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn
  • 33.
    The suffix array Tb n aaT a sn n Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 10 2 3 4 5 6
  • 34.
    The suffix array Tb n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n 10 2 3 4 5 6
  • 35.
    The suffix array Tb n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n 10 2 3 4 5 6
  • 36.
    The suffix array Tb n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n 10 2 3 4 5 6
  • 37.
    The suffix array Tb n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n The suffix array is much smaller than the suffix tree (in terms of constants) 10 2 3 4 5 6
  • 38.
    The suffix array Tb n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n The suffix array is much smaller than the suffix tree (in terms of constants) 10 2 3 4 5 6
  • 39.
    From Suffix Treesto Suffix Arrays a s$ nas$ nas$ nas$ s$ na s$ bananas$ 1 3 5 0 2 4 6 T b n aaT a sn n Suffix Array Suffix Tree 1 0 625 43 n Recall that we can get get the Suffix Array from the Suffix Tree 1 3 5 0 2 4 6 10 2 3 4 5 6 using depth-first search in O(n) time
  • 40.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 157 100 1 43 112 1396 1485 12 c db cb a c bc c d adb
  • 41.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find c db cb a c bc c d adb m
  • 42.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m Find an occurence of P using binary search
  • 43.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences could start anywhere Find an occurence of P using binary search
  • 44.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences could start anywhere Find an occurence of P using binary search
  • 45.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences could start anywhere Find an occurence of P using binary search
  • 46.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c d adb> c db cb a c bc c d adb m c db cb occurences could start anywhere Find an occurence of P using binary search
  • 47.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c d adb> c db cb a c bc c d adb m c db cb occurences could start anywhere Find an occurence of P using binary search
  • 48.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c d adb> c db cb a c bc c d adb m c db cb occurences could start anywhere Find an occurence of P using binary search
  • 49.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c d adb> c db cb a c bc c d adb m c db cb occurences must start in here Find an occurence of P using binary search
  • 50.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
  • 51.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
  • 52.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
  • 53.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb < c bc db c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
  • 54.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb < c bc db c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
  • 55.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb < c bc db c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
  • 56.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
  • 57.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m occurences must start in here Find an occurence of P using binary search
  • 58.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: = c db cb a c bc c d adb c db cb m c db cb occurences must start in here Find an occurence of P using binary search
  • 59.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: = c db cb a c bc c d adb c db cb m c db cb occurences must start in here we found a match! Find an occurence of P using binary search
  • 60.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb m Find an occurence of P using binary search
  • 61.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb How long does this take? m Find an occurence of P using binary search
  • 62.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb How long does this take? O(m) time to compare two strings m Find an occurence of P using binary search
  • 63.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb How long does this take? O(m) time to compare two strings so O(m log n) time in total m Find an occurence of P using binary search
  • 64.
    Searching in theSuffix Array 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 4 b c dba cb a c dbc cb d a c dba cb a c bc c d adb db cb a c bc c d adb d cb a c bc c d adb d c a c bc c d adb c a c bc c d adb a c bc c d adb c bc c d adb c b c d adb c d a d a a c d ad c d adb b c d adb 2 5 6 8 9 12 13 14 d c a c bc c d adb c a c bc c d adb c b c d adb c d a d a c d ad b c dbT a cb a c dbc cb d a n 15 7 100Suffix Array 1 43 11 2 1396 148 5 12 c db cbP 157 100 1 43 112 1396 1485 12 find Key Idea: c db cb a c bc c d adb How long does this take? O(m) time to compare two strings so O(m log n) time in total This method generalises to O(m log n + occ) time m to find all occ occurences. Find an occurence of P using binary search by continuing the binary search (we will skip the details)
  • 65.
    The suffix array Tb n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n P aa n m an Finding an occurrence of a pattern (length m) takes O(m log n) time Finding all occurrences takes O(m log n + occ) time where occ is the number of occurences
  • 66.
    The suffix array Tb n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n P aa n m an Finding an occurrence of a pattern (length m) takes O(m log n) time Finding all occurrences takes O(m log n + occ) time where occ is the number of occurences This can be further improved to O(m + log n + occ) time (using LCP queries which we will see in a future lecture)
  • 67.
    The suffix array Tb n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n P aa n m an Finding an occurrence of a pattern (length m) takes O(m log n) time Finding all occurrences takes O(m log n + occ) time where occ is the number of occurences Do we really need to build the suffix tree to construct the suffix array? This can be further improved to O(m + log n + occ) time (using LCP queries which we will see in a future lecture)
  • 68.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T =
  • 69.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1
  • 70.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1
  • 71.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2
  • 72.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2
  • 73.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2
  • 74.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 R1 is split into blocks of length 3
  • 75.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2
  • 76.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2
  • 77.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 R2 is also split into blocks of length 3
  • 78.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2
  • 79.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2
  • 80.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol)
  • 81.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1
  • 82.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2
  • 83.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 3
  • 84.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34
  • 85.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 5
  • 86.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56
  • 87.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7
  • 88.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 we assume that the bit representation of each symbol uses O(log n) bits. (which is a common and realistic assumption) This can be done by sorting the blocks in O(n) time using radix sort
  • 89.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7
  • 90.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7
  • 91.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7
  • 92.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7
  • 93.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : 1 2 3 4 5 6 70
  • 94.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : 1 2 3 4 5 6 70 0 1 6 4 2 5 3 7
  • 95.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : 1 2 3 4 5 6 70 0 1 6 4 2 5 3 7 How do we compute the suffix array for R ? Recursion! (Notice that R has length 2n/3)
  • 96.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : 1 2 3 4 5 6 70 0 1 6 4 2 5 3 7
  • 97.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : what use1 2 3 4 5 6 70 0 1 6 4 2 5 3 7 is this?!
  • 98.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $R1 = b d ob a a b b a d $R2 = $ a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: Introduce a new “filler symbol” $. B2 contains indices with i mod 3 = 2 Number the blocks in lexicographical order ($ is the smallest symbol) 1 2 4 34 56 7 let R = 1 2 4 34 56 7 compute the suffix array of R : what use1 2 3 4 5 6 70 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 is this?!
  • 99.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 is this?!
  • 100.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 is this?!
  • 101.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 a b d ob a a b b a d d oa b b a d is this?! 1 5
  • 102.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d is this?! 1 5
  • 103.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d is this?! 1 5
  • 104.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ is this?! 1 5
  • 105.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ is this?! 1 5
  • 106.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ d oa b b a d $ $ is this?! 1 5
  • 107.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ d oa b b a d $ $ their order is given by the suffix array of R : is this?! 1 5
  • 108.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ d oa b b a d $ $ their order is given by the suffix array of R : is this?! 1 5
  • 109.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R a b d ob a a b b a d d oa b b a d a b d ob a a b b a d $ b d ob a a b b a d $ $ d oa b b a d $ $ their order is given by the suffix array of R : is this?! 1 5 Suffix is smaller than suffix 1 55
  • 110.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?!
  • 111.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7
  • 112.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7 ob b a d $ b d ob a a b b a d $ $
  • 113.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7 d oa b b a d $ $ ob b a d $ b d ob a a b b a d $ $
  • 114.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7 d oa b b a d $ $ ob b a d $ b d ob a a b b a d $ $
  • 115.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7 d oa b b a d $ $ ob b a d $ b d ob a a b b a d $ $
  • 116.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 ob b a d7 d oa b b a d $ $ ob b a d $ b d ob a a b b a d $ $ Suffix is smaller than suffix 7 5
  • 117.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?!
  • 118.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 b d ob a a b b a d2
  • 119.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 b d ob a a b b a d2 d oa b b a d $ $ b d ob a a b b a d $ $
  • 120.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 b d ob a a b b a d2 d oa b b a d $ $ b d ob a a b b a d $ $
  • 121.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 b d ob a a b b a d2 d oa b b a d $ $ b d ob a a b b a d $ $
  • 122.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 Take any two suffixes in B1 ∪ B2 and find them in R their order is given by the suffix array of R : is this?! d oa b b a d5 b d ob a a b b a d2 d oa b b a d $ $ b d ob a a b b a d $ $ Suffix is smaller than suffix 2 5
  • 123.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 compute the suffix array of R : what use 0 1 6 4 2 5 3 7 0 1 2 3 4 5 6 7 is this?!
  • 124.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7
  • 125.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7
  • 126.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7 after relabelling,
  • 127.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7 after relabelling, 1 24 111058 7
  • 128.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7 after relabelling, we have the suffix array of just the suffixes from B1 ∪ B2 1 24 111058 7
  • 129.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7 after relabelling, we have the suffix array of just the suffixes from B1 ∪ B2 1 24 111058 7
  • 130.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 a b d ob a a b b a d $ b d ob a a b b a d $ $ Concatenate R1 and R2 to obtain R: B2 contains indices with i mod 3 = 2 0 1 2 3 4 5 6 7 The suffix array of R : 0 1 6 4 2 5 3 7 after relabelling, we have the suffix array of just the suffixes from B1 ∪ B2 1 24 111058 7 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
  • 131.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
  • 132.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 24 111058 7
  • 133.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7
  • 134.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 1 24 111058 7
  • 135.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y 1 24 111058 7
  • 136.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y suffix 0 is a y 1 24 111058 7 followed by suffix 1
  • 137.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y 1 24 111058 7
  • 138.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 1 24 111058 7
  • 139.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank:
  • 140.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6) Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank:
  • 141.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6) Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort
  • 142.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6)
  • 143.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 How do we find the ordering of the suffixes from B0? (where i mod 3 = 0) Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6)
  • 144.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6)
  • 145.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6) Suffix array for just B0 6 09 3
  • 146.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1) 1 24 111058 7 (the ranks are given by the array below) rank: We then sort in O(n) time using radix sort y a b d ob a a b b a d d ob a a b b a d oa b b a d oa d 0 3 6 9 = + 1y = + 4b = + 7a = +a 10 (y, 0) (b, 1) (a, 4) (a, 6) Suffix array for just B0 6 09 3 How do we merge these?
  • 147.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these?
  • 148.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . .
  • 149.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . which is smaller, suffix or ?1 6
  • 150.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . which is smaller, suffix or ? + 7a= 1 6 6 1 +a= 2
  • 151.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . which is smaller, suffix or ? + 7a= 1 6 6 1 +a= 2 (a, 4) (a, 3)
  • 152.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . which is smaller, suffix or ? + 7a= 1 6 6 1 +a= 2 (a, 4) (a, 3) It takes O(1) time to decide that 1 is smaller
  • 153.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . which is smaller, suffix or ? + 7a= 1 6 6 1 +a= 2 (a, 4) (a, 3) It takes O(1) time to decide that 1 is smaller 1
  • 154.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 which is smaller, suffix or4 6 ?
  • 155.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 which is smaller, suffix or4 6 + 7a=6 4 +a= 5 ?
  • 156.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 which is smaller, suffix or4 6 + 7a=6 4 +a= 5 (a, 4) (a, 5) ?
  • 157.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 which is smaller, suffix or4 6 + 7a=6 4 +a= 5 (a, 4) (a, 5) Again, it takes O(1) time to decide that is smaller6 ?
  • 158.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6
  • 159.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 which is smaller, suffix or4 9 ?
  • 160.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 )4(which is smaller, suffix or4 9 ? is smaller
  • 161.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 )4(which is smaller, suffix or4 9 ? is smaller
  • 162.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 8which is smaller, suffix or 9 ?
  • 163.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 8 +a= +b= 9 10 8 9 which is smaller, suffix or 9 ?
  • 164.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 8 +a= +b= 9 10 8 9 Uh oh! how do we compare 9 10to ? which is smaller, suffix or 9 ?
  • 165.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 8 Uh oh! how do we compare 9 10to ? +a= +b= 9 8 d a + + 11 10 It still takes O(1) time to decide that is smaller9 which is smaller, suffix or 9 ?
  • 166.
    The DC3 method ya b d ob a a b b a d 1 2 3 4 5 6 7 8 9 10 110 T = B1 contains indices with i mod 3 = 1 B2 contains indices with i mod 3 = 2 Suffix array for just B1 ∪ B2 1 2 3 4 5 6 70 1 24 111058 7 Suffix array for just B0 6 09 3 How do we merge these? Merge them like in mergesort. . . 1 6 4 Overall this merging phase takes O(n) time (because processing each suffix takes O(1) time)
  • 167.
    The DC3 method Theorem TheDC3 algorithm constructs a suffix array in O(n) time.
  • 168.
    The DC3 method Theorem TheDC3 algorithm constructs a suffix array in O(n) time. Proof Suppose T(n) is the running time. We have T(n) = T(2n/3) + O(n)
  • 169.
    The DC3 method Theorem TheDC3 algorithm constructs a suffix array in O(n) time. Proof Suppose T(n) is the running time. We have T(n) = T(2n/3) + O(n) radix sorting and merging recursion to construct a suffix array of size 2n/3
  • 170.
    The DC3 method Theorem TheDC3 algorithm constructs a suffix array in O(n) time. Proof Suppose T(n) is the running time. We have T(n) = T(2n/3) + O(n) Solving this recurrence gives T(n) ∈ O(n). radix sorting and merging recursion to construct a suffix array of size 2n/3
  • 171.
    The suffix array Tb n aaT a sn n Suffix Array 1 0 625 4 Sort the suffixes lexicographically 0 b n aaa sn n aa1 a sn 2 n aa sn 4 a sn 5 a s 6 s 3 aa sn 3 n P aa n m an Finding an occurrence of a pattern (length m) takes O(m log n) time Finding all occurrences takes O(m log n + occ) time where occ is the number of occurences We can construct the suffix array in O(n) time This can be further improved to O(m + log n + occ) time (using LCP queries which we will see in a future lecture)