Rabin-Karp String Matching Algorithm -Gajanand Sharma U V C E B A N G A L O R E B A N G
Let a text string T of length n and a pattern string P of length m are given such that m<=n, then the string matching problem is to find all occurrences of P in T. Example- T = “KARNATAKA” and P=“NAT” Applications: • Searching keywords in a file • Searching engines • Database searching Problem Statement…
T : Given text or String e.g. – “JAIPURISCALLEDPARISOFINDIA” |T| : 26, length of T Substring: Ti,j=TiT i+1…Tj e.g. – “PARIS” Subsequence of T: deleting zero or more characters from T e.g. –“RISALL” and “JISINA” Prefix of T: T1,k e.g. –“JAIPUR” is a prefix of T. Suffix of T: Th,|T| e.g. –“INDIA” is a suffix of T. Notations…
The Rabin-Karp string matching algorithm calculates a hash value for the pattern, as well as for each M-character subsequence of given text to be compared. If the hash values for a particular subsequence are unequal, the algorithm will calculate the hash value for next M-character sequence. If the hash values are equal, the algorithm will do a Brute Force comparison between the pattern and the M-character sequence. Therefore there is only one comparison per text subsequence, and Brute Force is only needed when hash values match. Rabin-Karp Algorithm…
String:- LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLM Pattern:-LLLLLM Let Hash Value of “LLLLLL”= 0; And Hash Value of “LLLLLM”= 1; Step-1: LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLM LLLLLM ------------ 0 != 1 (One Comparison) Step-2: LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLM LLLLLM ------------ 0 != 1(One Comparison) :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: Step-N: LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLM LLLLLM --------- 1 = 1 Hash Value is matching so compare elements one by one.- (6+1 Comparisons) Rabin-Karp Example…
Length of pattern = M; Hash(p) = hash value of pattern; Hash(t) = hash value of first M letters in body of text; do if (hash(p) == hash(t)) brute force comparison of pattern and selected section of text hash(t) = hash value of next section of text, one character over while (end of text or brute force comparison == true) Pseudo Code…
Let’s associate one integer with every letter of the alphabet. Hence we can say ‘A’ corresponds to 1, ‘B’ corresponds to 2 , ‘C’ corresponds to 3…… Similarly all other letters are corresponding to their index values. The Hash Value of the String “CAT” will be- 3*100 + 1*10 + 20 = 330 Calculating Hash Value… Index Position
If the hash value matches for two strings then it is called a ‘hit’. It may be possible that two or more different strings produce the same hash value. String 1: “CBJ” hash code=3*100 + 2*10 + 10 = 330 String 2: “CAT” hash code=3*100 + 1*10 + 20 = 330 Hence it is necessary to check whether it is a hit or not? Any hit will have to be tested to verify that it is not spurious and that p[1..m] = T[s+1..s+m] What if two values collide…
Let’s take an m-character sequence as an m-digit number in base b. The text subsequence t[ i .. i + m-1] is mapped to the number as follows- x(i) = t[i].𝑏 𝑚−1 + t[i+1].𝑏 𝑚−2 + t[i+2].𝑏 𝑚−3 +………..t[i+m-1] If m is very large then the hash value will be very large in size, so we can hash the value by taking mod a prime number, say q. h(i)=((t[i] 𝑏 𝑚−1 mod q) +(t[i+1] 𝑏 𝑚−2 mod q) + ...+(t[i+M-1] mod q))mod q Mathematical Function…
• If a large prime number is used to calculate hash function then there a very low possibility of being hashed values equal for two different patterns. • In this case searching takes O(N) time, where N is number of characters in the text body. • In worst case the time complexity may be O(MN), where M is no. of characters in the pattern. This case may occur when the prime no. chosen is very small. Complexity…
Rabin karp string matching algorithm

Rabin karp string matching algorithm

  • 1.
    Rabin-Karp String Matching Algorithm -GajanandSharma U V C E B A N G A L O R E B A N G
  • 2.
    Let a textstring T of length n and a pattern string P of length m are given such that m<=n, then the string matching problem is to find all occurrences of P in T. Example- T = “KARNATAKA” and P=“NAT” Applications: • Searching keywords in a file • Searching engines • Database searching Problem Statement…
  • 3.
    T : Giventext or String e.g. – “JAIPURISCALLEDPARISOFINDIA” |T| : 26, length of T Substring: Ti,j=TiT i+1…Tj e.g. – “PARIS” Subsequence of T: deleting zero or more characters from T e.g. –“RISALL” and “JISINA” Prefix of T: T1,k e.g. –“JAIPUR” is a prefix of T. Suffix of T: Th,|T| e.g. –“INDIA” is a suffix of T. Notations…
  • 4.
    The Rabin-Karp stringmatching algorithm calculates a hash value for the pattern, as well as for each M-character subsequence of given text to be compared. If the hash values for a particular subsequence are unequal, the algorithm will calculate the hash value for next M-character sequence. If the hash values are equal, the algorithm will do a Brute Force comparison between the pattern and the M-character sequence. Therefore there is only one comparison per text subsequence, and Brute Force is only needed when hash values match. Rabin-Karp Algorithm…
  • 5.
    String:- LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLM Pattern:-LLLLLM Let HashValue of “LLLLLL”= 0; And Hash Value of “LLLLLM”= 1; Step-1: LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLM LLLLLM ------------ 0 != 1 (One Comparison) Step-2: LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLM LLLLLM ------------ 0 != 1(One Comparison) :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: :::: Step-N: LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLM LLLLLM --------- 1 = 1 Hash Value is matching so compare elements one by one.- (6+1 Comparisons) Rabin-Karp Example…
  • 6.
    Length of pattern= M; Hash(p) = hash value of pattern; Hash(t) = hash value of first M letters in body of text; do if (hash(p) == hash(t)) brute force comparison of pattern and selected section of text hash(t) = hash value of next section of text, one character over while (end of text or brute force comparison == true) Pseudo Code…
  • 7.
    Let’s associate oneinteger with every letter of the alphabet. Hence we can say ‘A’ corresponds to 1, ‘B’ corresponds to 2 , ‘C’ corresponds to 3…… Similarly all other letters are corresponding to their index values. The Hash Value of the String “CAT” will be- 3*100 + 1*10 + 20 = 330 Calculating Hash Value… Index Position
  • 8.
    If the hashvalue matches for two strings then it is called a ‘hit’. It may be possible that two or more different strings produce the same hash value. String 1: “CBJ” hash code=3*100 + 2*10 + 10 = 330 String 2: “CAT” hash code=3*100 + 1*10 + 20 = 330 Hence it is necessary to check whether it is a hit or not? Any hit will have to be tested to verify that it is not spurious and that p[1..m] = T[s+1..s+m] What if two values collide…
  • 9.
    Let’s take anm-character sequence as an m-digit number in base b. The text subsequence t[ i .. i + m-1] is mapped to the number as follows- x(i) = t[i].𝑏 𝑚−1 + t[i+1].𝑏 𝑚−2 + t[i+2].𝑏 𝑚−3 +………..t[i+m-1] If m is very large then the hash value will be very large in size, so we can hash the value by taking mod a prime number, say q. h(i)=((t[i] 𝑏 𝑚−1 mod q) +(t[i+1] 𝑏 𝑚−2 mod q) + ...+(t[i+M-1] mod q))mod q Mathematical Function…
  • 10.
    • If alarge prime number is used to calculate hash function then there a very low possibility of being hashed values equal for two different patterns. • In this case searching takes O(N) time, where N is number of characters in the text body. • In worst case the time complexity may be O(MN), where M is no. of characters in the pattern. This case may occur when the prime no. chosen is very small. Complexity…