Aho-Corasick Algorithm in Python

Last Updated : 17 Apr, 2024

Given an input text and an array of k words, arr[], find all occurrences of all words in the input text. Let n be the length of text and m be the total number of characters in all words, i.e. m = length(arr[0]) + length(arr[1]) + … + length(arr[k-1]). Here k is the total number of input words.

Examples:

Input: text = "hello worldhello"
arr = ["hello", "world"]
Output: {'hello': [0, 10], 'world': [6]}
Explantion: In the given text "hello worldhello", the pattern "hello" appears at index 0 and 10, and the pattern "world" appears at index 6.

Input: text = "abxabcabcaby"
arr = ["ab", "abc", "aby"]
Output: {'ab': [0, 3], 'abc': [2, 5], 'aby': [9]}

Aho-Corasick Algorithm:

The Aho-Corasick algorithm is a string-searching algorithm that constructs a finite state machine representing all keywords to be searched for. It’s a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the “dictionary”) within an input text. It matches all strings simultaneously.

Step-by-step explanation of the algorithm:

Build Trie (Keyword Tree):

Create a root node.
For each keyword in the given list, add it to the trie.
If a keyword ends at a node, add it to the output list of that node.

Below is the syntax of the above idea:

Python3

def build_trie(patterns): root = AhoCorasickNode(None) # root node of the trie # Iterate over each pattern in the list of patterns for pattern in patterns: node = root # Iterate over each character in the pattern for char in pattern: # If the character is not in the children of the current node, add a new child node if char not in node.children: node.children[char] = AhoCorasickNode(char) # Move to the child node node = node.children[char] # Add the pattern to the output of the current node node.output.append(pattern) return root

Build Failure Links:

Use BFS to traverse the trie.
For each node, set its failure link to the longest suffix of the current keyword that is also a prefix of a keyword in the trie. If no such suffix exists, set the failure link to the root node.

Below is the syntax of the above idea:

Python3

from collections import deque def build_failure_function(root): queue = deque() # Initialize failure function of the root's children to the root itself for node in root.children.values(): node.failure = root queue.append(node) # Breadth-first traversal of the trie to compute the failure function while queue: current_node = queue.popleft() # For each child of the current node for char, child_node in current_node.children.items(): queue.append(child_node) failure_node = current_node.failure # Traverse the failure function until a node is found with a matching child or the root is reached while failure_node and char not in failure_node.children: failure_node = failure_node.failure # Update the failure function of the child node child_node.failure = failure_node.children[char] if failure_node else root # Add the output of the failure node to the output of the current node child_node.output.extend(child_node.failure.output)

Search the Text:

Start at the root node of the trie.
For each character in the text:
- Follow the character along the trie.
- If a keyword is found, record its position in the text.
- If a character leads to a failure link, follow the failure link and continue searching.

Below is the syntax of the above idea:

Python3

def search(text, patterns): root = build_trie(patterns) build_failure_function(root) current_node = root results = {} # Dictionary to store the indices of the found patterns # Iterate over each character in the text for i, char in enumerate(text): # Follow the failure function until a matching child is found or the root is reached while current_node and char not in current_node.children: current_node = current_node.failure # If a matching child is found, move to that child if current_node: current_node = current_node.children[char] # If the current node has any patterns that end at it, store the indices of those patterns for pattern in current_node.output: start_index = i - len(pattern) + 1 if start_index not in results: results[start_index] = [] results[start_index].append(pattern) return results

Implementation of Aho-Corasick Algorithm in Python:

Aho-Corasick Algorithm efficiently finds multiple patterns in a given text. Here's a Python implementation:

Python3

class TrieNode: def __init__(self): # Initialize TrieNode attributes self.children = {} self.output = [] self.fail = None def build_automaton(keywords): # Initialize root node of the trie root = TrieNode() # Build trie for keyword in keywords: node = root # Traverse the trie and create nodes for each character for char in keyword: node = node.children.setdefault(char, TrieNode()) # Add keyword to the output list of the final node node.output.append(keyword) # Build failure links using BFS queue = [] # Start from root's children for node in root.children.values(): queue.append(node) node.fail = root # Breadth-first traversal of the trie while queue: current_node = queue.pop(0) # Traverse each child node for key, next_node in current_node.children.items(): queue.append(next_node) fail_node = current_node.fail # Find the longest proper suffix that is also a prefix while fail_node and key not in fail_node.children: fail_node = fail_node.fail # Set failure link of the current node next_node.fail = fail_node.children[key] if fail_node else root # Add output patterns of failure node to current node's output next_node.output += next_node.fail.output return root def search_text(text, keywords): # Build the Aho-Corasick automaton root = build_automaton(keywords) # Initialize result dictionary result = {keyword: [] for keyword in keywords} current_node = root # Traverse the text for i, char in enumerate(text): # Follow failure links until a match is found while current_node and char not in current_node.children: current_node = current_node.fail if not current_node: current_node = root continue # Move to the next node based on current character current_node = current_node.children[char] # Record matches found at this position for keyword in current_node.output: result[keyword].append(i - len(keyword) + 1) return result # Example 1 text1 = "hello worldhello" arr1 = ["hello", "world"] result1 = search_text(text1, arr1) print(result1) # Example 2 text2 = "abxabcabcaby" arr2 = ["ab", "abc", "aby"] result2 = search_text(text2, arr2) print(result2)

Output

{'hello': [0, 11], 'world': [6]} {'ab': [0, 3, 6, 9], 'abc': [3, 6], 'aby': [9]}

Time Complexity:

Building the automaton: O(m+k)
Searching the text: O(n+z), where z is the total number of occurrences of all keywords in the text.

Auxiliary Space: O (m+k)

Boyer Moore Algorithm in Python

srinam

Aho-Corasick Algorithm in Python

Aho-Corasick Algorithm:

Build Trie (Keyword Tree):

Build Failure Links:

Search the Text:

Implementation of Aho-Corasick Algorithm in Python:

Similar Reads

Thank You!

What kind of Experience do you want to share?