Aho-Corasick Algorithm in Python
Last Updated : 17 Apr, 2024
Given an input text and an array of k words, arr[], find all occurrences of all words in the input text. Let n be the length of text and m be the total number of characters in all words, i.e. m = length(arr[0]) + length(arr[1]) + … + length(arr[k-1]). Here k is the total number of input words.
Examples:
Input: text = "hello worldhello"
arr = ["hello", "world"]
Output: {'hello': [0, 10], 'world': [6]}
Explantion
:
In the given text "hello worldhello", the pattern "hello" appears at index 0 and 10, and the pattern "world" appears at index 6.
Input: text = "abxabcabcaby"
arr = ["ab", "abc", "aby"]
Output: {'ab': [0, 3], 'abc': [2, 5], 'aby': [9]}
The Aho-Corasick algorithm is a string-searching algorithm that constructs a finite state machine representing all keywords to be searched for. It’s a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the “dictionary”) within an input text. It matches all strings simultaneously.
Step-by-step explanation of the algorithm:
Build Trie (Keyword Tree):
- Create a root node.
- For each keyword in the given list, add it to the trie.
- If a keyword ends at a node, add it to the output list of that node.
Below is the syntax of the above idea:
Python3 def build_trie(patterns): root = AhoCorasickNode(None) # root node of the trie # Iterate over each pattern in the list of patterns for pattern in patterns: node = root # Iterate over each character in the pattern for char in pattern: # If the character is not in the children of the current node, add a new child node if char not in node.children: node.children[char] = AhoCorasickNode(char) # Move to the child node node = node.children[char] # Add the pattern to the output of the current node node.output.append(pattern) return root
Build Failure Links:
- Use BFS to traverse the trie.
- For each node, set its failure link to the longest suffix of the current keyword that is also a prefix of a keyword in the trie. If no such suffix exists, set the failure link to the root node.
Below is the syntax of the above idea:
Python3 from collections import deque def build_failure_function(root): queue = deque() # Initialize failure function of the root's children to the root itself for node in root.children.values(): node.failure = root queue.append(node) # Breadth-first traversal of the trie to compute the failure function while queue: current_node = queue.popleft() # For each child of the current node for char, child_node in current_node.children.items(): queue.append(child_node) failure_node = current_node.failure # Traverse the failure function until a node is found with a matching child or the root is reached while failure_node and char not in failure_node.children: failure_node = failure_node.failure # Update the failure function of the child node child_node.failure = failure_node.children[char] if failure_node else root # Add the output of the failure node to the output of the current node child_node.output.extend(child_node.failure.output)
Search the Text:
- Start at the root node of the trie.
- For each character in the text:
- Follow the character along the trie.
- If a keyword is found, record its position in the text.
- If a character leads to a failure link, follow the failure link and continue searching.
Below is the syntax of the above idea:
Python3 def search(text, patterns): root = build_trie(patterns) build_failure_function(root) current_node = root results = {} # Dictionary to store the indices of the found patterns # Iterate over each character in the text for i, char in enumerate(text): # Follow the failure function until a matching child is found or the root is reached while current_node and char not in current_node.children: current_node = current_node.failure # If a matching child is found, move to that child if current_node: current_node = current_node.children[char] # If the current node has any patterns that end at it, store the indices of those patterns for pattern in current_node.output: start_index = i - len(pattern) + 1 if start_index not in results: results[start_index] = [] results[start_index].append(pattern) return results
Implementation of Aho-Corasick Algorithm in Python:
Aho-Corasick Algorithm efficiently finds multiple patterns in a given text. Here's a Python implementation:
Python3 class TrieNode: def __init__(self): # Initialize TrieNode attributes self.children = {} self.output = [] self.fail = None def build_automaton(keywords): # Initialize root node of the trie root = TrieNode() # Build trie for keyword in keywords: node = root # Traverse the trie and create nodes for each character for char in keyword: node = node.children.setdefault(char, TrieNode()) # Add keyword to the output list of the final node node.output.append(keyword) # Build failure links using BFS queue = [] # Start from root's children for node in root.children.values(): queue.append(node) node.fail = root # Breadth-first traversal of the trie while queue: current_node = queue.pop(0) # Traverse each child node for key, next_node in current_node.children.items(): queue.append(next_node) fail_node = current_node.fail # Find the longest proper suffix that is also a prefix while fail_node and key not in fail_node.children: fail_node = fail_node.fail # Set failure link of the current node next_node.fail = fail_node.children[key] if fail_node else root # Add output patterns of failure node to current node's output next_node.output += next_node.fail.output return root def search_text(text, keywords): # Build the Aho-Corasick automaton root = build_automaton(keywords) # Initialize result dictionary result = {keyword: [] for keyword in keywords} current_node = root # Traverse the text for i, char in enumerate(text): # Follow failure links until a match is found while current_node and char not in current_node.children: current_node = current_node.fail if not current_node: current_node = root continue # Move to the next node based on current character current_node = current_node.children[char] # Record matches found at this position for keyword in current_node.output: result[keyword].append(i - len(keyword) + 1) return result # Example 1 text1 = "hello worldhello" arr1 = ["hello", "world"] result1 = search_text(text1, arr1) print(result1) # Example 2 text2 = "abxabcabcaby" arr2 = ["ab", "abc", "aby"] result2 = search_text(text2, arr2) print(result2)
Output{'hello': [0, 11], 'world': [6]} {'ab': [0, 3, 6, 9], 'abc': [3, 6], 'aby': [9]}
Time Complexity:
- Building the automaton: O(m+k)
- Searching the text: O(n+z), where z is the total number of occurrences of all keywords in the text.
Auxiliary Space: O (m+k)
Similar Reads
Backtracking Algorithm in Python Backtracking is a problem-solving algorithmic technique that involves finding a solution incrementally by trying different options and undoing them if they lead to a dead end. The backtracking algorithm is a recursive algorithm that is used to solve problems by making a series of choices, and if a c
4 min read
Aho-Corasick Algorithm for Pattern Searching Given an input text and an array of k words, arr[], find all occurrences of all words in the input text. Let n be the length of text and m be the total number of characters in all words, i.e. m = length(arr[0]) + length(arr[1]) + ... + length(arr[k-1]). Here k is total numbers of input words. Exampl
15+ min read
Boyer Moore Algorithm in Python Boyer-Moore algorithm is an efficient string search algorithm that is particularly useful for large-scale searches. Unlike some other string search algorithms, the Boyer-Moore does not require preprocessing, making it ideal where the sample is relatively large relative to the data being searched. Wh
3 min read
Ford-Fulkerson Algorithm in Python The Ford-Fulkerson algorithm is a widely used algorithm to solve the maximum flow problem in a flow network. The maximum flow problem involves determining the maximum amount of flow that can be sent from a source vertex to a sink vertex in a directed weighted graph, subject to capacity constraints o
4 min read
A* Search Algorithm in Python Given an adjacency list and a heuristic function for a directed graph, implement the A* search algorithm to find the shortest path from a start node to a goal node. Examples: Input:Start Node: AGoal Node: FNodes: A, B, C, D, E, FEdges with weights:(A, B, 1), (A, C, 4),(B, D, 3), (B, E, 5),(C, F, 2),
8 min read
Bellman-Ford algorithm in Python Given a weighted graph with V vertices and E edges, and a source vertex src, find the shortest path from the source vertex to all vertices in the given graph. If a vertex cannot be reached from source vertex, mark its distance as 108.Note: If a graph contains negative weight cycle, return -1.Bellman
3 min read