Strings Part 1: Tries and KMP Lucca Siaudzionis and Jack Spalding-Jamieson 2020/03/05 University of British Columbia
Announcements • We’re still finalizing A4. It will be out the weekend and you’ll have a little over two weeks to do it. 1
Inspiration Suppose we want to implement a map: string -> int • where we have N string keys and • each string has length ≤ M 2
Inspiration One solution: build a BST of the strings • This is essentially what would happen if you were to use map<string, int> 3
Inspiration One solution: build a BST of the strings • This is essentially what would happen if you were to use map<string, int> Time complexity: O ( M log N ) 3
Inspiration One solution: build a BST of the strings • This is essentially what would happen if you were to use map<string, int> Time complexity: O ( M log N ) Space complexity: O ( M + N ) 3
Inspiration One solution: build a BST of the strings • This is essentially what would happen if you were to use map<string, int> Time complexity: O ( M log N ) Space complexity: O ( M + N ) This doesn’t allow partial prefix matches, which might be useful sometimes • Can we do better? 3
Observation There are only 26 letters in the alphabet, so there is a lot of repetitive information that would be stored in a BST. 4
Observation There are only 26 letters in the alphabet, so there is a lot of repetitive information that would be stored in a BST. Why don’t we just use the alphabet to form a tree? 4
A Trie is a Tree! If we build a tree where every node represents a prefix of a word, we have a Trie. Keys: to, tea, ted, ten, A, i, in, inn. Source: Wikipedia 5
Trie Structure In the previous example, each node represents a prefix. 6
Trie Structure In the previous example, each node represents a prefix. • But we shouldn’t store that entire prefix. (Why?) 6
Trie Structure In the previous example, each node represents a prefix. • But we shouldn’t store that entire prefix. (Why?) We expand a prefix with an edge containing a character. 6
Trie Structure In the previous example, each node represents a prefix. • But we shouldn’t store that entire prefix. (Why?) We expand a prefix with an edge containing a character. An entire word is also a prefix of itself, so we flag some nodes indicating that they contain the end of a word. 6
Trie Structure – Implementation struct TrieNode { 1 bool isWord; 2 vector<TrieNode*> children; 3 TrieNode() { 4 isWord = false; 5 children = vector<TrieNode*>(26, nullptr); // assuming only 6 } 7 // ... 8 }; 9 10 TrieNode* root = new TrieNode(); // fresh new trie 11 7
Trie Lookup To see find out if a word is present in the trie, we just walk along the path in the tree defined by the word. • If, at any step, an edge is missing, then the word is not present in the trie. • If we reach the end node, but it has isWord false, then the word is not present in the trie. 8
Trie Lookup – Implementation // implementation inside TrieNode 1 bool find(string& word) { 2 TrieNode* curNode = this; 3 for (auto c : word) { 4 if (!curNode->children[c - 'a']) return false; 5 curNode = curNode->children[c - 'a']; 6 } 7 return curNode->isWord; 8 } 9 9
Trie Insertion To insert a new word, we walk in the trie along the path defined by that word • Every time an edge is missing, we create a new edge and node, appending it to the current node we are scanning. • When we reach the end of the word, we define it in the trie. 10
Trie Insertion – Implementation // implementation inside TrieNode struct 1 void insert(string& word) { 2 TrieNode* curNode = this; 3 for (auto c : word) { 4 // if the edge is missing 5 if (!curNode->children[c - 'a']) { 6 // we create a new node 7 curNode->children[c - 'a'] = new TrieNode(); 8 } 9 curNode = curNode->children[c - 'a']; 10 } 11 curNode->isWord = true; 12 } 13 11
Trie Deletion and Prefix Match Prefix match is very similar to the Lookup procedure • You should adapt it according to your problem. 12
Trie Deletion and Prefix Match Prefix match is very similar to the Lookup procedure • You should adapt it according to your problem. There are a few different ways to implement deletion • We’ll leave that as an exercise to you. :) 12
Discussion Problem Two player game where you alternate turns adding a letter to a string. At every turn, the string must be prefix of some word from a given list. The person who adds the last letter of a word loses. If you go first, can you win? 13
Discussion Problem – Insight Perform tree DP on the trie of all words • State: f(node) = can you win if you are here? • f(trie node that is a word end) = false • f(node) = true if f(child) = false for some child • f(node) = false if f(child) = true for all child 14
Exact String Matching Given a text string T and a pattern P , find all the occurrences of P in T . • Let N = length ( T ) and M = length ( P ). 15
Exact String Matching – Brute Force The brute force is intuitive: • For every position of T , see if there is a match of P that starts at that position. • Implementation is a double for-loop. 16
Exact String Matching – Brute Force The brute force is intuitive: • For every position of T , see if there is a match of P that starts at that position. • Implementation is a double for-loop. Time complexity: O ( NM ) Can we do better? 16
Knuth-Morris-Pratt Algorithm (KMP) The idea of KMP is to find, for every position of T , the longest prefix of P that ends there. 17
KMP – Insight Say that we know for a fact that the longest prefix of P that ends at the i − 1-th character of T has length equal to k . 18
KMP – Insight Say that we know for a fact that the longest prefix of P that ends at the i − 1-th character of T has length equal to k . • How do we use this to find the longest prefix of P that ends at position i of T ? 18
KMP – Insight Assume for now that k < M (i.e. there was no full match of P ). If the longest prefix of P ending at i − 1 has length k , there are two cases to analyze: 19
KMP – Insight Assume for now that k < M (i.e. there was no full match of P ). If the longest prefix of P ending at i − 1 has length k , there are two cases to analyze: • If P [ k ] = T [ i ], then the longest prefix of P ending at i has length k + 1. 19
KMP – Insight Assume for now that k < M (i.e. there was no full match of P ). If the longest prefix of P ending at i − 1 has length k , there are two cases to analyze: • If P [ k ] = T [ i ], then the longest prefix of P ending at i has length k + 1. • What if P [ k ] � = T [ i ]? 19
KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. 20
KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. Suppose you knew that the longest suffix of P [0 .. k − 1] that is also a prefix of P has length k 2 • i.e., P [0 .. k 2 − 1] = P [ k − k 2 .. k − 1], and k 2 is the maximum such number < k . 20
KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. Suppose you knew that the longest suffix of P [0 .. k − 1] that is also a prefix of P has length k 2 • i.e., P [0 .. k 2 − 1] = P [ k − k 2 .. k − 1], and k 2 is the maximum such number < k . Then, we have two cases: 20
KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. Suppose you knew that the longest suffix of P [0 .. k − 1] that is also a prefix of P has length k 2 • i.e., P [0 .. k 2 − 1] = P [ k − k 2 .. k − 1], and k 2 is the maximum such number < k . Then, we have two cases: • If P [ k 2 ] = T [ i ], the longest prefix of P ending at i has length k 2 + 1. 20
KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. Suppose you knew that the longest suffix of P [0 .. k − 1] that is also a prefix of P has length k 2 • i.e., P [0 .. k 2 − 1] = P [ k − k 2 .. k − 1], and k 2 is the maximum such number < k . Then, we have two cases: • If P [ k 2 ] = T [ i ], the longest prefix of P ending at i has length k 2 + 1. • But what if P [ k 2 ] � = T [ i ]? 20
KMP – Insight So, we have P [0 .. k − 1] = T [ i − k .. i − 1], and P [ k ] � = T [ i ]. Suppose you knew that the longest suffix of P [0 .. k − 1] that is also a prefix of P has length k 2 • i.e., P [0 .. k 2 − 1] = P [ k − k 2 .. k − 1], and k 2 is the maximum such number < k . Then, we have two cases: • If P [ k 2 ] = T [ i ], the longest prefix of P ending at i has length k 2 + 1. • But what if P [ k 2 ] � = T [ i ]? • Then, we find the longest suffix of P [0 .. k 2 − 1] that is also a prefix and repeat! 20
KMP – Success and Fail Arrows Implicitly, what we are doing is building a DFA. Each node represents a current prefix-length of P . There are two arrows leaving each node: • Success : That means there was a character-match, and we increase our longest prefix by 1 • Fail : The characters we compared were different, so we move the the longest suffix of the current prefix. • This is equivalent of making k ← k 2 in the previous slide. 21
Recommend
More recommend