CPSC 490 Finale Heavy-Light Decomposition and Suffix Array Lucca Siaudzionis and Jack Spalding-Jamieson 2020/04/07 University of British Columbia
Announcements • firt • Congrats on [basically] finishing the course! • Last reminder that A5 is due Sunday April 19th. It will not be extended past this day, so please work on it and all the upsolvers you want to complete early (before your finals!). 1
Heavy-Light Decomposition: The Goal Input : A tree (not necessarily binary), with an integer stored at each vertex. We want to handle a bunch of requests online: • Updates: Add x to all the vertices on the path between a and b . • Queries: What is the sum along the path from a to b . For now, we can assume that b is always the root (it will be easy to generalize our answer). Output : The answer to each query. 5 1 − 3 3 2 1 3 3 − 1 6 4 − 2 Figure 1: A tree and a highlighted path with sum 9. 2
Heavy-Light Decomposition: The Worst-Case Recall why this may be hard: What if we had nodes very deep in the tree? Then the path up may be very long. To add even more complexity, we could also have a broomstick: Figure 2: A small broomstick. 3
Heavy-Light Decomposition: Decomposition Plan Our end-goal is going to be to decompose our tree into paths with a special property: Figure 3: The heavy-light decomposition of some tree. The special property is that any node has O (log n ) distinct paths between it and the root. 4
Heavy-Light Decomposition: Using Paths The result is: O (log n ) distinct heavy paths between each node and the root. Additional properties: • Every edge can be considered to be a ”light edge” or a ”heavy edge”. • The number of light edges from any vertex to the root is O (log n ). • Every vertex has at most one heavy edge to its children (it is possible to create a decomposition that has exactly one path to each child too). Why is this useful? A path is a segment! So we can turn each path into a segment tree, and do range queries/updates on the O (log n ) segments up to the root from any node. This means our queries will run in O (log 2 n ) time generally. In practice, we can use one segment tree for the entire tree, generated with DFS that starts with heavy paths. 5
Heavy-Light Decomposition: How To The algorithm to compute the heavy-light decomposition is quite simple. For each vertex: • Recursively compute the size of each subtree. • Compute the sum of the subtree sizes. • If any subtree has at least half the total number of nodes among them all, extend or create a heavy edge from the current node. Alternative definition : Create a heavy edge to the child with the largest size, even if it is not more than half of the total. In our implementation, we will actually use the alternative definition, since it’s easier to implement. 6
Heavy-Light Decomposition: Property Proof Idea We need to prove something about the heavy-light decomposition: • The number of light edges from any vertex to the root is O (log n ). More specifically, we show that there are at most log 2 n such light edges. This will then imply that there are O (log n ) distinct heavy paths between each node and the root. Proof idea: Starting at some vertex v , iteratively move up to the parent. Every time we take a light edge, the total number of elements in the subtree must at least double, by the choice of the heavy edge (EXERCISE: check this for both definitions of the heavy edge). This doubling can only happen up to log n times. 7
Heavy-Light Decomposition: Implementation The implementation is actually extremely simple. Here we will always set the first child to be the heavy child, by swapping: 1 void compute_size(vector<int>& size, int v, vector<vector<int>>& adj) { size[v] = 1; 2 for (int& u : adj[v]) { 3 compute_size(size, u, adj); 4 size[v] += size[u]; 5 // make the heaviest child the first child 6 if (size[u] >= size[adj[v][0]]) swap(u, adj[v][0]); 7 } 8 9 } 10 // next stores the next vertex that is the root or is connected by a light edge to its parent 11 // next[root] must be initialized to root 12 void hld(int v, vector<vector<int>>& adj, vector<int>& next) { for (int u: adj[v]) { 13 next[u] = (u == adj[v][0] ? next[v] : u); 14 hld(u, adj, next); 15 } 16 8 17 }
Heavy-Light Decomposition: Implementation Usage // sg initalized to have length n 19 // query path up to root from v 20 // assume parent[root] == -1 21 void query(int v, int root, vector<int>& next, vector<int>& parent, 22 segtree sg, int q, vector<int>& dfs, int i=0) { 23 dfs[v] = i++; 24 while (v != -1) { 25 int u = next[v]; 26 sg.query(dfs[u], dfs[v], q); // add q to the range [u,v] 27 v = parent[u]; 28 } } 29 This function could be used to query arbitrary paths in the tree using inverse operations and LCA (we’ve seen this before in the RQ unit). A better way to do these (that would work for min/max) would be to do only two queries up from each of the nodes halting at the LCA (which we can do by recording depths, and finding the depth of the LCA). 9
Heavy-Light Decomposition: What You Can Use This For Now that we’ve turned all trees into a decomposition of segments, you can do many more things: • Compute min, max, argmin, argmax, product, sum modulo, product modulo a prime, etc. along paths in the tree (anything we could do easily with segment trees). • Query inclusion in tree paths (store sets in the segment trees) in O (log 3 n ) time. • Combine subtree and path queries (keep track of an Euler tour within the same DFS as HLD). • Go learn link-cut trees (uses HLD only as part of the proof). 10
Suffix Arrays Let’s start with a motivational problem. 11
LCS or LCS? We have solved the Longest Common Subsequence problem with DP. What about Longest Common Substring?? 12
Longest Common Substring Very slow method: run Aho Corasick to find all substrings of S 1 that appear in S 2 → at least O ( m 2 + n ) Observation: only need to match suffixes of S 1 because Aho Corasick can tell you longest prefix match of any suffix, which is good enough. ⇒ All we need to do is to figure out how to build a Suffix Trie of S 1 with all the extra arrows for Aho Corasick, and then run S 2 through. Suffix Trie can be built in O ( n ) but algorithm is quite complicated. Instead we will build a simpler data structure – a Suffix Array 13
Suffix Array – Definition A Suffix Array is the representation of all the suffixes of a word S sorted in lexicographical order. 0: BANANA $ $ (6) 1: ANANA $ A $ (5) 2: NANA $ ANA $ (3) → 3: ANA $ ANANA $ (1) 4: NA $ BANANA $ (0) 5: A $ NA $ (4) 6: $ NANA $ (2) 14
Longest Common Substring with Suffix Array Main idea: if we have a sorted list of all suffixes of both S 1 and S 2 , then we can just scan through the list and compare adjacent suffixes. What we will do: • Construct a Suffix Array of a string in O ( n log 2 n ) • O ( n log n ) if you use radix sort • O ( n ) algorithms exist, but are more complicated • At the same time, construct a DP table so that we can find Longest Common Prefix of any two suffixes in O (log n ) 15
Suffix Array Construction To avoid O ( n 2 ) memory, store suffixes by their starting index. Full comparison of 2 suffixes is slow – possibly O (string length), but comparing only first character is fast! ⇒ Let’s try sorting suffixes by first character 16
Suffix Array Construction – Rank • Define rank as the “rank” of a string when sorted by something, not breaking ties. • The rank must be defined such that if rank ( a ) < rank ( b ) iff a comes before b in the sorting. 17
Suffix Array Construction – Pass 1 First pass: sort by the first character of the suffix and label with rank {B} 0 = BANANA$ {$} -> {0} 6 = $ {A} 1 = ANANA$ {A} -> {1} 1 = ANANA$ {N} 2 = NANA$ {A} -> {1} 3 = ANA$ {A} 3 = ANA$ => {A} -> {1} 5 = A$ {N} 4 = NA$ {B} -> {2} 0 = BANANA$ {A} 5 = A$ {N} -> {3} 2 = NANA$ {$} 6 = $ {N} -> {3} 4 = NA$ Now we know the rank of all suffixes by their first character. 18
Suffix Array Construction – Pass 2 Observation: a suffix of a suffix is a suffix, so if two suffixes share first character, we know their relative rank by second character! ⇒ Sort again with pair(rank 1, rank 2) {0, 0} 6 = $ {0, 0} -> {0} 6 = $ {1, 3} 1 = ANANA$ {1, 0} -> {1} 5 = A$ {1, 3} 3 = ANA$ {1, 3} -> {2} 1 = ANANA$ {1, 0} 5 = A$ => {1, 3} -> {2} 3 = ANA$ {2, 1} 0 = BANANA$ {2, 1} -> {3} 0 = BANANA$ {3, 1} 2 = NANA$ {3, 1} -> {4} 2 = NANA$ {3, 1} 4 = NA$ {3, 1} -> {4} 4 = NA$ Now we know the rank of all suffixes by their first 2 characters. 19
Suffix Array Construction – Pass 3 We know the rank by first 2 characters, so if two suffixes have same rank, we know their relative rank by the next 2 characters. ⇒ Sort again with pair(rank of char 1-2, rank of char 3-4) {0, 0} 6 = $ {0, 0} -> {0} 6 = $ {1, 0} 5 = A$ {1, 0} -> {1} 5 = A$ {2, 2} 1 = ANANA$ {2, 1} -> {2} 3 = ANA$ {2, 1} 3 = ANA$ => {2, 2} -> {3} 1 = ANANA$ {3, 4} 0 = BANANA$ {3, 4} -> {4} 0 = BANANA$ {4, 4} 2 = NANA$ {4, 0} -> {5} 4 = NA$ {4, 0} 4 = NA$ {4, 4} -> {6} 2 = NANA$ Now we know the rank of all suffixes by first their 4 characters. For “BANANA” we are done as all ranks are unique. Otherwise, sort again with pair(rank of char 1-4, rank of char 5-8), etc. 20
Recommend
More recommend