l2s learning to search
play

L2S: Learning to Search CS 6355: Structured Prediction 1 Some - PowerPoint PPT Presentation

L2S: Learning to Search CS 6355: Structured Prediction 1 Some slides adapted from Daum and Ross Inference What is inference? An overview of what we have seen before Combinatorial optimization Different views of inference


  1. Learning to search: General setting Predicting an output 𝐳 as a sequence of decisions General data structures – State: Partial assignments to (𝑧 1 , 𝑧 2 , … , 𝑧 π‘ˆ ) – Initial state: Empty assignments (βˆ’, βˆ’, … , βˆ’) – Actions: Pick a 𝑧 𝑗 component and assign an label to it – Transition model: Move from one partial structure to another – Goal test: Whether all 𝑧 components are assigned β€’ A goal state does not need to be optimal 30

  2. Learning to search: General setting Predicting an output 𝐳 as a sequence of decisions General data structures – State: Partial assignments to (𝑧 1 , 𝑧 2 , … , 𝑧 π‘ˆ ) – Initial state: Empty assignments (βˆ’, βˆ’, … , βˆ’) – Actions: Pick a 𝑧 𝑗 component and assign an label to it – Transition model: Move from one partial structure to another – Goal test: Whether all 𝑧 components are assigned β€’ A goal state does not need to be optimal – Path cost/score function: 𝐱 π‘ˆ 𝜚(𝐲, node) , or more generally, a neural network that depends on the 𝐲 and the node β€’ A node contains the current state and the back pointer to trace back the search path 31

  3. Example Suppose each y can be one of A, B or C y 3 y 1 y 2 x 1 x 2 x 3 32

  4. Example Suppose each y can be one of A, B or C y 3 y 1 y 2 x 1 x 2 x 3 State: Triples (y 1 , y 2 , y 3 ) all possibly unknown β€’ (A, -, -), (-, A, A), (-, -, -),… β€’ Transition: Fill in one of the unknowns β€’ Start state: (-,-,-) β€’ End state: All three y’s are assigned β€’ 33

  5. Example Suppose each y can be one of A, B or C y 3 y 1 y 2 x 1 x 2 x 3 (-,-,-) State: Triples (y 1 , y 2 , y 3 ) all possibly unknown β€’ (A, -, -), (-, A, A), (-, -, -),… β€’ (A,-,-) (B,-,-) (C,-,-) Transition: Fill in one of the unknowns β€’ (A,A,-) (C,C,-) Start state: (-,-,-) ….. β€’ (A,A,A) End state: All three y’s are assigned (C,C,C) β€’ 34

  6. 1 st Framework: LaSO: Learning as Search Optimization [Hal DaumΓ© III and Daniel Marcu, ICML 2005] 35

  7. The enqueue function in LaSO 36

  8. The enqueue function in LaSO β€’ The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue 37

  9. The enqueue function in LaSO β€’ The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue β€’ LaSO assumes enqueue is based on two components g + h 38

  10. The enqueue function in LaSO β€’ The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue β€’ LaSO assumes enqueue is based on two components g + h – g: path component. (g = w T Ο†(x, node)) 39

  11. The enqueue function in LaSO β€’ The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue β€’ LaSO assumes enqueue is based on two components g + h – g: path component. (g = w T Ο†(x, node)) – h: heuristic component. (h is given) β€’ A * if h is admissible, heuristic search if h is not admissible, best first search if h = 0, beam search if queue is limited. 40

  12. The enqueue function in LaSO β€’ The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue β€’ LaSO assumes enqueue is based on two components g + h The goal is to learn w. – g: path component. (g = w T Ο†(x, node)) How? – h: heuristic component. (h is given) β€’ A * if h is admissible, heuristic search if h is not admissible, best first search if h = 0, beam search if queue is limited. 41

  13. β€œy-good” node 42

  14. β€œy-good” node Assumption : for any given node s and an gold output y, we can tell whether s can or cannot lead to y. 43

  15. β€œy-good” node Assumption : for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition : The node s is y-good if s can lead to y 44

  16. β€œy-good” node Assumption : for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition : The node s is y-good if s can lead to y y = (y 1 , y 2 , y 3 ) Suppose each y can be one of A, B or C, and the true label is (y 1 =A, y 2 =B, y 3 =C) 45

  17. β€œy-good” node Assumption : for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition : The node s is y-good if s can lead to y (-,-,-) y = (y 1 , y 2 , y 3 ) (A,-,-) (-,B,-) (C,-,-) Suppose each y can be one of A, B or C, and the true (A,A,-) (C,C,-) label is (y 1 =A, y 2 =B, y 3 =C) ….. (A,A,A) (C,C,C) 46

  18. Learning in LaSO 47

  19. Learning in LaSO β€’ Search as if in the prediction phase, but when an error is made: 48

  20. Learning in LaSO β€’ Search as if in the prediction phase, but when an error is made: – update w 49

  21. Learning in LaSO β€’ Search as if in the prediction phase, but when an error is made: – update w – clear the queue and insert all the correct moves 50

  22. Learning in LaSO β€’ Search as if in the prediction phase, but when an error is made: – update w – clear the queue and insert all the correct moves β€’ Two kinds of errors: 51

  23. Learning in LaSO β€’ Search as if in the prediction phase, but when an error is made: – update w – clear the queue and insert all the correct moves β€’ Two kinds of errors: – Error type 1: none of the queue is y-good 52

  24. Learning in LaSO β€’ Search as if in the prediction phase, but when an error is made: – update w – clear the queue and insert all the correct moves β€’ Two kinds of errors: – Error type 1: none of the queue is y-good – Error type 2: the goal state is not y-good 53

  25. Learning Algorithm in LaSO 54

  26. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 55

  27. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 56

  28. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 57

  29. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 58

  30. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 59

  31. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 60

  32. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 61

  33. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 62

  34. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 63

  35. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 64

  36. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 65

  37. What should learning do? node 1 y-good node 2 node 3 y-good y-good node 4 node 5 current y-good y-good Let’s say we found an error (of either type) at the current node, then we should have made the choice of node 4 instead of the current node 66

  38. What should learning do? node 1 y-good node 2 node 3 y-good y-good node 5 node 4 current y-good y-good Let’s say we found an error (of either type) at the current node, then we should have made the choice of node 4 instead of the current node Node 4 is the y-good sibling of the current node 67

  39. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, node, nodes ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 68

  40. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, node, nodes ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 69

  41. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, node, nodes ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 70

  42. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, {node, nodes} ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 71

  43. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, {node, nodes} ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 72

  44. Parameter Updates We need to specify w = update( w, x, sibs, nodes ) A simple perceptron-style update rule: w = w + Ξ” Ξ¦ ( x, n ) Ξ¦ ( x, n ) X X βˆ† = | sibs | βˆ’ | nodes | n ∈ sibs n ∈ nodes It comes with the usual perceptron-style mistake bound and generalization bound. (See references) 73

  45. 2 nd Framework: SEARN: Search and Learning Hal DaumΓ© III, John Langford, Daniel Marcu (2007) 74

  46. Policy A policy is a mapping from a state to an action β€’ For a given node, the policy tells what action should be taken β€’ 75

  47. Policy A policy is a mapping from a state to an action β€’ For a given node, the policy tells what action should be taken β€’ A policy gives a search path in the search space. β€’ – Different policy means different search path – Can be thought as the β€œdriver” in the search space 76

  48. Policy A policy is a mapping from a state to an action β€’ For a given node, the policy tells what action should be taken β€’ A policy gives a search path in the search space. β€’ – Different policy means different search path – Can be thought as the β€œdriver” in the search space A policy may be deterministic, or may contain some randomness. β€’ (More on this later) 77

  49. Reference Policy and Learned Policy 78

  50. Reference Policy and Learned Policy β€’ We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs 79

  51. Reference Policy and Learned Policy β€’ We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs β€’ Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) 80

  52. Reference Policy and Learned Policy β€’ We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs β€’ Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) Ο€ ref ref Ο€ 81

  53. Reference Policy and Learned Policy β€’ We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs β€’ Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) For example if we are using Hamming distance for cost vector 𝐝 , then the reference policy is trivial Ο€ ref ref Ο€ to compute, why? 82

  54. Reference Policy and Learned Policy β€’ We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs β€’ Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) For example if we are using Hamming distance for cost vector 𝐝 , then the reference policy is trivial Ο€ ref ref Ο€ to compute, why? Just make the right decision at every step 83

  55. Reference Policy and Learned Policy β€’ We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs β€’ Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) For example if we are using Hamming distance for cost vector 𝐝 , then the reference policy is trivial Ο€ ref ref Ο€ to compute, why? Just make the right decision at every step Suppose gold state is (A, B, C, A) and we are at the state (A, C, -, -) The reference policy tells us the next action is assigned C to the third slot. 84

  56. Cost-Sensitive Classification Suppose we want to learn a classifier β„Ž that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels β€’ 𝑦, 𝑧 ∈ π‘Œ Γ—[𝐿] – Learning goal: To find a classifier that has low error β€’ – min = Pr β„Ž 𝑦 β‰  𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost β€’ of predicting each label 𝑦, 𝐝 ∈ π‘Œ Γ— 0, ∞ S – Learning goal: To find a classifier that has low cost β€’ – min = 𝐹 >,T 𝑑 = > 85

  57. Cost-Sensitive Classification Suppose we want to learn a classifier β„Ž that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels β€’ 𝑦, 𝑧 ∈ π‘Œ Γ—[𝐿] – Learning goal: To find a classifier that has low error β€’ – min = Pr β„Ž 𝑦 β‰  𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost β€’ of predicting each label 𝑦, 𝐝 ∈ π‘Œ Γ— 0, ∞ S – Learning goal: To find a classifier that has low cost β€’ – min = 𝐹 >,T 𝑑 = > 86

  58. Cost-Sensitive Classification Suppose we want to learn a classifier β„Ž that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels β€’ 𝑦, 𝑧 ∈ π‘Œ Γ—[𝐿] – Learning goal: To find a classifier that has low error β€’ – min = Pr β„Ž 𝑦 β‰  𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost β€’ of predicting each label 𝑦, 𝐝 ∈ π‘Œ Γ— 0, ∞ S – Learning goal: To find a classifier that has low cost β€’ – min = 𝐹 >,T 𝑑 = > 87

  59. Cost-Sensitive Classification Suppose we want to learn a classifier β„Ž that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels β€’ 𝑦, 𝑧 ∈ π‘Œ Γ—[𝐿] – Learning goal: To find a classifier that has low error β€’ – min = Pr β„Ž 𝑦 β‰  𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost β€’ of predicting each label Exercise: How would 𝑦, 𝐝 ∈ π‘Œ Γ— 0, ∞ S – you design a cost- Learning goal: To find a classifier that has low cost β€’ sensitive learner? – min = 𝐹 >,T 𝑑 = > 88

  60. Cost-Sensitive Classification Suppose we want to learn a classifier β„Ž that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels β€’ 𝑦, 𝑧 ∈ π‘Œ Γ—[𝐿] – Learning goal: To find a classifier that has low error β€’ – min = Pr β„Ž 𝑦 β‰  𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost β€’ of predicting each label 𝑦, 𝐝 ∈ π‘Œ Γ— 0, ∞ S – Learning goal: To find a classifier that has low cost β€’ – min = 𝐹 >,T 𝑑 = > SEARN uses a cost-sensitive learner to learn a policy 89

  61. SEARN at test time We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output. 90

  62. SEARN at test time We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output. 1. Use the learned policy on initial state (-,…, -) to compute y 1 91

  63. SEARN at test time We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output. 1. Use the learned policy on initial state (-,…, -) to compute y 1 2. Use the learned policy on state (y 1 , -,…,-) to compute y 2 92

  64. SEARN at test time We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output. 1. Use the learned policy on initial state (-,…, -) to compute y 1 2. Use the learned policy on state (y 1 , -,…,-) to compute y 2 3. Keep going until we get y = (y 1 ,…,y n ) 93

  65. SEARN at training time 94

  66. SEARN at training time β€’ The core idea in training is to notice that at each decision step, we are actually doing a cost-sensitive classification 95

  67. SEARN at training time β€’ The core idea in training is to notice that at each decision step, we are actually doing a cost-sensitive classification β€’ Construct cost-sensitive classification examples (s, c) with state s and cost vector c. 96

  68. SEARN at training time β€’ The core idea in training is to notice that at each decision step, we are actually doing a cost-sensitive classification β€’ Construct cost-sensitive classification examples (s, c) with state s and cost vector c. β€’ Learn a cost-sensitive classifier. (This is nothing but a policy) 97

  69. Roll-in, Roll-out 98

  70. Roll-in, Roll-out roll in At each state, use some policy to move to a new state. 99

  71. Roll-in, Roll-out roll in What is the cost of deviating from the policy at this step? 100

Recommend


More recommend