L2S: Learning to Search CS 6355: Structured Prediction 1 Some - PowerPoint PPT Presentation

Learning to search: General setting Predicting an output 𝐳 as a sequence of decisions General data structures – State: Partial assignments to (𝑧 1 , 𝑧 2 , … , 𝑧 𝑈 ) – Initial state: Empty assignments (−, −, … , −) – Actions: Pick a 𝑧 𝑗 component and assign an label to it – Transition model: Move from one partial structure to another – Goal test: Whether all 𝑧 components are assigned • A goal state does not need to be optimal 30

Learning to search: General setting Predicting an output 𝐳 as a sequence of decisions General data structures – State: Partial assignments to (𝑧 1 , 𝑧 2 , … , 𝑧 𝑈 ) – Initial state: Empty assignments (−, −, … , −) – Actions: Pick a 𝑧 𝑗 component and assign an label to it – Transition model: Move from one partial structure to another – Goal test: Whether all 𝑧 components are assigned • A goal state does not need to be optimal – Path cost/score function: 𝐱 𝑈 𝜚(𝐲, node) , or more generally, a neural network that depends on the 𝐲 and the node • A node contains the current state and the back pointer to trace back the search path 31

Example Suppose each y can be one of A, B or C y 3 y 1 y 2 x 1 x 2 x 3 32

Example Suppose each y can be one of A, B or C y 3 y 1 y 2 x 1 x 2 x 3 State: Triples (y 1 , y 2 , y 3 ) all possibly unknown • (A, -, -), (-, A, A), (-, -, -),… • Transition: Fill in one of the unknowns • Start state: (-,-,-) • End state: All three y’s are assigned • 33

Example Suppose each y can be one of A, B or C y 3 y 1 y 2 x 1 x 2 x 3 (-,-,-) State: Triples (y 1 , y 2 , y 3 ) all possibly unknown • (A, -, -), (-, A, A), (-, -, -),… • (A,-,-) (B,-,-) (C,-,-) Transition: Fill in one of the unknowns • (A,A,-) (C,C,-) Start state: (-,-,-) ….. • (A,A,A) End state: All three y’s are assigned (C,C,C) • 34

1 st Framework: LaSO: Learning as Search Optimization [Hal Daumé III and Daniel Marcu, ICML 2005] 35

The enqueue function in LaSO 36

The enqueue function in LaSO • The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue 37

The enqueue function in LaSO • The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue • LaSO assumes enqueue is based on two components g + h 38

The enqueue function in LaSO • The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue • LaSO assumes enqueue is based on two components g + h – g: path component. (g = w T φ(x, node)) 39

The enqueue function in LaSO • The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue • LaSO assumes enqueue is based on two components g + h – g: path component. (g = w T φ(x, node)) – h: heuristic component. (h is given) • A * if h is admissible, heuristic search if h is not admissible, best first search if h = 0, beam search if queue is limited. 40

The enqueue function in LaSO • The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue • LaSO assumes enqueue is based on two components g + h The goal is to learn w. – g: path component. (g = w T φ(x, node)) How? – h: heuristic component. (h is given) • A * if h is admissible, heuristic search if h is not admissible, best first search if h = 0, beam search if queue is limited. 41

“y-good” node 42

“y-good” node Assumption : for any given node s and an gold output y, we can tell whether s can or cannot lead to y. 43

“y-good” node Assumption : for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition : The node s is y-good if s can lead to y 44

“y-good” node Assumption : for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition : The node s is y-good if s can lead to y y = (y 1 , y 2 , y 3 ) Suppose each y can be one of A, B or C, and the true label is (y 1 =A, y 2 =B, y 3 =C) 45

“y-good” node Assumption : for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition : The node s is y-good if s can lead to y (-,-,-) y = (y 1 , y 2 , y 3 ) (A,-,-) (-,B,-) (C,-,-) Suppose each y can be one of A, B or C, and the true (A,A,-) (C,C,-) label is (y 1 =A, y 2 =B, y 3 =C) ….. (A,A,A) (C,C,C) 46

Learning in LaSO 47

Learning in LaSO • Search as if in the prediction phase, but when an error is made: 48

Learning in LaSO • Search as if in the prediction phase, but when an error is made: – update w 49

Learning in LaSO • Search as if in the prediction phase, but when an error is made: – update w – clear the queue and insert all the correct moves 50

Learning in LaSO • Search as if in the prediction phase, but when an error is made: – update w – clear the queue and insert all the correct moves • Two kinds of errors: 51

Learning in LaSO • Search as if in the prediction phase, but when an error is made: – update w – clear the queue and insert all the correct moves • Two kinds of errors: – Error type 1: none of the queue is y-good 52

Learning in LaSO • Search as if in the prediction phase, but when an error is made: – update w – clear the queue and insert all the correct moves • Two kinds of errors: – Error type 1: none of the queue is y-good – Error type 2: the goal state is not y-good 53

Learning Algorithm in LaSO 54

Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 55

What should learning do? node 1 y-good node 2 node 3 y-good y-good node 4 node 5 current y-good y-good Let’s say we found an error (of either type) at the current node, then we should have made the choice of node 4 instead of the current node 66

What should learning do? node 1 y-good node 2 node 3 y-good y-good node 5 node 4 current y-good y-good Let’s say we found an error (of either type) at the current node, then we should have made the choice of node 4 instead of the current node Node 4 is the y-good sibling of the current node 67

Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, node, nodes ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 68

Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, {node, nodes} ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 71

Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, {node, nodes} ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 72

Parameter Updates We need to specify w = update( w, x, sibs, nodes ) A simple perceptron-style update rule: w = w + Δ Φ ( x, n ) Φ ( x, n ) X X ∆ = | sibs | − | nodes | n ∈ sibs n ∈ nodes It comes with the usual perceptron-style mistake bound and generalization bound. (See references) 73

2 nd Framework: SEARN: Search and Learning Hal Daumé III, John Langford, Daniel Marcu (2007) 74

Policy A policy is a mapping from a state to an action • For a given node, the policy tells what action should be taken • 75

Policy A policy is a mapping from a state to an action • For a given node, the policy tells what action should be taken • A policy gives a search path in the search space. • – Different policy means different search path – Can be thought as the “driver” in the search space 76

Policy A policy is a mapping from a state to an action • For a given node, the policy tells what action should be taken • A policy gives a search path in the search space. • – Different policy means different search path – Can be thought as the “driver” in the search space A policy may be deterministic, or may contain some randomness. • (More on this later) 77

Reference Policy and Learned Policy 78

Reference Policy and Learned Policy • We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs 79

Reference Policy and Learned Policy • We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs • Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) 80

Reference Policy and Learned Policy • We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs • Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) π ref ref π 81

Reference Policy and Learned Policy • We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs • Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) For example if we are using Hamming distance for cost vector 𝐝 , then the reference policy is trivial π ref ref π to compute, why? 82

Reference Policy and Learned Policy • We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs • Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) For example if we are using Hamming distance for cost vector 𝐝 , then the reference policy is trivial π ref ref π to compute, why? Just make the right decision at every step 83

Reference Policy and Learned Policy • We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs • Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) For example if we are using Hamming distance for cost vector 𝐝 , then the reference policy is trivial π ref ref π to compute, why? Just make the right decision at every step Suppose gold state is (A, B, C, A) and we are at the state (A, C, -, -) The reference policy tells us the next action is assigned C to the third slot. 84

Cost-Sensitive Classification Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels • 𝑦, 𝑧 ∈ 𝑌 ×[𝐿] – Learning goal: To find a classifier that has low error • – min = Pr ℎ 𝑦 ≠ 𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost • of predicting each label 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S – Learning goal: To find a classifier that has low cost • – min = 𝐹 >,T 𝑑 = > 85

Cost-Sensitive Classification Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels • 𝑦, 𝑧 ∈ 𝑌 ×[𝐿] – Learning goal: To find a classifier that has low error • – min = Pr ℎ 𝑦 ≠ 𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost • of predicting each label Exercise: How would 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S – you design a cost- Learning goal: To find a classifier that has low cost • sensitive learner? – min = 𝐹 >,T 𝑑 = > 88

Cost-Sensitive Classification Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels • 𝑦, 𝑧 ∈ 𝑌 ×[𝐿] – Learning goal: To find a classifier that has low error • – min = Pr ℎ 𝑦 ≠ 𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost • of predicting each label 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S – Learning goal: To find a classifier that has low cost • – min = 𝐹 >,T 𝑑 = > SEARN uses a cost-sensitive learner to learn a policy 89

SEARN at test time We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output. 90

SEARN at test time We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output. 1. Use the learned policy on initial state (-,…, -) to compute y 1 91

SEARN at test time We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output. 1. Use the learned policy on initial state (-,…, -) to compute y 1 2. Use the learned policy on state (y 1 , -,…,-) to compute y 2 92

SEARN at test time We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output. 1. Use the learned policy on initial state (-,…, -) to compute y 1 2. Use the learned policy on state (y 1 , -,…,-) to compute y 2 3. Keep going until we get y = (y 1 ,…,y n ) 93

SEARN at training time 94

SEARN at training time • The core idea in training is to notice that at each decision step, we are actually doing a cost-sensitive classification 95

SEARN at training time • The core idea in training is to notice that at each decision step, we are actually doing a cost-sensitive classification • Construct cost-sensitive classification examples (s, c) with state s and cost vector c. 96

SEARN at training time • The core idea in training is to notice that at each decision step, we are actually doing a cost-sensitive classification • Construct cost-sensitive classification examples (s, c) with state s and cost vector c. • Learn a cost-sensitive classifier. (This is nothing but a policy) 97

Roll-in, Roll-out 98

Roll-in, Roll-out roll in At each state, use some policy to move to a new state. 99

Roll-in, Roll-out roll in What is the cost of deviating from the policy at this step? 100

L2S: Learning to Search CS 6355: Structured Prediction 1 Some - PowerPoint PPT Presentation

L2S: Learning to Search CS 6355: Structured Prediction 1 Some slides adapted from Daum and Ross Inference What is inference? An overview of what we have seen before Combinatorial optimization Different views of inference

Quantum Control via Adiabatic Theory U. Boscain, F. C. Chittaro, P. Mason, M. Sigalotti L2S-Sup

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Marco Di Renzo Paris-Saclay University Laboratory of Signals and Systems (L2S) UMR8506 CNRS

Marco Di Renzo Paris-Saclay University Laboratory of Signals and Systems (L2S) UMR8506 CNRS

Audio declipping Matthieu Kowalski Univ Paris-Sud L2S (GPI) Matthieu Kowalski Audio declipping

Quantum Control via Adiabatic Theory and intersection of eigenvalues U. Boscain, F. C. Chittaro,

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

2 EBI Search 3 EBI Search 4 EBI

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Search Algorithms 3 AI Slides (6e) c Lin Zuoquan@PKU 2003-2020 3 1 3 Search Algorithms

Query DB structures Manipulation queries DB search Hits Memory search 2 Standardization of

Search 3 AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 3 1 3 Search 3.1 Problem-solving

Statistical Machine Learning Lecture 01: Introduction Kristian Kersting TU Darmstadt Summer

9/12/17 Universal Design Ron Rogers for Learning @ronbrogers Ron_Rogers@ocali.org 101 Goals

Low-Cost Learning via Active Data Procurement EC 2015 Jacob Abernethy Yiling Chen Chien-Ju Ho

Learning What is learning? Foundations of Artificial Intelligence An agent learns when it

MACHINE LEARNING Probably Approximately Correct (PAC) Learning Alessandro Moschitti Department

Beating the Perils of Non-Convexity: Machine Learning using Tensor Methods Anima Anandkumar ..

Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Phillips Hall, 407 Instructor : Karthik

INQUIRY-BASED LEARNING PROBLEM SETS IN AN OUTREACH PROGRAM FOR HIGH SCHOOL GIRLS Increasing