10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University MAP Inference with MILP Matt Gormley Lecture 12 Oct. 7, 2019 1
Reminders • Homework 2: BP for Syntax Trees – Out: Sat, Sep. 28 – Due: Sat, Oct. 12 at 11:59pm • Last chance to switch between 10-418 / 10- 618 is October 7th (drop deadline) • Today’s after-clas office hours are un- cancelled (i.e. I am having them) 3
MBR DECODING 4
Minimum Bayes Risk Decoding • Suppose we given a loss function l( y ’, y ) and are asked for a single tagging • How should we choose just one from our probability distribution p( y | x ) ? • A minimum Bayes risk (MBR) decoder h( x ) returns the variable assignment with minimum expected loss under the model’s distribution E y ∼ p θ ( ·| x ) [ ` (ˆ h θ ( x ) = argmin y , y )] ˆ y X p θ ( y | x ) ` (ˆ = argmin y , y ) ˆ y y 5
Minimum Bayes Risk Decoding h θ ( x ) = argmin E y ∼ p θ ( ·| x ) [ ` (ˆ y , y )] ˆ y Consider some example loss functions: X The Hamming loss corresponds to accuracy and returns the number of incorrect variable assignments: V X ` (ˆ y , y ) = (1 − I (ˆ y i , y i )) i =1 The MBR decoder is: y i = h θ ( x ) i = argmax ˆ p θ (ˆ y i | x ) y i ˆ This decomposes across variables and requires the variable marginals. 6
Minimum Bayes Risk Decoding h θ ( x ) = argmin E y ∼ p θ ( ·| x ) [ ` (ˆ y , y )] ˆ y Consider some example loss functions: X The 0-1 loss function returns 1 only if the two assignments are identical and 0 otherwise: ` (ˆ y , y ) = 1 − I (ˆ y , y ) The MBR decoder is: X h θ ( x ) = argmin p θ ( y | x )(1 − I (ˆ y , y )) ˆ y y = argmax p θ (ˆ y | x ) ˆ y which is exactly the MAP inference problem! 7
LINEAR PROGRAMMING & INTEGER LINEAR PROGRAMMING 8
Linear Programming Whiteboard – Example of Linear Program in 2D – LP Standard Form – Converting an LP to Standard Form – LP and its Polytope – Simplex algorithm (tableau method) – Interior points algorithm(s) 9
Integer Linear Programming Whiteboard – Example of an ILP in 2D – Example of an MILP in 2D 10
Background: Nonconvex Global Optimization Goal: optimize over the blue surface . 11
Background: Nonconvex Global Optimization Goal: optimize over the blue surface . 12
Background: Nonconvex Global Optimization Relaxation : provides an upper bound on the surfac e. 13
Background: Nonconvex Global Optimization Branching: partitions the search space into subspaces, and enables tighter relaxation s. X 1 ≤ 0.0 X 1 ≥ 0.0 14
Background: Nonconvex Global Optimization Branching: partitions the search space into subspaces, and enables tighter relaxation s. X 1 ≤ 0.0 X 1 ≥ 0.0 15
Background: Nonconvex Global Optimization Branching: partitions the search space into subspaces, and enables tighter relaxation s. X 1 ≤ 0.0 X 1 ≥ 0.0 16
Background: Nonconvex Global Optimization The max of all relaxed solutions for each of the partitions is a global upper bound . 17
Background: Nonconvex Global Optimization We can project a relaxed solution onto the feasible region . 18
Background: Nonconvex Global Optimization The incumbent is ε-optimal if the relative difference between the global upper bound and the incumbent score is less than ε . 19
How much should we subdivide? 20
How much should we subdivide? BRANCH-AND-BOUND • Method for recursively subdividing the search space • Subspace order can be determined heuristically (e.g. best-first search with depth-first plunging) • Prunes subspaces that can’t yield better solutions 21
Background: Nonconvex Global Optimization If the subspace upper bound is worse than the current incumbent , we can prune that subspac e. 22
Background: Nonconvex Global Optimization If the subspace upper bound is worse than the current incumbent , we can prune that subspac e. 23
Limitations: Branch-and-Bound for the Viterbi Objective • The Viterbi Objective • Preview of Experiments – Nonconvex – We solve 5 sentences, but on 200 sentences, we couldn’t – NP Hard to solve run to completion (Cohen & Smith, 2010) – Our (hybrid) global search • Branch-and-bound framework incorporates – Kind of tricky to get it local search right… – This hybrid approach – Curse of dimensionality kicks sometimes finds higher in quickly likelihood (and higher • Nonconvex quadratic accuracy) solutions than optimization by LP-based pure local search branch-and-bound usually fails with more than 80 variables (Burer and Vandenbussche, 2009) • Our smallest (toy) problems have hundreds of variables 24
BRANCH-AND-BOUND INGREDIENTS Mathematical Program Relaxation Projection (Branch-and-Bound Search Heuristics) 25
Background: Nonconvex Global Optimization We solve the relaxation using the Simplex algorithm . 26
Background: Nonconvex Global Optimization We can project a relaxed solution onto the feasible region . 27
Integer Linear Programming Whiteboard – Branch and bound for an ILP in 2D 28
Branch and Bound Algorithm 2.1 Branch-and-bound Input : Minimization problem instance R . Output : Optimal solution x ⋆ with value c ⋆ , or conclusion that R has no solution, indicated by c ⋆ = ∞ . c := ∞ . 1. Initialize L := { R } , ˆ [ init ] 2. If L = ∅ , stop and return x ⋆ = ˆ x and c ⋆ = ˆ c . [ abort ] 3. Choose Q ∈ L , and set L := L \ { Q } . [ select ] c := ∞ . Otherwise, let ˇ 4. Solve a relaxation Q relax of Q . If Q relax is empty, set ˇ x be an optimal solution of Q relax and ˇ c its objective value. [ solve ] c ≥ ˆ 5. If ˇ c , goto Step 2. [ bound ] 6. If ˇ x is feasible for R , set ˆ x := ˇ x , ˆ c := ˇ c , and goto Step 2. [ check ] 7. Split Q into subproblems Q = Q 1 ∪ . . . ∪ Q k , set L := L ∪ { Q 1 , . . . , Q k } , and goto Step 2. [ branch ] 29 Slide from Achterberg (thesis, 2007)
Branch and Bound root node R pruned solved subproblem subproblem current feasible Q subproblem solution unsolved new Q 1 Q k subproblems subproblems 30 Slide from Achterberg (thesis, 2007)
Branch and Bound Q Q 1 Q 2 x ˇ x ˇ Figure 2.2. LP based branching on a single fractional variable. 31 Slide from Achterberg (thesis, 2007)
Recommend
More recommend