Structure and Support Vector Machines SPFLODD October - PowerPoint PPT Presentation

Structure ¡and ¡ ¡ Support ¡Vector ¡Machines ¡ SPFLODD ¡ October ¡31, ¡2013 ¡

Outline ¡ • SVMs ¡for ¡structured ¡outputs ¡ – Declara?ve ¡view ¡ – Procedural ¡view ¡

Warning: ¡ ¡Math ¡Ahead ¡

Nota?on ¡for ¡Linear ¡Models ¡ • Training ¡data: ¡ ¡{(x 1 , ¡y 1 ), ¡(x 2 , ¡y 2 ), ¡…, ¡(x N , ¡y N )} ¡ • Tes?ng ¡data: ¡ ¡{(x N+1 , ¡y N+1 ), ¡… ¡(x N+N ’ , ¡y N+N ’ )} ¡ • Feature ¡func?on: ¡ ¡ g ¡ • Weights: ¡ ¡ w ¡ • Decoding: ¡ w � g ( x , y ) decode( w , x ) = arg max y • Learning: ¡ { ( x i , y i ) } N w , { ( x i , y i ) } N � ⇥ � ⇥ learn = arg max w Φ i =1 i =1 • Evalua?on: ¡ N � 1 ⇤ { ( x i , y i ) } N � � � ⇥ ⇥ ⇥ cost decode learn , x N + i , y N + i i =1 N � i =1

The ¡Ideal ¡Loss ¡Func?on ¡ • Convex ¡ • Con?nuous ¡ • Cost-‑aware ¡

Cost ¡and ¡Margin ¡ • The ¡“margin” ¡is ¡an ¡important ¡concept ¡when ¡ we ¡take ¡the ¡linear ¡models ¡point ¡of ¡view. ¡ – A ¡“large ¡margin” ¡means ¡that ¡the ¡correct ¡output ¡is ¡ well-‑separated ¡from ¡the ¡incorrect ¡outputs. ¡ • Neither ¡log ¡loss ¡nor ¡“perceptron ¡loss” ¡takes ¡ into ¡account ¡the ¡ cost ¡func?on, ¡though. ¡ – In ¡other ¡words, ¡some ¡incorrect ¡outputs ¡are ¡worse ¡ than ¡others. ¡

Mul?class ¡SVM ¡(Crammer ¡and ¡ Singer, ¡2001) ¡ max w γ s.t. ⌃ w ⌃ ⇥ 1 � if y ⌅ = y i γ ⇧ i, ⇧ y , w � g ( x i , y i ) � w � g ( x i , y ) ⇤ 0 otherwise • The ¡above ¡can ¡be ¡understood ¡as ¡a ¡0-‑1 ¡cost; ¡ let’s ¡generalize ¡a ¡bit: ¡ max w γ s.t. ⇧ w ⇧ ⇥ 1 ⌅ i, ⌅ y , w � g ( x i , y i ) � w � g ( x i , y ) ⇤ γ cost( y , y i )

Max-‑Margin ¡Markov ¡Networks ¡ • Star?ng ¡point: ¡ ¡mul?class ¡SVM ¡(Crammer ¡and ¡ Singer, ¡2001) ¡ max w γ s.t. ⇧ w ⇧ ⇥ 1 ⌅ i, ⌅ y , w � g ( x i , y i ) � w � g ( x i , y ) ⇤ γ cost( y , y i )

Max-‑Margin ¡Markov ¡Networks ¡ • Standard ¡transforma?on ¡to ¡get ¡rid ¡of ¡explicit ¡ men?on ¡of ¡γ, ¡plus ¡slack ¡variables ¡in ¡case ¡the ¡ constraints ¡cannot ¡be ¡met: ¡ N C � 2 ⌅ w ⌅ 2 min 2 + ξ i w i =1 ⇤ i, ⇤ y , w � g ( x i , y i ) � w � g ( x i , y ) ⇥ cost( y , y i ) � ξ i s.t. • No?ce: ¡ ¡ − w � g ( x i , y i ) + w � g ( x i , y ) + cost( y , y i ) ∀ i, ∀ y , ξ i ≥ − w � g ( x i , y i ) + w � g ( x i , y ) + cost( y , y i ) max ∀ i, ξ i ≥ y

Max-‑Margin ¡Markov ¡Networks ¡ • Having ¡solved ¡for ¡the ¡slack ¡variables, ¡we ¡can ¡plug ¡ in; ¡we ¡now ¡have ¡an ¡unconstrained ¡problem: ¡ N C � � w � g ( x i , y i ) + max w � g ( x i , y ) + cost( y , y i ) 2 ⇥ w ⇥ 2 min 2 + w y i =1 • Ratliff, ¡Bagnell, ¡and ¡Zinkevich ¡(2007): ¡ ¡ subgradient ¡descent ¡(or ¡stochas?c ¡version) ¡– ¡ much, ¡much ¡simpler ¡approach ¡to ¡op?mizing ¡this ¡ func?on. ¡ – And ¡more ¡perceptron-‑like! ¡ − g j ( x , y ) + g j ( x , cost augmented decode( w , x ))

Structured ¡Hinge ¡Loss ¡ • Small ¡change ¡to ¡the ¡perceptron ¡loss: ¡ − w ⇥ g ( x , y ) + max y � w ⇥ g ( x , y � ) + cost( y � , y ) L ( w , x , y ) = ¡ • Resul?ng ¡subgradient: ¡ − g j ( x , y ) + g j ( x , cost augmented decode( w , x )) – Rather ¡than ¡merely ¡decoding, ¡find ¡a ¡candidate ¡y ’ ¡ that ¡is ¡both ¡high-‑scoring ¡and ¡ dangerous . ¡

Structured ¡Hinge ¡ • Three ¡different ¡lines ¡of ¡work ¡all ¡arrived ¡at ¡this ¡ idea, ¡or ¡something ¡very ¡close. ¡ – Max-‑margin ¡Markov ¡networks ¡ (Taskar, ¡Guestrin, ¡and ¡Koller, ¡2003) ¡ – Structural ¡support ¡vector ¡machines ¡(Tsochantaridis, ¡ Joachims, ¡Hoffman, ¡and ¡Altun, ¡2005) ¡ – Online ¡passive-‑aggressive ¡algorithms ¡ ¡ (Crammer, ¡Keshet, ¡Dekel, ¡Shalev-‑Shwartz, ¡and ¡Singer, ¡ 2006) ¡ • Important ¡developments ¡in ¡op?miza?on ¡ techniques ¡since ¡then! ¡ – I’ll ¡highlight ¡what ¡I ¡think ¡it’s ¡most ¡useful ¡to ¡know. ¡

I’m ¡Taking ¡Liber?es ¡ • The ¡M 3 N ¡view ¡of ¡the ¡world ¡really ¡thinks ¡about ¡ outputs ¡as ¡configura?ons ¡in ¡a ¡Markov ¡network. ¡ • They ¡assume ¡y ¡corresponds ¡to ¡a ¡set ¡of ¡random ¡ variables, ¡each ¡of ¡which ¡gets ¡a ¡label ¡in ¡a ¡finite ¡ set. ¡ • Their ¡cost ¡func?on ¡is ¡Hamming ¡cost: ¡ ¡“how ¡many ¡ r.v.s ¡do ¡I ¡predict ¡incorrectly?” ¡ – This ¡is ¡convenient ¡and ¡makes ¡sense ¡for ¡their ¡ applica?ons. ¡ ¡But ¡it’s ¡not ¡as ¡general ¡as ¡it ¡could ¡be. ¡

Cost-‑Augmented ¡Decoding ¡ y � w ⇥ g ( x , y � ) decode( w , x ) = arg max y � w ⇥ g ( x , y � ) + cost( y � , y ) cost augmented decode( w , x , y ) = arg max • Efficient ¡decoding ¡is ¡possible ¡when ¡the ¡features ¡ factor ¡locally: ¡ � g ( x , y ) = f ( x , part p ( y )) p • Efficient ¡cost-‑augmented ¡decoding ¡requires ¡that ¡ the ¡cost ¡func?on ¡break ¡into ¡parts ¡the ¡same ¡way: ¡ � cost( y � , y ) local cost(part p ( y � ) , y ) = p

An ¡Exercise ¡ • If ¡the ¡features ¡are ¡such ¡that ¡we ¡can ¡use ¡the ¡ Viterbi ¡algorithm ¡for ¡decoding, ¡what ¡are ¡some ¡ cost ¡func?ons ¡we ¡could ¡inside ¡an ¡efficient ¡ cost-‑augmented ¡decoding ¡algorithm ¡that’s ¡a ¡ very ¡small ¡change ¡to ¡Viterbi? ¡

Max-‑Margin ¡Markov ¡Networks ¡ • Taskar ¡et ¡al. ¡actually ¡work ¡through ¡a ¡ dual ¡version ¡of ¡ the ¡problem. ¡ – Primal ¡and ¡dual ¡are ¡both ¡QPs; ¡exponen?ally ¡many ¡ constraints ¡or ¡variables, ¡respec?vely. ¡ • Key ¡trick: ¡ ¡ factored ¡dual . ¡ – Enables ¡kernelized ¡factors ¡in ¡the ¡MN. ¡ – Actual ¡algorithm ¡is ¡sequen?al ¡minimal ¡op?miza?on ¡(SMO) ¡ for ¡SVMs, ¡a ¡coordinate ¡ascent ¡method ¡(Plao, ¡1999). ¡ • The ¡paper ¡includes ¡a ¡generaliza?on ¡bound ¡that ¡is ¡ argued ¡to ¡improve ¡over ¡the ¡Collins ¡perceptron. ¡ • Experiments: ¡ ¡handwri?ng ¡recogni?on, ¡text ¡ classifica?on ¡for ¡hyperlinked ¡documents. ¡

Structural ¡SVM ¡ • Tsochantaridis ¡et ¡al. ¡(2005) ¡– ¡extends ¡their ¡2004 ¡ paper. ¡ • Slightly ¡different ¡version ¡of ¡the ¡loss ¡func?on: ¡ N C � 2 ⌅ w ⌅ 2 min 2 + ξ i w i =1 ξ i ⇤ i, ⇤ y , w � g ( x i , y i ) � w � g ( x i , y ) ⇥ +1 � s.t. cost( y , y i ) – Alterna?ve ¡version ¡of ¡cost-‑augmented ¡decoding ¡ (“slack ¡rescaling” ¡as ¡opposed ¡to ¡Taskar ¡et ¡al.’s ¡“margin ¡ rescaling”) ¡

Op?miza?on ¡Algorithms ¡for ¡SSVMs ¡ • Taskar ¡et ¡al. ¡(2003): ¡ ¡SMO ¡based ¡on ¡factored ¡dual ¡ • Bartleo ¡et ¡al. ¡(2004) ¡and ¡Collins ¡et ¡al. ¡(2008): ¡ ¡ exponen?ated ¡gradient ¡ • Tsochantaridis ¡et ¡al. ¡(2005): ¡ ¡cusng ¡planes ¡(based ¡on ¡ dual) ¡ • Taskar ¡et ¡al. ¡(2005): ¡ ¡dual ¡extragradient ¡ ¡ Easiest ¡to ¡use, ¡in ¡my ¡opinion: ¡ ¡ • Ratliff ¡et ¡al. ¡(2006): ¡ ¡(stochas?c) ¡subgradient ¡descent ¡ • Crammer ¡et ¡al. ¡(2006): ¡ ¡online ¡ “ passive-‑aggressive ” ¡ algorithms ¡

“Passive ¡Aggressive” ¡Learners ¡ • Star?ng ¡point ¡is ¡the ¡perceptron, ¡and ¡the ¡focus ¡ is ¡on ¡the ¡step ¡size. ¡ • In ¡NLP, ¡people ¡oten ¡use ¡a ¡specific ¡instance ¡ called ¡“1-‑best ¡MIRA” ¡(margin ¡infused ¡ relaxa?on ¡algorithm). ¡ ¡ – Some?mes ¡with ¡regular ¡decoding, ¡some?mes ¡ cost-‑augmented ¡decoding. ¡ • I ¡do ¡not ¡understand ¡the ¡name. ¡

Structure and Support Vector Machines SPFLODD October - PowerPoint PPT Presentation

Structure and Support Vector Machines SPFLODD October 31, 2013 Outline SVMs for structured outputs Declara?ve view Procedural view Warning:

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

Support Vector Machines Support Vector Machines CSC 411 Tutorial April 1, 2015 Tutor: Shenlong

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Setting the Emotional Tone: Managing Emotional Culture in the Library Jason Martin Walker

MIL-UT at ILSVRC2014 IIT Guwahati (undergrad) -> Virginia Tech (intern) Senthil Purushwalkam,

Try it out from the Priority Inbox settings tab. Doug Aberdeen, Ond ej Pacovsk , Andrew

Online Learning 9.520 Class, 19 March 2007 Sanmay Das (using some slides from Andrea Caponnetto)

Th The promise of i f eHealth eHealth Alejandro Jadad j MD DPhil FRCPC FCAHS We are in the

APNA 30th Annual Conference Session 2011: October 20, 2016 th Annu AP APNA NA 30 30 th Annual

10/13/2016 WORKPLACE VIOLENCE AND BULLYING ETHICS OF CARING IN CO-WORKER RELATIONSHIPS Elissa

Improving Team Effectiveness: Team Health and Resilience Donna Stevens, BS Program Director,

Sambuz

Useful Links

Newsletter

Mail Us

Structure and Support Vector Machines SPFLODD October - PowerPoint PPT Presentation

Structure and Support Vector Machines SPFLODD October 31, 2013 Outline SVMs for structured outputs Declara?ve view Procedural view Warning:

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

Support Vector Machines Support Vector Machines CSC 411 Tutorial April 1, 2015 Tutor: Shenlong

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Setting the Emotional Tone: Managing Emotional Culture in the Library Jason Martin Walker

MIL-UT at ILSVRC2014 IIT Guwahati (undergrad) -&gt; Virginia Tech (intern) Senthil Purushwalkam,

Try it out from the Priority Inbox settings tab. Doug Aberdeen, Ond ej Pacovsk , Andrew

Online Learning 9.520 Class, 19 March 2007 Sanmay Das (using some slides from Andrea Caponnetto)

Th The promise of i f eHealth eHealth Alejandro Jadad j MD DPhil FRCPC FCAHS We are in the

APNA 30th Annual Conference Session 2011: October 20, 2016 th Annu AP APNA NA 30 30 th Annual

10/13/2016 WORKPLACE VIOLENCE AND BULLYING ETHICS OF CARING IN CO-WORKER RELATIONSHIPS Elissa

Improving Team Effectiveness: Team Health and Resilience Donna Stevens, BS Program Director,

Sambuz

Useful Links

Newsletter

Mail Us

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

MIL-UT at ILSVRC2014 IIT Guwahati (undergrad) -> Virginia Tech (intern) Senthil Purushwalkam,