extended version: Biehl-Part1.pdf Support t Vecto tor Machine (str treamlined) Mich Michael ael Bieh Biehl Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence University of Groningen www.cs.rug.nl/biehl
� ���� th the sto torage problem revisite ted Solving the perceptron storage problem re-write the problem ... D = { ξ µ , S µ consider a given data set R } I S µ H = sign( w · ξ µ ) = S µ ... find a vector with for all µ w R R ⇔ sign( w · ξ µ S µ R ) = 1 ⇔ E µ = w · ξ µ S µ sign( w · ξ µ ) = S µ Note: R > 0 ( local potentials E µ ) equivalent problem: solve a set of linear inequalities (in w ) E µ = w · ξ µ S µ ... find a vector with for all µ w R ≥ c > 0 Note that the actual value of is irrelevant: IAC Winte ter School November 2018, La Laguna 2 satisfies satisfies
solving equati tions ? Instead of inequalities , try to solve P equations for N unknowns: N E µ = j S µ = 1 X w j ξ µ for all µ = 1 , 2 , . . . , P j =1 (A) if no solution exists, find approximati (A) tion by least square dev.: P minimize f = 1 (1 − E µ ) 2 X 2 µ =1 minimization, e.g. by means of gradient t descent t with P (1 � E µ ) ξ µ S µ X r w f = � µ =1 IAC Winte ter School November 2018, La Laguna 3
solving equati tions ? (B) (B) if the system is under-determined → find a unique solution: minimize 1 { E µ = 1 } P 2 | w | 2 under constraints µ =1 P L = 1 2 | w | 2 + λ µ (1 − E µ ) X Lagrange function µ =1 ∂ L ! necessary conditi tions for opti timum: ∂λ µ = (1 − E µ ) = 0 P P λ µ ξ µ S µ λ µ ξ µ S µ ! X X r w L = w � = 0 ⇒ w = µ =1 µ =1 Lagrange parameters ~ embedding str trength ths λ µ (rescaled with N) solution is a linear combination of the data IAC Winte ter School November 2018, La Laguna 4
solving equati tions ? eliminate weights: P N N 1 X X E ν = ( ξ µ k S µ ) ( ξ ν λ µ λ ν C ν µ λ µ k S ν ) X X w 2 j ∝ N µ =1 k =1 µ, ν j =1 | {z } ≡ C ν µ max λ L = − 1 λ ν C ν µ λ µ + X X simplified problem: λ µ 2 µ, ν µ ∂ L C ρ µ λ µ = (1 − E ρ ) X gradient ascent with: ∂λ ρ = 1 − µ (1 − E ρ ) ξ ρ S ρ in terms of weights: X ∆ w ∝ the same as in (A) !!! ρ IAC Winte ter School November 2018, La Laguna 5
solving equati tions ? rename the Lagrange parameters, re-writing the problem: P N N 1 X X E ν = x ν C ν µ x µ ( ξ µ k S µ ) ( ξ ν x µ X X w 2 k S ν ) j ∝ N µ =1 k =1 j =1 µ, ν | {z } ≡ C ν µ max x L = − 1 x ν C ν µ x µ + X X simplified problem: x µ 2 µ, ν µ ∂ L C ρ µ x µ = (1 − E ρ ) gradient ascent with: X ∂ x ρ = 1 − µ in terms of weights: (1 − E ρ ) ξ ρ S ρ X ∆ w ∝ the same as in (A) !!! ρ IAC Winte ter School November 2018, La Laguna 6
classical algorith thm: ADA DALINE E Adaptive Li Ad Linear N Neuron (Widrow and Hoff, 1960) ⇣ 1 − E µ ( t ) ⌘ ξ µ ( t ) S µ ( t ) Adaline algorithm: w ( t ) = w ( t − 1) + η sequence µ(t) ⇣ 1 − E µ ( t ) ⌘ x µ ( t ) x µ ( t − 1) + η = of examples iteration of weights / embedding strengths more general: training of a linear unit with continuous output P minimize f = 1 ( h µ − E µ ) 2 X with h µ ∈ I R, µ = 1 , 2 . . . , P 2 µ =1 P f = 1 y µ − w > ξ µ � 2 with y µ = h µ S µ X � 2 µ =1 gradient based learning for linear regression (MSE) frequent strategy: regression as a proxy for classification IAC Winte ter School November 2018, La Laguna 7
hardware realizati tion “Science in acti tion” ca. 1960 youtube video “science in action” with Bernard Widrow http://www.youtube.com/watch?v=IEFRtz68m-8 8
Intr troducti tion: • supervised learning, clasification, regression • machine learning “vs.” statistical modeling Ea Early (importa tant! t!) approaches: • linear threshold classifier, Rosenblatt’s Perceptron • adaptive linear neuron, Widrow and Hoff’s Adaline From Perceptr tron to Support t Vecto tor Machine • large margin classification • beyond linear separability Di Dista tance-based syste tems • prototypes: K-means and Vector Quantization • from K-Neares_Neighbors to Learning Vector Quantization • adaptive distance measures and relevance learning IAC Winte ter School November 2018, La Laguna 9
Optimal stability by quadratic optimization minimize 1 E µ = w > ξ µ S µ 2 w 2 subject to inequality constraints P � R ≥ 1 µ =1 1 Note: the solution of the problem yields stability κ max = w max | w max |
Notation: correlation matrix (outputs incorporated) with elements P-vectors: inequalities “one-vector”: (C is positive semi-definite)
Optimal stability by quadratic optimization minimize 1 E µ = w > ξ µ S µ 2 w 2 subject to inequality constraints P � R ≥ 1 µ =1 1 Note: the solution of the problem yields stability κ max = w max | w max | We can formulate optimal stability completely in terms of embedding strengths: subject to linear constraints minimize This is a special case of a standard problem in Quadratic Programming : minimize a nonlinear function under linear inequality constraints
Optimization theory: Kuhn–Tucker theorem see, e.g., R. Fletcher, Practical Methods of Optimization (Wiley, 1987) or http://wikipedia.org “Karush-Kuhn-Tucker-conditions” for a quick start necessary conditions for a local solution of a general non-linear optimization problem with equality and inequality constraints
Max. stability: Kuhn–Tucker theorem for a special non-linear optimization problem 1 x > C ~ x ≥ ~ minimize ~ 2 ~ x subject to C ~ 1 x � ) = 1 � > ( C ~ x > C ~ x, ~ x − ~ x − ~ Lagrange function: L ( ~ 1) 2 ~ Any solution can be represented by a Kuhn-Tucker (KT) point with: non-negative embedding strengths ( ← minover) linear separability complementarity implies also: straightforward to show: → all KT-points yield the same unique perceptron weight vector → any local solution is globally optimal
Duality, theory of Lagrange multipliers → equivalent formulation ( Wolfe dual ): absent in the f = − 1 x T C ~ x T ~ e subject to maximize 2 ~ x + ~ 1 x ≥ 0 ~ Adaline problem ~ x IAC Winte ter School November 2018, La Laguna
Duality, theory of Lagrange multipliers → equivalent formulation ( Wolfe dual ): f = − 1 x T C ~ x T ~ e subject to maximize 2 ~ x + ~ 1 x ≥ 0 ~ ~ x (Ad Adaptive PercepTron Tron) ) [Anlauf and Biehl, 1989] AdaTron algorithm: D = { ξ µ , S µ } – sequential presentation of examples I – gradient ascent w.r.t. e f , projected onto ~ x ≥ 0 x µ → max { 0 , x µ + ⌘ ( 1 − [ C ~ x ] µ ) } ( 0 < ⌘ < 2 ) z }| { h i µ x e f r ~ η IAC Winte ter School November 2018, La Laguna
Duality, theory of Lagrange multipliers → equivalent formulation ( Wolfe dual ): f = − 1 x T C ~ x T ~ e subject to maximize 2 ~ x + ~ 1 ~ x ≥ 0 ~ x (Ad Adaptive PercepTron Tron) ) [Anlauf and Biehl, 1989] AdaTron algorithm: D = { ξ µ , S µ } – sequential presentation of examples I – gradient ascent w.r.t. e f , projected onto ~ x ≥ 0 x µ → max { 0 , x µ + ⌘ ( 1 − [ C ~ x ] µ ) } ( 0 < ⌘ < 2 ) for the proof of convergence one can show: e x ∗ ) ≥ e • for an arbitrary ~ x ≥ 0 and a KT point ~ x ∗ : f ( ~ f ( ~ x ) • e f ( x ) is bounded from above in ~ x ≥ 0 • e f ( x ) increases in every cycle through I D , unless a KT point has been reached 5 IAC Winte ter School November 2018, La Laguna
Support Vectors x µ ( 1 − E µ ) = 0 for all complementarity condition: µ ⇢ � ⇢ � = 1 1 E µ E µ > i.e. either or 0 = 0 x µ x µ ≥ examples ... have to be embedded or ... are stabilized “automatically” the � weights � �� � � ∝ ��������� x µ � ξ µ S µ P µ depend (explicitly) only on a subset of I D if these support vectors were known in advance, training could be restricted to the subset (unfortunately they are not...) IAC Winte ter School November 2018, La Laguna
learn learnin ing in in v version ersion space ? space ? ... (including max. stability) is only possible if the data set is linearly separable • ... even then, it only makes sense if the unknown rule is a linearly separable function • the data set is reliable ( noise-free ) • ? lin. separable nonlin. boundary noisy data (?) IAC Winte ter School November 2018, La Laguna 19
. �lassi�ication beyond linear separability assume is not linearly separable - what can we do? potential reasons: noisy data, more complex problem → see “pocket algorithm” ● accept an approximation by a linearly separable function and �large margin with errors � ● ! large margins with errors arge margins with errors → see “committee and admit disagreements w.r.t. training data, but keep basic idea of optimal stability ● P 1 2 w 2 + γ subject to E µ ≥ 1 − β µ X β µ minimize w , β µ =1 and β µ ≥ 0 for all µ β µ = 0 ↔ E µ ≥ 1 ( slack variables β µ > 0 ↔ E µ < 1 E µ < 0 includes errors with
Recommend
More recommend