Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu - - PowerPoint PPT Presentation

β–Ά
lecture 5 svm ii
SMART_READER_LITE
LIVE PREVIEW

Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu - - PowerPoint PPT Presentation

Machine Learning Basics Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang Review: SVM objective SVM: objective , = + . Margin: Let +1, 1 , ,


slide-1
SLIDE 1

Machine Learning Basics Lecture 5: SVM II

Princeton University COS 495 Instructor: Yingyu Liang

slide-2
SLIDE 2

Review: SVM objective

slide-3
SLIDE 3

SVM: objective

  • Let 𝑧𝑗 ∈ +1, βˆ’1 , 𝑔

π‘₯,𝑐 𝑦 = π‘₯π‘ˆπ‘¦ + 𝑐. Margin:

𝛿 = min

𝑗

𝑧𝑗𝑔

π‘₯,𝑐 𝑦𝑗

| π‘₯ |

  • Support Vector Machine:

max

π‘₯,𝑐 𝛿 = max π‘₯,𝑐 min 𝑗

𝑧𝑗𝑔

π‘₯,𝑐 𝑦𝑗

| π‘₯ |

slide-4
SLIDE 4

SVM: optimization

  • Optimization (Quadratic Programming):

min

π‘₯,𝑐

1 2 π‘₯

2

𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 β‰₯ 1, βˆ€π‘—

  • Solved by Lagrange multiplier method:

β„’ π‘₯, 𝑐, 𝜷 = 1 2 π‘₯

2

βˆ’ ෍

𝑗

𝛽𝑗[𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 βˆ’ 1] where 𝜷 is the Lagrange multiplier

slide-5
SLIDE 5

Lagrange multiplier

slide-6
SLIDE 6

Lagrangian

  • Consider optimization problem:

min

π‘₯

𝑔(π‘₯) β„Žπ‘— π‘₯ = 0, βˆ€1 ≀ 𝑗 ≀ π‘š

  • Lagrangian:

β„’ π‘₯, 𝜸 = 𝑔 π‘₯ + ෍

𝑗

π›Ύπ‘—β„Žπ‘—(π‘₯) where 𝛾𝑗’s are called Lagrange multipliers

slide-7
SLIDE 7

Lagrangian

  • Consider optimization problem:

min

π‘₯

𝑔(π‘₯) β„Žπ‘— π‘₯ = 0, βˆ€1 ≀ 𝑗 ≀ π‘š

  • Solved by setting derivatives of Lagrangian to 0

πœ–β„’ πœ–π‘₯𝑗 = 0; πœ–β„’ πœ–π›Ύπ‘— = 0

slide-8
SLIDE 8

Generalized Lagrangian

  • Consider optimization problem:

min

π‘₯

𝑔(π‘₯) 𝑕𝑗 π‘₯ ≀ 0, βˆ€1 ≀ 𝑗 ≀ 𝑙 β„Žπ‘˜ π‘₯ = 0, βˆ€1 ≀ π‘˜ ≀ π‘š

  • Generalized Lagrangian:

β„’ π‘₯, 𝜷, 𝜸 = 𝑔 π‘₯ + ෍

𝑗

𝛽𝑗𝑕𝑗(π‘₯) + ෍

π‘˜

π›Ύπ‘˜β„Žπ‘˜(π‘₯) where 𝛽𝑗, π›Ύπ‘˜β€™s are called Lagrange multipliers

slide-9
SLIDE 9

Generalized Lagrangian

  • Consider the quantity:

πœ„π‘„ π‘₯ ≔ max

𝜷,𝜸:𝛽𝑗β‰₯0 β„’ π‘₯, 𝜷, 𝜸

  • Why?

πœ„π‘„ π‘₯ = α‰Šπ‘” π‘₯ , if π‘₯ satisfies all the constraints +∞, if π‘₯ does not satisfy the constraints

  • So minimizing 𝑔 π‘₯ is the same as minimizing πœ„π‘„ π‘₯

min

π‘₯ 𝑔 π‘₯ = min π‘₯ πœ„π‘„ π‘₯ = min π‘₯

max

𝜷,𝜸:𝛽𝑗β‰₯0 β„’ π‘₯, 𝜷, 𝜸

slide-10
SLIDE 10

Lagrange duality

  • The primal problem

π‘žβˆ— ≔ min

π‘₯ 𝑔 π‘₯ = min π‘₯

max

𝜷,𝜸:𝛽𝑗β‰₯0 β„’ π‘₯, 𝜷, 𝜸

  • The dual problem

π‘’βˆ— ≔ max

𝜷,𝜸:𝛽𝑗β‰₯0min π‘₯ β„’ π‘₯, 𝜷, 𝜸

  • Always true:

π‘’βˆ— ≀ π‘žβˆ—

slide-11
SLIDE 11

Lagrange duality

  • The primal problem

π‘žβˆ— ≔ min

π‘₯ 𝑔 π‘₯ = min π‘₯

max

𝜷,𝜸:𝛽𝑗β‰₯0 β„’ π‘₯, 𝜷, 𝜸

  • The dual problem

π‘’βˆ— ≔ max

𝜷,𝜸:𝛽𝑗β‰₯0min π‘₯ β„’ π‘₯, 𝜷, 𝜸

  • Interesting case: when do we have

π‘’βˆ— = π‘žβˆ—?

slide-12
SLIDE 12

Lagrange duality

  • Theorem: under proper conditions, there exists π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— such that

π‘’βˆ— = β„’ π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— = π‘žβˆ— Moreover, π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— satisfy Karush-Kuhn-Tucker (KKT) conditions: πœ–β„’ πœ–π‘₯𝑗 = 0, 𝛽𝑗𝑕𝑗 π‘₯ = 0 𝑕𝑗 π‘₯ ≀ 0, β„Žπ‘˜ π‘₯ = 0, 𝛽𝑗 β‰₯ 0

slide-13
SLIDE 13

Lagrange duality

  • Theorem: under proper conditions, there exists π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— such that

π‘’βˆ— = β„’ π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— = π‘žβˆ— Moreover, π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— satisfy Karush-Kuhn-Tucker (KKT) conditions: πœ–β„’ πœ–π‘₯𝑗 = 0, 𝛽𝑗𝑕𝑗 π‘₯ = 0 𝑕𝑗 π‘₯ ≀ 0, β„Žπ‘˜ π‘₯ = 0, 𝛽𝑗 β‰₯ 0 dual complementarity

slide-14
SLIDE 14

Lagrange duality

  • Theorem: under proper conditions, there exists π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— such that

π‘’βˆ— = β„’ π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— = π‘žβˆ—

  • Moreover, π‘₯βˆ—, πœ·βˆ—, πœΈβˆ— satisfy Karush-Kuhn-Tucker (KKT) conditions:

πœ–β„’ πœ–π‘₯𝑗 = 0, 𝛽𝑗𝑕𝑗 π‘₯ = 0 𝑕𝑗 π‘₯ ≀ 0, β„Žπ‘˜ π‘₯ = 0, 𝛽𝑗 β‰₯ 0 dual constraints primal constraints

slide-15
SLIDE 15

Lagrange duality

  • What are the proper conditions?
  • A set of conditions (Slater conditions):
  • 𝑔, 𝑕𝑗 convex, β„Žπ‘˜ affine
  • Exists π‘₯ satisfying all 𝑕𝑗 π‘₯ < 0
  • There exist other sets of conditions
  • Search Karush–Kuhn–Tucker conditions on Wikipedia
slide-16
SLIDE 16

SVM: optimization

slide-17
SLIDE 17

SVM: optimization

  • Optimization (Quadratic Programming):

min

π‘₯,𝑐

1 2 π‘₯

2

𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 β‰₯ 1, βˆ€π‘—

  • Generalized Lagrangian:

β„’ π‘₯, 𝑐, 𝜷 = 1 2 π‘₯

2

βˆ’ ෍

𝑗

𝛽𝑗[𝑧𝑗 π‘₯π‘ˆπ‘¦π‘— + 𝑐 βˆ’ 1] where 𝜷 is the Lagrange multiplier

slide-18
SLIDE 18

SVM: optimization

  • KKT conditions:

πœ–β„’ πœ–π‘₯ = 0, οƒ  π‘₯ = σ𝑗 𝛽𝑗𝑧𝑗𝑦𝑗 (1) πœ–β„’ πœ–π‘ = 0, οƒ  0 = σ𝑗 𝛽𝑗𝑧𝑗

(2)

  • Plug into β„’:

β„’ π‘₯, 𝑐, 𝜷 = σ𝑗 𝛽𝑗 βˆ’

1 2 Οƒπ‘—π‘˜ π›½π‘—π›½π‘˜π‘§π‘—π‘§π‘˜π‘¦π‘— π‘ˆπ‘¦π‘˜ (3)

combined with 0 = σ𝑗 𝛽𝑗𝑧𝑗 , 𝛽𝑗 β‰₯ 0

slide-19
SLIDE 19

SVM: optimization

  • Reduces to dual problem:

β„’ π‘₯, 𝑐, 𝜷 = ෍

𝑗

𝛽𝑗 βˆ’ 1 2 ෍

π‘—π‘˜

π›½π‘—π›½π‘˜π‘§π‘—π‘§π‘˜π‘¦π‘—

π‘ˆπ‘¦π‘˜

෍

𝑗

𝛽𝑗𝑧𝑗 = 0, 𝛽𝑗 β‰₯ 0

  • Since π‘₯ = σ𝑗 𝛽𝑗𝑧𝑗𝑦𝑗, we have π‘₯π‘ˆπ‘¦ + 𝑐 = σ𝑗 𝛽𝑗𝑧𝑗𝑦𝑗

π‘ˆπ‘¦ + 𝑐

Only depend on inner products

slide-20
SLIDE 20

Kernel methods

slide-21
SLIDE 21

Features

Color Histogram

Red Green Blue

Extract features

𝑦 𝜚 𝑦

slide-22
SLIDE 22

Features

slide-23
SLIDE 23

Features

  • Proper feature mapping can make non-linear to linear
  • Using SVM on the feature space {𝜚 𝑦𝑗 }: only need 𝜚 𝑦𝑗 π‘ˆπœš(π‘¦π‘˜)
  • Conclusion: no need to design 𝜚 β‹… , only need to design

𝑙 𝑦𝑗, π‘¦π‘˜ = 𝜚 𝑦𝑗 π‘ˆπœš(π‘¦π‘˜)

slide-24
SLIDE 24

Polynomial kernels

  • Fix degree 𝑒 and constant 𝑑:

𝑙 𝑦, 𝑦′ = π‘¦π‘ˆπ‘¦β€² + 𝑑 𝑒

  • What are 𝜚(𝑦)?
  • Expand the expression to get 𝜚(𝑦)
slide-25
SLIDE 25

Polynomial kernels

Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

slide-26
SLIDE 26

Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

slide-27
SLIDE 27

Gaussian kernels

  • Fix bandwidth 𝜏:

𝑙 𝑦, 𝑦′ = exp(βˆ’ 𝑦 βˆ’ 𝑦′

2/2𝜏2)

  • Also called radial basis function (RBF) kernels
  • What are 𝜚(𝑦)? Consider the un-normalized version

𝑙′ 𝑦, 𝑦′ = exp(π‘¦π‘ˆπ‘¦β€²/𝜏2)

  • Power series expansion:

𝑙′ 𝑦, 𝑦′ = ෍

𝑗 +∞ π‘¦π‘ˆπ‘¦β€² 𝑗

πœπ‘—π‘—!

slide-28
SLIDE 28

Mercer’s condition for kenerls

  • Theorem: 𝑙 𝑦, 𝑦′ has expansion

𝑙 𝑦, 𝑦′ = ෍

𝑗 +∞

π‘π‘—πœšπ‘— 𝑦 πœšπ‘—(𝑦′) if and only if for any function 𝑑(𝑦), ∫ ∫ 𝑑 𝑦 𝑑 𝑦′ 𝑙 𝑦, 𝑦′ 𝑒𝑦𝑒𝑦′ β‰₯ 0 (Omit some math conditions for 𝑙 and 𝑑)

slide-29
SLIDE 29

Constructing new kernels

  • Kernels are closed under positive scaling, sum, product, pointwise

limit, and composition with a power series σ𝑗

+∞ 𝑏𝑗𝑙𝑗(𝑦, 𝑦′)

  • Example: 𝑙1 𝑦, 𝑦′ , 𝑙2 𝑦, 𝑦′ are kernels, then also is

𝑙 𝑦, 𝑦′ = 2𝑙1 𝑦, 𝑦′ + 3𝑙2 𝑦, 𝑦′

  • Example: 𝑙1 𝑦, 𝑦′ is kernel, then also is

𝑙 𝑦, 𝑦′ = exp(𝑙1 𝑦, 𝑦′ )

slide-30
SLIDE 30

Kernels v.s. Neural networks

slide-31
SLIDE 31

Features

Color Histogram

Red Green Blue

Extract features

𝑦 𝑧 = π‘₯π‘ˆπœš 𝑦

build hypothesis

slide-32
SLIDE 32

Features: part of the model

𝑧 = π‘₯π‘ˆπœš 𝑦

build hypothesis

Linear model Nonlinear model

slide-33
SLIDE 33

Polynomial kernels

Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

slide-34
SLIDE 34

Polynomial kernel SVM as two layer neural network

𝑦1 𝑦2 𝑦1

2

𝑦2

2

2𝑦1𝑦2 2𝑑𝑦1 2𝑑𝑦2 𝑑 𝑧 = sign(π‘₯π‘ˆπœš(𝑦) + 𝑐) First layer is fixed. If also learn first layer, it becomes two layer neural network