Efficient Multiple Kernel Learning Lei Tang
Outline • What is Kernel Learning? • What’s the problem with existing formulation? • Two new formulations for large scale kernel selection selection – SIL formulation (Cutting Planes) – More efficient MKL (Steepest Decent)
Linear algorithm: binary classification • Data: {(x i ,y i )} i=1...n – x R d = feature vector • HEART • HEART – y {-1,+1} = label • URINE • URINE • DNA • DNA • BLOOD • BLOOD • SCAN • SCAN • Question : design a classification rule y = f(x) such that, given a new x, this predicts y with minimal probability of error
Linear algorithm: binary classification • Find good hyperplane (w,b) R d+1 that classifies this and future data • HEART • URINE • DNA • BLOOD points as good as possible • SCAN • HEART • URINE • DNA • BLOOD • SCAN � � ��������� Classification Rule: ����������������� � � ����
Linear algorithm: binary classification • Intuition (Vapnik, 1965) if linearly separable: – Separate the data – – Place hyerplane “far” from Place hyerplane “far” from the data: large margin ���
Linear algorithm: binary classification • Intuition (Vapnik, 1965) if linearly separable: – Separate the data – Place hyerplane “far” – Place hyerplane “far” from the data: large margin ���� � � � � Maximal Margin Classifier
Linear algorithm: binary classification If not linearly separable : – Allow some errors – Still, try to place hyerplane “far” from each class
SVM: Primal & Dual = ⋅ + f w x b + � 1 Primal: N 2 ξ min || w || C w , b 2 i = i 1 2 ⋅ ⋅ + + ≥ ≥ − − ξ ξ ∀ ∀ subject t subject t o o y y ( ( ) ) 1 1 w w x x b b i i i i i ξ ≥ 0 i 1 � � Dual: T α − α α max y y x x α i i j i j i j 2 i i , j � α = ≥ α ≥ subject t o y 0 , C 0 i i i i
Linear algorithm: binary classification • Training = convex optimization problem (QP) : 1 � � T α − α α max y y x x implicit α i i j i j i j 2 i i , j embedding � � α α = = α α ≥ ≥ subject t subject t o o i y y 0 0 , , 0 0 i i i i i i X j 1 T T α − α α max e D KD α y y 2 X i K T α = α ≥ subject to y 0 , 0
Kernel algorithm: Support Vector Machine (SVM) • Training = convex optimization problem (QP) : 1 T T α − α α max e D KD α y y 2 T α α = = ≥ ≥ α α ≥ ≥ subject t subject t o o y y 0 0 , , C C 0 0 Classification rule: classify new data point x : • n � SV T T = + = α + f ( x ) sign ( w x b ) sign ( y x x b ) i i i = i 1 Kernel algorithm !
Support Vector Machines (SVM) • Hand-writing recognition (e.g., USPS) • Computational biology (e.g., micro-array data) • Text classification Face detection Face detection • • Face expression recognition Time series prediction (regression) • • Drug discovery (novelty detection)
Different Kernels • Various kinds of Kernel – Linear kernel – Gaussian kernel – Gaussian kernel � � 2 − − X Y � � ( ) = K X , Y exp � � 2 σ � 2 � – Diffusion kernel – String Kernel – ……
Learning with Multiple Kernels ? ? K
Learning the optimal Kernel Overview of SVM with Overview of SVM with single kernel : single kernel : G(K)
Learning the optimal Kernel Upper bound: Learn a linear mix the smaller, the better the guaranteed better the guaranteed performance G(K) G
To be Continued To be Continued
Recommend
More recommend