output kernel learning methods
play

Output Kernel Learning Methods Francesco Dinuzzo 1 Cheng Soon Ong 2 - PowerPoint PPT Presentation

Output Kernel Learning Methods Francesco Dinuzzo 1 Cheng Soon Ong 2 Kenji Fukumizu 3 1 Max Planck Institute for Intelligent Systems, T ubingen, Germany 2 NICTA, Melbourne 3 Institute of Statistical Mathematics, Japan Part I Learning multiple


  1. Output Kernel Learning Methods Francesco Dinuzzo 1 Cheng Soon Ong 2 Kenji Fukumizu 3 1 Max Planck Institute for Intelligent Systems, T¨ ubingen, Germany 2 NICTA, Melbourne 3 Institute of Statistical Mathematics, Japan

  2. Part I Learning multiple tasks and their relationships

  3. Multiple regression: population pharmacokinetics Xenobiotic concentration in 27 human subjects after a bolus administration 120 100 80 Concentration 60 40 20 0 0 5 10 15 20 25 Time (hours) The response curves have similar shapes. However, there is macroscopic inter-individual variability.

  4. Multiple regression: population pharmacokinetics Subject #1 Subject #2 Subject #3 120 120 120 100 100 100 80 80 80 Concentration Concentration Concentration 60 60 60 40 40 40 20 20 20 0 0 0 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 Time (hours) Time (hours) Time (hours) Subject #4 Subject #5 Subject #6 120 120 120 100 100 100 80 80 80 Concentration Concentration Concentration 60 60 60 40 40 40 20 20 20 0 0 0 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 Time (hours) Time (hours) Time (hours) Few data points per subject with sparse sampling.

  5. Multiple regression: population pharmacokinetics Subject #1 Subject #2 Subject #3 120 120 120 100 100 100 80 80 80 Concentration Concentration Concentration 60 60 60 40 40 40 20 20 20 0 0 0 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 Time (hours) Time (hours) Time (hours) Subject #4 Subject #5 Subject #6 120 120 120 100 100 100 80 80 80 Concentration Concentration Concentration 60 60 60 40 40 40 20 20 20 0 0 0 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 Time (hours) Time (hours) Time (hours) Can we combine the datasets to better estimate all the curves?

  6. Collaborative filtering and recommender systems Data: collections of ratings assigned by several users to a set of items. Problem: estimate the preferences of the every user for all the items. Preference profiles are different for every user. However, similar users have similar preferences.

  7. Collaborative filtering and recommender systems Additional information: Data about the items. Data about the users (e.g. gender, age, occupation). Data about the ratings themselves (e.g. timestamp, tags). Can we combine all these data to better estimate individual preferences?

  8. Multi-task learning: dataset structure 10 20 30 40 Task 50 Sampling can be very sparse. 60 70 80 90 100 10 20 x

  9. Multi-task learning: dataset structure 10 20 30 40 Task 50 Few samples per task... 60 70 80 90 100 10 20 x

  10. Multi-task learning: dataset structure 10 20 30 40 Task 50 ... but each sample is shared by many tasks. 60 70 80 90 100 10 20 x

  11. Object recognition with structure discovery 1 Build an object classifier with good generalization performance 2 Discover relationships between the different classes

  12. Organizing the classes in a graph structure baseball-bat goose breadmaker calculator monitor chopsticks bear killer-whale helicopter boom-box laptop microwave sword chimp porcupine swan speed-boat vcr motorbikes photocopier tweezer gorilla video-projector tennis-court airplanes mountain-bike toaster refrigerator skunk comet cactus centipede butterfly fireworks golden-gate buddha grapes spaghetti fern crab grasshopper bowling-ball galaxy light-house eiffel-tower toad scorpion hummingbird hibiscus cd lightning teepee minaret iris brain frisbee rainbow windmill skyscraper mars steering-wheel tower-pisa How can we generate such a graph automatically?

  13. Part II Output Kernel Learning

  14. Kernel-based multi-task learning Multi-task supervised learning Synthesizing multiple functions f j : X → Y , j = 1 , . . . , m from multiple datasets of input-output pairs ( x ij , y ij ).

  15. Kernel-based multi-task learning Multi-task supervised learning Synthesizing multiple functions f j : X → Y , j = 1 , . . . , m from multiple datasets of input-output pairs ( x ij , y ij ). Multi-task kernels For every pair of inputs ( x 1 , x 2 ) and every pair of task indices ( i, j ), specify a similarity value K (( x 1 , i ) , ( x 2 , j )) Equivalently, specify a matrix valued function H such that [ H ( x 1 , x 2 )] ij = K (( x 1 , i ) , ( x 2 , j ))

  16. Decomposable kernels Decomposable kernels K (( x 1 , i ) , ( x 2 , j )) = K X ( x 1 , x 2 ) K Y ( i, j ) , Matrix-valued kernel H ( x 1 , x 2 ) = K X ( x 1 , x 2 ) · L , L ij = K Y ( i, j ) K X is the input kernel . K Y is the output kernel (equivalently, L ∈ S m + ).

  17. Kernel-based regularization methods Kernel-based regularization   ℓ j m � � V ( y ij , f j ( x ij )) + � f � 2 min  H L  f ∈H L j =1 i =1

  18. Kernel-based regularization methods Kernel-based regularization   ℓ j m � � V ( y ij , f j ( x ij )) + � f � 2 min  H L  f ∈H L j =1 i =1 Representer theorem � ℓ k m � � � f j ( x ) = L jk c ij K X ( x ij , x ) i =1 k =1

  19. Kernel-based regularization methods Kernel-based regularization   ℓ j m � � V ( y ij , f j ( x ij )) + � f � 2 min  H L  f ∈H L j =1 i =1 Representer theorem � ℓ k m � � � f j ( x ) = L jk c ij K X ( x ij , x ) i =1 k =1 How to choose the output kernel? Independent single-task learning: L = I .

  20. Kernel-based regularization methods Kernel-based regularization   ℓ j m � � V ( y ij , f j ( x ij )) + � f � 2 min  H L  f ∈H L j =1 i =1 Representer theorem � ℓ k m � � � f j ( x ) = L jk c ij K X ( x ij , x ) i =1 k =1 How to choose the output kernel? Independent single-task learning: L = I . Pooled single-task learning: L = 1 .

  21. Kernel-based regularization methods Kernel-based regularization   ℓ j m � � V ( y ij , f j ( x ij )) + � f � 2 min  H L  f ∈H L j =1 i =1 Representer theorem � ℓ k m � � � f j ( x ) = L jk c ij K X ( x ij , x ) i =1 k =1 How to choose the output kernel? Independent single-task learning: L = I . Pooled single-task learning: L = 1 . Design it using prior knowledge.

  22. Kernel-based regularization methods Kernel-based regularization   ℓ j m � � V ( y ij , f j ( x ij )) + � f � 2 min  H L  f ∈H L j =1 i =1 Representer theorem � ℓ k m � � � f j ( x ) = L jk c ij K X ( x ij , x ) i =1 k =1 How to choose the output kernel? Independent single-task learning: L = I . Pooled single-task learning: L = 1 . Design it using prior knowledge. Learn it from the data.

  23. Multiple Kernel Learning Multiple Kernel Learning (MKL) N � K = d k ≥ 0 . d k K k , k =1

  24. Multiple Kernel Learning Multiple Kernel Learning (MKL) N � K = d k ≥ 0 . d k K k , k =1 MKL with decomposable basis kernels N � d k K k X ( x 1 , x 2 ) K k K (( x 1 , i ) , ( x 2 , j )) = Y ( i, j ) k =1

  25. Multiple Kernel Learning Multiple Kernel Learning (MKL) N � K = d k ≥ 0 . d k K k , k =1 MKL with decomposable basis kernels N � d k K k X ( x 1 , x 2 ) K k K (( x 1 , i ) , ( x 2 , j )) = Y ( i, j ) k =1 MKL with decomposable basis kernels (common input kernel) � N � � d k K k K (( x 1 , i ) , ( x 2 , j )) = K X ( x 1 , x 2 ) Y ( i, j ) k =1

  26. Multiple Kernel Learning Multiple Kernel Learning (MKL) N � K = d k ≥ 0 . d k K k , k =1 MKL with decomposable basis kernels N � d k K k X ( x 1 , x 2 ) K k K (( x 1 , i ) , ( x 2 , j )) = Y ( i, j ) k =1 MKL with decomposable basis kernels (common input kernel) � N � � d k K k K (( x 1 , i ) , ( x 2 , j )) = K X ( x 1 , x 2 ) Y ( i, j ) k =1 Two issues: 1 The maximum number of kernels is limited by memory constraints. 2 Specifying the dictionary of basis kernels requires domain knowledge.

  27. Output Kernel Learning Optimization problem     ℓ j m  min � � V ( y ij , f j ( x ij )) + � f � 2  , min H L + Ω( L )   L ∈ S + f ∈H L j =1 i =1

  28. Output Kernel Learning Optimization problem     ℓ j m  min � � V ( y ij , f j ( x ij )) + � f � 2  , min H L + Ω( L )   L ∈ S + f ∈H L j =1 i =1 Examples: Squared Frobenius norm Ω( L ) = � L � 2 F . Sparsity-inducing regularizer Ω( L ) = � L � 1 . Low-rank inducing regularizer Ω( L ) = tr( L ) + I (rank( L ) ≤ p ) .

  29. Low-Rank Output Kernel Learning Low-Rank OKL C T KCL � � �� � W ⊙ ( Y − KCL ) � 2 � � + tr + tr( L ) F min min . L ∈ S m,p 2 λ 2 2 C ∈ R ℓ × m +

  30. Low-Rank Output Kernel Learning Low-Rank OKL C T KCL � � �� � W ⊙ ( Y − KCL ) � 2 � � + tr + tr( L ) F min min . L ∈ S m,p 2 λ 2 2 C ∈ R ℓ × m + A non-linear generalization of reduced-rank regression.

  31. Low-Rank Output Kernel Learning Low-Rank OKL C T KCL � � �� � W ⊙ ( Y − KCL ) � 2 � � + tr + tr( L ) F min min . L ∈ S m,p 2 λ 2 2 C ∈ R ℓ × m + A non-linear generalization of reduced-rank regression. One of the reformulations only requires storing low-rank matrices.

Recommend


More recommend