scalable machine learning
play

Scalable Machine Learning 5. (Generalized) Linear Models Alex Smola - PowerPoint PPT Presentation

Scalable Machine Learning 5. (Generalized) Linear Models Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12 Administrative stuff Solutions will be posted by tomorrow New problem set will be


  1. Properties • Ignores ‘typical’ instances with small error • Only upper or lower bound active at any time (we cannot violate both bounds simultaneously) • Quadratic Program in 2n variables can be solved as cheaply as standard SVM problem • Robustness with respect to outliers • l1 loss yields same problem without epsilon • Huber’s robust loss yields similar problem but with added quadratic penalty on coefficients

  2. Regression example sinc x + 0.1 sinc x - 0.1 approximation

  3. Regression example sinc x + 0.2 sinc x - 0.2 approximation

  4. Regression example sinc x + 0.5 sinc x - 0.5 approximation

  5. Regression example Support Vectors Support Vectors Support Vectors

  6. Huber’s robust loss ( 1 2 ( y − f ( x )) 2 if | y − f ( x ) | < 1 l ( y, f ( x )) = | y − f ( x ) | − 1 otherwise 2 trimmed mean linear estimatior quadratic

  7. Novelty Detection

  8. Basic Idea Data Observations ( x i ) generated from some P( x ) , e.g., network usage patterns handwritten digits alarm sensors factory status Task Find unusual events, clean database, dis- tinguish typical ex- amples.

  9. Applications Network Intrusion Detection Detect whether someone is trying to hack the network, downloading tons of MP3s, or doing anything else un- usual on the network. Jet Engine Failure Detection You can’t destroy jet engines just to see how they fail. Database Cleaning We want to find out whether someone stored bogus in- formation in a database (typos, etc.), mislabelled digits, ugly digits, bad photographs in an electronic album. Fraud Detection Credit Cards, Telephone Bills, Medical Records Self calibrating alarm devices Car alarms (adjusts itself to where the car is parked), home alarm (furniture, temperature, windows, etc.)

  10. Novelty Detection via Density Estimation Key Idea Novel data is one that we don’t see frequently. It must lie in low density regions. Step 1: Estimate density Observations x 1 , . . . , x m Density estimate via Parzen windows Step 2: Thresholding the density Sort data according to density and use it for rejection Practical implementation: compute p ( x i ) = 1 X k ( x i , x j ) for all i m j and sort according to magnitude. Pick smallest p ( x i ) as novel points.

  11. Order Statistics of Densities

  12. Typical Data

  13. Outliers

  14. A better way Problems We do not care about estimating the density properly in regions of high density (waste of capacity). We only care about the relative density for threshold- ing purposes. We want to eliminate a certain fraction of observations and tune our estimator specifically for this fraction. Solution Areas of low density can be approximated as the level set of an auxiliary function. No need to estimate p ( x ) directly — use proxy of p ( x ) . Specifically: find f ( x ) such that x is novel if f ( x ) ≤ c where c is some constant, i.e. f ( x ) describes the amount of novelty.

  15. Problems with density estimation Maximum a Posteriori m g ( θ ) � h φ ( x i ) , θ i + 1 X 2 σ 2 k θ k 2 minimize θ i =1 Advantages Convex optimization problem Concentration of measure Problems Normalization g ( θ ) may be painful to compute For density estimation we need no normalized p ( x | θ ) No need to perform particularly well in high density regions

  16. Thresholding

  17. Optimization Problem Optimization Problem m � log p ( x i | θ ) + 1 X 2 σ 2 k θ k 2 MAP i =1 m ✓ ◆ p ( x i | θ ) + 1 X 2 k θ k 2 Novelty max � log exp( ρ � g ( θ )) , 0 i =1 m max( ρ � h φ ( x i ) , θ i , 0) + 1 X 2 k θ k 2 i =1 Advantages No normalization g ( θ ) needed No need to perform particularly well in high density regions (estimator focuses on low-density regions) Quadratic program

  18. Maximum Distance Hyperplane Idea Find hyperplane, given by f ( x ) = h w, x i + b = 0 that has maximum distance from origin yet is still closer to the origin than the observations. Hard Margin 1 2 k w k 2 minimize subject to h w, x i i � 1 Soft Margin m 1 2 k w k 2 + C X minimize ξ i i =1 subject to h w, x i i � 1 � ξ i ξ i � 0

  19. Optimization Problem Primal Problem m 1 2 k w k 2 + C X minimize ξ i i =1 subject to h w, x i i � 1 + ξ i � 0 and ξ i � 0 Lagrange Function L Subtract constraints, multiplied by Lagrange multipli- ers ( α i and η i ), from Primal Objective Function. Lagrange function L has saddlepoint at optimum. m m m L = 1 2 k w k 2 + C X X X ξ i � α i ( h w, x i i � 1 + ξ i ) � η i ξ i i =1 i =1 i =1 subject to α i , η i � 0 .

  20. Dual Problem Optimality Conditions m m X X ∂ w L = w � α i x i = 0 = ) w = α i x i i =1 i =1 ∂ ξ i L = C � α i � η i = 0 = ) α i 2 [0 , C ] Now substitute the optimality conditions back into L . Dual Problem m m 1 X X minimize α i α j h x i , x j i � α i 2 i =1 i =1 subject to α i 2 [0 , C ] All this is only possible due to the convexity of the primal problem.

  21. Minimum enclosing ball x x • Observations on R x surface of ball . x x • Find minimum x enclosing ball • Equivalent to ||w|| single class SVM ρ /

  22. Adaptive thresholds Problem Depending on C , the number of novel points will vary. We would like to specify the fraction ν beforehand. Solution Use hyperplane separating data from the origin H := { x | h w, x i = ρ } where the threshold ρ is adaptive . Intuition Let the hyperplane shift by shifting ρ Adjust it such that the ’right’ number of observations is considered novel. Do this automatically

  23. Optimization Problem Primal Problem m minimize 1 2 k w k 2 + X ξ i � m νρ i =1 where h w, x i i � ρ + ξ i � 0 ξ i � 0 Dual Problem m minimize 1 X α i α j h x i , x j i 2 i =1 m X where α i 2 [0 , 1] and α i = ν m. i =1 Similar to SV classification problem, use standard

  24. The ν -property theorem • Optimization problem m 1 2 k w k 2 + X ξ i � m νρ minimize w i =1 subject to h w, x i i � ρ � ξ i and ξ i � 0 • Solution satisfies • At most a fraction of ν points are novel • At most a fraction of (1- ν ) points aren’t novel • Fraction of points on boundary vanishes for large m (for non-pathological kernels)

  25. Proof • Move boundary at optimality • For smaller threshold m - points on wrong side of margin contribute δ ( m − − ν m ) ≤ 0 • For larger threshold m+ points not on ‘good’ side of margin yield δ ( m + − ν m ) ≥ 0 • Combining inequalities m − m ≤ ν ≤ m + m • Margin set of measure 0

  26. Toy example ν , width c 0.5, 0.5 0.5, 0.5 0.1, 0.5 0.5, 0.1 frac. SVs/OLs 0.54, 0.43 0.59, 0.47 0.24, 0.03 0.65, 0.38 margin ρ / � w � 0.84 0.70 0.62 0.48 threshold and smoothness requirements

  27. Novelty detection for OCR Better estimates since we only optimize in low density regions. Specifically tuned for small number of outliers. Only estimates of a level-set. For ν = 1 we get the Parzen-windows estimator back.

  28. Classification with the ν -trick changing kernel width and threshold

  29. Structured Estimation (preview)

  30. Large Margin Condition • Binary classifier Correct class chosen with large margin y f(x) • Multiple classes • Score function per class f(x,y) • Want that correct class has much larger score than incorrect class f ( x, y ) � f ( x, y 0 ) � 1 for all y 0 6 = y • Structured loss function (e.g. coal & diamonds) ∆ ( y, y 0 )

  31. Large Margin Classifiers • Large Margin without rescaling (convex) (Guestrin, Taskar, Koller) [ f ( x, y 0 ) − f ( x, y ) + ∆ ( y, y 0 )] l ( x, y, f ) = sup y 0 2 Y • Large Margin with rescaling (convex) (Tsochantaridis, Hofmann, Joachims, Altun) [ f ( x, y 0 ) − f ( x, y ) + 1] ∆ ( y, y 0 ) l ( x, y, f ) = sup y 0 2 Y • Both losses majorize misclassification loss ✓ ◆ f ( x, y 0 ) y, argmax ∆ y 0 • Proof by plugging argmax into the definition

  32. Many applications • Ranking (DCG, NDCG) • Graph matching (linear assignment) • ROC and F β scores • Sequence annotation (named entities, activity) • Segmentation • Natural Language Translation • Image annotation / scene understanding • Caution - this loss is generally not consistent!

  33. Extensions • Invariances • Add prior knowledge (e.g. in OCR) • Make estimates robust against malicious abuse (e.g. spam filtering) • Tighter upper bounds • Convex bound can be very loose • Overweights noisy data • Structured version of ramp loss • Can be shown to be consistent

  34. More Kernel Algorithms

  35. Kernel PCA

  36. Principal Component Analysis • Gaussian density model ✓ ◆ − 1 d 2 | Σ | − 1 2 exp 2( x − µ ) Σ − 1 ( x − µ ) p ( x ; µ, Σ ) = (2 π ) • Estimate variance by empirical average m m Σ = 1 µ = 1 µ > where ˆ ˆ X X x i x > i − ˆ µ ˆ x i m m i =1 i =1 • Good approximation by low-rank model • Extract leading eigenvalues of covariance • Data might lie in a subspace

  37. Principal Component Analysis • Generative approximation of data X σ i v i α i where α i ∼ N (0 , 1) x = i • Heuristic Good explanation of data implies that we have meaningful dimensions of the data. • Linear feature extraction g i ( x ) = h v i , x i • PCA is reconstruction with smallest l 2 error

  38. good for exploratory data analysis http://www.plantsciences.ucdavis.edu/gepts/pb143/LEC17/pq0921251003.gif

  39. Kernel PCA linear PCA k ( x,x’ ) = < x , x’ > R 2 x x x x x x x x x x x x d kernel PCA e.g. k ( x,x’ ) = < x , x’ > R 2 x x x x x x x x x x x x x x x x x x x x x x x x k H Φ k k k

  40. PCA via inner products • Eigenvector condition Σ v = λ v 1 x i = x i − 1 X X x > x i ¯ ¯ i v = λ v for ¯ x i m m i i X hence v = α j ¯ x j j 1 X x > x > x > using ¯ x i ¯ ¯ i v = λ ¯ l v l m i yields 1 K ¯ ¯ K α = λ ¯ K α m • Kernel PCA 1 K α = λα where ¯ ¯ K ij = h ¯ x j i x i , ¯ m

  41. Two dimensional feature extraction Eigenvalue=0.709 Eigenvalue=0.621 Eigenvalue=0.570 Eigenvalue=0.552 1 1 1 1 noisy 0.5 0.5 0.5 0.5 0 0 0 0 parabola − 0.5 − 0.5 − 0.5 − 0.5 − 1 0 1 − 1 0 1 − 1 0 1 − 1 0 1 Eigenvalue=0.291 Eigenvalue=0.345 Eigenvalue=0.395 Eigenvalue=0.418 1 1 1 1 polynomials 0.5 0.5 0.5 0.5 of increasing 0 0 0 0 − 0.5 − 0.5 − 0.5 − 0.5 − 1 0 1 − 1 0 1 − 1 0 1 − 1 0 1 order Eigenvalue=0.000 Eigenvalue=0.034 Eigenvalue=0.026 Eigenvalue=0.021 1 1 1 1 (1 is PCA) 0.5 0.5 0.5 0.5 0 0 0 0 − 0.5 − 0.5 − 0.5 − 0.5 − 1 0 1 − 1 0 1 − 1 0 1 − 1 0 1

  42. Feature extraction Eigenvalue=0.251 Eigenvalue=0.233 Eigenvalue=0.052 Eigenvalue=0.044 Eigenvalue=0.037 Eigenvalue=0.033 Eigenvalue=0.031 Eigenvalue=0.025 Eigenvalue=0.014 Eigenvalue=0.008 Eigenvalue=0.007 Eigenvalue=0.006 Eigenvalue=0.005 Eigenvalue=0.004 Eigenvalue=0.003 Eigenvalue=0.002

  43. Mean Classifier

  44. ‘Trivial’ classifier o + . w + o c - o c + c + x-c + x • Represent each class by mean in feature space • Classify along direction of maximum discrepancy between classes • Trivial to ‘train’

  45. ‘Trivial’ classifier o + . w + o c - o c + c + x-c + x • Class mean 1 1 X X µ + = φ ( x i ) and µ − = φ ( x i ) m + m − i : y i =1 i : y i = − 1 • Classifier like Watson y i X f ( x ) = h µ + � µ − , φ ( x ) i = k ( x i , x ) Nadaraya m y i i

  46. More kernel methods • Canonical Correlation analysis • Two sample test • Mean in feature space is sufficient to fully represent a distribution • Compare them by computing distance • Independence test • Compare joint and product of marginals • Structured feature extraction • Find directions of high significance and low function complexity

  47. Conditional Models

  48. Gaussian Processes

  49. Weight & height

  50. Weight & height assume Gaussian correlation

  51. p ( weight | height ) = p ( height , weight ) ∝ p ( height , weight ) p ( height )

  52. � >  Σ 11  x 1 − µ 1 � � 1  x 1 − µ 1 " �# − 1 Σ 12 p ( x 2 | x 1 ) ∝ exp Σ 12 Σ 22 2 x 2 − µ 2 x 2 − µ 2 keep linear and quadratic terms of exponent

  53. The gory math Correlated Observations Assume that the random variables t 2 R n , t 0 2 R n 0 are jointly normal with mean ( µ, µ 0 ) and covariance matrix K � >  K tt K tt 0 �!  � � 1  � 1 t � µ t � µ p ( t, t 0 ) / exp . t 0 � µ 0 t 0 � µ 0 K > tt 0 K t 0 t 0 2 Inference Given t , estimate t 0 via p ( t 0 | t ) . Translation into machine learning language: we learn t 0 from t . Practical Solution µ, ˜ Since t 0 | t ⇠ N (˜ K ) , we only need to collect all terms in p ( t, t 0 ) depending on t 0 by matrix inversion, hence µ = µ 0 + K > ⇥ ⇤ ˜ tt 0 K � 1 K � 1 K = K t 0 t 0 � K > tt K tt 0 and ˜ tt ( t � µ ) tt 0 | {z } independent of t 0

  54. Gaussian Process Key Idea Instead of a fixed set of random variables t, t 0 we assume a stochastic process t : X ! R , e.g. X = R n . Previously we had X = { age , height , weight , . . . } . Definition of a Gaussian Process A stochastic process R , where all : X t ! ( t ( x 1 ) , . . . , t ( x m )) are normally distributed. Parameters of a GP Mean µ ( x ) := E [ t ( x )] k ( x, x 0 ) := Cov( t ( x ) , t ( x 0 )) Covariance Function Simplifying Assumption We assume knowledge of k ( x, x 0 ) and set µ = 0 .

  55. Kernels ... Covariance Function Function of two arguments Leads to matrix with nonnegative eigenvalues Describes correlation between pairs of observations Kernel Function of two arguments Leads to matrix with nonnegative eigenvalues Similarity measure between pairs of observations Lucky Guess We suspect that kernels and covariance functions are the same . . .

  56. The connection Gaussian Process on Parameters t ⇠ N ( µ, K ) where K ij = k ( x i , x j ) Linear Model in Feature Space t ( x ) = h Φ ( x ) , w i + µ ( x ) where w ⇠ N (0 , 1 ) The covariance between t ( x ) and t ( x 0 ) is then given by E w [ h Φ ( x ) , w ih w, Φ ( x 0 ) i ] = h Φ ( x ) , Φ ( x 0 ) i = k ( x, x 0 ) Conclusion A small weight vector in “feature space”, as commonly used in SVM amounts to observing t with high p ( t ) . ) Margin k w k 2 Log prior � log p ( t ) ( Will get back to this later again.

  57. Regression

  58. Joint Gaussian Model • Random variables (t,t’) are drawn from GP • Observe a subset t of them • Predict the rest using µ = µ 0 + K > ˜ tt 0 K � 1 K � 1 K = K t 0 t 0 − K > ⇥ ⇤ tt K tt 0 and ˜ tt ( t − µ ) tt 0 • Linear expansion (precompute things) • Predictive uncertainty is data independent Good for experimental design • Predictive uncertainty is data independent • Predictive variance vanishes if K is rank deficient

  59. Some kernels Observation Any function k leading to a symmetric matrix with non- negative eigenvalues is a valid covariance function. Necessary and sufficient condition (Mercer’s Theorem) k needs to be a nonnegative integral kernel. Examples of kernels k ( x, x 0 ) h x, x 0 i Linear exp ( � λ k x � x 0 k ) Laplacian RBF � λ k x � x 0 k 2 � � Gaussian RBF exp ( h x, x 0 i + c i ) d , c � 0 , d 2 N Polynomial B 2 n +1 ( x � x 0 ) B-Spline E c [ p ( x | c ) p ( x 0 | c )] Cond. Expectation

  60. Linear ‘GP regression’ Linear kernel: k ( x, x 0 ) = h x, x 0 i Kernel matrix X > X Mean and covariance K = X 0> X 0 � X 0> X ( X > X ) � 1 X > X 0 = X 0> ( 1 � P X ) X 0 . ˜ µ = X 0> ⇥ X ( X > X ) � 1 t ⇤ ˜ µ is a linear function of X 0 . ˜ Problem The covariance matrix X > X has at most rank n . After n observations ( x 2 R n ) the variance vanishes . This is not realistic . “Flat pancake” or “cigar” distribution.

  61. Degenerate Covariance

  62. Additive Noise Indirect Model Instead of observing t ( x ) we observe y = t ( x ) + ξ , where ξ is a nuisance term. This yields m Z Y p ( Y | X ) = p ( y i | t i ) p ( t | X ) dt i =1 where we can now find a maximum a posteriori solution for t by maximizing the integrand (we will use this later). Additive Normal Noise If ξ ∼ N (0 , σ 2 ) then y is the sum of two Gaussian ran- dom variables. Means and variances add up . y ∼ N ( µ, K + σ 2 1 ) .

  63. Data

Recommend


More recommend