robust sparse quadratic discriminantion jianqing fan
play

Robust Sparse Quadratic Discriminantion Jianqing Fan Princeton - PowerPoint PPT Presentation

Robust Sparse Quadratic Discriminantion Jianqing Fan Princeton University with Tracy Ke, Han Liu and Lucy Xia May 2, 2014 Jianqing Fan (Princeton University) Quadro Outline Introduction 1 Rayleigh Quotient for sparse QDA 2 Optimization


  1. Robust Sparse Quadratic Discriminantion Jianqing Fan Princeton University with Tracy Ke, Han Liu and Lucy Xia May 2, 2014 Jianqing Fan (Princeton University) Quadro

  2. Outline Introduction 1 Rayleigh Quotient for sparse QDA 2 Optimization Algorithm 3 Application to Classification 4 Theoretical Results 5 Numerical Studies 6 Jianqing Fan (Princeton University) Quadro

  3. Introduction High Dimensional Classification Jianqing Fan (Princeton University) Quadro

  4. High-dimensional Classification � pervades all facets of machine learning and Big Data Biomedicine : disease classification / predicting clinical outcomes / biological process using microarray or proteomics data. Machine learning : Document/text classification, image classification Social Networks : Community detection Jianqing Fan (Princeton University) Quadro

  5. Classification Training data : { X i 1 } n 1 i = 1 and { X i 2 } n 2 i = 1 for classes 1 and 2. 5 4 Aim : Classify a new data X by I { f ( X ) < c } + 1 3 ? 2 � Family of functions f : linear, quadratic 1 � Criterion for selecting f : logistic, hinge 0 Convex surrogate −1 −2 −2 −1 0 1 2 3 4 Jianqing Fan (Princeton University) Quadro

  6. A popular approach Sparse linear classifiers : Minimize classification errors ( Bickel& Levina, 04, Fan & Fan, 08; Shao et al. 11; Cai & Liu, 11; Fan, et al, 12 ). ⋆ Works well with Gaussian data with equal variance. ⋆ Powerless if centroids are the same; no interaction considered 2.5 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2 −1 0 1 2 3 4 � Heteroscadestic variance? Non-Gaussian distributions? Jianqing Fan (Princeton University) Quadro

  7. Other popular approaches � Plug-in quadratic discriminant. ⋆ needs Σ − 1 1 , Σ − 1 2 ; ⋆ Gaussianity. � Kernel SVM, logistic regression. ⋆ inadequate use of dist.; ⋆ few results; ⋆ interactions � Minimizing classification error: ⋆ non-convex; not easily computable. Jianqing Fan (Princeton University) Quadro

  8. What new today? Find a quadratic rule that max. Rayleigh Quotient. 1 Non-equal covariance matrices; 2 Fourth cross-moments avoided using elliptical distributions 3 Uniform estimation of means and variance for heavy-tails. 4 Jianqing Fan (Princeton University) Quadro

  9. Rayleigh Quotient Optimization Jianqing Fan (Princeton University) Quadro

  10. Rayleigh Quotient [ E 1 f ( X ) − E 2 f ( X )] 2 between-class-var Rq ( f ) = ∝ π var 1 [ f ( X )]+( 1 − π ) var 2 [ f ( X )] within-class-var Rayleigh Q � In the ”classical” setting, Rq ( f ) is equiv. to Err ( f ) � In ”broader” setting, it is a surrogate of classification error. � Of independent scientific interest. Jianqing Fan (Princeton University) Quadro

  11. Rayleigh quotient for quadratic loss Quadratic projection : Q Ω , δ ( X ) = X ⊤ Ω X − 2 δ ⊤ X . With π = P ( Y = 1 ) and κ = 1 − π π , we have [ D ( Ω , δ )] 2 Rq ( Q ) ∝ V 1 ( Ω , δ )+ κ V 2 ( Ω , δ ) = R ( Ω , δ ) , D ( Ω , δ ) = E 1 Q ( X ) − E 2 Q ( X ) . V k ( Ω , δ ) = var k ( Q ( X )) , k = 1 , 2. Reduce to ROAD ( Fan, Feng, Tong, 12 ) when linear. Jianqing Fan (Princeton University) Quadro

  12. Challenge and Solution Challenge : involve all fourth cross moments. Solution : Consider the elliptical family. E ξ 2 = d , X = µ + ξ Σ 1 / 2 U , X ∼ E ( µ , Σ , g ) Theorem ( Variance of Quadratic Form ) var ( Q ( X )) = 2 ( 1 + γ ) tr ( ΩΣΩΣ )+ γ [ tr ( ΩΣ )] 2 + 4 ( Ω µ − δ ) ⊤ Σ ( Ω µ − δ ) , quadratic in Ω , δ , E ( ξ 4 ) where γ = d ( d + 2 ) − 1 is the kurtosis parameter. Jianqing Fan (Princeton University) Quadro

  13. Rayleigh Quotient under elliptical family Semiparametric model : Two classes: E ( µ 1 , Σ 1 , g ) and E ( µ 2 , Σ 2 , g ) . D , V 1 and V 2 : involve only µ 1 , µ 2 , Σ 1 , Σ 2 and γ Examples of γ : Contaminated Gaussian ( ω , τ ) Compound Gaussian U ( 1 , 2 ) Gaussian t v 1 + ω ( τ 4 − 1 ) 2 1 γ 0 ( 1 + ω ( τ 2 − 1 )) 2 − 1 ν − 2 6 Jianqing Fan (Princeton University) Quadro

  14. Sparse quadratic solution Simplification : Using homogeneity, [ D ( Ω , δ )] 2 V 1 ( Ω , δ )+ κ V 2 ( Ω , δ ) ∝ argmin V 1 ( Ω , δ )+ κ V 2 ( Ω , δ ) argmax � �� � Ω , δ D ( Ω , δ )= 1 V ( Ω , δ ) Theorem ( Sparsified version: Ω ∈ R d × d , δ ∈ R d ) V ( Ω , δ )+ λ 1 | Ω | 1 + λ 2 | δ | 1 . argmin ( Ω , δ ): D ( Ω , δ )= 1 � Applicable to linear discriminant = ⇒ ROAD Jianqing Fan (Princeton University) Quadro

  15. Robust Estimation and Optimization Algorithm Jianqing Fan (Princeton University) Quadro

  16. Robust Estimation of Mean Problems : Elliptical distributions can have heavy tails. Challenges : ⋆ Sample median �≈ mean when skew (e.g. EX 2 ) ⋆ Need uniform conv. for exponentially many σ 2 ii . How to estimate mean with exponential concentration for heavy tails? Jianqing Fan (Princeton University) Quadro

  17. Robust Estimation of Mean Problems : Elliptical distributions can have heavy tails. Challenges : ⋆ Sample median �≈ mean when skew (e.g. EX 2 ) ⋆ Need uniform conv. for exponentially many σ 2 ii . How to estimate mean with exponential concentration for heavy tails? Jianqing Fan (Princeton University) Quadro

  18. Catoni’s M-estimator � µ n ∑ h ( α n , d ( x ij − � µ j )) = 0 , α n , d → 0 . i = 1 h strictly increasing: log ( 1 − y + y 2 / 2 ) ≤ h ( y ) ≤ log ( 1 + y + y 2 / 2 ) . 1 � � 1 / 2 4log ( n ∨ d ) with v ≥ max j σ 2 α n , d = jj . 2 n [ v + 4 v log ( n ∨ d )) n − 4log ( n ∨ d ) ] Catoni's influence function h(.) 3 2 � 1 log d | � µ j − µ j | ∞ = O p ( n ) y 0 −1 needs bounded 2 nd moment −2 −3 −6 −4 −2 0 2 4 6 x Jianqing Fan (Princeton University) Quadro

  19. Robust Estimation of Σ k η j = � j , Catoni’s M-estimator using { x 2 1 j , ··· , x 2 � EX 2 nj } . 1 variance estimation : for a small δ 0 , 2 j = � σ 2 µ 2 � Σ jj = max { � η j − � j , δ 0 } . Off-diagonal elements : 3 � Σ jk = � σ j � σ k sin ( π � τ jk / 2 ) � �� � robust corr � τ jk : Kendall’s tau correlation ( Liu, et al, 12; Zou & Xue, 12 ). Jianqing Fan (Princeton University) Quadro

  20. Projection into nonnegative matrix � � Σ is indefinite : sup-norm projection : � � � | A − � Σ = argmin Σ | ∞ , convex optimization A ≥ 0 Estimated projected truth Property : | � Σ − Σ | ∞ ≤ 2 | � Σ − Σ | ∞ . Jianqing Fan (Princeton University) Quadro

  21. Robust Estimation of γ 1 d ( d + 2 ) E ( ξ 4 ) − 1 and Recall : γ = E ( ξ 4 ) = E { [( X − µ ) ⊤ Σ − 1 ( X − µ )] 2 } . Intuitive estimator: — also estimable for subvectors . � � n 1 1 µ ) ⊤ � µ )] 2 − 1 , ∑ � γ = max [( X i − � Ω ( X i − � , 0 d ( d + 2 ) n i = 1 µ and � Ω are estimators of µ and Σ − 1 (CLIME, Cai, et al, 11 ). ⋆ � � � µ − µ | ∞ , | � Ω − Σ − 1 | ∞ Properties: | � | � γ − γ | ≤ C max . Jianqing Fan (Princeton University) Quadro

  22. Linearized Augmented Lagrangian Target : min D ( Ω , δ )= 1 V ( Ω , δ )+ λ 1 | Ω | 1 + λ 2 | δ | 1 . Rayleigh Q � Let F ρ ( Ω , δ , ν ) = V ( Ω , δ )+ ν [ D ( Ω , δ ) − 1 ]+ ρ [ D ( Ω , δ ) − 1 ] 2 � �� � quadratic in Ω and δ Ω ( 1 ) ⇒ δ ( 1 ) ⇒ ν ( 1 ) = ⇒ Ω ( 2 ) ⇒ δ ( 2 ) ⇒ ν ( 2 ) = ⇒ ··· Jianqing Fan (Princeton University) Quadro

  23. Linearized Augmented Lagrangian: Details � Minimize F ρ ( Ω , δ , ν )+ λ 1 | Ω | 1 + λ 2 | δ | 1 . Rayleigh Q � � Ω ( k ) = argmin Ω F ρ ( Ω , δ ( k − 1 ) , ν ( k − 1 ) )+ λ 1 | Ω | 1 , (soft-thresh.) � � δ ( k ) = argmin δ F ρ ( Ω ( k ) , δ , ν ( k − 1 ) )+ λ 2 | δ | 1 , (LASSO) ν ( k ) = ν ( k − 1 ) + 2 ρ [ D ( Ω ( k ) , δ ( k ) ) − 1 ] . Jianqing Fan (Princeton University) Quadro

  24. Application to Classification Jianqing Fan (Princeton University) Quadro

  25. Finding a Threshold Q Where to Cut??? Jianqing Fan (Princeton University) Quadro

  26. Finding a Threshold Back to approx � � Z ⊤ Ω Z − 2 Z ⊤ δ < c + 1. ⋆ Classification rule: I ⋆ Reparametrization: c = tM 1 ( Ω , δ )+( 1 − t ) M 2 ( Ω , δ ) . ⋆ Minimizing wrt t an approximated classification error: � � � � ( 1 − t ) D ( Ω , δ ) tD ( Ω , δ ) Err ( t ) ≡ π ¯ +( 1 − π )¯ � � Φ Φ , V 1 ( Ω , δ ) V 2 ( Ω , δ ) Jianqing Fan (Princeton University) Quadro

  27. Overview of Our Procedure Raw Data Robust M-estimator, and Kendall’s tau correlation estimation µ 2 , b Σ 1 , b µ 1 , b b Σ 2 , b γ Rayleigh quotient optimization (a regularized convex programming) Ω , b ( b δ ) Find threshold of c ( t ∗ ) , where t ∗ is found by minimizing Err ( b Ω , b δ , t ) Quadratic Classification Rule: δ , c ( t ∗ )) = I ( Z > b f ( b Ω , b Ω Z − 2 Z > b δ < c ( t ∗ )) Jianqing Fan (Princeton University) Quadro

  28. Theoretical Results Jianqing Fan (Princeton University) Quadro

Recommend


More recommend