Sparse Shrunk Additive Models Guodong Liu(University of Pittsburgh) , Hong Chen (Huazhong Agricultural Univerisity), Heng Huang (University of Pittsburgh) June 14, 2020
1. Motivation Deep models have made great progress in learning large dataset, however, statistical models could do better in smaller ones. Also, statistical models usually show better interpretability.
1. Motivation Deep models have made great progress in learning large dataset, however, statistical models could do better in smaller ones. Also, statistical models usually show better interpretability. ◮ Linear model. ◮ Linear assumption is too restricted. ◮ The non-linear fact in applications.
1. Motivation Deep models have made great progress in learning large dataset, however, statistical models could do better in smaller ones. Also, statistical models usually show better interpretability. ◮ Linear model. ◮ Linear assumption is too restricted. ◮ The non-linear fact in applications. ◮ Generalized additive model. ◮ Nonparametric extensions of linear models. ◮ Flexible and adaptive to high dimensional data.
1. Motivation Deep models have made great progress in learning large dataset, however, statistical models could do better in smaller ones. Also, statistical models usually show better interpretability. ◮ Linear model. ◮ Linear assumption is too restricted. ◮ The non-linear fact in applications. ◮ Generalized additive model. ◮ Nonparametric extensions of linear models. ◮ Flexible and adaptive to high dimensional data. ◮ Univariate smooth component function. ◮ Pre-defined group structure information.
2. Contribution Propose a uniform framework to bridge sparse feature ◮ selection, sparse sample selection, and feature interaction structure learning tasks.
2. Contribution Propose a uniform framework to bridge sparse feature ◮ selection, sparse sample selection, and feature interaction structure learning tasks. Provided Generalization bound on the excess risk under mild ◮ conditions, which implies the fast convergence rate can be achieved.
2. Contribution Propose a uniform framework to bridge sparse feature ◮ selection, sparse sample selection, and feature interaction structure learning tasks. Provided Generalization bound on the excess risk under mild ◮ conditions, which implies the fast convergence rate can be achieved. Derived the necessary and sufficient condition to characterize ◮ the sparsity of SSAM.
3. Algorithm: Sparse Shrunk Additive Models ◮ Let X ⊂ R n be a explanatory feature space and let Y ⊂ [ − 1 , 1] be the response set. Let z := { z i } m i =1 = { ( x i , y i ) } m i =1 be independent copies of a random sample ( x , y ) following an unknown intrinsic distribution ρ on Z := X × Y .
3. Algorithm: Sparse Shrunk Additive Models ◮ Let X ⊂ R n be a explanatory feature space and let Y ⊂ [ − 1 , 1] be the response set. Let z := { z i } m i =1 = { ( x i , y i ) } m i =1 be independent copies of a random sample ( x , y ) following an unknown intrinsic distribution ρ on Z := X × Y . � n ◮ For any given 1 ≤ k ≤ n and { 1 , 2 , ..., n } , we denote d = � as the k number of index subset with k elements. Let x ( j ) ∈ R k be a subset of x with k features and denote its corresponding space as X ( j ) .
3. Algorithm: Sparse Shrunk Additive Models ◮ Let X ⊂ R n be a explanatory feature space and let Y ⊂ [ − 1 , 1] be the response set. Let z := { z i } m i =1 = { ( x i , y i ) } m i =1 be independent copies of a random sample ( x , y ) following an unknown intrinsic distribution ρ on Z := X × Y . � n ◮ For any given 1 ≤ k ≤ n and { 1 , 2 , ..., n } , we denote d = � as the k number of index subset with k elements. Let x ( j ) ∈ R k be a subset of x with k features and denote its corresponding space as X ( j ) . ◮ Let K ( j ) : X ( j ) × X ( j ) → R be a continuous function satisfying � K ( j ) � ∞ < + ∞ .
3. Algorithm: Sparse Shrunk Additive Models ◮ Let X ⊂ R n be a explanatory feature space and let Y ⊂ [ − 1 , 1] be the response set. Let z := { z i } m i =1 = { ( x i , y i ) } m i =1 be independent copies of a random sample ( x , y ) following an unknown intrinsic distribution ρ on Z := X × Y . � n ◮ For any given 1 ≤ k ≤ n and { 1 , 2 , ..., n } , we denote d = � as the k number of index subset with k elements. Let x ( j ) ∈ R k be a subset of x with k features and denote its corresponding space as X ( j ) . ◮ Let K ( j ) : X ( j ) × X ( j ) → R be a continuous function satisfying � K ( j ) � ∞ < + ∞ . ◮ For any given z , we define the data dependent hypothesis space as: j =1 f ( j ) ( x ( j ) ) , f ( j ) ∈ H ( j ) H z = { f : f ( x ) = � d z } , where H ( j ) = { f ( j ) = � m i =1 α ( j ) i K ( j ) ( x ( j ) , · ) : α ( j ) ∈ R } z i i
3. Algorithm: Sparse Shrunk Additive Models ◮ Let X ⊂ R n be a explanatory feature space and let Y ⊂ [ − 1 , 1] be the response set. Let z := { z i } m i =1 = { ( x i , y i ) } m i =1 be independent copies of a random sample ( x , y ) following an unknown intrinsic distribution ρ on Z := X × Y . � n ◮ For any given 1 ≤ k ≤ n and { 1 , 2 , ..., n } , we denote d = � as the k number of index subset with k elements. Let x ( j ) ∈ R k be a subset of x with k features and denote its corresponding space as X ( j ) . ◮ Let K ( j ) : X ( j ) × X ( j ) → R be a continuous function satisfying � K ( j ) � ∞ < + ∞ . ◮ For any given z , we define the data dependent hypothesis space as: j =1 f ( j ) ( x ( j ) ) , f ( j ) ∈ H ( j ) H z = { f : f ( x ) = � d z } , where H ( j ) = { f ( j ) = � m i =1 α ( j ) i K ( j ) ( x ( j ) , · ) : α ( j ) ∈ R } z i i � � m t | : f ( j ) = � m � t =1 | α ( j ) t =1 α ( j ) t K ( j ) ( x ( j ) ◮ Denote � f ( j ) � ℓ 1 = inf t , · ) , and � f � ℓ 1 := � d j =1 � f ( j ) � ℓ 1 for f = � d j =1 f ( j ) .
3. Algorithm: Sparse Shrunk Additive Models Predictor of SSAM d d m f ( j ) α ( j ) t K ( j ) ( x ( j ) � � � f z = = ˆ t , · ) z j =1 j =1 t =1 where, for 1 ≤ t ≤ m and 1 ≤ j ≤ d , d m � α ( j ) | α ( j ) � � { ˆ t } = arg min λ t | α ( j ) t ∈ R , t , j t =1 j =1 (1) m d m + 1 � 2 + � α ( j ) t K ( j ) ( x ( j ) t , x ( j ) � � � � y i − ) . i m i =1 j =1 t =1
3. Algorithm: Sparse Shrunk Additive Models SSAM from the viewpoint of function approximation � 1 m ( y i − f ( x i )) 2 + λ � f � ℓ 1 � � f z = arg min . m f ∈H z i =1
4. Theoretical Analysis: Assumptions Assumption 1: j =1 f ( j ) Assume that f ρ = � d ρ , where for each j ∈ { 1 , 2 , ..., d } , f ( j ) : X ( j ) → R is a function of the form f ( j ) K ( j ) ( g ( j ) = L r ρ ) with ρ ρ ˜ some r > 0 and g ( j ) ∈ L 2 ρ X ( j ) . ρ Assumption 2: For each j ∈ { 1 , 2 , ..., d } , the kernel function K ( j ) : X ( j ) × X ( j ) → R is C s with some s > 0 satisfying: 2 , ∀ u , v , v ′ ∈ X ( j ) � K ( j ) ( u , v ) − K ( j ) ( u , v ′ ) � ≤ c s � v − v ′ � s for some positive constant c s .
4. Theoretical Analysis: Theorems Theorem 1 Let Assumptions 1 and 2 be true. For any 0 < δ < 1, with confidence 1 − δ , there exists positive constant ˜ c 1 independent of m , δ such that: (1) If r ∈ (0 , 1 2 ) in Assumption 1, setting λ = m − θ 1 with 2 θ 1 ∈ (0 , 2+ p ), c 1 log(8 /δ ) m − γ 1 , E ( π ( f z )) − E ( f ρ ) ≤ ˜ 2+ p − (2 − 2 r ) θ 1 , 2(1 − p θ 1 ) 2 r θ 1 , 1 − θ 1 +2 r θ 1 2 � � where γ 1 = min , . 2 2+ p 2 in Assumption 1, taking λ = m − θ 2 with some (2) If r ≥ 1 2 θ 2 ∈ (0 , 2+ p ), c 1 log(8 /δ ) m − γ 2 , E ( π ( f z )) − E ( f ρ ) ≤ ˜ � � θ 2 , 1 2 where γ 2 = min 2 , 2+ p − θ 2 .
4. Theoretical Analysis: Remark ◮ Theorem 1 provides the upper bound of generalization error to SSAM with Lipshitz continuous kernel. ◮ For r ∈ (0 , 1 2 ), as s → ∞ , we have γ 1 → min { 2 r θ 1 , 1 2 + ( r − 1 2 ) θ, 1 − 2 θ 1 + 2 r θ 1 } . 2 , the convergence rate O ( m − 1 ◮ When r → 1 2 and θ 1 → 1 2 ) can be reached. ◮ For r ≥ 1 1 2 , taking θ 2 = 2+ p , we get the convergence rate 1 O ( m − 2+ p ).
4. Theoretical Analysis: Theorems Theorem 2 2 Assume that f ( j ) ∈ H ( j ) for each 1 ≤ j ≤ d . Take λ = m − 2+3 p in ρ (1). For any 0 < δ < 1, with confidence 1 − δ we have 2 c 2 log(1 /δ ) m − 2+3 p , E ( π ( f z )) − E ( f ρ ) ≤ ˜ where ˜ c 2 is a positive constant independent of m , δ .
4. Theoretical Analysis: Theorems Theorem 2 2 Assume that f ( j ) ∈ H ( j ) for each 1 ≤ j ≤ d . Take λ = m − 2+3 p in ρ (1). For any 0 < δ < 1, with confidence 1 − δ we have 2 c 2 log(1 /δ ) m − 2+3 p , E ( π ( f z )) − E ( f ρ ) ≤ ˜ where ˜ c 2 is a positive constant independent of m , δ . ◮ The result is about a special case when f ( j ) ∈ H ( j ) . ρ ◮ Under the strong condition on f ρ , the convergence rate can be arbitrary close to O ( m − 1 ) as s → ∞ .
5. Empirical Evaluation: Synthetic Data Setting � n ◮ Pairwise interaction setting: k = 2 , d = � . 2
5. Empirical Evaluation: Synthetic Data Setting � n ◮ Pairwise interaction setting: k = 2 , d = � . 2 ◮ Each kernel on X ( j ) is generated from Gaussian kernel.
Recommend
More recommend