generalization error analysis of quantized compressive
play

Generalization Error Analysis of Quantized Compressive Learning - PowerPoint PPT Presentation

Generalization Error Analysis of Quantized Compressive Learning Xiaoyun Li Ping Li Department of Statistics, Rutgers University Cognitive Computing Lab, Baidu Research USA Xiaoyun Li, Ping Li NeurIPS 2019 1 / 14 Random Projection (RP)


  1. Generalization Error Analysis of Quantized Compressive Learning Xiaoyun Li Ping Li Department of Statistics, Rutgers University Cognitive Computing Lab, Baidu Research USA Xiaoyun Li, Ping Li NeurIPS 2019 1 / 14

  2. Random Projection (RP) Method Data matrix X ∈ R n × d , normalized to unit norm (all samples on unit sphere). Save storage by k random projections: X R = X × R , with R ∈ R d × k ⇒ X R ∈ R n × k . a random matrix with i.i.d. N (0 , 1) entries = J-L lemma: approximate distance preservation = ⇒ Many applications: clustering, classification, compressed sensing, dimensionality reduction, etc.. “Projection+quantization”: more storage saving. Apply (entry-wise) scalar quantization function Q ( · ) by X Q = Q ( X R ). More applications: MaxCut, SimHash, 1-bit compressive sensing, etc.. Xiaoyun Li, Ping Li NeurIPS 2019 2 / 14

  3. Compressive Learning + Quantization We can apply learning models to projected data ( X R , Y ), where Y is the response or label = ⇒ learning in the projected space S R ! This is called compressive learning . It has been shown that learning in the projected space is able to provide satisfactory performance, while substantially reduce the computational cost, especially for high-dimensional data. We go one step further: learning with quantized random projections ( X Q , Y ) = ⇒ learning in the quantized projected space S Q ! This is called quantized compressive learning . A relatively new topic, but is practical in applications with data compression. Xiaoyun Li, Ping Li NeurIPS 2019 3 / 14

  4. Paper Summary We provide generalization error bounds (of a test sample x ∈ X ) on three quantized compressive learning models: Nearest neighbor classifier Linear classifier (logistic regression, linear SVM, etc.) Linear regression Applications : we identify the factors that affect the generalization performance of each model, which gives recommendations on the choice of quantizer Q in practice. Some experiments are conducted to verify the theory. Xiaoyun Li, Ping Li NeurIPS 2019 4 / 14

  5. Backgrounds A b -bit quantizer Q b separates the real line into M = 2 b regions. Distortion : D Q b = E [( Q b ( X ) − X ) 2 ] ⇐ ⇒ minimized by Lloyd-Max (LM) quantizer. Maximal gap of Q on interval [ a , b ]: the largest gap between two consecutive boarders of Q on [ a , b ]. Indeed, we can estimate the inner product between two samples x 1 1 R ) Q ( R T x 2 ) ρ Q ( x 1 , x 2 ) = Q ( x T and x 2 through the estimator ˆ , which k might be biased. We define the debiased variance of a quantizer Q as the variance of ˆ ρ Q after debiasing. Idea: connection between the generalization of three models and inner product estimates. Xiaoyun Li, Ping Li NeurIPS 2019 5 / 14

  6. Quantized Compressive 1-NN Classifier We are interested in the risk of a classifier h , L ( h ) = E [ ✶ { h ( x ) � = y } ]. Assume ( x , y ) ∼ D , with conditional probability η ( x ) = P ( y = 1 | x ). Bayes classifier h ∗ ( x ) = ✶ { η ( x ) > 1 / 2 } has the minimal risk. h Q ( x ) = y (1) Q , where ( x (1) Q , y (1) Q ) is the sample and label of nearest neighbor of x in the quantized space S Q . Theorem: Generalization of 1-NN Classifier Suppose ( x , y ) is a test sample. Q is a uniform quantizer with △ between boarders and maximal gap g Q . Under some technical conditions and with some constants c 1 , c 2 , with high probability, √ k +1 √ � E X , Y [ L ( h Q ( x ))] ≤ 2 L ( h ∗ ( x ))+ c 1 ( △ 1 + ω k + c 2 △ k 1 k k +1 ( ne ) − 1 − ω ) √ 1 − ω. g Q Xiaoyun Li, Ping Li NeurIPS 2019 6 / 14

  7. Quantized Compressive 1-NN Classifier: Asymptotics Theorem: Asymptotic Error of 1-NN Classifier 1 R ) Q ( R T x 2 ) ρ Q = Q ( x T Let the cosine estimator ˆ , assume ∀ x 1 , x 2 , k E [ˆ ρ Q ( x 1 , x 2 )] = αρ x 1 , x 2 for some α > 0. As k → ∞ , we have E X , Y , R [ L ( h Q ( x ))] ≤ E X , Y [ L ( h S ( x ))] + r k , √ k (cos( x , x i ) − cos( x , x (1) )) � � � r k = E [ Φ ] , � ξ 2 x , x i + ξ 2 ρ Q ( x , x (1) )) ξ x , x i ξ x , x (1) x , x (1) − 2 Corr (ˆ ρ Q ( x , x i ) , ˆ i : x i ∈G with ξ 2 ρ Q ( x , y ) and G = X / x (1) . L ( h S ( x )) x , y / k the debiased variance of ˆ is the risk of data space NN classifier, and Φ( · ) is the CDF of N (0 , 1). Let x (1) be the nearest neighbor of a test sample x . Under mild conditions, smaller debiased variance around ρ = cos( x , x (1) ) leads to smaller generalization error. Xiaoyun Li, Ping Li NeurIPS 2019 7 / 14

  8. Quantized Compressive Linear Classifier with (0,1)-loss H separates the space by a hyper-plane: H ( x ) = ✶ { h T x > 0 } . ERM classifiers: ˆ H ( x ) = ✶ { ˆ h T x > 0 } , ˆ H Q ( x ) = ✶ { ˆ h T Q Q ( R T x ) > 0 } . Theorem: Generalization of linear classifier Under some technical conditions, with probability (1 − 2 δ ), n h ) + 1 Pr [ ˆ H Q ( x ) � = y ] ≤ ˆ L (0 , 1) ( S , ˆ � f k , Q ( ρ i ) + C k , n ,δ , δ n i =1 √ k | ρ i | where f k , Q ( ρ i ) = Φ( − ), with ρ i the cosine between training sample ξ ρ i x i and ERM classifier ˆ h in the data space, and ξ 2 ρ i / k the debiased variance 1 R ) Q ( R T x 2 ) ρ Q = Q ( x T of ˆ at ρ i . k Small debiased variance around ρ = 0 lowers the bound. Xiaoyun Li, Ping Li NeurIPS 2019 8 / 14

  9. Quantized Compressive Least Squares (QCLS) Regression Fixed design: Y = X T β + ǫ , with x i fixed, ǫ i.i.d. N (0 , γ ) L ( β ) = 1 n E Y [ � Y − X β � 2 ], L Q ( β Q ) = 1 n E Y , R [ � Y − Q ( XR ) β Q � 2 ]. ˆ ˆ L ( β ) = 1 n � Y − X β � 2 , L Q ( β Q ) = 1 1 k Q ( XR ) β Q � 2 . (given R ) n � Y − √ Theorem: Generalization of QCLS β ∗ = argmin Let ˆ ˆ L ( β ) and ˆ ˆ β ∗ L Q ( β ). Let Σ = X T X / k , k < n . Q = argmin β ∈ R d β ∈ R k D Q is the distortion of Q . Then we have n + 1 Q )] − L ( β ∗ ) ≤ γ k E Y , R [ L Q (ˆ β ∗ k � β ∗ � 2 Ω , (1) √ where Ω = [ ξ 2 , 2 − 1+ D Q 1 w T Ω w the (1 − D Q ) 2 − 1]Σ + 1 − D Q I d , with � w � Ω = Mahalanobis norm. Smaller distortion lowers the error bound. Xiaoyun Li, Ping Li NeurIPS 2019 9 / 14

  10. Implications 1-NN classification : In most applications, we should choose the quantizer with small debiased variance of inner product estimator ρ Q = Q ( R T x ) T Q ( R T y ) ˆ in high similarity region. = ⇒ Normalizing the k quantized random projections ( X Q ) may help, see ref Xiaoyun Li and Ping Li, Random Projections with Asymmetric Quantization, NeurIPS 2019. Linear classification : we should choose the quantizer with small ρ Q = Q ( R T x ) T Q ( R T y ) debiased variance of inner product estimate ˆ at k around ρ = 0. = ⇒ First choice: Lloyd-Max quantizer. Linear regression : we should choose the quantizer with small distortion D Q . = ⇒ First choice: Lloyd-Max quantizer. Xiaoyun Li, Ping Li NeurIPS 2019 10 / 14

  11. Experiments Dataset # samples # features # classes Mean 1-NN ρ BASEHOCK 1993 4862 2 0.6 orlraws10P 100 10304 10 0.9 3 Debiased Variance 2 1 Full-precision LM b=1 LM b=3 Uniform b=3 0 0.2 0.4 0.6 0.8 1 Figure 1: Empirical debiased variance of three quantizers. Mean 1-NN ρ is the estimated cos( x , x (1) ) from training set. Xiaoyun Li, Ping Li NeurIPS 2019 11 / 14

  12. Quantized Compressive 1-NN Classification Claim: smaller debiased variance at around ρ = cos( x , x (1 ) is better. 100% 100% orlraws10P BASEHOCK Test Accuracy Test Accuracy 95% 80% 90% Full-precision Full-precision 60% 85% LM b=1 LM b=1 LM b=3 LM b=3 Uniform b=3 Uniform b=3 80% 40% 2 6 2 7 2 8 2 9 2 10 2 11 2 12 2 6 2 7 2 8 2 9 2 10 2 11 2 12 Number of Projections Number of Projections Figure 2: Quantized compressive 1-NN classification. Target ρ should be around: BASEHOCK: 0 . 6, where 1-bit quantizer has largest debiased variance. Orlraws10P: 0 . 9, where 1-bit quantizer has smallest debiased variance. 1-bit quantizer may generalize better than using more bits! Xiaoyun Li, Ping Li NeurIPS 2019 12 / 14

  13. Quantized Compressive Linear SVM Claim: smaller debiased variance at ρ = 0 is better. 100% 100% BASEHOCK Test Accuracy Test Accuracy 90% 90% 80% orlraws10P Full-precision Full-precision 80% LM b=1 LM b=1 70% LM b=3 LM b=3 Uniform b=3 Uniform b=3 70% 60% 2 6 2 7 2 8 2 9 2 10 2 11 2 12 2 6 2 7 2 8 2 9 2 10 2 11 2 12 Number of Projections Number of Projections Figure 3: Quantized compressive linear SVM. At ρ = 0, red quantizer has much larger debiased variance than others = ⇒ Lowest test accuracy on both datasets. Xiaoyun Li, Ping Li NeurIPS 2019 13 / 14

  14. Quantized Compressive Linear Regression Claim: smaller distortion is better. 1.1 1 Test MSE 0.9 0.8 0.7 0.6 200 400 600 800 1000 Number of Projections Figure 4: Test MSE of QCLS. Blue: uniform quantizers. Red: Lloyd-Max (LM) quantizers. LM quantizer always outperforms uniform quantizer. The order of test error agrees with the order of distortion. Xiaoyun Li, Ping Li NeurIPS 2019 14 / 14

Recommend


More recommend