analysis of distributed learning algorithms
play

Analysis of Distributed Learning Algorithms Ding-Xuan Zhou City - PowerPoint PPT Presentation

Analysis of Distributed Learning Algorithms Ding-Xuan Zhou City University of Hong Kong E-mail: mazhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start November 5, 2016 Outline of the Talk I. Distributed learning


  1. Analysis of Distributed Learning Algorithms Ding-Xuan Zhou City University of Hong Kong E-mail: mazhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start November 5, 2016

  2. Outline of the Talk I. Distributed learning with big data II. Least squares regression and and regularization III. Distributed learning with regularization schemes IV. Optimal rates for regularization V. Other distributed learning algorithms VI. Further topics 1 First Previous Next Last Back Close Quit

  3. I. Distributed learning with big data Big data leads to scientific challenges: storage bottleneck, algorithmic scalability, ... Distributed learning : based on a divide-and-conquer approach A distributed learning algorithm consisting of three steps: (1) partitioning the data into disjoint subsets (2) applying a learning algorithm implemented in an individual machine or processor to each data subset to produce an indi- vidual output (3) synthesizing a global output by utilizing some average of the individual outputs Advantages: reducing the memory and computing costs to handle big data 2 First Previous Next Last Back Close Quit

  4. If we divide a sample D = { ( x i , y i ) } N i =1 of input-output pairs into disjoint subsets { D j } m j =1 , applying a learning algorithm to the much smaller data subset D j gives an output f D j , and the global output might be f D = 1 � m j =1 f D j . m The distributed learning method has been observed to be very successful in many practical applications. There a challenging theoretical question is raised: If we had a ”big machine” which could implement the same learning algorithm to the whole data set D to produce an output f D , could f D be as efficient as f D ? Recent work: Zhou-Chawla-Jin-Williams, Zhang-Duchi-Wainwright, Shamir-Srebro, ... 3 First Previous Next Last Back Close Quit

  5. II. Least squares regression and and regularization Learn f : II.1. Model for the least squares regression. X → Y from a random sample D = { ( x i , y i ) } N i =1 Take X to be a compact metric space and Y = R . y ≈ f ( x ) Due to noises or other uncertainty, we assume a (unknown) probability measure ρ on Z = X × Y governs the sampling. marginal distribution ρ X on X : x = { x i } N i =1 drawn according to ρ X conditional distribution ρ ( ·| x ) at x ∈ X � Y ydρ ( y | x ) Learning the regression function : f ρ ( x ) = y i ≈ f ρ ( x i ) 4 First Previous Next Last Back Close Quit

  6. II.2. Error decomposition and ERM Z ( f ( x ) − y ) 2 dρ minimized by f ρ : E ls ( f ) = � E ls ( f ) − E ls ( f ρ ) = � f − f ρ � 2 =: � f − f ρ � 2 ρ ≥ 0 . L 2 ρX Classical Approach of Empirical Risk Minimization (ERM) Let H be a compact subset of C ( X ) called hypothesis space (model selection). The ERM algorithm is given by N D ( f ) = 1 f ∈H E ls E ls ( f ( x i ) − y i ) 2 . � f D = arg min D ( f ) , N i =1 Target function f H : best approximation of f ρ in H � f ∈H E ls ( f ) = arg inf Z ( f ( x ) − y ) 2 dρ f H = arg min f ∈H 5 First Previous Next Last Back Close Quit

  7. II.3. Approximation error � f D − f ρ � 2 X ( f D ( x ) − f ρ ( x )) 2 dρ X is bounded Analysis . = � L 2 ρX � � � � � E ls D ( f ) − E ls ( f ) E ls ( f H ) − E ls ( f ρ ) by 2 sup f ∈H � + . � � Approximation Error . Smale-Zhou (Anal. Appl. 2003) � E ls ( f H ) − E ls ( f ρ ) = � f H − f ρ � 2 ( f ( x ) − f ρ ( x )) 2 dρ X = inf L 2 f ∈H ρX f H ≈ f ρ when H is rich Theorem 1 Let B be a Hilbert space (such as a Sobolev space or a reproducing kernel Hilbert space). If B ⊂ L 2 ρ X is dense and θ > 0 , then ρX = O ( R − θ ) � f � B ≤ R � f − f ρ � L 2 inf if and only if f ρ lies in the interpolation space ( B, L 2 ρ X ) 1+ θ , ∞ . θ 6 First Previous Next Last Back Close Quit

  8. II.4. Examples of hypothesis spaces if X ⊂ R n , ρ X is the normalized Lebesgue Sobolv spaces : measure, and B is the Sobolev space H s with s > n/ 2, then θ θ 1+ θ s θ 1+ θ s 1+ θ s ⊂ B ( H s , L 2 ρ X ) 1+ θ , ∞ is the Besov space B and H 2 , ∞ ⊂ θ 2 , ∞ θ 1+ θ s − ǫ for any ǫ > 0. H Range of power of integral operator : if K : X × X → R is a Mercer kernel (continuous, symmetric and positive semidef- inite), then the integral operator L K on L 2 ρ X is defined by � L K ( f )( x ) = X K ( x, y ) f ( y ) dρ X ( y ) , x ∈ X . The r-th power L r K is well defined for any r ≥ 0. Its range ρ X ) gives the RKHS H K = L 1 / 2 L r K ( L 2 K ( L 2 ρ X ) and for 0 < r ≤ 1 / 2, ρ X ) 2 r, ∞ ⊂ L r − ǫ L r K ( L 2 ρ X ) ⊂ ( H K , L 2 ρ X ) 2 r, ∞ and ( H K , L 2 K ( L 2 ρ X ) for any ǫ > 0 when the support of ρ X is X . So we may assume f ρ = L r for some r > 0 , g ρ ∈ L 2 K ( g ρ ) ρ X . 7 First Previous Next Last Back Close Quit

  9. II.5. Least squares regularization   N 1 ( f ( x i ) − y i ) 2 + λ � f � 2   � f D,λ := arg min  , λ > 0 . K N f ∈H K  i =1 A large literature in learning theory: books by Vapnik, Sch¨ olkopf- Smola, Wahba, Anthony-Bartlett, Shawe-Taylor-Cristianini, Steinwart- Christmann, Cucker-Zhou, ... many papers: Cucker-Smale, Zhang, De Vito-Caponnetto- Rosasco, Smale-Zhou, Lin-Zeng-Fang-Xu, Yao, Chen-Xu, Shi- Feng-Zhou, Wu-Ying-Zhou, ... regularity of f ρ complexity of H K : covering numbers, decay of eigenvalues { λ i } of L K , effective dimension, ... | y | ≤ M , exponential decay, moment decay- decay of y : ρ ∈ L p ing condition, E [ | y | q ] < ∞ for some q > 2, σ 2 ρ X for the Y ( y − f ρ ( x )) 2 dρ ( y | x ), ... conditional variance σ 2 ρ ( x ) = � 8 First Previous Next Last Back Close Quit

  10. III. Distributed learning with regularization schemes Join work with S. B. Lin and X. Guo (under major revision for JMLR) Distributed learning with the data disjoint union D = ∪ m j =1 D j : m | D j | � f D,λ = | D | f D j ,λ j =1 Define the effective dimension to measure the complexity of H K with respect to ρ X as λ i ( L K + λI ) − 1 L K � � � N ( λ ) = Tr = λ > 0 . λ i + λ, i Note that λ i = O ( i − 2 α ) implies N ( λ ) = O ( λ − 1 2 α ) 9 First Previous Next Last Back Close Quit

  11. III.1. Error analysis for distributed learning Theorem 2 Assume | y | ≤ M and f ρ = L r K ( g ρ ) for some 0 ≤ If N ( λ ) = O ( λ − 1 r ≤ 1 2 and g ρ ∈ H K . 2 α ) for some α > 0 , � � 12 αr +1 4 αr min 5(4 αr +2 α +1) , | D j | = N 4 αr +2 α +1 m for j = 1 , . . . , m , and m ≤ N , 2 α then by taking λ = N − 4 αr +1 , we have α +2 αr �� � � � N − � E � f D,λ − f ρ = O 2 α +4 αr +1 . � � � ρ 2 α 1 2 α +1 yields � m � If f ρ ∈ H K and m ≤ N 4+6 α , the choice λ = N 1 α �� � � � N − 2 α +1 m − � E � f D,λ − f D,λ = O 4 α +2 � � � ρ and � � 1 �� � � � f D,λ − f D,λ = O E √ m . � � � K 10 First Previous Next Last Back Close Quit

  12. III.2. Previous work : Zhang-Duchi-Wainwright (2015): If the normalized eigenfunctions { ϕ i } i of L K on L 2 ρ X satisfy � ϕ i � 2 k � | ϕ i ( x ) | 2 k � ≤ A 2 k , = E i = 1 , 2 , . . . , L 2 k ρX for some constants k > 2 and A < ∞ , f ρ ∈ H K and λ i = 2 α �� 2 � � � N − � O ( i − 2 α ) for some α > 1 / 2, then E � f D,λ − f ρ = O 2 α +1 � � � ρ 2( k − 4) α − k 2 α 1 / ( A 4 k log k N )) 2 α +1 and m = O (( N when λ = N 2 α +1 k − 2 ). An example of a C ∞ Mercer kernel without uniform bounded- ness of the eigenfunctions: Zhou (2002) Advantages of our analysis: (1) General results without any eigenfunction assumption (2) Error estimates in the H K metric (Smale-Zhou 2007) (3) A novel second order decomposition applicable to other algorithms 11 First Previous Next Last Back Close Quit

  13. IV. Optimal rates for regularization: by-product Caponnetto-DeVito (2007): If λ i ≈ i − 2 α with some α > 1 / 2, 2 α � log N � 2 α +1 , then with λ = N 2 α   � log N 2 � 2 α +1 � �  = 1 . � f D,λ N − f ρ ρ ≤ τ τ →∞ lim sup lim sup prob � �  � ρ N N →∞ � i − 2 α � Steinwart-Hush-Scovel (2009): If λ i = O with some α > 1 / 2, and for some constant C > 0, the pair ( K, ρ X ) satisfies 1 1 − 1 2 α 2 α � f � ∞ ≤ C � f � K � f � , ∀ f ∈ H K , ρ 2 α then with λ = N − 2 α +1 , 2 α �� 2 � � � N − � � � E � π M f D,λ − f ρ = O . 2 α +1 � � ρ � Here π M is the projection onto the interval [ − M, M ]. 12 First Previous Next Last Back Close Quit

Recommend


More recommend