Large-Scale Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw - PowerPoint PPT Presentation

Outline When ML Meets Big Data 1 Advantages of Deep Learning 2 Representation Learning Exponential Gain of Expressiveness Memory and GPU Friendliness Online & Transfer Learning Learning Theory Revisited 3 Generalizability and Over-Parametrization Wide-and-Deep NN is a Gaussian Process before Training* Gradient Descent is an Affine Transformation* Wide-and-Deep NN is a Gaussian Process after Training* Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 13 / 67

Curse of Dimensionality Most classic nonlinear ML models find θ by assuming function smoothness: if x ∼ x ( i ) ∈ X , then f ( x ; w ) ∼ f ( x ( i ) ; w ) Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 14 / 67

Curse of Dimensionality Most classic nonlinear ML models find θ by assuming function smoothness: if x ∼ x ( i ) ∈ X , then f ( x ; w ) ∼ f ( x ( i ) ; w ) E.g., the non-parametric methods predict the label ˆ y of x by simply interpolating the labels of examples x ( i ) ’s close to x : α i y ( i ) k ( x ( i ) , x )+ b , where k ( x ( i ) , x ) = exp ( − γ � x ( i ) − x � 2 ) y = ∑ ˆ i Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 14 / 67

Curse of Dimensionality Most classic nonlinear ML models find θ by assuming function smoothness: if x ∼ x ( i ) ∈ X , then f ( x ; w ) ∼ f ( x ( i ) ; w ) E.g., the non-parametric methods predict the label ˆ y of x by simply interpolating the labels of examples x ( i ) ’s close to x : α i y ( i ) k ( x ( i ) , x )+ b , where k ( x ( i ) , x ) = exp ( − γ � x ( i ) − x � 2 ) y = ∑ ˆ i Suppose f is smooth within a bin, we need exponentially more examples to get a good interpolation as D increases Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 14 / 67

Exponential Gains from Depth I Functions representable with a deep rectifier NN require an exponential number of hidden units in a shallow NN [13] Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 15 / 67

Exponential Gains from Depth I Functions representable with a deep rectifier NN require an exponential number of hidden units in a shallow NN [13] In deep learning, a deep factor is defined by “reusing” the shallow ones Face = 0.3 [corner] + 0.7 [circle] Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 15 / 67

Exponential Gains from Depth I Functions representable with a deep rectifier NN require an exponential number of hidden units in a shallow NN [13] In deep learning, a deep factor is defined by “reusing” the shallow ones Face = 0.3 [corner] + 0.7 [circle] With a shallow structure, a deep factor needs to be replaced by exponentially many factors Face = 0.3 [0.5 [vertical] + 0.5 [horizontal] ] + 0.7 [ ... ] Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 15 / 67

Exponential Gains from Depth II Another example: an NN with absolute value rectification units Each hidden unit specifies where to fold the input space in order to create mirror responses (on both sides of the absolute value) A single fold in a deep layer creates an exponentially large number of piecewise linear regions in input space No need to see examples in each linear regions in input space Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 16 / 67

Exponential Gains from Depth II Another example: an NN with absolute value rectification units Each hidden unit specifies where to fold the input space in order to create mirror responses (on both sides of the absolute value) A single fold in a deep layer creates an exponentially large number of piecewise linear regions in input space No need to see examples in each linear regions in input space This exponential gain counters the exponential challenges posed by the curse of dimensionality Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 16 / 67

Stochastic Gradient Descent Gradient Descent (GD) w ( 0 ) ← a randon vector; Repeat until convergence { w ( t + 1 ) ← w ( t ) − η ∇ w C N ( w ( t ) ; X ) ; } Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 67

Stochastic Gradient Descent Gradient Descent (GD) w ( 0 ) ← a randon vector; Repeat until convergence { w ( t + 1 ) ← w ( t ) − η ∇ w C N ( w ( t ) ; X ) ; } Needs to scan the entire dataset to descent (many I/Os) Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 67

Stochastic Gradient Descent Gradient Descent (GD) w ( 0 ) ← a randon vector; Repeat until convergence { w ( t + 1 ) ← w ( t ) − η ∇ w C N ( w ( t ) ; X ) ; } Needs to scan the entire dataset to descent (many I/Os) (Mini-Batched) Stochastic Gradient Descent (SGD) w ( 0 ) ← a randon vector; Repeat until convergence { Randomly partition the training set X into minibatches { X ( j ) } j ; w ( t + 1 ) ← w ( t ) − η ∇ w C ( w ( t ) ; X ( j ) ) ; } Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 67

Stochastic Gradient Descent Gradient Descent (GD) w ( 0 ) ← a randon vector; Repeat until convergence { w ( t + 1 ) ← w ( t ) − η ∇ w C N ( w ( t ) ; X ) ; } Needs to scan the entire dataset to descent (many I/Os) (Mini-Batched) Stochastic Gradient Descent (SGD) w ( 0 ) ← a randon vector; Repeat until convergence { Randomly partition the training set X into minibatches { X ( j ) } j ; w ( t + 1 ) ← w ( t ) − η ∇ w C ( w ( t ) ; X ( j ) ) ; } No I/O if the next mini-batch can be prefetched Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 67

GD vs. SGD Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 67

GD vs. SGD Is SGD really a better algorithm? Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 67

Yes, If You Have Big Data Performance is limited by training time Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 20 / 67

Asymptotic Analysis [4] GD SGD Time per iteration N 1 log 1 1 #Iterations to opt. error ρ ρ ρ N log 1 1 Time to opt. error ρ ρ ρ ε 1 / α log 1 1 ε , where α ∈ [ 1 1 Time to excess error ε 2 , 1 ] ε Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 21 / 67

Parallelizing SGD Data Parallelism Model Parallelism Every core/GPU trains the full Every core/GPU train a model given partitioned data. partitioned model given full data. Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 22 / 67

Parallelizing SGD Data Parallelism Model Parallelism Every core/GPU trains the full Every core/GPU train a model given partitioned data. partitioned model given full data. The effectiveness depends on applications and available hardware E.g., CPU/GPU speed, communication latency, bandwidth, etc. Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 22 / 67

Online Learning So far, we assume that the training data X comes at once What if data come sequentially? Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 24 / 67

Online Learning So far, we assume that the training data X comes at once What if data come sequentially? Online learning : to update model when new data arrive Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 24 / 67

Online Learning So far, we assume that the training data X comes at once What if data come sequentially? Online learning : to update model when new data arrive This a already supported by SGD Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 24 / 67

Muti-Task and Transfer Learning Multi-task learning : to learning a single model for multiple tasks Transfer learning : to reuse the knowledge learned from one task to help another Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 25 / 67

Muti-Task and Transfer Learning Multi-task learning : to learning a single model for multiple tasks Via shared layers Transfer learning : to reuse the knowledge learned from one task to help another Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 25 / 67

Muti-Task and Transfer Learning Multi-task learning : to learning a single model for multiple tasks Via shared layers Transfer learning : to reuse the knowledge learned from one task to help another Via pretrained layers (whose weights may be further updated when a smaller learning rate) Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 25 / 67

Learning Theory How to learn a function f N from N examples X that is close to the true function f ∗ ? Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 27 / 67

Learning Theory How to learn a function f N from N examples X that is close to the true function f ∗ ? i = 1 loss ( f N ( x ( i ) ) , y ( i ) ) Empirical risk : C N [ f N ] = 1 N ∑ N � loss ( f ( x ) , y ) dP ( x , y ) Expected risk : C [ f N ] = Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 27 / 67

Learning Theory How to learn a function f N from N examples X that is close to the true function f ∗ ? i = 1 loss ( f N ( x ( i ) ) , y ( i ) ) Empirical risk : C N [ f N ] = 1 N ∑ N � loss ( f ( x ) , y ) dP ( x , y ) Expected risk : C [ f N ] = Let f ∗ = argmin f C [ f ] be the true function (our goal) Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 27 / 67

Learning Theory How to learn a function f N from N examples X that is close to the true function f ∗ ? i = 1 loss ( f N ( x ( i ) ) , y ( i ) ) Empirical risk : C N [ f N ] = 1 N ∑ N � loss ( f ( x ) , y ) dP ( x , y ) Expected risk : C [ f N ] = Let f ∗ = argmin f C [ f ] be the true function (our goal) Since we are seeking a function in a model (hypothesis space) F , this is what can have at best: f ∗ F = argmin f ∈ F C [ f ] Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 27 / 67

Learning Theory How to learn a function f N from N examples X that is close to the true function f ∗ ? i = 1 loss ( f N ( x ( i ) ) , y ( i ) ) Empirical risk : C N [ f N ] = 1 N ∑ N � loss ( f ( x ) , y ) dP ( x , y ) Expected risk : C [ f N ] = Let f ∗ = argmin f C [ f ] be the true function (our goal) Since we are seeking a function in a model (hypothesis space) F , this is what can have at best: f ∗ F = argmin f ∈ F C [ f ] But we only minimizes errors on limited examples in our objective, so we only have f N = argmin f ∈ F C N [ f ] Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 27 / 67

Learning Theory How to learn a function f N from N examples X that is close to the true function f ∗ ? i = 1 loss ( f N ( x ( i ) ) , y ( i ) ) Empirical risk : C N [ f N ] = 1 N ∑ N � loss ( f ( x ) , y ) dP ( x , y ) Expected risk : C [ f N ] = Let f ∗ = argmin f C [ f ] be the true function (our goal) Since we are seeking a function in a model (hypothesis space) F , this is what can have at best: f ∗ F = argmin f ∈ F C [ f ] But we only minimizes errors on limited examples in our objective, so we only have f N = argmin f ∈ F C N [ f ] The excess error E = C [ f N ] − C [ f ∗ ] : E = C [ f ∗ F ] − C [ f ∗ ] + C [ f N ] − C [ f ∗ F ] � �� E app E est Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 27 / 67

Excess Error Wait, we may not have enough training time, so we stop iterations early and have ˜ f N , where C N [ ˜ f N ] ≤ C N [ f N ]+ ρ Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 28 / 67

Excess Error Wait, we may not have enough training time, so we stop iterations early and have ˜ f N , where C N [ ˜ f N ] ≤ C N [ f N ]+ ρ The excess error becomes E = C [ ˜ f N ] − C [ f ∗ ] : E = C [ f ∗ F ] − C [ f ∗ ] + C [ f N ] − C [ f ∗ + C [ ˜ F ] f N ] − C [ f N ] � �� E app E est E opt Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 28 / 67

Excess Error Wait, we may not have enough training time, so we stop iterations early and have ˜ f N , where C N [ ˜ f N ] ≤ C N [ f N ]+ ρ The excess error becomes E = C [ ˜ f N ] − C [ f ∗ ] : E = C [ f ∗ F ] − C [ f ∗ ] + C [ f N ] − C [ f ∗ + C [ ˜ F ] f N ] − C [ f N ] � �� E app E est E opt Approximation error E app : reduced by choosing a larger model Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 28 / 67

Excess Error Wait, we may not have enough training time, so we stop iterations early and have ˜ f N , where C N [ ˜ f N ] ≤ C N [ f N ]+ ρ The excess error becomes E = C [ ˜ f N ] − C [ f ∗ ] : E = C [ f ∗ F ] − C [ f ∗ ] + C [ f N ] − C [ f ∗ + C [ ˜ F ] f N ] − C [ f N ] � �� E app E est E opt Approximation error E app : reduced by choosing a larger model Estimation error E est : reduced by Increasing N , or 1 Choosing smaller model [5, 12, 15] 2 Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 28 / 67

Excess Error Wait, we may not have enough training time, so we stop iterations early and have ˜ f N , where C N [ ˜ f N ] ≤ C N [ f N ]+ ρ The excess error becomes E = C [ ˜ f N ] − C [ f ∗ ] : E = C [ f ∗ F ] − C [ f ∗ ] + C [ f N ] − C [ f ∗ + C [ ˜ F ] f N ] − C [ f N ] � �� E app E est E opt Approximation error E app : reduced by choosing a larger model Estimation error E est : reduced by Increasing N , or 1 Choosing smaller model [5, 12, 15] 2 Optimization error E opt : reduced by Running optimization alg. longer (with smaller ρ ) 1 Choosing more efficient optimization alg. 2 Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 28 / 67

Minimizing Excess Error E = C [ f ∗ F ] − C [ f ∗ ] + C [ f N ] − C [ f ∗ + C [ ˜ F ] f N ] − C [ f N ] � �� E app E est E opt Small-scale ML tasks: Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 67

Minimizing Excess Error E = C [ f ∗ F ] − C [ f ∗ ] + C [ f N ] − C [ f ∗ + C [ ˜ F ] f N ] − C [ f N ] � �� E app E est E opt Small-scale ML tasks: Mainly constrained by N Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 67

Minimizing Excess Error E = C [ f ∗ F ] − C [ f ∗ ] + C [ f N ] − C [ f ∗ + C [ ˜ F ] f N ] − C [ f N ] � �� E app E est E opt Small-scale ML tasks: Mainly constrained by N Computing time is not an issue, so E opt can be insignificant by choosing small ρ Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 67

Minimizing Excess Error E = C [ f ∗ F ] − C [ f ∗ ] + C [ f N ] − C [ f ∗ + C [ ˜ F ] f N ] − C [ f N ] � �� E app E est E opt Small-scale ML tasks: Mainly constrained by N Computing time is not an issue, so E opt can be insignificant by choosing small ρ Size of hypothesis is important to balance the trade-off between E app and E est Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 67

Minimizing Excess Error E = C [ f ∗ F ] − C [ f ∗ ] + C [ f N ] − C [ f ∗ + C [ ˜ F ] f N ] − C [ f N ] � �� E app E est E opt Small-scale ML tasks: Mainly constrained by N Computing time is not an issue, so E opt can be insignificant by choosing small ρ Size of hypothesis is important to balance the trade-off between E app and E est Large-scale ML tasks: Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 67

Minimizing Excess Error E = C [ f ∗ F ] − C [ f ∗ ] + C [ f N ] − C [ f ∗ + C [ ˜ F ] f N ] − C [ f N ] � �� E app E est E opt Small-scale ML tasks: Mainly constrained by N Computing time is not an issue, so E opt can be insignificant by choosing small ρ Size of hypothesis is important to balance the trade-off between E app and E est Large-scale ML tasks: Mainly constrained by time (significant E opt ), so SGD is preferred Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 67

Minimizing Excess Error E = C [ f ∗ F ] − C [ f ∗ ] + C [ f N ] − C [ f ∗ + C [ ˜ F ] f N ] − C [ f N ] � �� E app E est E opt Small-scale ML tasks: Mainly constrained by N Computing time is not an issue, so E opt can be insignificant by choosing small ρ Size of hypothesis is important to balance the trade-off between E app and E est Large-scale ML tasks: Mainly constrained by time (significant E opt ), so SGD is preferred N is large, so E est can be reduced Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 67

Minimizing Excess Error E = C [ f ∗ F ] − C [ f ∗ ] + C [ f N ] − C [ f ∗ + C [ ˜ F ] f N ] − C [ f N ] � �� E app E est E opt Small-scale ML tasks: Mainly constrained by N Computing time is not an issue, so E opt can be insignificant by choosing small ρ Size of hypothesis is important to balance the trade-off between E app and E est Large-scale ML tasks: Mainly constrained by time (significant E opt ), so SGD is preferred N is large, so E est can be reduced Large model is preferred to reduce E app Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 67

Big Data + Big Models 9. COTS HPC unsupervised convolutional network [6] 10. GoogleLeNet [14] Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 30 / 67

Big Data + Big Models 9. COTS HPC unsupervised convolutional network [6] 10. GoogleLeNet [14] With domain-specific architecture such as convolutional NNs (CNNs) and recurrent NNs (RNNs) Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 30 / 67

Over-Parametrized NNs Let D ( l ) be the output dimension (“width”) of a layer f ( l ) ( · ; θ ( l ) ) of an NN Input/output dimension: ( x , y ) ∈ R D ( 0 ) × R D ( L ) D = min ( D ( 0 ) , ··· , D ( L ) ) the network width From the statistical learning theory point of view, the larger the D , the worse the generalizability Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 32 / 67

Over-Parametrized NNs Let D ( l ) be the output dimension (“width”) of a layer f ( l ) ( · ; θ ( l ) ) of an NN Input/output dimension: ( x , y ) ∈ R D ( 0 ) × R D ( L ) D = min ( D ( 0 ) , ··· , D ( L ) ) the network width From the statistical learning theory point of view, the larger the D , the worse the generalizability However, as D grows, the generalizability actually increases [20]; i.e., over-parametrization leads to better performance Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 32 / 67

Over-Parametrized NNs Let D ( l ) be the output dimension (“width”) of a layer f ( l ) ( · ; θ ( l ) ) of an NN Input/output dimension: ( x , y ) ∈ R D ( 0 ) × R D ( L ) D = min ( D ( 0 ) , ··· , D ( L ) ) the network width From the statistical learning theory point of view, the larger the D , the worse the generalizability However, as D grows, the generalizability actually increases [20]; i.e., over-parametrization leads to better performance Why such a paradox? Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 32 / 67

Wide-and-Deep NNs as Gaussian Processes Recent studies [10, 9, 11] show that a wide NN of any depth can be approximated by a Gaussian process (GP) Either before, during, or after training Recall that a GP is a non-parametric model whose complexity depends only on the size of training set | X | and the hyperparameters of kernel function k ( · , · ) : � y N � m N � K N , N � � � K N , M ∼ N ( , y M m M K M , N K M , M with Bayesian inference for test points X ′ : P ( y M | X ′ , X ) = N ( K M , N K − 1 N , N y N , K M , M − K M , N K − 1 N , N K N , M ) Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 33 / 67

Wide-and-Deep NNs as Gaussian Processes Recent studies [10, 9, 11] show that a wide NN of any depth can be approximated by a Gaussian process (GP) Either before, during, or after training Recall that a GP is a non-parametric model whose complexity depends only on the size of training set | X | and the hyperparameters of kernel function k ( · , · ) : � y N � m N � K N , N � � � K N , M ∼ N ( , y M m M K M , N K M , M with Bayesian inference for test points X ′ : P ( y M | X ′ , X ) = N ( K M , N K − 1 N , N y N , K M , M − K M , N K − 1 N , N K N , M ) Therefore, wide-and-deep NNs do not overfit as one may expect The D , once becoming large, does not reflect true model complexity Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 33 / 67

Example: NN for Regression For simplicity, we consider an L -layer NN f ( · ; θ ) for the regression problem: f ( x ; θ ) = a ( l ) = φ ( l ) ( W ( l ) ⊤ a ( l − 1 ) + b ( l ) ) , for l = 1 ,..., L , where the activation functions φ ( 1 ) ( · ) = ··· = φ ( L − 1 ) ( · ) ≡ φ ( · ) and φ ( L − 1 ) ( · ) is an identify function a ( 0 ) = x and ˆ y = a ( L ) = z ( L ) ∈ R the mean of a Gaussian θ ( l ) = vec ( W ( l ) , b ( l ) ) and θ = vec ( θ ( 1 ) , ··· , θ ( L ) ) Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 35 / 67

Example: NN for Regression For simplicity, we consider an L -layer NN f ( · ; θ ) for the regression problem: f ( x ; θ ) = a ( l ) = φ ( l ) ( W ( l ) ⊤ a ( l − 1 ) + b ( l ) ) , for l = 1 ,..., L , where the activation functions φ ( 1 ) ( · ) = ··· = φ ( L − 1 ) ( · ) ≡ φ ( · ) and φ ( L − 1 ) ( · ) is an identify function a ( 0 ) = x and ˆ y = a ( L ) = z ( L ) ∈ R the mean of a Gaussian θ ( l ) = vec ( W ( l ) , b ( l ) ) and θ = vec ( θ ( 1 ) , ··· , θ ( L ) ) y N = [ f ( x ( 1 ) ; θ ) , ··· , f ( x ( N ) ; θ )] ⊤ ∈ R N be the predictions for the Let ˆ i = 1 = { X N ∈ R N × D ( 0 ) , y N ∈ R N } points in training set X = { ( x ( i ) , y ( i ) ) } N Maximum-likelihood estimation: 1 y N − y N � 2 P ( X | θ ) = argmin θ C ( ˆ y N , y N ) = argmin 2 � ˆ argmax θ θ Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 35 / 67

Weight Initialization and Normalization a ( l ) = φ ( l ) ( W ( l ) ⊤ a ( l − 1 ) + b ( l ) ) Common initialization: W ( l ) w ) and b ( l ) i , j ∼ N ( 0 , σ 2 ∼ N ( 0 , σ 2 b ) i Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 36 / 67

Weight Initialization and Normalization a ( l ) = φ ( l ) ( W ( l ) ⊤ a ( l − 1 ) + b ( l ) ) Common initialization: W ( l ) w ) and b ( l ) i , j ∼ N ( 0 , σ 2 ∼ N ( 0 , σ 2 b ) i To normalize the forward and backward gradient signals w.r.t. layer width D ( l ) , we can define an equivalent NN: a ( l ) = φ ( l ) ( W ( l ) ⊤ a ( l − 1 ) + b ( l ) ) , where W ( l ) D ( l − 1 ) ω ( l ) i , j , b ( l ) = σ b β ( l ) i , and ω ( l ) i , j , β ( l ) σ w i , j = ∼ N ( 0 , 1 ) √ i i Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 36 / 67

Distribution of ˆ y Given an x , what is the distribution of its prediction ˆ y ? Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 37 / 67

Distribution of ˆ y Given an x , what is the distribution of its prediction ˆ y ? Recall that σ w y = z ( L ) = w ( L ) ⊤ a ( L − 1 ) + b ( L ) = D ( l − 1 ) Σ j ω ( L ) φ ( z ( L − 1 ) )+ σ b β ( L ) √ ˆ j j Since ω ( l ) j ’s and β ( l ) are Gaussian random variables with zero means, their sum ˆ y is also a zero-mean Gaussian Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 37 / 67

Distribution of ˆ y Given an x , what is the distribution of its prediction ˆ y ? Recall that σ w y = z ( L ) = w ( L ) ⊤ a ( L − 1 ) + b ( L ) = D ( l − 1 ) Σ j ω ( L ) φ ( z ( L − 1 ) )+ σ b β ( L ) √ ˆ j j Since ω ( l ) j ’s and β ( l ) are Gaussian random variables with zero means, their sum ˆ y is also a zero-mean Gaussian y ( x ( N ) )] ⊤ ∈ R N for N y ( x ( 1 ) ) , ··· , ˆ Now consider the predictions ˆ y N = [ ˆ points, we have     φ ( z ( l − 1 ) y ( x ( 1 ) ) ( x ( 1 ) )) ˆ j σ w . .   D ( l − 1 ) Σ j ω ( l )  + σ b β ( l )  .  .  = √ i 1 N   . .  j , i  y ( x ( N ) ) φ ( z ( l − 1 ) ( x ( N ) )) ˆ j As D ( L − 1 ) → ∞ , by multidimensional Central Limit Theorem, ˆ y is a multivariate Gaussian with mean 0 N and covariance Σ Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 37 / 67

Wide-and-Deep NN as a Gaussian Process The covariance Σ completely describes the behavior of our NN y ( · ) = f ( · ) over N points ˆ Furthermore, we will show that Σ can be describe by a deterministic kernel function k ( L ) ( · , · ) independent of a particular initialization such that   k ( L ) ( x ( 1 ) , x ( 1 ) ) k ( L ) ( x ( 1 ) , x ( N ) ) ··· . . ...  ≡ K ( L )  . .  Σ = . . N , N  k ( L ) ( x ( N ) , x ( 1 ) ) k ( L ) ( x ( N ) , x ( N ) ) ··· This implies that the NN is in correspondent with a GP called NN-GP : � ˆ � 0 N � � � � K ( L ) K ( L ) y N N , N N , M ∼ N ( , ) K ( L ) K ( L ) y M ˆ 0 M M , N M , M Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 38 / 67

Wide-and-Deep NN as a Gaussian Process The covariance Σ completely describes the behavior of our NN y ( · ) = f ( · ) over N points ˆ Furthermore, we will show that Σ can be describe by a deterministic kernel function k ( L ) ( · , · ) independent of a particular initialization such that   k ( L ) ( x ( 1 ) , x ( 1 ) ) k ( L ) ( x ( 1 ) , x ( N ) ) ··· . . ...  ≡ K ( L )  . .  Σ = . . N , N  k ( L ) ( x ( N ) , x ( 1 ) ) k ( L ) ( x ( N ) , x ( N ) ) ··· This implies that the NN is in correspondent with a GP called NN-GP : � ˆ � 0 N � � � � K ( L ) K ( L ) y N N , N N , M ∼ N ( , ) K ( L ) K ( L ) y M ˆ 0 M M , N M , M What’s the k ( L ) ( · , · ) ? Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 38 / 67

Deriving k ( 1 ) ( · , · ) We use induction to show that z ( 1 ) i ( · ) , z ( 2 ) i ( · ) , ··· , z ( L ) ( · ) = ˆ y ( · ) are GPs, which are govern by kernels k ( 1 ) ( · , · ) , ··· , k ( L ) ( · , · ) independent with i , respectively Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 39 / 67

Deriving k ( 1 ) ( · , · ) We use induction to show that z ( 1 ) i ( · ) , z ( 2 ) i ( · ) , ··· , z ( L ) ( · ) = ˆ y ( · ) are GPs, which are govern by kernels k ( 1 ) ( · , · ) , ··· , k ( L ) ( · , · ) independent with i , respectively Consider z ( 1 ) D ( 0 ) Σ j ω ( l ) j , i x j + σ b β ( l ) σ w i ( x ) = a zero-mean Gaussian √ i As D ( 0 ) → ∞ , we have [ z ( 1 ) i ( x ( 1 ) ) , ··· , z ( 1 ) i ( x ( N ) )] ⊤ ∼ N ( 0 N , K ( 1 ) N , N ) by multidimensional Central Limit Theorem, where = Cov [ z ( 1 ) ( x ) , z ( 1 ) i [ z ( 1 ) ( x ) z ( 1 ) k ( 1 ) ( x , x ′ ) ( x ′ )] = E ω ( l ) ( x ′ )] : , i , β ( l ) i i i i � � � � = σ 2 Σ j , k ω ( l ) j , i ω ( l ) β ( l ) i Σ j ω ( l ) k , i x j x ′ + σ w σ b √ D ( 0 ) E w D ( 0 ) E j , i x j k � � � � β ( l ) i Σ j ω ( l ) β ( l ) i β ( l ) + σ w σ b j , i x ′ + σ 2 √ D ( 0 ) E b E j i � � � � = σ 2 ω ( l ) j , i ω ( l ) β ( l ) i β ( l ) x j x ′ j + σ 2 D ( 0 ) Σ j E w b E j , i i = σ 2 D ( 0 ) x ⊤ x ′ + σ 2 w b , is independent with i Note that z ( 1 ) i ( · ) and z ( 1 ) j ( · ) are independent with each other, ∀ i � = j Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 39 / 67

Deriving k ( l ) ( · , · ) Given that D ( 0 ) → ∞ , ··· , D ( l − 2 ) → ∞ and [ z ( l − 1 ) ( x ( 1 ) ) , ··· , z ( l − 1 ) ( x ( N ) )] ⊤ ∼ N ( 0 N , K ( l − 1 ) N , N ) i i z ( l − 1 ) ( · ) and z ( l − 1 ) ( · ) are independent with each other, ∀ i � = j i j Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 40 / 67

Deriving k ( l ) ( · , · ) Given that D ( 0 ) → ∞ , ··· , D ( l − 2 ) → ∞ and [ z ( l − 1 ) ( x ( 1 ) ) , ··· , z ( l − 1 ) ( x ( N ) )] ⊤ ∼ N ( 0 N , K ( l − 1 ) N , N ) i i z ( l − 1 ) ( · ) and z ( l − 1 ) ( · ) are independent with each other, ∀ i � = j i j Consider z ( l ) D ( l − 1 ) Σ j ω ( l ) j , i φ ( z ( l − 1 ) ( x ))+ σ b β ( l ) σ w i ( x ) = a zero-mean √ j i Gaussian As D ( l − 1 ) → ∞ , we have [ z ( l ) i ( x ( 1 ) ) , ··· , z ( l ) i ( x ( N ) )] ⊤ ∼ N ( 0 N , K ( l ) N , N ) by multidimensional Central Limit Theorem, where = Cov [ z ( l ) i ( x ) , z ( l ) i , z ( l − 1 ) ( x ) [ z ( l ) i ( x ) z ( l ) k ( l ) ( x , x ′ ) i ( x ′ )] = E ω ( l ) i ( x ′ )] : , i , β ( l ) � � � � σ 2 Σ j , k ω ( l ) j , i ω ( l ) k , i φ ( z ( l − 1 ) ( x )) φ ( z ( l − 1 ) β ( l ) i β ( l ) ( x ′ )) + σ 2 = w D ( l − 1 ) E b E j k i � � � � �� β ( l ) i Σ j ω ( l ) j , i φ ( z ( l − 1 ) β ( l ) i Σ j ω ( l ) j , i φ ( z ( l − 1 ) + σ w σ b ( x ′ )) √ ( x )) + E E j j D ( l − 1 ) � � � � � � σ 2 ω ( l ) j , i ω ( l ) φ ( z ( l − 1 ) ( x )) φ ( z ( l − 1 ) β ( l ) i β ( l ) ( x ′ )) + σ 2 = D ( l − 1 ) Σ j E w E b E j , i j j i � � φ ( z ( l − 1 ) ( x )) φ ( z ( l − 1 ) = σ 2 ( x ′ )) + σ 2 b , w E ( z ( l − 1 ) ( x ) , z ( l − 1 ) ( x ′ )) ∼ N ( 0 2 , K ( l − 1 ) i i ) i i 2 , 2 where � k ( l − 1 ) ( x , x ) � k ( l − 1 ) ( x , x ′ ) K ( l − 1 ) = 2 , 2 k ( l − 1 ) ( x , x ′ ) k ( l − 1 ) ( x ′ , x ′ ) Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 40 / 67

Evaluating K ( l ) For certain activation functions φ ( · ) , such as tanh and ReLU, k ( l ) ( x , x ′ ) has a closed form [10] For other φ ( · ) ’s, Markov Chain Monte Carlo (MCMC) sampling is required to devaluate k ( l ) ( x , x ′ ) Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 41 / 67

Large-Scale Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw - PowerPoint PPT Presentation

Large-Scale Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 1 / 67 Outline When ML Meets Big

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

Veracity: Practical Secure Network Coordinates via Vote-Based Agreements Micah Sherr, Matt Blaze,

ARA Asset Management Limited Results for the Nine Months ended 30 September 2015 2 YTD2015

SECOND QUARTER 2018 INVESTOR PRESENTATION Financing the Growth of Tomorrows Companies Today TM

Who are we? MS Ventures is the global corporate VC arm of Merck Serono MS Ventures strategic

Adventures in Monitorability Antonis Achilleos 1 joint work with: Luca Aceto 1,2 Adrian

Suggesting Edits to Explain Failing Traces Giles Reger University of Manchester, Manchester, UK

Bayesian response-adaptive designs for basket trials Department of Biostatistics Dana-Farber

Introduction First UseR! Conference Vienna, May 2004 The .C and .Fortran functions are commonly