a new perspective on machine learning
play

A new perspective on machine learning H. N. Mhaskar Claremont - PowerPoint PPT Presentation

A new perspective on machine learning H. N. Mhaskar Claremont Graduate University ICERM Scientific Machine Learning January 28, 2019 H. N. Mhaskar A new perspective on machine learning Outline My understanding of machine learning problem


  1. A new perspective on machine learning H. N. Mhaskar Claremont Graduate University ICERM Scientific Machine Learning January 28, 2019 H. N. Mhaskar A new perspective on machine learning

  2. Outline ◮ My understanding of machine learning problem and its traditional solution. ◮ What bothers me about this. ◮ My own efforts to remedy the problems ◮ Diffusion geometry based approach ◮ Application to diabetic sugar level prediction ◮ Problems ◮ Hermite polynomial based approach ◮ Applications H. N. Mhaskar A new perspective on machine learning

  3. Problem of machine learning Given data (training data) of the form { ( x j , y j ) } M j =1 , where y j ∈ R , and x j ’s are in some Euclidean space R q , find a function P on a suitable domain ◮ that models the data well; ◮ in particular, P ( x j ) ≈ y j . H. N. Mhaskar A new perspective on machine learning

  4. 1. Traditional paradigm H. N. Mhaskar A new perspective on machine learning

  5. Basic set up { ( x j , y j ) } are i.i.d. samples from an unknown probability distribution µ f ( x ) = E µ ( y | x ), target function µ ∗ = marginal distribution of x . X =support of µ ∗ . V n ⊂ V n +1 ⊂ · · · = classes of models, V n with complexity n (typically, number of parameters). H. N. Mhaskar A new perspective on machine learning

  6. Traditional methodology ◮ Assume f ∈ W γ (smoothness class, prior, RKHS). ◮ Estimate E n ( f ) = inf P ∈ V n � f − P � L 2 ( µ ∗ ) = � f − P ∗ � L 2 ( µ ∗ ) . Decide upon the right value of n . ◮ Find � � P # = arg min Loss( { y ℓ − P ( x ℓ ) } ) + λ � P � W γ . P ∈ V n H. N. Mhaskar A new perspective on machine learning

  7. Generalization error � � | y − P # ( x ) | 2 d µ ( y , x ) | y − f ( x ) | 2 d µ ( y , x ) = X × R X × R � �� � � �� � generalization error variance + � f − P ∗ � 2 L 2 ( µ ∗ ) � �� � Approximation error + � f − P # � 2 L 2 ( µ ∗ ) − � f − P ∗ � L 2 ( µ ∗ ) � �� � Sampling error Only the approximation error and sampling error can be controlled. H. N. Mhaskar A new perspective on machine learning

  8. Observations on the paradigm ◮ Too complicated. ◮ Bounds of approximation error are often obtained by explicit constructions. The approach makes no use of these constructions. ◮ Measuring errors in L 2 with function values makes sense only if f is in some RKHS. So, the method is not universal. H. N. Mhaskar A new perspective on machine learning

  9. Observations on the paradigm Good is better than best On the left, the log–plot of the absolute error between the function x �→ | cos x | 1 / 4 , and its Fourier projection. On the right the corresponding plot with the trigonometric polynomial obtained by our summability operator. This is based on 128 equidistant samples. The order of the trigonometric polynomials is 31 in each case. The numbers on the x axis are in multiples of π , the actual absolute errors are 10 y . − 0.5 − 0.5 − 1 − 1 − 1.5 − 2 − 1.5 − 2.5 − 2 − 3 − 3.5 − 2.5 − 4 − 4.5 − 3 − 5 − 3.5 − 5.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 H. N. Mhaskar A new perspective on machine learning

  10. Observations on the paradigm ◮ The choice of penalty functional/loss functional, kernels, etc. are often ad hoc, and assume a prior on the target function. ◮ Performance guarantees on new and unseen data are often not easy to obtain, sometimes impossible. ◮ The optimization algorithms might not converge or converge too slowly. ◮ The paradigm does not work in the context of deep learning. H. N. Mhaskar A new perspective on machine learning

  11. Curse of dimensionality The number of parameters required to get a generalization error of ǫ is at least constant times ǫ − γ/ q , γ =smoothness of f , q =number of input variables. 1 1 Donoho 2000, DeVore Howard, Micchelli, 1989 H. N. Mhaskar A new perspective on machine learning

  12. Blessing of compositionality Approximate F ( x 1 , · · · , x 4 ) = f ( f 1 ( x 1 , x 2 ) , f 2 ( x 3 , x 4 )) by 2 Q ( x 1 , · · · , x 4 ) = P ( P 1 ( x 1 , x 2 ) , P 2 ( x 3 , x 4 )) . Only functions of 2 vari- ables are involved at each stage. 2 Mh., Poggio, 2016 H. N. Mhaskar A new perspective on machine learning

  13. How to measure generalization error � | f ( f 1 ( x 1 , x 2 ) , f 2 ( x 3 , x 4 )) − P ( P 1 ( x 1 , x 2 ) , P 2 ( x 3 , x 4 )) | 2 d µ ( x 1 , x 2 , x 3 , x 4 ) µ ignores compositionality � | f ( f 1 , f 2 ) − P ( P 1 , P 2 ) | 2 d ν (?) The distributions of ( f 1 , f 2 ) and ( P 1 , P 2 ) are different. Must have a different notion of generalization error. 3 3 Mh., Poggio, 2016 H. N. Mhaskar A new perspective on machine learning

  14. A new look Given data (training data) of the form { ( x j , y j ) } M j =1 , where y j ∈ R , and x j ’s are in some Euclidean space R q +1 . ◮ Assume that there is an underlying target function f : X → R , such that y j = f ( x j ) + ǫ j . ◮ No priors, just continuity. ◮ Use approximation theory to construct the approximation P . H. N. Mhaskar A new perspective on machine learning

  15. Objectives ◮ Universal approximation with no assumptions on prior. ◮ Generalization error defined pointwise, and adjusts itself per local smoothness. ◮ Optimization is substantially easier. ◮ Can be adapted to deep learning easily. H. N. Mhaskar A new perspective on machine learning

  16. Problem Classical approximation theory results are not adequate. ◮ Data distributed densely on a known domain, cube, sphere, etc. ◮ The points x j need to be chosen judiciously; e.g., Driscoll-Healy points on the sphere or quadrature nodes on the cube, etc. H. N. Mhaskar A new perspective on machine learning

  17. 2. Diffusion geometry based construction H. N. Mhaskar A new perspective on machine learning

  18. Set up Data { x j } i.i.d. sample from a distribution µ ∗ from a smooth compact manifold X (unknown). { φ k } a system of eigenfunctions of a suitable PDE with eigenvalues { λ k } . φ k , λ k ’s are computed approximately from a “graph Laplacian” 4 4 Lafon, 2004, Singer, 2006, Belkin, Niyogi, 2008 H. N. Mhaskar A new perspective on machine learning

  19. Set up Data { x j } i.i.d. sample from a distribution µ ∗ from a smooth compact manifold X (unknown). { φ k } a system of eigenfunctions of a suitable PDE with eigenvalues { λ k } . Π n = span { φ k : λ k < n } . � f � = sup x ∈ X | f ( x ) | , n γ dist( f , Π n ) . � f � γ = � f � + sup n ≥ 1 γ is the smoothness of f . H. N. Mhaskar A new perspective on machine learning

  20. Construction h : a smooth low pass filter (even, = 1 on [0 , 1 / 2], = 0 on [1 , ∞ )). � λ k � � Φ n ( x , y ) = h φ k ( x ) φ k ( y ) . n 0 ≤ k < n Fact: 4 If x j ’s are sufficiently dense, then there exist w j such that for all P ∈ Π n , � � � � P ( x ) d µ ∗ ( x ) , | P ( x ) | d µ ∗ ( x ) . w j P ( x j ) = | w j P ( x j ) | ≤ c X X j j (Marcinkiewicz-Zygmund (MZ) quadrature) 4 Filbir-Mhaskar, 2010, 2011 H. N. Mhaskar A new perspective on machine learning

  21. Algorithm ◮ Find w j ’s depending only on x j ’s and construct 5 M � P ( x ) = w j y j Φ n ( x , x j ) j =1 � M � M = w j f ( x j )Φ n ( x , x j ) + w j ǫ j Φ n ( x , x j ) , j =1 j =1 � �� � � �� � σ n ( f )( x ) noise part 5 Ehler, Filbir, Mhaskar, 2012 H. N. Mhaskar A new perspective on machine learning

  22. Theorem f ∈ W γ if and only if � f − σ n ( f ) � = O ( n − γ ). 6 If f ∈ W γ , P the noisy version of σ n ( f ), then with high probability, and n ∼ ( M / log M ) 1 / (2 q +2 γ ) , � f − P � ≤ cn − γ . 6 Maggioni,Mhaskar, 2008 H. N. Mhaskar A new perspective on machine learning

  23. 3. An application 7 7 Mhaskar, Pereverzyev, van der Walt, 2017 H. N. Mhaskar A new perspective on machine learning

  24. Continuous blood glucose monitoring Source : http://www.dexcom.com/seven-plus Problem: Estimate the future blood glucose level based on the past few readings, and the direction in which it is going – up or down. H. N. Mhaskar A new perspective on machine learning

  25. PRED-EGA grid ◮ Numerical accuracy is not as critical as classification errors. ◮ Depending upon low, normal, high blood sugar, the results are classified as accurate, or wrong but with no serious consequences (benign) or outright errors. H. N. Mhaskar A new perspective on machine learning

  26. Deep diffusion network Given sugar levels s ( t 0 ) , s ( t 1 ) , · · · at times t 0 , t 1 , · · · or different patients, we form a data set P = { ( x j , y j ) } , x j = ( s ( t 0 ) , · · · , s ( t 6 )), y j = s ( t 12 ) (30 minute prediction), and a training data C ⊂ P . H. N. Mhaskar A new perspective on machine learning

Recommend


More recommend