A new perspective on machine learning H. N. Mhaskar Claremont Graduate University ICERM Scientific Machine Learning January 28, 2019 H. N. Mhaskar A new perspective on machine learning
Outline ◮ My understanding of machine learning problem and its traditional solution. ◮ What bothers me about this. ◮ My own efforts to remedy the problems ◮ Diffusion geometry based approach ◮ Application to diabetic sugar level prediction ◮ Problems ◮ Hermite polynomial based approach ◮ Applications H. N. Mhaskar A new perspective on machine learning
Problem of machine learning Given data (training data) of the form { ( x j , y j ) } M j =1 , where y j ∈ R , and x j ’s are in some Euclidean space R q , find a function P on a suitable domain ◮ that models the data well; ◮ in particular, P ( x j ) ≈ y j . H. N. Mhaskar A new perspective on machine learning
1. Traditional paradigm H. N. Mhaskar A new perspective on machine learning
Basic set up { ( x j , y j ) } are i.i.d. samples from an unknown probability distribution µ f ( x ) = E µ ( y | x ), target function µ ∗ = marginal distribution of x . X =support of µ ∗ . V n ⊂ V n +1 ⊂ · · · = classes of models, V n with complexity n (typically, number of parameters). H. N. Mhaskar A new perspective on machine learning
Traditional methodology ◮ Assume f ∈ W γ (smoothness class, prior, RKHS). ◮ Estimate E n ( f ) = inf P ∈ V n � f − P � L 2 ( µ ∗ ) = � f − P ∗ � L 2 ( µ ∗ ) . Decide upon the right value of n . ◮ Find � � P # = arg min Loss( { y ℓ − P ( x ℓ ) } ) + λ � P � W γ . P ∈ V n H. N. Mhaskar A new perspective on machine learning
Generalization error � � | y − P # ( x ) | 2 d µ ( y , x ) | y − f ( x ) | 2 d µ ( y , x ) = X × R X × R � �� � � �� � generalization error variance + � f − P ∗ � 2 L 2 ( µ ∗ ) � �� � Approximation error + � f − P # � 2 L 2 ( µ ∗ ) − � f − P ∗ � L 2 ( µ ∗ ) � �� � Sampling error Only the approximation error and sampling error can be controlled. H. N. Mhaskar A new perspective on machine learning
Observations on the paradigm ◮ Too complicated. ◮ Bounds of approximation error are often obtained by explicit constructions. The approach makes no use of these constructions. ◮ Measuring errors in L 2 with function values makes sense only if f is in some RKHS. So, the method is not universal. H. N. Mhaskar A new perspective on machine learning
Observations on the paradigm Good is better than best On the left, the log–plot of the absolute error between the function x �→ | cos x | 1 / 4 , and its Fourier projection. On the right the corresponding plot with the trigonometric polynomial obtained by our summability operator. This is based on 128 equidistant samples. The order of the trigonometric polynomials is 31 in each case. The numbers on the x axis are in multiples of π , the actual absolute errors are 10 y . − 0.5 − 0.5 − 1 − 1 − 1.5 − 2 − 1.5 − 2.5 − 2 − 3 − 3.5 − 2.5 − 4 − 4.5 − 3 − 5 − 3.5 − 5.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 H. N. Mhaskar A new perspective on machine learning
Observations on the paradigm ◮ The choice of penalty functional/loss functional, kernels, etc. are often ad hoc, and assume a prior on the target function. ◮ Performance guarantees on new and unseen data are often not easy to obtain, sometimes impossible. ◮ The optimization algorithms might not converge or converge too slowly. ◮ The paradigm does not work in the context of deep learning. H. N. Mhaskar A new perspective on machine learning
Curse of dimensionality The number of parameters required to get a generalization error of ǫ is at least constant times ǫ − γ/ q , γ =smoothness of f , q =number of input variables. 1 1 Donoho 2000, DeVore Howard, Micchelli, 1989 H. N. Mhaskar A new perspective on machine learning
Blessing of compositionality Approximate F ( x 1 , · · · , x 4 ) = f ( f 1 ( x 1 , x 2 ) , f 2 ( x 3 , x 4 )) by 2 Q ( x 1 , · · · , x 4 ) = P ( P 1 ( x 1 , x 2 ) , P 2 ( x 3 , x 4 )) . Only functions of 2 vari- ables are involved at each stage. 2 Mh., Poggio, 2016 H. N. Mhaskar A new perspective on machine learning
How to measure generalization error � | f ( f 1 ( x 1 , x 2 ) , f 2 ( x 3 , x 4 )) − P ( P 1 ( x 1 , x 2 ) , P 2 ( x 3 , x 4 )) | 2 d µ ( x 1 , x 2 , x 3 , x 4 ) µ ignores compositionality � | f ( f 1 , f 2 ) − P ( P 1 , P 2 ) | 2 d ν (?) The distributions of ( f 1 , f 2 ) and ( P 1 , P 2 ) are different. Must have a different notion of generalization error. 3 3 Mh., Poggio, 2016 H. N. Mhaskar A new perspective on machine learning
A new look Given data (training data) of the form { ( x j , y j ) } M j =1 , where y j ∈ R , and x j ’s are in some Euclidean space R q +1 . ◮ Assume that there is an underlying target function f : X → R , such that y j = f ( x j ) + ǫ j . ◮ No priors, just continuity. ◮ Use approximation theory to construct the approximation P . H. N. Mhaskar A new perspective on machine learning
Objectives ◮ Universal approximation with no assumptions on prior. ◮ Generalization error defined pointwise, and adjusts itself per local smoothness. ◮ Optimization is substantially easier. ◮ Can be adapted to deep learning easily. H. N. Mhaskar A new perspective on machine learning
Problem Classical approximation theory results are not adequate. ◮ Data distributed densely on a known domain, cube, sphere, etc. ◮ The points x j need to be chosen judiciously; e.g., Driscoll-Healy points on the sphere or quadrature nodes on the cube, etc. H. N. Mhaskar A new perspective on machine learning
2. Diffusion geometry based construction H. N. Mhaskar A new perspective on machine learning
Set up Data { x j } i.i.d. sample from a distribution µ ∗ from a smooth compact manifold X (unknown). { φ k } a system of eigenfunctions of a suitable PDE with eigenvalues { λ k } . φ k , λ k ’s are computed approximately from a “graph Laplacian” 4 4 Lafon, 2004, Singer, 2006, Belkin, Niyogi, 2008 H. N. Mhaskar A new perspective on machine learning
Set up Data { x j } i.i.d. sample from a distribution µ ∗ from a smooth compact manifold X (unknown). { φ k } a system of eigenfunctions of a suitable PDE with eigenvalues { λ k } . Π n = span { φ k : λ k < n } . � f � = sup x ∈ X | f ( x ) | , n γ dist( f , Π n ) . � f � γ = � f � + sup n ≥ 1 γ is the smoothness of f . H. N. Mhaskar A new perspective on machine learning
Construction h : a smooth low pass filter (even, = 1 on [0 , 1 / 2], = 0 on [1 , ∞ )). � λ k � � Φ n ( x , y ) = h φ k ( x ) φ k ( y ) . n 0 ≤ k < n Fact: 4 If x j ’s are sufficiently dense, then there exist w j such that for all P ∈ Π n , � � � � P ( x ) d µ ∗ ( x ) , | P ( x ) | d µ ∗ ( x ) . w j P ( x j ) = | w j P ( x j ) | ≤ c X X j j (Marcinkiewicz-Zygmund (MZ) quadrature) 4 Filbir-Mhaskar, 2010, 2011 H. N. Mhaskar A new perspective on machine learning
Algorithm ◮ Find w j ’s depending only on x j ’s and construct 5 M � P ( x ) = w j y j Φ n ( x , x j ) j =1 � M � M = w j f ( x j )Φ n ( x , x j ) + w j ǫ j Φ n ( x , x j ) , j =1 j =1 � �� � � �� � σ n ( f )( x ) noise part 5 Ehler, Filbir, Mhaskar, 2012 H. N. Mhaskar A new perspective on machine learning
Theorem f ∈ W γ if and only if � f − σ n ( f ) � = O ( n − γ ). 6 If f ∈ W γ , P the noisy version of σ n ( f ), then with high probability, and n ∼ ( M / log M ) 1 / (2 q +2 γ ) , � f − P � ≤ cn − γ . 6 Maggioni,Mhaskar, 2008 H. N. Mhaskar A new perspective on machine learning
3. An application 7 7 Mhaskar, Pereverzyev, van der Walt, 2017 H. N. Mhaskar A new perspective on machine learning
Continuous blood glucose monitoring Source : http://www.dexcom.com/seven-plus Problem: Estimate the future blood glucose level based on the past few readings, and the direction in which it is going – up or down. H. N. Mhaskar A new perspective on machine learning
PRED-EGA grid ◮ Numerical accuracy is not as critical as classification errors. ◮ Depending upon low, normal, high blood sugar, the results are classified as accurate, or wrong but with no serious consequences (benign) or outright errors. H. N. Mhaskar A new perspective on machine learning
Deep diffusion network Given sugar levels s ( t 0 ) , s ( t 1 ) , · · · at times t 0 , t 1 , · · · or different patients, we form a data set P = { ( x j , y j ) } , x j = ( s ( t 0 ) , · · · , s ( t 6 )), y j = s ( t 12 ) (30 minute prediction), and a training data C ⊂ P . H. N. Mhaskar A new perspective on machine learning
Recommend
More recommend