J. Vega Asociación EURATOM/CIEMAT para Fusión jesus.vega@ciemat.es
Concepts Classification Regression Advanced methods 7th FDPVA. Frascati (March 26-28, 2012) 2
Technology: it gives stereotyped solutions to stereotyped problems Basic science: it is the accumulation of knowledge to explain a phenomenon Applied science: it is the application of scientific knowledge to a particular environment ◦ Machine learning can increase our comprehension of plasma physics 7th FDPVA. Frascati (March 26-28, 2012) 3
Learning does not mean ‘ learning by heart ’ (any computer can memorize) Learning means ‘ generalization capability ’: we learn with some samples and predict for other samples 7th FDPVA. Frascati (March 26-28, 2012) 4
The learning problem is the problem of finding a desired dependence (function) using a limited number of observations (training data) ◦ Classification: the function can represent the separation frontier between two classes ◦ Regression: the function can provide a fit to the data o o o o o o o o o x o x x x x o o o x x o o o o o o o o Classification Regression 7th FDPVA. Frascati (March 26-28, 2012) 5
The general model of learning from examples is described through three components a ˆ y f ( , ) x Generator of random Learning machine: n x vectors: p( x ) f( x , a ) Supervisor: p(y| x ) y p y | x p( x ) : fixed but unknown probability distribution function y = p(y| x ) (fixed and unknown) ( x i , y i ) , i = 1, ..., N : training samples The problem of learning is that of choosing from the given set of functions f( x , a ) , the one that best approximates the supervisor’s response ˆ ˆ ˆ y y , ,..., y "close" to y y , ,..., y 1 2 N 1 2 N 7th FDPVA. Frascati (March 26-28, 2012) 6
Main hypothesis ◦ The training set, ( x i , y i ) , i = 1, ..., N , is made up of independent and identically distributed (iid) observations drawn according to p( x , y) = p(y| x )p( x ) Loss function: L(y, f( x , a ) ) ◦ It measures the quality of the approach performed by the learning algorithm, i.e. the discrepancy between the response y of the supervisor and the response f( x , a ) of the learning machine. Its values are ≥ 0 Risk functional: a a R L y f , , p , y d dy x x x The goal of a learning process is to find the function f( x , a 0 ) that minimizes R( a ) (over the class of functions f( x , a ) ) in the situation where p( x , y) is unknown and the only available information is contained in the training set 7th FDPVA. Frascati (March 26-28, 2012) 7
a a R L y f , , p , y d dy x x x Pattern recognition (or classification) a 0 if y f x , a L y f , , x a 1 if y f x , Regression estimation 2 a a L y f , , y f , x x Density estimation a a L p , log p x , x 7th FDPVA. Frascati (March 26-28, 2012) 8
0 if x ax b a 2 1 f x , 1 if x ax b 2 1 a a b , 7th FDPVA. Frascati (March 26-28, 2012) 9
a a f x , x A f , , , A f , 1 1 1 1 1 f x , x A , , f , A , f 2 2 2 2 2 f x , x m b , , , m n , 3 7th FDPVA. Frascati (March 26-28, 2012) 10
7th FDPVA. Frascati (March 26-28, 2012) 11
Dataset: ( x 1 , y 1 ) , ( x 2 , y 2 ) , ..., ( x i , y i ) , ..., ( x N , y N ) x i ∈ R m : features that are of distinctive nature (object description with attributes managed by computers) y i ∈{L 1 , L 2 , ..., L K }: label of the sample x i Continuous-valued (length, pressure) Quantitative Discrete (numerical) (total basketball score, number of citizens in a town) Feature types Qualitative Ordinal (categorical) (education degree) Nominal (profession, brand of a car) 7th FDPVA. Frascati (March 26-28, 2012) 12
Dataset: ( x 1 , y 1 ) , ( x 2 , y 2 ) , ..., ( x i , y i ) , ..., ( x N , y N ) x i ∈ R m : features that are of distinctive nature (object description with attributes managed by computers) y i ∈{L 1 , L 2 , ..., L K }: known label of the sample x i Objective ve: to determine a separating function between classes (generalization) to predict the label of new samples with known feature vectors ( ( x N+1 , y N+1 ) , ( x N+2 , y N+2 ) , ...) Overfitting x x x x Decision boundary Decision boundary 7th FDPVA. Frascati (March 26-28, 2012) 13
How good is a classifier? ( x i , y i ) , i = 1, ..., J: training set Dataset: ( x 1 , y 1 ) , ( x 2 , y 2 ) , ..., ( x N , y N ) ( x i , y i ) , i = J+1, ..., N: test set Training set: a model is created to make predictions. Given x i , the model predicts y i Test set: model validation. The success rate is taken as the level of confidence and it is assumed to be the same for all future samples 7th FDPVA. Frascati (March 26-28, 2012) 14
Multi-class problems: K > 2 ◦ It can be tackled as K binary problems. Each class is compared with the rest (one-versus-the-rest approach) c 2 not c 2 c 1 c 1 not c 1 ambiguity c 2 c 4 region not c 3 c 3 c 3 c 4 not c 4 7th FDPVA. Frascati (March 26-28, 2012) 15
Examples of feature vectors ◦ Disruptions ( ), t I ( ), t n t ( ),... , y D N , x 1 p s p s e s 1 x ( t T ), I ( t T ), n t ( T ),... , y D N , 2 p s p s e s 2 . . . . ( t 2 ), T I ( t 2 ), T n t ( 2 ),... , T y D N , x 3 p s p s e s 3 ◦ L/H transition m , y , , y L H , x x i i i i ◦ Image classification x i : the set of pixels of an image y i ∈{1, 2, 3} 0 5 10 15 20 time 7th FDPVA. Frascati (March 26-28, 2012) 16
Single classifiers ◦ Support Vector Machines (SVM) ◦ Neural networks ◦ Bayes decision theory Parametric method Non-parametric method ◦ Classification trees Combining classifiers 7th FDPVA. Frascati (March 26-28, 2012) 17
Binary classifier It finds the optimal separating hyper-plane between classes Samples: (x k , y k ), x k ∈R n , k = 1, ..., N, y∈{C {+1} , C {-1} } C {+1} C {+1} D ( ) x 1 D ( ) x w x . b 0 D ( ) 1 x D (x)>+1 | D x ( ) | k || w || C {-1} C {-1} D (x)<-1 w Maximum y D ( ) x margin: 2 t t k k , y 1, 1 , k 1, , N k w To find the optimal hyper-plane it is necessary to determine the vector w that maximizes the margin t There are infinite solutions due to the presence of a scale factor. To avoid this: t || w|| = 1 Therefore, to maximize the margin is equivalent to minimize || w|| Opti timi miza zation ion problem: lem: 2 min J ( ) , subject to y . w 1 w w w x w , w k i 0 0 7th FDPVA. Frascati (March 26-28, 2012) 18
N Solution: a * * y w x (x k , y k ), x k ∈R n , k = 1, ..., N, y∈C {+1} , C {-1} } i i i i 1 a i are the Lagrange multipliers Samples associated to a i ≠ 0 are C {+1} called “ support vectors ” * * ( ) b 0 w x Support a * y w x vectors i i i support vectors The rest of training samples are irrelevant to classify new samples C {-1} The constant b is obtained from any condition (Karush-Kuhn-Tucker) w a y ( ) b 1 0, i 1 , ,N w x i i i is the distance (with sign) from X to the separating hyper-plane * * D ( ) · b x w x Given to classify x a * * if sign y ( ) b 0, C . Otherwise C x x x x i i i 1 1 vectores soporte V. Cherkassky, F. Mulier. Learning from data . 2 nd edition. Wiley-Interscience. 7th FDPVA. Frascati (March 26-28, 2012) 19
Non-linearly separable case input space feature space H ( x , x ’) Kernels ◦ Linear: H ( , ) ( . ) x x' x x' ◦ Polynomials of degree q: [( . ) 1] q H ( , ) x x' x x' 2 x x' ◦ Radial basis functions: H ( , ) exp x x' 2 ◦ Neural network: H ( , ) tanh(2( . ) 1) x x' x x' Given to classify x a * * if sign y H , b 0, C . Otherwise C x x x x i i i 1 1 vectores soporte V. Cherkassky, F. Mulier. Learning from data . 2 nd edition. Wiley-Interscience. 7th FDPVA. Frascati (March 26-28, 2012) 20
Recommend
More recommend