Machine Learning Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 59
Table of contents Introduction 1 High-dimensional space 2 Dimensionality reduction methods 3 Feature selection methods 4 Feature extraction 5 Feature extraction methods 6 Principal component analysis Kernel principal component analysis Factor analysis Multidimensional Scaling Locally Linear Embedding Isomap Linear discriminant analysis Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 2 / 59
Table of contents Introduction 1 High-dimensional space 2 Dimensionality reduction methods 3 Feature selection methods 4 Feature extraction 5 Feature extraction methods 6 Principal component analysis Kernel principal component analysis Factor analysis Multidimensional Scaling Locally Linear Embedding Isomap Linear discriminant analysis Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 3 / 59
Introduction The complexity of any classifier or regressors depends on the number of input variables or features. These complexities include Time complexity: In most learning algorithms, the time complexity depends on the number of input dimensions( D ) as well as on the size of training set ( N ). Decreasing D decreases the time complexity of algorithm for both training and testing phases. Space complexity: Decreasing D also decreases the memory amount needed for training and testing phases. Samples complexity: Usually the number of training examples ( N ) is a function of length of feature vectors ( D ). Hence, decreasing the number of features also decreases the number of training examples. Usually the number of training pattern must be 10 to 20 times of the number of features. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 3 / 59
Introduction There are several reasons why we are interested in reducing dimensionality as a separate preprocessing step. Decreasing the time complexity of classifiers or regressors. Decreasing the cost of extracting/producing unnecessary features. Simpler models are more robust on small data sets. Simpler models have less variance and thus are less depending on noise and outliers. Description of classifier or regressors is simpler / shorter. Visualization of data is simpler. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 4 / 59
Peaking phenomenon In practice, for a finite N, by increasing the number of features we obtain an initial improvement in performance, but after a critical value further increase of the number of features results in an increase of the probability of error. This phenomenon is also known as the peaking phenomenon. If the number of samples increases ( N 2 ≫ N 1 ), the peaking phenomenon occures for larger number of features ( l 2 > l 1 ). Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 5 / 59
Table of contents Introduction 1 High-dimensional space 2 Dimensionality reduction methods 3 Feature selection methods 4 Feature extraction 5 Feature extraction methods 6 Principal component analysis Kernel principal component analysis Factor analysis Multidimensional Scaling Locally Linear Embedding Isomap Linear discriminant analysis Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 6 / 59
High-dimensional space In most applications of data mining/ machine learning, typically the data is very high dimensional (the number of features can easily be in the hundreds or thousands). Understanding the nature of high-dimensional space (hyperspace) is very important, because hyperspace does not behave like the more familiar geometry in two or three dimensions. Consider the N × D data matrix x 11 x 12 . . . x 1 D x 21 x 22 . . . x 2 D S = . . . . ... . . . . . . x N 1 x N 2 . . . x ND Let the minimum and maximum values for each feature x j be given as min ( x j ) = min i { x ij } max ( x j ) = max { x ij } i The data hyperspace can be considered as a D -dimensional hyper-rectangle, defined as D � R D = [ min ( x j ) , max ( x j )] . j =1 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 6 / 59
High-dimensional space (cont.) Hypercube Assume the data is centered to have mean : µ = 0. Let m denote the largest absolute value in S . D N m = max max i =1 {| x ij |} . j =1 The data hyperspace can be represented as a hypercube H D ( l ), centered at 0, with all sides of length l = 2 m . � � �� − l 2 , l x = ( x 1 , . . . , x D ) T | H D ( l ) = ∀ i x i ∈ . 2 Hypersphere Assume the data is centered to have mean : µ = 0. Let r denote the largest magnitude among all points in S . r = max {� x i �} . i The data hyperspace can also be represented as a D -dimensional hyperball centered at 0 with radius r B D ( r ) = { x | � x i � ≤ r } The surface of the hyperball is called a hypersphere, and it consists of all the points exactly at distance r from the center of the hyperball S D ( r ) = { x | � x i � = r } Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 7 / 59
High-dimensional space (cont.) Consider two features of Irish data set 2 1 X 2 : sepal width r 0 − 1 − 2 − 2 − 1 0 1 2 X 1 : sepal length Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 8 / 59
High-dimensional volumes The volume of a hypercube with edge length l equals to vol ( H D ( l )) = l D . The volume of a hyperball and its corresponding hypersphere equals to � � D π 2 r D . vol ( S D ( r )) = � D � Γ 2 + 1 where gamma function for α > 0 is defined as � ∞ x α − 1 e − x dx Γ( α ) = 0 The surface area of the hypersphere can be obtained by differentiating its volume with respect to r � D � area ( S D ( r )) = d 2 π 2 r D − 1 . dr vol ( S D ( r )) = � D � Γ 2 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 9 / 59
Asymptotic Volume An interesting observation about the hypersphere volume is that as dimensionality increases, the volume first increases up to a point, and then starts to decrease, and ultimately vanishes. For the unit hypersphere ( r = 1), � D � π 2 r D → 0 . D →∞ vol ( S D ( r )) = lim lim � D � Γ 2 + 1 D →∞ 5 4 vol ( S d ( 1 )) 3 2 1 0 0 5 10 15 20 25 30 35 40 45 50 d Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 10 / 59
Hypersphere Inscribed Within Hypercube Consider the space enclosed within the largest hypersphere that can be accommodated within a hypercube. Consider a hypersphere of radius r inscribed in a hypercube with sides of length 2 r . The ratio of the volume of the hypersphere of radius r to the hypercube with side length l = 2 r equals to vol ( H 2 (2 r )) = π r 2 vol ( S 2 ( r )) 4 r 2 = π 4 = 0 . 785 4 3 π r 3 vol ( S 3 ( r )) 8 r 3 = π vol ( H 3 (2 r )) = 6 = 0 . 524 � D � vol ( S D ( r )) π 2 lim vol ( H D (2 r )) = lim → 0 . � D � 2 D Γ 2 + 1 D →∞ D →∞ Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 11 / 59
Hypersphere Inscribed within Hypercube Hypersphere inscribed inside a hypercube for two and three dimensions. − r r − r 0 0 r Conceptual view of high-dimensional space for two, three, four, and higher dimensions. (a) (b) (c) (d) In d dimensions there are 2 d corners and 2 d 1 diagonals . Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 12 / 59
Volume of Thin Hypersphere Shell Consider the volume of a thin hypersphere shell of width ǫ bounded by an outer hypersphere of radius r , and an inner hypersphere of radius r − ǫ . Volume of the thin shell equals to the difference between the volumes of the two bounding hyperspheres. r r − � � Let S D ( r , ǫ ) denote thethin hypershell of width ǫ . Its volume equals vol ( S D ( r , ǫ )) = vol ( S D ( r )) − vol ( S D ( r − ǫ )) = K D r D − K D ( r − ǫ ) D D π 2 K D = � D � Γ 2 + 1 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 13 / 59
Volume of Thin Hypersphere Shell (cont.) Ratio of the volume of the thin shell to the volume of the outer sphere equals to = K D r D − K D ( r − ǫ ) D vol ( S D ( r , ǫ )) 1 − ǫ � D � r = 1 − K D r D r − � vol ( S D ( r )) r � For r = 1 and ǫ = 0 . 01 � 2 � vol ( S 2 (1 , 0 . 01) 1 − 0 . 01 = 1 − ≃ 0 . 02 vol ( S 2 (1)) 1 � 3 . vol ( S 3 (1 , 0 . 01) � 1 − 0 . 01 = 1 − ≃ 0 . 03 vol ( S 3 (1)) 1 � 4 . vol ( S 4 (1 , 0 . 01) � 1 − 0 . 01 = 1 − ≃ 0 . 04 vol ( S 4 (1)) 1 � 5 . vol ( S 5 (1 , 0 . 01) � 1 − 0 . 01 = 1 − ≃ 0 . 05 . vol ( S 5 (1)) 1 As D increases, in the limit we obtain vol ( S D ( r , ǫ )) 1 − ǫ � D � lim = lim D →∞ 1 − → 1 . vol ( S D ( r )) r D →∞ Almost all of the volume of the hypersphere is contained in the thin shell as D → ∞ . Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 14 / 59
Recommend
More recommend