Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 6: High-dimensional Data Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 1 / 21
High-dimensional Space Let D be a n × d data matrix. In data mining typically the data is very high dimensional. Understanding the nature of high-dimensional space, or hyperspace , is very important, especially because it does not behave like the more familiar geometry in two or three dimensions. Hyper-rectangle: The data space is a d -dimensional hyper-rectangle d � � � R d = min( X j ) , max( X j ) j = 1 where min( X j ) and max ( X j ) specify the range of X j . Hypercube: Assume the data is centered, and let m denote the maximum attribute value � � d n m = max max | x ij | j = 1 i = 1 The data hyperspace can be represented as a hypercube , centered at 0, with all sides of length l = 2 m , given as � � � ∀ i , x i ∈ [ − l / 2 , l / 2 ] x = ( x 1 , x 2 ,..., x d ) T � H d ( l ) = The unit hypercube has all sides of length l = 1, and is denoted as H d ( 1 ) . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 2 / 21
Hypersphere Assume that the data has been centered, so that µ = 0. Let r denote the largest magnitude among all points: � � r = max � x i � i The data hyperspace can be represented as a d -dimensional hyperball centered at 0 with radius r , defined as � d � � x 2 j ≤ r 2 � � � B d ( r ) = x | � x � ≤ r or B d ( r ) = x = ( x 1 , x 2 ,..., x d ) � j = 1 The surface of the hyperball is called a hypersphere , and it consists of all the points exactly at distance r from the center of the hyperball � � S d ( r ) = x | � x � = r � d � ( x j ) 2 = r 2 � � or S d ( r ) = x = ( x 1 , x 2 ,..., x d ) � j = 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 3 / 21
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC b bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC Iris Data Hyperspace: Hypercube and Hypersphere l = 4 . 12 and r = 2 . 19 2 1 X 2 : sepal width r 0 − 1 − 2 − 2 − 1 0 1 2 X 1 : sepal length Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 4 / 21
High-dimensional Volumes Hypercube: The volume of a hypercube with edge length l is given as vol( H d ( l )) = l d Hypersphere The volume of a hyperball and its corresponding hypersphere is identical The volume of a hypersphere is given as In 3D: vol( S 3 ( r )) = 4 In 2D: vol( S 2 ( r )) = π r 2 3 π r 3 In 1D: vol( S 1 ( r )) = 2 r � d � π vol( S d ( r )) = K d r d = 2 r d In d -dimensions: � d � Γ 2 + 1 where �� d � ! if d is even � d � 2 Γ 2 + 1 = √ π � � d !! if d is odd 2 ( d + 1 ) / 2 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 5 / 21
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC Volume of Unit Hypersphere With increasing dimensionality the hypersphere volume first increases up to a point, and then starts to decrease, and ultimately vanishes. In particular, for the unit hypersphere with r = 1, d π 2 d →∞ vol( S d ( 1 )) = lim lim 2 + 1 ) → 0 Γ( d d →∞ 5 4 vol( S d ( 1 )) 3 2 1 0 0 5 10 15 20 25 30 35 40 45 50 d Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 6 / 21
Hypersphere Inscribed within Hypercube Consider the space enclosed within the largest hypersphere that can be accommodated within a hypercube (which represents the dataspace). The ratio of the volume of the hypersphere of radius r to the hypercube with side length l = 2 r is given as vol( H 2 ( 2 r )) = π r 2 vol( S 2 ( r )) 4 r 2 = π In 2 dimensions: 4 = 78 . 5 % 4 3 π r 3 vol( S 3 ( r )) 8 r 3 = π In 3 dimensions: vol( H 3 ( 2 r )) = 6 = 52 . 4 % π d / 2 vol( S d ( r )) In d dimensions: lim vol( H d ( 2 r )) = lim 2 + 1 ) → 0 2 d Γ( d d →∞ d →∞ As the dimensionality increases, most of the volume of the hypercube is in the “corners,” whereas the center is essentially empty. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 7 / 21
Hypersphere Inscribed inside a Hypercube − r r − r 0 0 r Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 8 / 21
Conceptual View of High-dimensional Space Two, three, four, and higher dimensions All the volume of the hyperspace is in the corners, with the center being essentially empty. High-dimensional space looks like a rolled-up porcupine! (a) 2D (b) 3D (c) 4D (d) d D Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 9 / 21
Volume of a Thin Shell The volume of a thin hypershell of width ǫ is given as vol( S d ( r ,ǫ )) = vol( S d ( r )) − vol( S d ( r − ǫ )) = K d r d − K d ( r − ǫ ) d . The ratio of volume of the thin shell to the volume of the outer sphere: r vol( S d ( r )) = K d r d − K d ( r − ǫ ) d vol( S d ( r ,ǫ )) 1 − ǫ � d � = 1 − r K d r d r − ǫ As d increases, we have ǫ vol( S d ( r ,ǫ )) 1 − ǫ � d � lim vol( S d ( r )) = lim d →∞ 1 − → 1 r d →∞ Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 10 / 21
Diagonals in Hyperspace Consider a d -dimensional hypercube, with origin 0 d = ( 0 1 , 0 2 ,..., 0 d ) , and bounded in each dimension in the range [ − 1 , 1 ] . Each “corner” of the hyperspace is a d -dimensional vector of the form ( ± 1 1 , ± 1 2 ,..., ± 1 d ) T . Let e i = ( 0 1 ,..., 1 i ,..., 0 d ) T denote the d -dimensional canonical unit vector in dimension i , and let 1 denote the d -dimensional diagonal vector ( 1 1 , 1 2 ,..., 1 d ) T . Consider the angle θ d between the diagonal vector 1 and the first axis e 1 , in d dimensions: e T 1 1 e T 1 1 1 = 1 cos θ d = � e 1 � � 1 � = √ = √ √ √ � e T 1 1 T 1 d d 1 e 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 11 / 21
Diagonals in Hyperspace As d increases, we have 1 d →∞ cos θ d = lim lim √ → 0 d →∞ d which implies that d →∞ θ d → π lim 2 = 90 ◦ Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 12 / 21
Angle between Diagonal Vector 1 and e 1 1 1 1 1 θ e 1 0 0 θ e 1 − 1 − 1 − 1 − 1 0 1 1 0 0 1 − 1 (a) In 2D (b) In 3D In high dimensions all of the diagonal vectors are perpendicular (or orthogonal) to all the coordinates axes! Each of the 2 d − 1 new axes connecting pairs of 2 d corners are essentially orthogonal to all of the d principal coordinate axes! Thus, in effect, high-dimensional space has an exponential number of orthogonal “axes.” Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 13 / 21
Density of the Multivariate Normal Consider the standard multivariate normal distribution with µ = 0, and Σ = I − x T x 1 � � f ( x ) = √ 2 π ) d exp 2 ( The peak of the density is at the mean. Consider the set of points x with density at least α fraction of the density at the mean f ( x ) f ( 0 ) ≥ α � − x T x � exp ≥ α 2 x T x ≤ − 2 ln( α ) d ( x i ) 2 ≤ − 2 ln( α ) � i = 1 The sum of squared IID random variables follows a chi-squared distribution χ 2 d . Thus, � f ( x ) � P f ( 0 ) ≥ α = F χ 2 d ( − 2 ln( α )) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 14 / 21 where F χ 2 is the CDF.
Density Contour for α Fraction of the Density at the Mean: One Dimension Let α = 0 . 5, then − 2 ln( 0 . 5 ) = 1 . 386 and F χ 2 1 ( 1 . 386 ) = 0 . 76. Thus, 24% of the density is in the tail regions. 0 . 4 0 . 3 α = 0 . 5 0 . 2 0 . 1 | | − 4 − 3 − 2 − 1 0 1 2 3 4 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 15 / 21
b Density Contour for α Fraction of the Density at the Mean: Two Dimensions Let α = 0 . 5, then − 2 ln( 0 . 5 ) = 1 . 386 and F χ 2 2 ( 1 . 386 ) = 0 . 50. Thus, 50% of the density is in the tail regions. f ( x ) 0.15 0.10 α = 0 . 5 0.05 − 4 − 3 − 2 0 − 1 0 X 2 1 − 4 − 3 − 2 2 − 1 0 1 3 2 X 1 3 4 4 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 16 / 21
Recommend
More recommend