Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 1: Data Mining and Analysis Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 1: Data Mining and Analysis 1 /
Data Matrix Data can often be represented or abstracted as an n × d data matrix , with n rows and d columns, given as ··· X 1 X 2 X d x 1 x 11 x 12 ··· x 1 d x 2 x 21 x 22 ··· x 2 d D = . . . . ... . . . . . . . . x n x n 1 x n 2 ··· x nd Rows : Also called instances , examples , records , transactions , objects , points , feature-vectors , etc. Given as a d -tuple x i = ( x i 1 , x i 2 ,..., x id ) Columns : Also called attributes , properties , features , dimensions , variables , f ields , etc. Given as an n -tuple X j = ( x 1 j , x 2 j ,..., x nj ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 1: Data Mining and Analysis 2 /
Iris Dataset Extract Sepal Sepal Petal Petal Class length width length width X 1 X 2 X 3 X 4 X 5 x 1 5 . 9 3 . 0 4 . 2 1 . 5 Iris-versicolor x 2 6 . 9 3 . 1 4 . 9 1 . 5 Iris-versicolor x 3 6 . 6 2 . 9 4 . 6 1 . 3 Iris-versicolor x 4 4 . 6 3 . 2 1 . 4 0 . 2 Iris-setosa x 5 6 . 0 2 . 2 4 . 0 1 . 0 Iris-versicolor x 6 4 . 7 3 . 2 1 . 3 0 . 2 Iris-setosa x 7 6 . 5 3 . 0 5 . 8 2 . 2 Iris-virginica 5 . 8 2 . 7 5 . 1 1 . 9 Iris-virginica x 8 . . . . . . . . . . . . . . . . . . x 149 7 . 7 3 . 8 6 . 7 2 . 2 Iris-virginica x 150 5 . 1 3 . 4 1 . 5 0 . 2 Iris-setosa Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 1: Data Mining and Analysis 3 /
Attributes Attributes may be classified into two main types Numeric Attributes : real-valued or integer-valued domain Interval-scaled : only differences are meaningful e.g., temperature Ratio-scaled : differences and ratios are meaningful e..g, Age Categorical Attributes : set-valued domain composed of a set of symbols Nominal : only equality is meaningful e.g., domain( Sex ) = { M, F } Ordinal : both equality (are two values the same?) and inequality (is one value less than another?) are meaningful e.g., domain( Education ) = { High School , BS , MS , PhD } Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 1: Data Mining and Analysis 4 /
bC bc Data: Algebraic and Geometric View For numeric data matrix D , each row or point is a d -dimensional column vector: x i 1 x i 2 � T ∈ R d � x i = = x i 1 x i 2 ··· x id . . . x id whereas each column or attribute is a n -dimensional column vector: � T ∈ R n � X j = x 1 j x 2 j ··· x nj X 3 X 2 4 4 3 x 1 = ( 5 . 9 , 3 . 0 , 4 . 2 ) T x 1 = ( 5 . 9 , 3 . 0 ) T 3 2 2 1 1 1 2 3 1 X 2 2 3 0 X 1 4 5 0 1 2 3 4 5 6 6 X 1 (a) R 2 (b) R 3 Figure: Projections of x 1 = ( 5 . 9 , 3 . 0 , 4 . 2 , 1 . 5 ) T in 2D and 3D Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 1: Data Mining and Analysis 5 /
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC b bC bC bC bC bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC Scatterplot: 2D Iris Dataset sepal length versus sepal width . Visualizing Iris dataset as points/vectors in 2D Solid circle shows the mean point 4 . 5 4 . 0 X 2 : sepal width 3 . 5 3 . 0 2 . 5 2 4 4 . 5 5 . 0 5 . 5 6 . 0 6 . 5 7 . 0 7 . 5 8 . 0 X 1 : sepal length Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 1: Data Mining and Analysis 6 /
Numeric Data Matrix If all attributes are numeric, then the data matrix D is an n × d matrix, or equivalently a i ∈ R d or a set of d column vectors X j ∈ R n set of n row vectors x T — x T 1 — x 11 x 12 ··· x 1 d — x T | | | ··· x 21 x 22 x 2 d 2 — D = = = X 1 X 2 ··· X d . . . ... . . . . . . . . | | | . ··· x n 1 x n 2 x nd — x T n — n The mean of the data matrix D is the average of all the points: mean ( D ) = µ = 1 � x i n i = 1 The centered data matrix is obtained by subtracting the mean from all the points: x T µ T x T 1 − µ T z T 1 1 x T µ T x T 2 − µ T z T 2 2 Z = D − 1 · µ T = − = = (1) . . . . . . . . . . . . x T µ T x T n − µ T z T n n where z i = x i − µ is a centered point, and 1 ∈ R n is the vector of ones. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 1: Data Mining and Analysis 7 /
bc Norm, Distance and Angle Given two points a , b ∈ R m , their dot Distance between a and b is given as product is defined as the scalar � m � � a T b = a 1 b 1 + a 2 b 2 + ··· + a m b m � � a − b � = ( a i − b i ) 2 � i = 1 m � = a i b i i = 1 Angle between a and b is given as The Euclidean norm or length of a � a � T � b a T b � vector a is defined as cos θ = � a �� b � = � a � � b � � √ m � X 2 � � � a � = a T a = a 2 � ( 1 , 4 ) i 4 a − b i = 1 bc ( 5 , 3 ) 3 2 b a The unit vector in the direction of a is 1 θ a u = � a � with � a � = 1. 0 X 1 0 1 2 3 4 5 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 1: Data Mining and Analysis 8 /
Orthogonal Projection Two vectors a and b are orthogonal iff a T b = 0, i.e., the angle between them is 90 ◦ . Orthogonal projection of b on a comprises the vector p = b � parallel to a , and r = b ⊥ perpendicular or orthogonal to a , given as b = b � + b ⊥ = p + r where � a T b � p = b � = a a T a X 2 b 4 a b ⊥ = 3 r 2 b � = 1 p 0 X 1 0 1 2 3 4 5 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 1: Data Mining and Analysis 9 /
rS rS bC bc bC rs rS rs rs bC rS rs rS rs rS rs bc bc rs bc bc bC bc bC bc bC bC bC bc bC bc bC bc bC bc rS rS bc rs rs rS rs uT rs rS rS rs rs rS rs rS rs rS rS rS rs rS rS rs rS rs rS rs rs rs rS rs rS rs rS rs rS bC bC rS bc bc bC bc bC bc bC bC bc bc bC bc bC bc bC bC bC bC bC bC bc bC bc bC bc bc bc bC bc bC bc bC bc bC bc bc bc bC bC bc bC bc bC bc bc bC bC bc bC bc bC bc bC bc bc bC bc bc bC bc bC bc bC bC bC bc bC bc bC bc bC bc rs rS rs bC ut ut uT ut uT ut uT uT ut ut uT ut uT ut uT uT uT uT uT uT ut uT ut uT ut ut ut uT ut uT ut uT ut uT ut ut uT uT uT ut uT ut uT ut ut uT uT ut uT ut uT ut ut ut uT ut ut uT ut uT ut uT uT uT ut uT ut uT ut uT ut ut ut rs rS rS rs rS rs rS rs rs rS rS rs rS rs rS rs rs rs rs rS rS rs rS rs rS rs rs bc rS rs rS rs rS rs rS rS rS uT ut ut uT ut uT ut uT uT ut ut uT ut uT ut uT ut uT uT rs rS rS rs rS rs rS rs rs ut rS ut uT ut uT ut uT bc Projection of Centered Iris Data Onto a Line ℓ . X 2 ℓ 1 . 5 1 . 0 rS rs 0 . 5 0 . 0 X 1 − 0 . 5 − 1 . 0 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 1: Data Mining and Analysis 10
Data: Probabilistic View A random variable X is a function X : O → R , where O is the set of all possible outcomes of the experiment, also called the sample space . A discrete random variable takes on only a finite or countably infinite number of values, whereas a continuous random variable if it can take on any value in R . By default, a numeric attribute X j is considered as the identity random variable given as X ( v ) = v for all v ∈ O . Here O = R . Discrete Variable: Long Sepal Length Define random variable A , denoting long sepal length (7cm or more) as follows: � 0 if v < 7 A ( v ) = 1 if v ≥ 7 The sample space of A is O = [ 4 . 3 , 7 . 9 ] , and its range is { 0 , 1 } . Thus, A is discrete. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 1: Data Mining and Analysis 11
Recommend
More recommend