Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 3: Categorical Attributes Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 1 / 26
Univariate Analysis: Bernoulli Variable Consider a single categorical attribute, X , with domain dom ( X ) = { a 1 , a 2 ,..., a m } comprising m symbolic values. The data D is an n × 1 symbolic data matrix given as X x 1 x 2 D = . . . x n where each point x i ∈ dom ( X ) . Bernoulli Variable : Special case when m = 2 � 1 if v = a 1 X ( v ) = 0 if v = a 2 i.e., dom ( X ) = { 0 , 1 } . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 2 / 26
Bernoulli Variable: Mean and Variance Assume that each symbolic point has The probability mass function (PMF) of been mapped to its binary value. The set X is given as { x 1 , x 2 ,..., x n } is a random sample drawn from X . P ( X = x ) = f ( x ) = p x ( 1 − p ) 1 − x The sample mean is given as n µ = 1 x i = n 1 The expected value of X is given as � ˆ n = ˆ p n i = 1 µ = E [ X ] = 1 · p + 0 · ( 1 − p ) = p where n i is the number of points with x j = i in the random sample (equal to and the variance of X is given as the number of occurrences of symbol a i ). The sample variance is given as σ 2 = var ( X ) = p ( 1 − p ) σ 2 = ˆ ˆ p ( 1 − ˆ p ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 3 / 26
Binomial Distribution: Number of Occurrences Given the Bernoulli variable X , let { x 1 , x 2 ,..., x n } be a random sample of size n . Let N be the random variable denoting the number of occurrences of the symbol a 1 (value X = 1). N has a binomial distribution, given as � n � p n 1 ( 1 − p ) n − n 1 f ( N = n 1 | n , p ) = n 1 N is the sum of the n independent Bernoulli random variables x i IID with X , that is, N = � n i = 1 x i . The mean or expected number of occurrences of a 1 is � n � n n � � � µ N = E [ N ] = E x i = E [ x i ] = p = np i = 1 i = 1 i = 1 The variance of N is n n � � σ 2 N = var ( N ) = var ( x i ) = p ( 1 − p ) = np ( 1 − p ) i = 1 i = 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 4 / 26
Multivariate Bernoulli Variable For the general case when dom ( X ) = { a 1 , a 2 ,..., a m } , we model X as an m -dimensional or multivariate Bernoulli random variable X = ( A 1 , A 2 ,..., A m ) T , where each A i is a Bernoulli variable with parameter p i denoting the probability of observing symbol a i . However, X can assume only one of the symbolic values at any one time. Thus, X ( v ) = e i if v = a i where e i is the i -th standard basis vector in m dimensions. The range of X consists of m distinct vector values { e 1 , e 2 ,..., e m } . The PMF of X is m e ij � P ( X = e i ) = f ( e i ) = p i = p j j = 1 with � m i = 1 p i = 1. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 5 / 26
Multivariate Bernoulli: Mean The mean or expected value of X can be obtained as 1 0 p 1 m m 0 0 p 2 � � µ = E [ X ] = e i f ( e i ) = e i p i = p 1 + ··· + p m = = p . . . . . . . . . i = 1 i = 1 0 1 p m The sample mean is n 1 / n p 1 ˆ n m n 2 / n p 2 ˆ µ = 1 n i � � ˆ x i = n e i = = = ˆ . . p . . n . . i = 1 i = 1 n m / n p m ˆ where n i is the number of occurrences of the vector value e i in the sample, i.e., the number of occurrences of the symbol a i . Furthermore, � m i = 1 n i = n . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 6 / 26
b b b b Multivariate Bernoulli Variable: sepal length Probability Mass Function Bins Domain Counts The total sample size is n = 150; the [ 4 . 3 , 5 . 2 ] Very Short ( a 1 ) n 1 = 45 estimates ˆ p i are: ( 5 . 2 , 6 . 1 ] Short ( a 2 ) n 2 = 50 ( 6 . 1 , 7 . 0 ] Long ( a 3 ) n 3 = 43 p 1 = 45 / 150 = 0 . 3 ˆ ( 7 . 0 , 7 . 9 ] Very Long ( a 4 ) n 4 = 12 p 2 = 50 / 150 = 0 . 333 ˆ We model sepal length as a multivariate p 3 = 43 / 150 = 0 . 287 ˆ Bernoulli variable X p 4 = 12 / 150 = 0 . 08 ˆ e 1 = ( 1 , 0 , 0 , 0 ) if v = a 1 f ( x ) e 2 = ( 0 , 1 , 0 , 0 ) if v = a 2 0 . 333 X ( v ) = 0 . 3 0 . 287 0 . 3 e 3 = ( 0 , 0 , 1 , 0 ) if v = a 3 e 4 = ( 0 , 0 , 0 , 1 ) if v = a 4 0 . 2 For example, the symbolic point 0 . 1 0 . 08 x 1 = Short = a 2 is represented as the vector ( 0 , 1 , 0 , 0 ) T = e 2 . 0 x e 1 e 2 e 3 e 4 Very Short Short Long Very Long Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 7 / 26
Multivariate Bernoulli Variable: Covariance Matrix We have X = ( A 1 , A 2 ,..., A m ) T , where A i is the Bernoulli variable corresponding to symbol a i . The variance for each Bernoulli variable A i is σ 2 i = var ( A i ) = p i ( 1 − p i ) The covariance between A i and A j is σ ij = E [ A i A j ] − E [ A i ] · E [ A j ] = 0 − p i p j = − p i p j Negative relationship since A i and A j cannot both be 1 at the same time. The covariance matrix for X is given as σ 2 σ 12 ... σ 1 m p 1 ( 1 − p 1 ) − p 1 p 2 ··· − p 1 p m 1 σ 2 σ 12 ... σ 2 m − p 1 p 2 p 2 ( 1 − p 2 ) ··· − p 2 p m 2 Σ = = . . . . . . ... ... . . . . . . . . . . . . σ 2 σ 1 m σ 2 m ... − p 1 p m − p 2 p m ··· p m ( 1 − p m ) m More compactly Σ = diag ( p ) − p · p T where µ = p = ( p 1 , ··· , p m ) T . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 8 / 26
Categorical, Mapped Binary and Centered Dataset Modeling as multivariate Bernoulli variable is equivalent to treating X ( x i ) as a new n × m binary data matrix X A 1 A 2 Z 1 Z 2 0 1 − 0.4 0.4 x 1 Short x 1 z 1 0 1 − 0.4 0.4 x 2 Short x 2 z 2 1 0 0.6 − 0.6 x 3 Long x 3 z 3 0 1 − 0.4 0.4 x 4 Short x 4 z 4 1 0 0.6 − 0.6 x 5 Long x 5 z 5 X is the multivariate Bernoulli variable e 1 = ( 1 , 0 ) T if v = Long ( a 1 ) X ( v ) = e 2 = ( 0 , 1 ) T if v = Short ( a 2 ) The sample mean and covariance matrix are � 0 . 24 � − 0 . 24 p = ( 2 / 5 , 3 / 5 ) T = ( 0 . 4 , 0 . 6 ) T p T = � µ = ˆ ˆ Σ = diag (ˆ p ) − ˆ p ˆ − 0 . 24 0 . 24 From the centered data, we have Z = ( Z 1 , Z 2 ) T and � 0 . 24 � Σ = 1 − 0 . 24 � 5 Z T Z = − 0 . 24 0 . 24 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 9 / 26
Multinomial Distribution: Number of Occurrences Let { x 1 , x 2 ,..., x n } be a random sample from X . Let N i be the random variable denoting number of occurrences of symbol a i in the sample, and let N = ( N 1 , N 2 ,..., N m ) T . N has a multinomial distribution, given as � � m � � � n p n i N = ( n 1 , n 2 ,..., n m ) | p = f i n 1 n 2 ... n m i = 1 The mean and covariance matrix of N are: np 1 . µ N = E [ N ] = nE [ X ] = n · µ = n · p = . . np m np 1 ( 1 − p 1 ) − np 1 p 2 ··· − np 1 p m − np 1 p 2 np 2 ( 1 − p 2 ) ··· − np 2 p m Σ N = n · ( diag ( p ) − pp T ) = . . . ... . . . . . . − np 1 p m − np 2 p m ··· np m ( 1 − p m ) The sample mean and covariance matrix for N are � p T � � µ N = n ˆ ˆ p Σ N = n diag (ˆ p ) − ˆ p ˆ Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 10 / 26
Bivariate Analysis Assume the data comprises two categorical attributes, X 1 and X 2 , dom ( X 1 ) = { a 11 , a 12 ,..., a 1 m 1 } dom ( X 2 ) = { a 21 , a 22 ,..., a 2 m 2 } We model X 1 and X 2 as multivariate Bernoulli variables X 1 and X 2 with dimensions m 1 and m 2 , respectively. The joint distribution of X 1 and X 2 is modeled as the m 1 + m 2 � X 1 � dimensional vector variable X = X 2 � X 1 ( v 1 ) � � e 1 i � � ( v 1 , v 2 ) T � = = X X 2 ( v 2 ) e 2 j provided that v 1 = a 1 i and v 2 = a 2 j . The joint PMF for X is given as the m 1 × m 2 matrix p 11 p 12 ... p 1 m 2 p 21 p 22 ... p 2 m 2 P 12 = . . . ... . . . . . . p m 1 1 p m 1 2 ... p m 1 m 2 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 11 / 26
Recommend
More recommend