Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008 1 / 37
What is Data Mining? ? Introduction Data mining September 2008 2 / 37
What is Data Mining? ? Introduction Data mining September 2008 2 / 37
What is Data Mining? ! Introduction Data mining September 2008 2 / 37
What is Data Mining? Data Mining in practice Introduction Data mining September 2008 3 / 37
What is Data Mining? Data Mining in practice Real−life data Off−the−shelf algorithm preprocess adapt Introduction Data mining September 2008 3 / 37
What is Data Mining? Data Mining in practice Real−life data Off−the−shelf algorithm preprocess adapt evaluate + iterate Introduction Data mining September 2008 3 / 37
What is Data Mining? Data Mining in practice Real−life data Off−the−shelf algorithm preprocess adapt evaluate + iterate general algorithmic methods data/domain − specific operations Introduction Data mining September 2008 3 / 37
What is Data Mining? An overview Unsupervised Learning Supervised Learning Labeled Data Unlabeled Data Classification Clustering Predictive Modeling Descriptive Modeling Rule Mining, Association Analysis Introduction Data mining September 2008 4 / 37
Classification A high-level view r e fi Spam i s s a yes/no l C Classification Data mining September 2008 5 / 37
Classification A high-level view SubAllCap yes/no TrustSend yes/no InvRet yes/no r Body’adult’ e fi yes/no Spam i s s a yes/no l C Body’zambia’ yes/no Classification Data mining September 2008 5 / 37
Classification A high-level view Cell -1 1..64 Cell-2 1..64 Cell-3 1..64 r e fi Symbol i s s a A..Z,0..9 l C Cell-324 1..64 Classification Data mining September 2008 5 / 37
Classification Labeled Data Attributes Class variable (Cases, Examples) (Features, Predictor Variables) (Target variable) SubAllCap TrustSend InvRet . . . B’zambia’ Spam y n n . . . n y Instances n n n . . . n n n y n . . . n y n n n . . . n n . . . . . . . . . . . . . . . . . . Attributes Class variable Cell-1 Cell-2 Cell-3 . . . Cell-324 Symbol 1 1 4 . . . 12 B Instances 1 1 1 . . . 3 1 34 37 43 . . . 22 Z 1 1 1 . . . 7 0 . . . . . . . . . . . . . . . . . . (In principle, any attribute can become the designated class variable) Classification Data mining September 2008 6 / 37
Classification Classification in general Attributes : Variables A 1 , A 2 , . . . , A n (discrete or continuous). Class variable : Variable C . Always discrete: states ( C ) = { c 1 , . . . , c l } (set of class labels ) A (complete data) Classifier is a mapping C : states ( A 1 , . . . , A n ) → states ( C ) . A classifier able to handle incomplete data provides mappings C : states ( A i 1 , . . . , A i k ) → states ( C ) for subsets { A i 1 , . . . , A i k } of { A 1 , . . . , A n } . A classifier partitions Attribute-value space (also: instance space ) into subsets labelled with class labels. Classification Data mining September 2008 7 / 37
Classification Iris dataset Measurement of petal width/length and sepal width/length for 150 flowers of 3 different species of Iris. PL PW first reported in: Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7 (1936). SL Attributes Class variable SW SL SW PL PW Species 5.1 3.5 1.4 0.2 Setosa 4.9 3.0 1.4 0.2 Setosa 6.3 2.9 6.0 2.1 Virginica 6.3 2.5 4.9 1.5 Versicolor . . . . . . . . . . . . . . . Classification Data mining September 2008 8 / 37
Classification Labeled data in instance space: Classification Data mining September 2008 9 / 37
Classification Labeled data in instance space: Virginica Versicolor Setosa Partition defined by classifier Classification Data mining September 2008 9 / 37
Classification Decision Regions Piecewise linear: e.g. Naive Axis-parallel linear: e.g. Deci- Bayes sion Trees Nonlinear: e.g. Neural Network Classification Data mining September 2008 10 / 37
Classification Classifiers differ in . . . Model space: types of partitions and their representation. how they compute the class label corresponding to a point in instance space (the actual classification task). how they are learned from data. Some important types of classifiers: Decision trees Naive Bayes classifier Other probabilistic classifiers (TAN,. . . ) Neural networks K-nearest neighbors Classification Data mining September 2008 11 / 37
Decision Trees Example Attributes: height ∈ [ 0 , 2 . 5 ] , sex ∈ { m , f } . Class labels: { tall , short } . 2.5 s 2.0 f tall tall m h h 1.0 short < 1 . 7 ≥ 1 . 7 short < 1 . 8 ≥ 1 . 8 0 short tall short tall m f Partition of instance space Representation by decision tree Decision trees Data mining September 2008 12 / 37
Decision Trees A decision tree is a tree - whose internal nodes are labeled with attributes - whose leaves are labeled with class labels - edges going out from node labeled with attribute A are labeled with subsets of states ( A ) , such that all labels combined form a partition of states ( A ) . Possible partitions states ( A ) = R : [ −∞ , 2 . 3 [ , [ 2 . 3 , ∞ ] [ −∞ , 1 . 9 [ , [ 1 . 9 , 3 . 5 [ , [ 3 . 5 , ∞ ] states ( A ) = { a , b , c } : { a } , { b } , { c } { a , b } , { c } Decision trees Data mining September 2008 13 / 37
Decision Trees Decision tree classification Each point in the instance space is sorted into a leaf by the decision tree. It is classified according to the class label at that leaf. s f m h h < 1 . 7 ≥ 1 . 7 < 1 . 8 ≥ 1 . 8 short tall short tall [m,1.85] C ([ m , 1 . 85 ]) = tall Decision trees Data mining September 2008 14 / 37
Decision Trees Learning a decision tree In general, we look for a small decision tree with minimal classification error over the data set � ( a 1 , c 1 ) , ( a 2 , c 2 ) , . . . , ( a n , c n ) � . A B t f t f B B A c 2 t f t f t f c 1 c 2 c 1 c 2 c 1 c 2 Bad tree Good tree Note: if data is noise-free , i.e. there are no instances ( a i , c i ) , ( a j , c j ) with a i = a j and c i � = c j , then there always exists decision tree with zero classification error. Decision trees Data mining September 2008 15 / 37
Decision Trees The ID3 algorithm A t f yes X Decision trees Data mining September 2008 16 / 37
Decision Trees The ID3 algorithm A f t yes B t f ? ? Top-down construction of the decision tree. For an “open” node X : Let D ( X ) be the instances that can reach X . If all instances agree on the class c , then label X with c and make it a leaf. Otherwise, find best attribute A and partition of states ( A ) , replace X with A , and make an outgoing edge from A for each member of the partition. Decision trees Data mining September 2008 16 / 37
Decision Trees Notes: The exact algorithm is formulated as a recursive procedure. One can modify the algorithm by providing weaker conditions for termination (necessary for noisy data): - If <some other termination condition applies> , turn X into a leaf with <most appropriate class label> . Decision trees Data mining September 2008 17 / 37
Decision Trees Scoring new partitions B t f X c 1 D ( X ) Decision trees Data mining September 2008 18 / 37
Decision Trees Scoring new partitions For each candidate attribute A with partition a 1 , a 2 , a 3 B of states ( A ) : t f Let p i ( c ) be the relative frequency of class label c in D ( a i ) . Measure for uniformity of class label distribution in D ( X i ) (entropy): A c 1 X H X i := − p i ( c ) log 2 ( p i ( c )) a 1 a 2 a 3 c ∈ states ( C ) Score of new partition (-expected entropy): X 1 X 2 X 3 3 | D ( X i ) | X Score ( A , a 1 , a 2 , a 3 ) := − | D ( X ) | H X i D ( X 1 ) D ( X 2 ) D ( X 3 ) i = 1 Decision trees Data mining September 2008 18 / 37
Decision Trees Searching for partitions When trying attribute A look for the partition of states ( A ) with highest score. In practice: Can try all choices for A . Cannot try all partitions of states ( A ) . Therefore For states ( A ) = R : only consider partitions of the form ] − ∞ , r [ , [ r , ∞ [ . Example: A : 1 3 4 6 10 12 17 18 22 25 C : y y y n n y y y n n Pick the partition with minimal expected entropy. For states ( A ) = { a 1 , . . . , a k } : only consider partition { a 1 } , . . . , { a k } . Decision trees Data mining September 2008 19 / 37
Decision Trees Decision boundaries revisited Decision trees Data mining September 2008 20 / 37
Attributes with many values The expected entropy measure favors attributes with many values: For example, an attribute Date (with the possible dates as states) will have a very low expected entropy but is unable to generalize! One approach for avoiding this problem is to select attributes based on GainRation: GainRation ( D , A ) = score ( S , A ) H A X H A = − p ( a ) log 2 ( p ( a )) , a ∈ states ( A ) where p ( a ) is the relative frequency of A = a in D . Decision trees Data mining September 2008 21 / 37
Recommend
More recommend