course content
play

Course Content Week 5 (April 7) and Week 6 (April 14) Introduction - PDF document

Lecture 4 Course Content Week 5 (April 7) and Week 6 (April 14) Introduction to Data Mining 33459-01 Principles of Knowledge Discovery in Data Association analysis Sequential Pattern Analysis Classification: Neural Networks,


  1. Lecture 4 Course Content Week 5 (April 7) and Week 6 (April 14) • Introduction to Data Mining 33459-01 Principles of Knowledge Discovery in Data • Association analysis • Sequential Pattern Analysis Classification: Neural Networks, • Classification and prediction Naïve Bayesian Classification, • Contrast Sets k-Nearest Neighbors, Decision • Data Clustering Trees & Associative Classifiers • Outlier Detection Lecture by: Dr. Osmar R. Zaïane • Web Mining 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 1 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 2 (Dr. O. Zaiane) (Dr. O. Zaiane) What is Classification? Classification = Learning a Model The goal of data classification is to organize and Training Set (labeled) categorize data in distinct classes. A model is first created based on the data distribution. The model is then used to classify new data. Given the model, a class can be predicted for new data. Classification Model With classification, I can predict in which bucket to put the ball, but I can’t predict the weight of the ball. ? … New unlabeled data Labeling=Classification 1 2 3 4 n 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 3 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 4 (Dr. O. Zaiane) (Dr. O. Zaiane) Classification is a three-step Classification is a three-step process process 1. Model construction ( Learning ): 2. Model Evaluation ( Accuracy ): • Each tuple is assumed to belong to a predefined class, as Estimate accuracy rate of the model based on a test set . determined by one of the attributes, called the class label . – The known label of test sample is compared with the • The set of all tuples used for construction of the model is classified result from the model. called training set . – Accuracy rate is the percentage of test set samples that • The model is represented in the following forms: are correctly classified by the model. – Test set is independent of training set otherwise over- • Classification rules, (IF-THEN statements), fitting will occur. • Decision tree • Mathematical formulae 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 5 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 6 (Dr. O. Zaiane) (Dr. O. Zaiane)

  2. Classification is a three-step Classification with Holdout process Derive 3. Model Use ( Classification ): Training Estimate Classifier The model is used to classify unseen objects. Data Accuracy (Model) • Give a class label to a new tuple Data • Predict the value of an actual attribute Testing Data •Holdout •Random sub-sampling •K-fold cross validation •Bootstrapping • … 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 7 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 8 (Dr. O. Zaiane) (Dr. O. Zaiane) 1. Classification Process 2. Classification Process (Learning) (Accuracy Evaluation) Classification Algorithms Training Classifier Testing (Model) Data Data Name Income Age Credit rating Classifier Bruce Low <30 bad (Model) Name Income Age Credit rating How accurate is the model? Tom Medium <30 bad Dave Medium [30..40] good Jane High <30 bad William High <30 good IF Income = ‘High’ Wei High >40 good IF Income = ‘High’ Marie Medium >40 good OR Age > 30 Hua Medium [30..40] good OR Age > 30 THEN CreditRating = ‘Good’ Anne Low [30..40] good THEN CreditRating = ‘Good’ Chris Medium <30 bad 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 9 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 10 (Dr. O. Zaiane) (Dr. O. Zaiane) 3. Classification Process Improving Accuracy (Classification) Classifier 1 Classifier 2 Classifier New (Model) Classifier 3 Combine Data Data votes … New Classifier n Credit Rating? Name Income Age Credit rating Data Paul High [30..40] ? Composite classifier 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 11 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 12 (Dr. O. Zaiane) (Dr. O. Zaiane)

  3. Classification Methods Framework (Supervised Learning) Next week � Decision Tree Induction � Neural Networks Derive Derive Training Training Classifier Estimate Estimate � Bayesian Classification Data (Model) Accuracy Classifier Labeled Data Data Accuracy Today (Model) � Associative Classifiers Testing Data Labeled � K-Nearest Neighbour Unlabeled Data New Data Next week Testing � Support Vector Machines Data � Case-Based Reasoning � Genetic Algorithms Unlabeled � Rough Set Theory New Data � Fuzzy Sets � Etc. 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 13 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 14 (Dr. O. Zaiane) (Dr. O. Zaiane) Lecture Outline Human Nervous System Part I: Artificial Neural Networks (ANN) (1 hour) • We have only just began to understand Introduction to Neural Networks how our neural system operates � Biological Neural System • • A huge number of neurons and What is an artificial neural network? • interconnections between them Neuron model and activation function • 100 billion (i.e. 10 10 ) neurons in the brain – Construction of a neural network • Learning: Backpropagation Algorithm � • a full Olympic-sized swimming pool contains 10 10 raindrops; the number of stars in the Forward propagation of signal • Milky Way is of the same magnitude Backward propagation of error • – 10 4 connections per neuron Example • Part II: Bayesian Classifiers (Statistical-based) (1 hour) What is Bayesian Classification � • Biological neurons are slower than computers Bayes theorem � – Neurons operate in 10 -3 seconds , computers in 10 -9 seconds Naïve Bayes Algorithm � – The brain makes up for the slow rate of operation by a single Using Laplace Estimate • neurone by the large number of neurons and connections Handling Missing Values and Numerical Data • • Belief Networks (think about the speed of face recognition by a human, for example, and the time it takes fast computers to do the same task.) 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 15 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 16 (Dr. O. Zaiane) (Dr. O. Zaiane) Biological Neurons Operation of biological neurons • Signals are transmitted between neurons by • The purpose of neurons: transmit information in electrical pulses ( action potentials, AP ) the form of electrical signals traveling along the axon; • When the potential at the synapse is raised – it accepts many inputs, which are all added up in some way sufficiently by the AP, it releases chemicals – if enough active inputs are received at once, the neuron will called neurotransmitters be activated and fire; if not, it remain in its inactive state - it may take the arrival of more than one AP before the synapse is triggered • Structure of neuron • Cell body - contains nucleus holding the • The neurotransmitters diffuse across the gap and chemically activate chromosomes gates on the dendrites, that allows charged ions to flow • Dendrites • Axon • The flow of ions alters the potential of the dendrite and provides a • Synapse voltage pulse on the dendrite ( post-synaptic-potential, PSP ) � couples the axon with the dendrite of • some synapses excite the dendrite they affect, while others inhibit it another cell; • the synapses also determine the strength of the new input signal � information is passed from one neuron • Each PSP travels along its dendrite and spreads over the soma (cell to another through synapses; body) � no direct linkage across the junction, • The soma sums the effects of thousands PSPs; if the resulting potential it is a chemical one. exceeds a threshold, the neuron fires and generates another AP. 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 17 33459-01: Principles of Knowledge Discovery in Data – March-June, 2006 18 (Dr. O. Zaiane) (Dr. O. Zaiane)

Recommend


More recommend