probability and statistics
play

Probability and Statistics for Computer Science many problems are - PowerPoint PPT Presentation

Probability and Statistics for Computer Science many problems are naturally classifica4on problems---Prof. Forsyth Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.5.2020 Last time Demo of Principal


  1. Probability and Statistics ì for Computer Science “…many problems are naturally classifica4on problems”---Prof. Forsyth Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.5.2020

  2. Last time � Demo of Principal Component Analysis � Introduc4on to classifica4on

  3. Objectives � Decision tree (II) } so � Random forest -08 � Support Vector Machine (I)

  4. Classifiers prediction � Why do we need classifiers? patterns efficient � What do we use to quan4fy the performance of a classifier? matrix confusion � What is the baseline accuracy of a 5-class classifier using 0-1 O loss func4on? ¥ � What’s valida4on and cross-valida4on in classifica4on?

  5. Performance of a multiclass classifier � Assuming there are c classes: � The class confusion matrix is c × c True label � Under the 0-1 loss func4on accuracy = sum of diagonal terms sum of all terms ie. in the right example, accuracy = predicted 32/38=84% Source: scikit-learn � The baseline accuracy is 1/c.

  6. - validation Cross mnl : tiple Split the ways data in randomly Testing I Training us . validation " " { tolu ' ' ' ' leave - one - ont purpose ?

  7. Q1. Cross-validation Cross-valida+on is a method used to prevent overficng in classifica4on. D A. TRUE B. FALSE

  8. Decision(tree:(object(classification( � The$object$classifica4on$ decision(tree $can$classify$ objects$into$mul4ple$classes$using$sequence$of$ simple$tests.$It$will$naturally$grow$into$a$tree.$ . moving moving not - hole parts or o o human non - human big or sunset chair(leg( toddler( Cat( dog( sofa( box(

  9. Training(a(decision(tree:(example( � The$“Iris”$data$set$ M Virginica$ PN o O . yin w ' T O a Setosa$ Versicolor$ t Seto s a 50 Virginica o 1?$Where?$ Versicolor o versicolor

  10. Training a decision tree � Choose a dimension/feature and a split � Split the training Data into lef- and right- child subsets D l and D r right left � Repeat the two steps above recursively on each child � Stop the recursion based on some condi4ons � Label the leaves with class labels

  11. Classifying with a decision tree: example � The “Iris” data set Virginica ✓ ✓ Setosa Versicolor

  12. Choosing a split � An informa4ve split makes the subsets more concentrated and reduces uncertainty about class labels

  13. Choosing a split � An informa4ve split makes the subsets more concentrated and reduces uncertainty about class labels

  14. Choosing a split � An informa4ve split makes the subsets ✔ more concentrated and reduces uncertainty about ✖ class labels

  15. Which is more informative?

  16. Quantifying uncertainty using entropy � We can measure uncertainty as the number of bits of informa4on needed to dis4nguish between classes in a dataset (first introduced by Claude Shannon) � We need Log 2 2 =1 bit to dis4nguish 2 equal classes � We need Log 2 4 =2 bit to dis4nguish 4 equal classes Claude Shannon (1916-2001)

  17. Quantifying uncertainty using entropy � Entropy (Shannon entropy) is the measure of uncertainty for a general distribu4on 1 � If class i contains a frac4on P ( i ) of the data, we need log 2 P ( i ) bits for that class � The entropy H(D) of a dataset is defined as the weighted mean of entropy for every class: c 1 � H ( D ) = P ( i ) log 2 P ( i ) i =1 = II , - Pci ) log , Pci ,

  18. Entropy: before the split Pco ) = # -_ 3g pcx ) De DR p H ( D ) = − 3 3 5 − 2 2 Hl De ) = ? 5 log 2 5 log 2 5 H CDR ) = ? = 0 . 971 bits

  19. Entropy: examples pix ) =/ - Ed H ( D ) = − 3 3 5 − 2 2 H ( D l ) = − 1 log 2 1 = 0 bits 5 log 2 5 log 2 5 = 0 . 971 bits

  20. Entropy: examples Dr De H ( D ) = − 3 5 − 2 3 2 H ( D l ) = − 1 log 2 1 = 0 bits 5 log 2 5 log 2 5 H ( D r ) = − 1 1 3 − 2 2 3 log 2 3 log 2 = 0 . 971 bits 3 = 0 . 918 bits

  21. Information gain of a split � The informa4on gain of a split is the amount of entropy that was reduced on average afer the split I = H ( D ) − ( N Dl H ( D l ) + N Dr H ( D r )) N D N D � where � N D is the number of items in the dataset D � N Dl is the number of items in the lef-child dataset D l � N Dr is the number of items in the lef-child dataset D r

  22. Information gain: examples I = H ( D ) − ( N Dl H ( D l ) + N Dr H ( D r )) N D N D = 0 . 971 − (24 60 × 0 + 36 60 × 0 . 918) = 0 . 420 bits

  23. Q. Is the splitting method global optimum? ← locally decided feature A. Yes Specific 0 ° Noa I B. No greedy .

  24. How to choose a dimension and split � If there are d dimensions, choose approximately √ d of them as candidates at random � For each candidate, find the split that maximizes the informa4on gain � Choose the best overall dimension and split � Note that splicng can be generalized to categorical features for which there is no natural ordering of the data

  25. When to stop growing the decision tree? � Growing the tree too deep can lead to overficng to the training data � Stop recursion on a data subset if any of the following occurs: � All items in the data subset are in the same class � The data subset becomes smaller than a predetermined size � A predetermined maximum tree depth has been reached.

  26. How to label the leaves of a decision tree � A leaf will usually have a data subset containing many class labels � Choose the class that has the most items in the subset hard � Alterna4vely, label the leaf with the number it contains in each class for a probabilis4c “sof” classifica4on. node leaf Cs ca Ci T ells cu Cy Cl

  27. Pros and Cons of a decision tree implement � Pros: easy Intuitive . co . → fast low cost b Rv . Conti Discrete Boundary Decision accurate � Cons: Not an over tilting .

  28. Training, evaluation and classification � Build the random forest by training each decision tree on a random subset with replacement from the training data and Ia subset of features are also randomly selected--- “Bagging” - � Evaluate the random forest by tes4ng on its out-of-bag items � Classify by merging the classifica4ons of individual decision trees � By simple vote � Or by adding sof classifica4ons together and then take a vote te . .

  29. An example of bagging Drawing random samples Sample Bagging Bagging … Bagging indices Round 1 Round 2 Round M from our training set with 1 2 7 replacement. E.g., if our 2 2 3 training set consists of 7 3 1 2 training samples, our 4 3 1 bootstrap samples (here: 5 4 1 n=7) can look as follows, 6 7 7 where C 1 , C 2 , … C m shall 7 2 1 symbolize the decision tree classifiers. -9 d- C 1 C 2 random seieztgwt d=9 features

  30. Pros and Cons of Random forest usually � Pros: accurate More overtittihg be likely to less . relative longer in � Cons: cost more , computing

  31. Q2. Do you think random forest will always outperform simple decision tree? A. Yes .r related B. No www E) trees by using different of d subsets d = ?

  32. Considerations in choosing a classifier � When solving a classifica4on problem, it is good to try several techniques. � Criteria to consider in choosing the classifier include Accuracy * model for the training g Speed * drag given new classification ( variety flexibility data of * . gull ) us big Interpretation * effect scaling . *

  33. Support Vector Machine (SVM) overview � The Decision boundary and func4on of a Support Vector Machine ← � Loss func4on (cost func4on in the book) � Training ← � Valida4on � Extension to mul4class classifica4on

  34. SVM problem formulation � At first we assume a binary classifica4on problem � The training set consists of N items � Feature vectors x i of dimension d o - � Corresponding class labels y i ∈ {± 1 } x (2) � We can picture the training data as a d-dimensional scaner plot with colored " " " * labels ! ! x (1) n

  35. Decision boundary of SVM qtktb 20 � SVM uses a hyperplane as its + I / decision boundary x (2) a T x + b = 0 o ✓ � The decision boundary is: a 1 x (1) + a 2 x (2) + ... + a d x ( d ) + b = 0 qtxtbso � In vector nota4on, the x (1) - l hyperplane can be wrinen as: ' ' ' Cds . * K - - . → " I at a n a T x + b = 0 es , - great ; s .

  36. Q3. How many solutions can we have for the decision boundary? x (2) a T x + b = 0 A. One H B. Several " C. Infinite x (1)

  37. Classification function of SVM � SVM assigns a class label to a x (2) feature vector according to the a T x + b = 0 following rule: +1 if a T x i + b ≥ 0 -1 if a T x i + b < 0 � In other words, the classifica4on x (1) func4on is: sign ( a T x i + b ) � Note that If is small, then was close to the decision � a T x i + b � � � x i � boundary If is large, then was far from the decision � � � a T x i + b � x i � boundary

  38. What if there is no clean cut boundary? � Some boundaries are bener x (2) than others for the training data a T x + b = 0 � Some boundaries are likely more robust for run-4me data � We need to a quan4ta4ve x (1) measure to decide about the boundary � The loss func+on can help decide if one boundary is bener than others

Recommend


More recommend