Probability and Statistics for Computer Science many problems are - PowerPoint PPT Presentation

Probability and Statistics ì for Computer Science “…many problems are naturally classifica4on problems”---Prof. Forsyth Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.5.2020

Last time � Demo of Principal Component Analysis � Introduc4on to classifica4on

Objectives � Decision tree (II) } so � Random forest -08 � Support Vector Machine (I)

Classifiers prediction � Why do we need classifiers? patterns efficient � What do we use to quan4fy the performance of a classifier? matrix confusion � What is the baseline accuracy of a 5-class classifier using 0-1 O loss func4on? ¥ � What’s valida4on and cross-valida4on in classifica4on?

Performance of a multiclass classifier � Assuming there are c classes: � The class confusion matrix is c × c True label � Under the 0-1 loss func4on accuracy = sum of diagonal terms sum of all terms ie. in the right example, accuracy = predicted 32/38=84% Source: scikit-learn � The baseline accuracy is 1/c.

- validation Cross mnl : tiple Split the ways data in randomly Testing I Training us . validation " " { tolu ' ' ' ' leave - one - ont purpose ?

Q1. Cross-validation Cross-valida+on is a method used to prevent overficng in classifica4on. D A. TRUE B. FALSE

Decision(tree:(object(classification( � The$object$classifica4on$ decision(tree $can$classify$ objects$into$mul4ple$classes$using$sequence$of$ simple$tests.$It$will$naturally$grow$into$a$tree.$ . moving moving not - hole parts or o o human non - human big or sunset chair(leg( toddler( Cat( dog( sofa( box(

Training(a(decision(tree:(example( � The$“Iris”$data$set$ M Virginica$ PN o O . yin w ' T O a Setosa$ Versicolor$ t Seto s a 50 Virginica o 1?$Where?$ Versicolor o versicolor

Training a decision tree � Choose a dimension/feature and a split � Split the training Data into lef- and right- child subsets D l and D r right left � Repeat the two steps above recursively on each child � Stop the recursion based on some condi4ons � Label the leaves with class labels

Classifying with a decision tree: example � The “Iris” data set Virginica ✓ ✓ Setosa Versicolor

Choosing a split � An informa4ve split makes the subsets more concentrated and reduces uncertainty about class labels

Choosing a split � An informa4ve split makes the subsets ✔ more concentrated and reduces uncertainty about ✖ class labels

Which is more informative?

Quantifying uncertainty using entropy � We can measure uncertainty as the number of bits of informa4on needed to dis4nguish between classes in a dataset (first introduced by Claude Shannon) � We need Log 2 2 =1 bit to dis4nguish 2 equal classes � We need Log 2 4 =2 bit to dis4nguish 4 equal classes Claude Shannon (1916-2001)

Quantifying uncertainty using entropy � Entropy (Shannon entropy) is the measure of uncertainty for a general distribu4on 1 � If class i contains a frac4on P ( i ) of the data, we need log 2 P ( i ) bits for that class � The entropy H(D) of a dataset is defined as the weighted mean of entropy for every class: c 1 � H ( D ) = P ( i ) log 2 P ( i ) i =1 = II , - Pci ) log , Pci ,

Entropy: before the split Pco ) = # -_ 3g pcx ) De DR p H ( D ) = − 3 3 5 − 2 2 Hl De ) = ? 5 log 2 5 log 2 5 H CDR ) = ? = 0 . 971 bits

Entropy: examples pix ) =/ - Ed H ( D ) = − 3 3 5 − 2 2 H ( D l ) = − 1 log 2 1 = 0 bits 5 log 2 5 log 2 5 = 0 . 971 bits

Entropy: examples Dr De H ( D ) = − 3 5 − 2 3 2 H ( D l ) = − 1 log 2 1 = 0 bits 5 log 2 5 log 2 5 H ( D r ) = − 1 1 3 − 2 2 3 log 2 3 log 2 = 0 . 971 bits 3 = 0 . 918 bits

Information gain of a split � The informa4on gain of a split is the amount of entropy that was reduced on average afer the split I = H ( D ) − ( N Dl H ( D l ) + N Dr H ( D r )) N D N D � where � N D is the number of items in the dataset D � N Dl is the number of items in the lef-child dataset D l � N Dr is the number of items in the lef-child dataset D r

Information gain: examples I = H ( D ) − ( N Dl H ( D l ) + N Dr H ( D r )) N D N D = 0 . 971 − (24 60 × 0 + 36 60 × 0 . 918) = 0 . 420 bits

Q. Is the splitting method global optimum? ← locally decided feature A. Yes Specific 0 ° Noa I B. No greedy .

How to choose a dimension and split � If there are d dimensions, choose approximately √ d of them as candidates at random � For each candidate, find the split that maximizes the informa4on gain � Choose the best overall dimension and split � Note that splicng can be generalized to categorical features for which there is no natural ordering of the data

When to stop growing the decision tree? � Growing the tree too deep can lead to overficng to the training data � Stop recursion on a data subset if any of the following occurs: � All items in the data subset are in the same class � The data subset becomes smaller than a predetermined size � A predetermined maximum tree depth has been reached.

How to label the leaves of a decision tree � A leaf will usually have a data subset containing many class labels � Choose the class that has the most items in the subset hard � Alterna4vely, label the leaf with the number it contains in each class for a probabilis4c “sof” classifica4on. node leaf Cs ca Ci T ells cu Cy Cl

Pros and Cons of a decision tree implement � Pros: easy Intuitive . co . → fast low cost b Rv . Conti Discrete Boundary Decision accurate � Cons: Not an over tilting .

Training, evaluation and classification � Build the random forest by training each decision tree on a random subset with replacement from the training data and Ia subset of features are also randomly selected--- “Bagging” - � Evaluate the random forest by tes4ng on its out-of-bag items � Classify by merging the classifica4ons of individual decision trees � By simple vote � Or by adding sof classifica4ons together and then take a vote te . .

An example of bagging Drawing random samples Sample Bagging Bagging … Bagging indices Round 1 Round 2 Round M from our training set with 1 2 7 replacement. E.g., if our 2 2 3 training set consists of 7 3 1 2 training samples, our 4 3 1 bootstrap samples (here: 5 4 1 n=7) can look as follows, 6 7 7 where C 1 , C 2 , … C m shall 7 2 1 symbolize the decision tree classifiers. -9 d- C 1 C 2 random seieztgwt d=9 features

Pros and Cons of Random forest usually � Pros: accurate More overtittihg be likely to less . relative longer in � Cons: cost more , computing

Q2. Do you think random forest will always outperform simple decision tree? A. Yes .r related B. No www E) trees by using different of d subsets d = ?

Considerations in choosing a classifier � When solving a classifica4on problem, it is good to try several techniques. � Criteria to consider in choosing the classifier include Accuracy * model for the training g Speed * drag given new classification ( variety flexibility data of * . gull ) us big Interpretation * effect scaling . *

Support Vector Machine (SVM) overview � The Decision boundary and func4on of a Support Vector Machine ← � Loss func4on (cost func4on in the book) � Training ← � Valida4on � Extension to mul4class classifica4on

SVM problem formulation � At first we assume a binary classifica4on problem � The training set consists of N items � Feature vectors x i of dimension d o - � Corresponding class labels y i ∈ {± 1 } x (2) � We can picture the training data as a d-dimensional scaner plot with colored " " " * labels ! ! x (1) n

Decision boundary of SVM qtktb 20 � SVM uses a hyperplane as its + I / decision boundary x (2) a T x + b = 0 o ✓ � The decision boundary is: a 1 x (1) + a 2 x (2) + ... + a d x ( d ) + b = 0 qtxtbso � In vector nota4on, the x (1) - l hyperplane can be wrinen as: ' ' ' Cds . * K - - . → " I at a n a T x + b = 0 es , - great ; s .

Q3. How many solutions can we have for the decision boundary? x (2) a T x + b = 0 A. One H B. Several " C. Infinite x (1)

Classification function of SVM � SVM assigns a class label to a x (2) feature vector according to the a T x + b = 0 following rule: +1 if a T x i + b ≥ 0 -1 if a T x i + b < 0 � In other words, the classifica4on x (1) func4on is: sign ( a T x i + b ) � Note that If is small, then was close to the decision � a T x i + b � � � x i � boundary If is large, then was far from the decision � � � a T x i + b � x i � boundary

What if there is no clean cut boundary? � Some boundaries are bener x (2) than others for the training data a T x + b = 0 � Some boundaries are likely more robust for run-4me data � We need to a quan4ta4ve x (1) measure to decide about the boundary � The loss func+on can help decide if one boundary is bener than others

Probability and Statistics for Computer Science many problems are - PowerPoint PPT Presentation

Probability and Statistics for Computer Science many problems are naturally classifica4on problems---Prof. Forsyth Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.5.2020 Last time Demo of Principal

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Counting and Probability Whats to come? Counting and Probability Whats to come?

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

ACMS 20340 Statistics for Life Sciences Chapter 9: Introducing Probability Why Consider

Statistics 1B Statistics 1B 1 (11) 0. Lecture 1. Introduction and probability review

Statistics 370 Probability and Statistics for Engineers Instructor: Peter Bloomfield Course

Chapter II.2: Basic Probability Theory and Statistics 1. What is a probability? 1.1. Probability

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

High Performance Data Intensive Computing Dongfang Zhao, Assistant Professor Department of

Advanced Machine Learning CS 7140 - Spring 2018 Lecture 13: Project Discussion Jan-Willem van de

Reproducible Research, Replicability, and Ethical Practice Ronald A. Thisted Departments of

Rare decays at LHCb: looking for new physics in b s + - transitions Luca

Changing needs in a changing world Research supported by TLRI Why change? FROM STATISTICAL

Computer Science Education Research at Wits Vashti Galpin vashti@cs.wits.ac.za

The Role of Adaptive Designs in Clinical Development Program * Sue- -Jane Wang, Ph.D. Jane Wang,

EZH2 Symposium June 2014 2013 Accomplishments Forward-Looking Statements This presentation