Introduction Alessandro Moschitti Department of Computer Science - PowerPoint PPT Presentation

MACHINE LEARNING Introduction Alessandro Moschitti Department of Computer Science and Information Engineering University of Trento Email: moschitti@disi.unitn.it

Course Schedule - Revised � 27 apr 9:30-12:30 Garda (Introduction to Machine Learning - Decision Tree and Bayesian Classifiers) � 2 maggio: 14:30-18:30 Ofek (Introduction to Statistical Learning Theory – Vector Space Model) � 4 Maggio 9:30-12:30 Ofek (Linear Classifier:) � 28 maggio 9:30-12:30 Ofek (VC dimension, Perceptron and Support Vector Machines) � 29 maggio 9:30-12:30 Garda (Kernel Methods for NLP Applications)

Lectures � Introduction to ML � Decision Tree � Bayesian Classifiers � Vector spaces � Vector Space Categorization � Feature design, selection and weighting � Document representation � Category Learning: Rocchio and KNN � Measuring of Performance � From binary to multi-class classification

Lectures � PAC Learning � VC dimension � Perceptron � Vector Space Model � Representer Theorem � Support Vector Machines (SVMs) � Hard/Soft Margin (Classification) � Regression and ranking

Lectures � Kernels Methods � Theory and Algebraic properties � Linear, Polynomial, Gaussian � Kernel construction, � Kernels for structured data � Sequence, Tree Kernels � Structured Output

Reference Book + some articles

Today � Introduction to Machine Learning � Vector Spaces

Why Learning Functions Automatically? � Anything is a function � From the planet motion � To the input/output actions in your computer � Any problem would be automatically solved

More concretely � Given the user requirement (input/output relations) we write programs � Different cases typically handled with if-then applied to input variables � What happens when � millions of variables are present and/or � values are not reliable (e.g. noisy data) � Machine learning writes the program (rules) for you

What is Statistical Learning? � Statistical Methods – Algorithms that learn relations in the data from examples � Simple relations are expressed by pairs of variables: 〈 x 1 ,y 1 〉 , 〈 x 2 ,y 2 〉 ,…, 〈 x n ,y n 〉 � Learning f such that evaluate y * given a new value x * , i.e. 〈 x * , f(x * ) 〉 = 〈 x * , y * 〉

You have already tackled the learning problem Y X

Linear Regression Y X

Degree 2 Y X

Degree Y X

Machine Learning Problems � Overfitting � How dealing with millions of variables instead of only two? � How dealing with real world objects instead of real values?

Learning Models � Real Values: regression � Finite and integer: classification � Binary Classifiers: � 2 classes, e.g. f(x) à {cats,dogs}

Decision Trees

Decision Tree (between Dogs/Cats) Taller than 50 cm? yes No Short hair? Output: dog No . Mustaches? . . Si No Output: Cat Output: Dog

Mustaches or Whiskers � Are an important orientation tool for both dogs and cats � all dogs and cats have them ⟾ not good features � We may use their length � What about mustaches?

Mustaches?

Entropy-based feature selection � Entropy of class distribution P(C i ) : � Measure “how much the distribution is uniform” � Given S 1 …S n sets partitioned wrt a feature the overall entropy is:

Example: cats and dogs classification S 0 � p(dog)=p(cat) = 4/8 = ½ (for both dogs and cats) � H(S 0 ) = ½ *log(2) * 2 = 1

Has the animal more than 6 siblings? S 1 S 0 S 2 � p(dog)=p(cat) = 2/4 = ½ (for both dogs and cats) � H(S 1 ) = H(S 2 ) = ¼ * [ ½ *log(2) * 2] = 0.25 � All(S 1, S 2 ) = 2*.25 = 0.5

Does the animal have short hair? S 1 S 0 S 2 � p(dog)= 1/4; p(cat) = 3/4 � H(S 2 )=H(S 1 ) = ¼ * [(1/4)*log(4) + (3/4)*log(4/3)] = ¼ * [ ½ + 0.31] = ¼ * 0.81 = 0.20 � All(S 1, S 2 ) = 0.20*2 = 0.40 (note that |S1| = |S2|)

Follow up � hair length feature is better than number of siblings since 0.40 is lower than 0.50 � Test all the features � Choose the best � Start with a new feature on the collection sets induced by the best feature

Probabilistic Classifier

Probability (1) � Let Ω be a space and β a collection of subsets of Ω � β is a collection of events � A probability function P is defined as: [ ] P : 0 , 1 β →

Definition of Probability � P is a function which associates each event E with a number P(E) called probability of E as follows: 1) 0 P ( E ) 1 ≤ ≤ 2) P ( ) 1 Ω = 3 ) P ( E E ... E ...) ∨ ∨ ∨ ∨ = 1 2 n ∞ ∑ P ( E i ) if E i ∧ E j = 0 , ∀ i ≠ j = i = 1

Finite Partition and Uniformly Distributed � Given a partition of n events uniformly distributed (with a probability of 1/ n ); and � given an event E , we can evaluate its probability as: P ( E ) = P ( E ∧ E tot ) = P ( E ∧ ( E 1 ∨ E 2 ∨ ... ∨ E n )) = 1 ∑ ∑ ∑ P ( E ∧ E i ) = P ( E i ) = = n i E i ⊂ E E i ⊂ E 1 1 = 1 } ) = Target Cases ∑ { ( i : E i ⊂ E n n All Cases E i ⊂ E

Conditioned Probability � P(A | B) is the probability of A given B � B is the piece of information that we know � The following rule holds: P ( A B ) ∧ A B A ∧ B P ( A | B ) = P ( B )

Indipendence � A and B are indipedent iff : P ( A | B ) P ( A ) = P ( B | A ) P ( B ) = � If A and B are indipendent: P ( A B ) ∧ P ( A ) P ( A | B ) = = P ( B ) P ( A B ) P ( A ) P ( B ) ∧ =

Bayesian Classifier � Given a set of categories { c 1 , c 2 ,… c n } � Let E be a description of a classifying example. � The category of E can be derived by using the following probability: P ( c i | E ) = P ( c i ) P ( E | c i ) P ( E ) n n P ( c i ) P ( E | c i ) ∑ ∑ P ( c i | E ) = = 1 P ( E ) i = 1 i = 1 n ∑ P ( E ) = P ( c i ) P ( E | c i ) i = 1

Bayesian Classifier (cont) � We need to compute: � the posterior probability: P( c i ) � the conditional probability: P( E | c i ) � P( c i ) can be estimated from the training set, D. � given n i examples in D of type c i , then P( c i ) = n i / | D| � Suppose that an example is represented by m features : E e e  e = ∧ ∧ ∧ 1 2 m � The elements will be exponential in m so there are not enough training examples to estimate P( E | c i )

Naïve Bayes Classifiers � The features are assumed to be indipendent given a category ( c i ). m P ( E | c i ) = P ( e 1 ∧ e 2 ∧  ∧ e m | c i ) = ∏ P ( e j | c i ) j = 1 � This allows us to only estimate P ( e j | c i ) for each feature and category.

An example of the Naïve Bayes Clasiffier � C = {Allergy, Cold, Healthy} � e 1 = sneeze; e 2 = cough; e 3 = fever � E = {sneeze, cough, ¬ fever} Prob Healthy Cold Allergy P( c i ) 0.9 0.05 0.05 P(sneeze| c i ) 0.1 0.9 0.9 P(cough| c i ) 0.1 0.8 0.7 P(fever| c i ) 0.01 0.7 0.4

An example of the Naïve Bayes Clasiffier (cont.) Probability Healthy Cold Allergy P( c i ) 0.9 0.05 0.05 E={sneeze, cough, ¬ fever} P(sneeze | c i ) 0.1 0.9 0.9 P(cough | c i ) 0.1 0.8 0.7 P(fever | c i ) 0.01 0.7 0.4 P(Healthy| E) = (0.9)(0.1)(0.1)(0.99)/P(E)=0.0089/P(E) P(Cold | E) = (0.05)(0.9)(0.8)(0.3)/P(E)=0.01/P(E) P(Allergy | E) = (0.05)(0.9)(0.7)(0.6)/P(E)=0.019/P(E ) The most probable category is allergy P(E) = 0.0089 + 0.01 + 0.019 = 0.0379 P(Healthy| E) = 0.23, P(Cold | E) = 0.26, P(Allergy | E) = 0.50

Probability Estimation � Estimate counts from training data. � Let n i be the number of examples in c i � let n ij be the number of examples of c i containing the feature e j , then: n ij P ( e | c ) = j i n i � Problems: the data set may still be too small. � For rare features we may have, e k , ∀ c i :P( e k | c i ) = 0.

Smoothing � The probabilities are estimated even if they are not in the data � Laplace smoothing � each feature has a priori probability, p , � We assume that such feature has been observed in an example of size m . n mp + ij P ( e | c ) = j i n m + i

Naïve Bayes for text classification � “bag of words” model � The examples are category documents � Features: Vocabulary V = { w 1 , w 2 ,… w m } � P( w j | c i ) is the probability to have w j in a category i � Let us use the Laplace’s smoothing � Uniform distribution ( p = 1/| V |) and m = | V | � That is each word is assumed to appear exactly one time in a category

Training (version 1) � V is built using all training documents D � For each category c i ∈ C Let D i the document subset of D in c i ⇒ P( c i ) = | D i | / | D | n i is the total number of words in D i for each w j ∈ V, n ij is the counts of w j in c i ⇒ P( w j | c i ) = ( n ij + 1) / ( n i + | V |)

Testing � Given a test document X � Let n be the number of words of X � The assigned category is: n ∏ argmax P ( c i ) P ( a j | c i ) ci ∈ C j = 1 where a j is a word at the j - th position in X

Part I: Abstract View of Statistical Learning Theory

Introduction Alessandro Moschitti Department of Computer Science - PowerPoint PPT Presentation

MACHINE LEARNING Introduction Alessandro Moschitti Department of Computer Science and Information Engineering University of Trento Email: moschitti@disi.unitn.it Course Schedule - Revised 27 apr 9:30-12:30 Garda (Introduction to Machine

INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION

Introduction ATV Introduction A T V Introduction A lphabet T V Introduction A lphabet

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Shenzhen Cuilu jewelry Co., Ltd was founded in 1996 and its a large private enterprise

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Spectrum Painting Richard Shipman MW0RCZ ADARS 6th Jan 2020 Introduction Introduction

Introduction Introduction Introduction Introduction Outline Motivation Failures

Introduction Introduction Introduction Nationwide Cause for Concern 1

Team Introduction Experiments Outreach Problem Project Brainstorm Introduction Introduction

Lecture 1 Andreas Habegger Introduction Zynq Introduction Zynq Introduction Zynq PS vs. PL

Introduction to Web Design & Computer Principles Class 1 CSCI-UA 4 Introduction and Overview

Introduction to CICS Course introduction Course introduction What is CICS? What is an

INF5110 Compiler Construction Introduction Spring 2016 1 / 33 Outline 1. Introduction

INTRODUCTION I Syllabus INTRODUCTION I Syllabus I Why study labor economics? INTRODUCTION I

2018.06 01 SMILE5 Introduction S E 5 02 Alpha Cloud M I L 03 Company Introduction 04

Selective Search for Object Recognition J.R.R. Uijlings 1,2 , K.E.A. van de Sande 2 , T.

Welcome! Assistance Animals in Public Accommodations & Housing will begin at 2:00 p.m.

Matrices of nonnegative rank at most three Emil Horobet (with Rob H. Eggermont and Kaie

Inference and Representation David Sontag New York University Lecture 2, September 9, 2014

Whats wrong with these sentences? I am anxious to meet you. Fred bit off more than he

Cheap Tricks and the Perils of Machine Learning Percy Liang Stanford / (Semantic Machines /

Scaling Data Products Under Startup Constraints A Case Study of ML Bias Testing Scaling Data

Enthymemes as Rhetorical Resources Ellen Breitholtz and Robin Cooper Department of Philosophy,

Introduction Alessandro Moschitti Department of Computer Science - PowerPoint PPT Presentation

MACHINE LEARNING Introduction Alessandro Moschitti Department of Computer Science and Information Engineering University of Trento Email: moschitti@disi.unitn.it Course Schedule - Revised 27 apr 9:30-12:30 Garda (Introduction to Machine

INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION

Introduction ATV Introduction A T V Introduction A lphabet T V Introduction A lphabet

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Shenzhen Cuilu jewelry Co., Ltd was founded in 1996 and its a large private enterprise

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Spectrum Painting Richard Shipman MW0RCZ ADARS 6th Jan 2020 Introduction Introduction

Introduction Introduction Introduction Introduction Outline Motivation Failures

Introduction Introduction Introduction Nationwide Cause for Concern 1

Team Introduction Experiments Outreach Problem Project Brainstorm Introduction Introduction

Lecture 1 Andreas Habegger Introduction Zynq Introduction Zynq Introduction Zynq PS vs. PL

Introduction to Web Design &amp; Computer Principles Class 1 CSCI-UA 4 Introduction and Overview

Introduction to CICS Course introduction Course introduction What is CICS? What is an

INF5110 Compiler Construction Introduction Spring 2016 1 / 33 Outline 1. Introduction

INTRODUCTION I Syllabus INTRODUCTION I Syllabus I Why study labor economics? INTRODUCTION I

2018.06 01 SMILE5 Introduction S E 5 02 Alpha Cloud M I L 03 Company Introduction 04

Selective Search for Object Recognition J.R.R. Uijlings 1,2 , K.E.A. van de Sande 2 , T.

Welcome! Assistance Animals in Public Accommodations &amp; Housing will begin at 2:00 p.m.

Matrices of nonnegative rank at most three Emil Horobet (with Rob H. Eggermont and Kaie

Inference and Representation David Sontag New York University Lecture 2, September 9, 2014

Whats wrong with these sentences? I am anxious to meet you. Fred bit off more than he

Cheap Tricks and the Perils of Machine Learning Percy Liang Stanford / (Semantic Machines /

Scaling Data Products Under Startup Constraints A Case Study of ML Bias Testing Scaling Data

Enthymemes as Rhetorical Resources Ellen Breitholtz and Robin Cooper Department of Philosophy,

Introduction to Web Design & Computer Principles Class 1 CSCI-UA 4 Introduction and Overview

Welcome! Assistance Animals in Public Accommodations & Housing will begin at 2:00 p.m.