CS 4100: Artificial Intelligence Perceptrons and Logistic Regression - PDF document

CS 4100: Artificial Intelligence Perceptrons and Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] Linear Classifiers

Feature Vectors Hello, SP SPAM # free : 2 YOUR_NAME : 0 Do you want free printr or or MISSPELLED : 2 cartriges? Why pay more FROM_FRIEND : 0 when you can get them + ... ABSOLUTELY FREE! Just PIXEL-7,12 : 1 “2 “2” PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ... Some (Simplified) Biology • Ve Very loose se insp spiration: human neurons

Linear Classifiers • In Inputs s are fe feature values • Ea Each feature has s a we weight • Su Sum is the act activat ation • If If the activa vation is: s: w 1 f 1 S • Po Positive ve , output +1 +1 w 2 >0? f 2 w 3 • Ne Negative , output -1 f 3 Weights • Bi Bina nary case: compare features to a weight vector • Le Learni ning ng: figure out the weight vector from examples # free : 4 YOUR_NAME :-1 # free : 2 MISSPELLED : 1 YOUR_NAME : 0 FROM_FRIEND :-3 MISSPELLED : 2 ... FROM_FRIEND : 0 ... # free : 0 YOUR_NAME : 1 MISSPELLED : 1 Do Dot t pr produ duct t po positive itive FROM_FRIEND : 1 me means the positive class ...

Decision Rules Binary Decision Rule • In In the sp space of feature ve vectors • Examples are points • Any weight vector is a hyperplane • One side corresponds to Y= Y=+1 • Other corresponds to Y= Y=-1 money 2 +1 = SPAM 1 BIAS : -3 free : 4 money : 2 0 ... -1 = HAM 0 1 free

Weight Updates Learning: Binary Perceptron • St Start with we weights = = 0 • Fo For each training instance: • Cl Classify with current weights • If If co correct ect: (i.e., y=y*), no change! • If If wrong: adjust the weight vector

Learning: Binary Perceptron • St Start with we weights = = 0 • Fo For each training instance: • Cl Classify with current weights • If If co correct ect: (i.e., y= y=y* y* ), no change! • If If wrong: adjust the weight vector by adding or subtracting the feature vector. Subtract if y* y* is -1 . Examples: Perceptron • Separable Case

Multiclass Decision Rule • If If we e hav ave e multiple e cl clas asses es: • A we weig ight vector for each class: • Sc Score (activation) of a class y: • Prediction with hi highe ghest st sc scor ore wins Binary = multiclass where the negative class has weight zero Learning: Multiclass Perceptron Start with all we • St weights = = 0 • Pi Pick training examples one by one • Pr Predi dict with current weights • If If c correct: no change! • If If wr wrong: lower score of wrong answer, raise score of right answer

Example: Multiclass Perceptron Qu Question: What will the weights w be for each class after 3 updates? y 1 = “p “politics” , x 1 = “wi “win the vote” y 2 = “p “politics” , x 2 = “wi “win the election” y 3 = “s “sports” , x 3 = “wi “win the game” ” BIAS : 1 BIAS : 0 BIAS : 0 win : 0 win : 0 win : 0 game : 0 game : 0 game : 0 vote : 0 vote : 0 vote : 0 the : 0 the : 0 the : 0 ... ... ... Example: Multiclass Perceptron Question: What will the weights w be for each class after 3 updates? Qu sports f( x 1 ) = 1 y 1 = “p “politics” , x 1 = “wi w sp “win the vote” Prediction: Pr politics f( x 1 ) = 0 y 2 = “p “politics” , x 2 = “wi w po “win the election” “sports” “s (wr (wrong) y 3 = “s “sports” , x 3 = “wi “win the game” ” w te tech f( x 1 ) = 0 1 BIAS : 1 - 1 BIAS : 0 + 1 BIAS : 0 1 win : 0 - 1 win : 0 + 1 win : 0 f( x 1 ) = 0 game : 0 - 0 game : 0 + 0 game : 0 1 vote : 0 - 1 vote : 0 + 1 vote : 0 1 the : 0 - 1 the : 0 + 1 the : 0 ... ... ...

Example: Multiclass Perceptron Question: What will the weights w be for each class after 3 updates? Qu w sp sports f( x 1 ) = -2 y 1 = “p “politics” , x 1 = “wi “win the vote” Prediction: Pr w po politics f( x 1 ) = 3 y 2 = “p “politics” , x 2 = “wi “win the election” “p “politics” (c (correct) y 3 = “s “sports” , x 3 = “wi “win the game” ” w te tech f( x 1 ) = -3 1 BIAS : 0 BIAS : 1 BIAS : 0 1 win : -1 win : 1 win : 0 f( x 2 ) = 0 game : 0 game : 0 game : 0 0 vote : -1 vote : 1 vote : 0 1 the : -1 the : 1 the : 0 ... ... ... Example: Multiclass Perceptron Question: What will the weights w be for each class after 3 updates? Qu sports f( x 1 ) = -2 y 1 = “p “politics” , x 1 = “wi w sp “win the vote” Prediction: Pr politics f( x 1 ) = 3 y 2 = “p “politics” , x 2 = “wi w po “win the election” “p “politics” (wr (wrong) y 3 = “s “sports” , x 3 = “wi “win the game” ” w te tech f( x 1 ) = -3 1 BIAS : 0 + 1 BIAS : 1 - 1 BIAS : 0 1 win : -1 + 1 win : 1 - 1 win : 0 f( x 3 ) = 1 game : 0 + 1 game : 0 - 1 game : 0 0 vote : -1 + 0 vote : 1 - 0 vote : 0 1 the : -1 + 1 the : 1 - 1 the : 0 ... ... ...

Example: Multiclass Perceptron Question: What will the weights w be for each class after 3 updates? Qu y 1 = “p “politics” , x 1 = “wi “win the vote” y 2 = “p “politics” , x 2 = “wi “win the election” y 3 = “s “sports” , x 3 = “wi “win the game” ” BIAS : 1 BIAS : 0 BIAS : 0 win : 0 win : 0 win : 0 game : 1 game : -1 game : 0 vote : -1 vote : 1 vote : 0 the : 0 the : 0 the : 0 ... ... ... Properties of Perceptrons Separable • Se Separability: y: tr true if there exists weights w w that get the training set perfectly correct • Co Conv nvergenc nce: if the training data are se separable , a perceptron will eventually converge (binary case) δ • Mistake ke Bound: the maximum number of mistakes (updates) Non-Separable (binary case) is related to the num number of featur ures k and the ma margin δ or degree of separability

Problems with the Perceptron • No Noise: if the data isn’t separable, weights might thrash • Av Averaging weight vectors over time can help (averaged perceptron) ation: finds a • Med Mediocr cre e gen ener eral alizat “barely” separating solution • Ov Overtraining: te test t / held-ou out t accuracy usually rises, then falls • Overtraining is a kind of overfitting Improving the Perceptron

Non-Separable Case: Deterministic Decision Even the best linear boundary makes at least one mistake Non-Separable Case: Probabilistic Decision 0.9 | 0.1 0.7 | 0.3 0.5 | 0.5 0.3 | 0.7 0.1 | 0.9

How to get probabilistic decisions? • Pe Perceptron on scor orin ing: g: z = w · f ( x ) • If If very po positive à want probability going to 1 z = w · f ( x ) very ne ive à want probability going to 0 • If If negative z = w · f ( x ) • Sigmoid Sigmoid fu function ion 1 φ ( z ) = 1 + e − z Best w? • Maximum like kelihood estimation: X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i 1 P ( y ( i ) = +1 | x ( i ) ; w ) = wi with th: 1 + e − w · f ( x ( i ) ) 1 P ( y ( i ) = − 1 | x ( i ) ; w ) = 1 − 1 + e − w · f ( x ( i ) ) Th This is is is calle lled Lo Logis istic ic Regressio ion

Separable Case: Deterministic Decision – Many Options Separable Case: Probabilistic Decision – Clear Preference 0.7 | 0.3 0.5 | 0.5 0.7 | 0.3 0.3 | 0.7 0.5 | 0.5 0.3 | 0.7

Multiclass Logistic Regression • Re Recall Perceptron: n: • A we weig ight vector for each class: • Sc Score (activation) of a class y: • Prediction with hi highe ghest st sc scor ore wins ns • Ho How w to tur urn n sc scores s in into pr proba babi bilities? ? e z 1 e z 2 e z 3 z 1 , z 2 , z 3 → e z 1 + e z 2 + e z 3 , e z 1 + e z 2 + e z 3 , e z 1 + e z 2 + e z 3 original activations softmax activations Best w? • Maximum like kelihood estimation: X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i e w y ( i ) · f ( x ( i ) ) P ( y ( i ) | x ( i ) ; w ) = with wi th: y e w y · f ( x ( i ) ) P Th This is is is calle lled Mu Multi-Cl Class L ss Logist stic R Regressi ssion

Next Lecture • Op Opti timizati tion • i.e., how do we solve: X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i

CS 4100: Artificial Intelligence Perceptrons and Logistic Regression - PDF document

CS 4100: Artificial Intelligence Perceptrons and Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial intelligence Artificial Intelligence is the science of PHILOSOPHY OF ARTIFICIAL

Artificial Intelligence Intro (Chapter 1 of AIMA) Summary Artificial Intelligence What is AI?

What is Artificial Intelligence? CPSC 322 Lecture 1 September 5, 2007 What is Artificial

Traditional Definition of Artificial Intelligence Trends Artificial Intelligence (AI) is

Artificial Intelligence as Law Bart Verheij Department of Artificial Intelligence, Bernoulli

CSCI 446 ARTIFICIAL INTELLIGENCE EXAM 1 STUDY OUTLINE Introduction to Artificial Intelligence

Lecture Overview What is Artificial Intelligence? Agents acting in an environment

CSCI 446: Artificial Intelligence CSCI 446: Artificial Intelligence Course Website:

1.1 What is AI? 1. What is Artificial Intelligence? 2. AI Past and Present 3. Rational

8th November 2019 Artificial Intelligence Finance Institute NYU Courant Artificial Intelligence

CSCI 446 ARTIFICIAL INTELLIGENCE EXAM 1 STUDY OUTLINE Introduction to Artificial Intelligence

Introduction to Artificial Intelligence What is Artificial Intelligence for YOU? CPSC 533

CS 4100: Artificial Intelligence Bayes Nets Jan-Willem van de Meent, Northeastern University

Perceptrons Jonathan Mugan jonathanwilliammugan@gmail.com www.jonathanmugan.com @jmugan April

Learning a Distance Metric for Structured Network Prediction Stuart Andrews and Tony Jebara

MIRA, SVM, k-NN Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values

Does Training Affect Match Performance? A Study Using Data Mining And Tracking Devices

Search for top Squarks Using Multivariate Methods Jonas Graw Max Planck Institute for Physics

x 2 > b w T x > 0 SPAM!! x ( x , 1) w 3 x 3 w T x + b ( w , b ) T ( x , 1)

First Observation of Single Top Quark Production at D Monica Pangilinan Brown University on

Bias/Variance Analysis for Network Data Jennifer Neville and David Jensen Knowledge Discovery

CS 4100: Artificial Intelligence Perceptrons and Logistic Regression - PDF document

CS 4100: Artificial Intelligence Perceptrons and Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial intelligence Artificial Intelligence is the science of PHILOSOPHY OF ARTIFICIAL

Artificial Intelligence Intro (Chapter 1 of AIMA) Summary Artificial Intelligence What is AI?

What is Artificial Intelligence? CPSC 322 Lecture 1 September 5, 2007 What is Artificial

Traditional Definition of Artificial Intelligence Trends Artificial Intelligence (AI) is

Artificial Intelligence as Law Bart Verheij Department of Artificial Intelligence, Bernoulli

CSCI 446 ARTIFICIAL INTELLIGENCE EXAM 1 STUDY OUTLINE Introduction to Artificial Intelligence

Lecture Overview What is Artificial Intelligence? Agents acting in an environment

CSCI 446: Artificial Intelligence CSCI 446: Artificial Intelligence Course Website:

1.1 What is AI? 1. What is Artificial Intelligence? 2. AI Past and Present 3. Rational

8th November 2019 Artificial Intelligence Finance Institute NYU Courant Artificial Intelligence

CSCI 446 ARTIFICIAL INTELLIGENCE EXAM 1 STUDY OUTLINE Introduction to Artificial Intelligence

Introduction to Artificial Intelligence What is Artificial Intelligence for YOU? CPSC 533

CS 4100: Artificial Intelligence Bayes Nets Jan-Willem van de Meent, Northeastern University

Perceptrons Jonathan Mugan jonathanwilliammugan@gmail.com www.jonathanmugan.com @jmugan April

Learning a Distance Metric for Structured Network Prediction Stuart Andrews and Tony Jebara

MIRA, SVM, k-NN Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values

Does Training Affect Match Performance? A Study Using Data Mining And Tracking Devices

Search for top Squarks Using Multivariate Methods Jonas Graw Max Planck Institute for Physics

x 2 &gt; b w T x &gt; 0 SPAM!! x ( x , 1) w 3 x 3 w T x + b ( w , b ) T ( x , 1)

First Observation of Single Top Quark Production at D Monica Pangilinan Brown University on

Bias/Variance Analysis for Network Data Jennifer Neville and David Jensen Knowledge Discovery

x 2 > b w T x > 0 SPAM!! x ( x , 1) w 3 x 3 w T x + b ( w , b ) T ( x , 1)