CS 678 Machine Learning Lecture Notes 1 Week 1 - chapter 1 and - PDF document

CS 678 Machine Learning Lecture Notes 1 Week 1 - chapter 1 and probability 1.1 General syllabus what do students know (prog. lang., stats, math, calculus ) 1.2 machine learning 1.2.1 general concepts • example of predicting basketball players (height and speed) • detecting patterns or regularities • application of ML to large databases is data mining • pattern recognition (face recognition, fingerprint, character, etc.) • combines math, statistics and computer science 1.2.2 examples of ML • learning associations • classification – classes – discriminant, prediction – OCR, face recognition, medical diagnosis, speech recognition – knowledge extraction, compression, outlier detection • regression 1

1.3 probability • events, probability and sample space • axioms – 0 ≤ P ( E ) ≤ 1 – P ( S ) = 1 example: ∗ E 1 = die = 1 ∗ S = E 1 ∪ E 2 ∪ E 3 ∪ E 4 ∪ E 5 ∪ E 6 ∗ p ( E 2 ) = 1 6 ∗ p ( S ) = 1 – P ( ∪ E i ) = � P ( E i ) – P ( E ∪ E c ) = P ( E ) + P ( E c ) = 1 – P ( EUF ) = P ( E ) + P ( F ) − P ( E ∩ F )) • conditional prob: – P ( E | F ) = P ( E ∩ F ) /P ( F ) – P ( F | E ) = P ( E | F ) P ( F ) /P ( E ) bayes formula (show derivation) ∗ E = have lung cancer, F = smoke ∗ P ( E ) = people with lung cancer = . 05 all people ∗ P ( F ) = people who smoke = . 50 all people ∗ P ( F | E ) = people who smoke and have lung cancer = . 50 people who have lung cancer ∗ P ( E | F ) = . 80 · . 05 = . 08 . 5 – marginals ∗ P ( X ) = � i P ( X | Y i ) P ( Y i ) ∗ E i =first die is i ∗ P ( T = 7 | E i ) = 1 ∗ so P ( T = 7) = P ( T = 7 | E 1 ) P ( E 1 ) + P ( T = 7 | E 2 ) P ( E 2 ) + ... = � 1 / 36 = 1 6 6 ∗ also do the same with P ( E 3 ) ∗ can also be done with continuous distributions... – P ( E 1 | F ) = P ( F | E 1 ) P ( E 1 ) P ( F | E 1 ) P ( E 1 ) = P ( F ) � i P ( F | E i ) P ( E i ) – P ( E ∩ F ) = P ( E ) P ( F ) if E and F are independent ∗ P ( E | F ) = P ( E ∩ F ) /P ( F ) ∗ P ( E ∩ F ) = P ( E | F ) P ( F ) ∗ so if E and F are independent, P ( E | F ) = P ( E ) ∗ for example, given the first die is 2, P ( die 2 = 3) = 1 / 6 ∗ independence is THE big assumption in machine learning: i.i.d. • random variables 2

– probability distributions ∗ F ( a ) = P { X < = a } ∗ P { a < X < = b } = F ( b ) − F ( a ) ∗ F ( a ) = sum ( x < = a )( P ( x )) discrete � a ∗ F ( a ) = ∞ p ( x ) dx – joint distributions ∗ F ( x, y ) = P { X ≤ x, Y ≤ y } ∗ F X ( x ) = P { X ≤ x, Y ≤ ∞} marginal (show both the discrete and continuous) – conditional distributions: P X | Y ( x | y ) = P { X = x | Y = y } P { Y = y } – bayes rule: P ( y | x ) = P ( x | y ) P Y ( y ) /P X ( x ) (posterior=likelihood*prior/evidence) � – expectation (mean) E [ X ] = � i x i P ( x i ) or E [ X ] = xp ( x ) dx – variance: V ar ( X ) = E [( X − µ ) 2 ] = E [ X 2 ] − µ 2 – distributions ∗ binomial ∗ multinomial ∗ uniform ∗ normal ∗ others (chi-sq, t, F, etc) 3

2 Week 2 - chapter 2 supervised learning 2.1 learning from examples • positive, negative examples • x = x 1 ...x d input representation (just the pertinant attributes) • X = { x t , r t } N t =1 • hypothesis h , hypothesis class, parameters. h ( x ) = 1 if h classifies x as positive • empirical error - classifier does not match those in X : E ( h | X ) = � N t =1 l ( h ( x t ) � = r t ) • generalization - most specific (S) vs. most general (G)(false positives and negatives). Doubt - those in G - S are not certain so we do not make a decision 2.2 vapnik-chervonenkis dimension maximum number of points that can be shattered by d dimensions. Draw example with 4 points and rectangles. 2.3 PAC learning • want the maximum error to be ǫ , for the 4 rectangles ǫ/ 4 • prob of not an error = 4(1 − ǫ/ 4) N • given the inequality (1 − x ) ≤ e − x , we want to choose N and δ so that 4 e − ǫN/ 4 ≤ δ , which leads to • N ≥ (4 /ǫ ) log (4 /δ ) • example: ǫ = . 1 and δ = . 05 we need to have at least 77 samples 2.4 noise imprecision in recording, labeling mistakes, additional attributes. Question: do you think it is possible to predict with certainty something like ”will so-and-so like a particular movie” given all pertinent data? Complex models can be more accurate but simple models are easier to use, train, explain and may be more accurate (overfitting) - occam’s razor. 2.5 learning multiple classes create rectangles for each class 4

2.6 regression t =1 where r t ∈ ℜ • X = x t , r tN • interpolation: r t = f ( x t ) , regression: r t = f ( x t ) + ǫ t =1 [ r t − g ( x t )] 2 � N • empirical error: E ( g | X ) = 1 N • if linear: g ( x ) = w 1 x 1 + ... + w d x d + w 0 = � d j =1 w j x j + w 0 • with one attribute: g ( x ) = w 1 x 1 + w 0 t =1 [ r t − ( w 1 x t + w 0 )] 2 • error function: E ( w 1 , w 0 | X ) = � N • taking the partials, setting to zero and solving: � N t =1 x t r t − ¯ xrN – w 1 = � N t =1 ( x t ) 2 − N ¯ x 2 – w 0 = ¯ r − w 1 ¯ x • quadratic and higher-order polynomials 2.7 model selection and generalization • • Go over example in table 2.1. • When the data does not identify a model with certainty, it is an ill-posed problem. • Inductive bias is the set of assumptions that are adopted. • model selection is choosing the right bias. • Underfitting is when the hypothesis is less complex than function • overfitting hypothesis is too complex 2.8 dimensions of supervised ML algorithm (recap) • model: g ( x | θ ) • loss function: E ( θ | X ) = � N t =1 L ( r t , g ( x t | θ )) • optimization method: θ ∗ = argmin θ E ( θ | X ) 5

2.9 implementation • program to find most specific parameters • program to find most general parameters • program to learn for multiple classes • program to do regression (many packages) 6

3 Week 3 - chapter 3 Baysian decision theory • observable ( x ) and unobservable ( z ) variables x = f ( z ) • choose the most probable event heads • estimate P ( X ) using samples, i.e. ˆ p 0 = totaltosses 3.1 classification • use the observable variables to predict the class • choose C = 1 if P ( C = 1 | x 1 , x 2 ) > . 5 • prob of error is 1 − max ( P ( C = 1 | x 1 , x 2 ) , P ( C = 1 | x 1 , x 2 )) • bayes rule: P ( C | x ) = p ( x | C ) P ( C ) p ( x ) • prior is the probability of the class • class likelihood is the probability of the data given the class • evidence is the probability of the data, normalization constant • classifier: choose the class with the highest posterior prob: choose C i if P ( C i ) = max k P ( C k | x ) • example: want to predict success of college applicant given: gpa, sat score • example: predict a patient’s reaction (get better, no diff, get worse) given their blood pressure and ethnic background 3.2 losses and risks need to weight decisions as not all decision have the same consequences • let α i be the action of choosing C i • and λ ik be the loss associated with taking action α i when the class is really C i • then the risk of taking action α i is R ( α i | x ) = � K k =1 λ ik P ( C k | x ) • zero-one loss is often assumed to simplify things. assigning risks can always be done as a post processing step. • example: say P ( C 0 | x ) = . 4 and P ( C 1 | x ) = . 6 but that λ 00 = 0 , λ 01 = 10 , λ 10 = 20 and λ 11 = 0 . So – R ( α 0 | x ) = 0 · . 4 + 10 · . 6 – R ( α 1 | x ) = 20 · . 4 + 0 · . 6 • reject option - create one more α and λ 7

3.3 discriminant functions • g i ( x ) = − R ( α i ) • g i ( x ) = P ( x | C i ) P ( C i ) when zero-one loss function is considered • show briefly the quadratic discriminator 3.4 utility theory • utility function: UE ( α i | x ) = � k U ik P ( S k | x ) • choose α i if UE ( α i | x ) = max j EU ( α j | x ) • typically defined in monetary terms 3.5 value of information • assessing the value of additional information (attributes) • expected utility of current best action: UE ( x ) = max i � k U ik P ( S k | x ) • with new feature z , UE ( x, z ) = max i � k U ik P ( S k | x, z ) • if EU ( x, z ) > EU ( x ) , then z is useful but only if utility of the additional feature is more than the cost of observation and processing 3.6 baysian nets • define probabilistic networks, graphical models and DAG • (slides) define causes and diagnostic arcs in network • explain P ( R | W ) = P ( W | R ) P ( R ) P ( W ) • explain P ( W | S ) = P ( W | R, S ) P ( R | S ) + P ( W | R, S ) P ( R | S ) • P ( W ) = P ( W | R, S ) P ( R, S )+ P ( W | R, S ) P ( R, S )+ P ( W | R, S ) P ( R, S )+ P ( W | R, S ) P ( R, S ) • explain why P ( S | R, W ) is less than P ( S | W ) • local structure - results in storing fewer parameters and making computations easier • belief propagation and junction trees are methods of efficiently solving nets • classification 3.7 influence diagrams 3.8 association rules 8

CS 678 Machine Learning Lecture Notes 1 Week 1 - chapter 1 and - PDF document

CS 678 Machine Learning Lecture Notes 1 Week 1 - chapter 1 and probability 1.1 General syllabus what do students know (prog. lang., stats, math, calculus ) 1.2 machine learning 1.2.1 general concepts example of predicting basketball

CMSC 678 Introduction to Machine Learning Spring 2019

CMSC 678 Introduction to Machine Learning Spring 2018

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

In this lecture we investigate a connection between Taylor series and Fourier series:

k-means++: few more steps yield constant approximation Davin Choo Christoph Grunau Julian

WHU_NERCMS at TRECVID2018: INS Dongshu Xu, Longxiang Jiang, Xiaoyu Chai, Jin Chen, Han Fang, Li

Greedy algorithms: greed is good? Greedy algorithms Shortest paths in weighted graphs Greed, for

Standardizing Commit-and-Prove ZK Daniel Benarroch Matteo Campanelli Dario Fiore IMDEA Software

A Kalman Filter for Robust Outlier Detection Jo-Anne Ting, Evangelos Theodorou, Stefan Schaal

Hidden algebraic structure on cohomology of simplicial complexes, and TFT Pavel Mnev University

Hlder spaces R n open, bounded u C 0 ( ) [0,1] | u ( x ) u