cs 678 machine learning
play

CS 678 Machine Learning Lecture Notes 1 Week 1 - chapter 1 and - PDF document

CS 678 Machine Learning Lecture Notes 1 Week 1 - chapter 1 and probability 1.1 General syllabus what do students know (prog. lang., stats, math, calculus ) 1.2 machine learning 1.2.1 general concepts example of predicting basketball


  1. CS 678 Machine Learning Lecture Notes 1 Week 1 - chapter 1 and probability 1.1 General syllabus what do students know (prog. lang., stats, math, calculus ) 1.2 machine learning 1.2.1 general concepts • example of predicting basketball players (height and speed) • detecting patterns or regularities • application of ML to large databases is data mining • pattern recognition (face recognition, fingerprint, character, etc.) • combines math, statistics and computer science 1.2.2 examples of ML • learning associations • classification – classes – discriminant, prediction – OCR, face recognition, medical diagnosis, speech recognition – knowledge extraction, compression, outlier detection • regression 1

  2. 1.3 probability • events, probability and sample space • axioms – 0 ≤ P ( E ) ≤ 1 – P ( S ) = 1 example: ∗ E 1 = die = 1 ∗ S = E 1 ∪ E 2 ∪ E 3 ∪ E 4 ∪ E 5 ∪ E 6 ∗ p ( E 2 ) = 1 6 ∗ p ( S ) = 1 – P ( ∪ E i ) = � P ( E i ) – P ( E ∪ E c ) = P ( E ) + P ( E c ) = 1 – P ( EUF ) = P ( E ) + P ( F ) − P ( E ∩ F )) • conditional prob: – P ( E | F ) = P ( E ∩ F ) /P ( F ) – P ( F | E ) = P ( E | F ) P ( F ) /P ( E ) bayes formula (show derivation) ∗ E = have lung cancer, F = smoke ∗ P ( E ) = people with lung cancer = . 05 all people ∗ P ( F ) = people who smoke = . 50 all people ∗ P ( F | E ) = people who smoke and have lung cancer = . 50 people who have lung cancer ∗ P ( E | F ) = . 80 · . 05 = . 08 . 5 – marginals ∗ P ( X ) = � i P ( X | Y i ) P ( Y i ) ∗ E i =first die is i ∗ P ( T = 7 | E i ) = 1 ∗ so P ( T = 7) = P ( T = 7 | E 1 ) P ( E 1 ) + P ( T = 7 | E 2 ) P ( E 2 ) + ... = � 1 / 36 = 1 6 6 ∗ also do the same with P ( E 3 ) ∗ can also be done with continuous distributions... – P ( E 1 | F ) = P ( F | E 1 ) P ( E 1 ) P ( F | E 1 ) P ( E 1 ) = P ( F ) � i P ( F | E i ) P ( E i ) – P ( E ∩ F ) = P ( E ) P ( F ) if E and F are independent ∗ P ( E | F ) = P ( E ∩ F ) /P ( F ) ∗ P ( E ∩ F ) = P ( E | F ) P ( F ) ∗ so if E and F are independent, P ( E | F ) = P ( E ) ∗ for example, given the first die is 2, P ( die 2 = 3) = 1 / 6 ∗ independence is THE big assumption in machine learning: i.i.d. • random variables 2

  3. – probability distributions ∗ F ( a ) = P { X < = a } ∗ P { a < X < = b } = F ( b ) − F ( a ) ∗ F ( a ) = sum ( x < = a )( P ( x )) discrete � a ∗ F ( a ) = ∞ p ( x ) dx – joint distributions ∗ F ( x, y ) = P { X ≤ x, Y ≤ y } ∗ F X ( x ) = P { X ≤ x, Y ≤ ∞} marginal (show both the discrete and continuous) – conditional distributions: P X | Y ( x | y ) = P { X = x | Y = y } P { Y = y } – bayes rule: P ( y | x ) = P ( x | y ) P Y ( y ) /P X ( x ) (posterior=likelihood*prior/evidence) � – expectation (mean) E [ X ] = � i x i P ( x i ) or E [ X ] = xp ( x ) dx – variance: V ar ( X ) = E [( X − µ ) 2 ] = E [ X 2 ] − µ 2 – distributions ∗ binomial ∗ multinomial ∗ uniform ∗ normal ∗ others (chi-sq, t, F, etc) 3

  4. 2 Week 2 - chapter 2 supervised learning 2.1 learning from examples • positive, negative examples • x = x 1 ...x d input representation (just the pertinant attributes) • X = { x t , r t } N t =1 • hypothesis h , hypothesis class, parameters. h ( x ) = 1 if h classifies x as positive • empirical error - classifier does not match those in X : E ( h | X ) = � N t =1 l ( h ( x t ) � = r t ) • generalization - most specific (S) vs. most general (G)(false positives and negatives). Doubt - those in G - S are not certain so we do not make a decision 2.2 vapnik-chervonenkis dimension maximum number of points that can be shattered by d dimensions. Draw example with 4 points and rectangles. 2.3 PAC learning • want the maximum error to be ǫ , for the 4 rectangles ǫ/ 4 • prob of not an error = 4(1 − ǫ/ 4) N • given the inequality (1 − x ) ≤ e − x , we want to choose N and δ so that 4 e − ǫN/ 4 ≤ δ , which leads to • N ≥ (4 /ǫ ) log (4 /δ ) • example: ǫ = . 1 and δ = . 05 we need to have at least 77 samples 2.4 noise imprecision in recording, labeling mistakes, additional attributes. Question: do you think it is possible to predict with certainty something like ”will so-and-so like a particular movie” given all pertinent data? Complex models can be more accurate but simple models are easier to use, train, explain and may be more accurate (overfitting) - occam’s razor. 2.5 learning multiple classes create rectangles for each class 4

  5. 2.6 regression t =1 where r t ∈ ℜ • X = x t , r tN • interpolation: r t = f ( x t ) , regression: r t = f ( x t ) + ǫ t =1 [ r t − g ( x t )] 2 � N • empirical error: E ( g | X ) = 1 N • if linear: g ( x ) = w 1 x 1 + ... + w d x d + w 0 = � d j =1 w j x j + w 0 • with one attribute: g ( x ) = w 1 x 1 + w 0 t =1 [ r t − ( w 1 x t + w 0 )] 2 • error function: E ( w 1 , w 0 | X ) = � N • taking the partials, setting to zero and solving: � N t =1 x t r t − ¯ xrN – w 1 = � N t =1 ( x t ) 2 − N ¯ x 2 – w 0 = ¯ r − w 1 ¯ x • quadratic and higher-order polynomials 2.7 model selection and generalization • • Go over example in table 2.1. • When the data does not identify a model with certainty, it is an ill-posed problem. • Inductive bias is the set of assumptions that are adopted. • model selection is choosing the right bias. • Underfitting is when the hypothesis is less complex than function • overfitting hypothesis is too complex 2.8 dimensions of supervised ML algorithm (recap) • model: g ( x | θ ) • loss function: E ( θ | X ) = � N t =1 L ( r t , g ( x t | θ )) • optimization method: θ ∗ = argmin θ E ( θ | X ) 5

  6. 2.9 implementation • program to find most specific parameters • program to find most general parameters • program to learn for multiple classes • program to do regression (many packages) 6

  7. 3 Week 3 - chapter 3 Baysian decision theory • observable ( x ) and unobservable ( z ) variables x = f ( z ) • choose the most probable event heads • estimate P ( X ) using samples, i.e. ˆ p 0 = totaltosses 3.1 classification • use the observable variables to predict the class • choose C = 1 if P ( C = 1 | x 1 , x 2 ) > . 5 • prob of error is 1 − max ( P ( C = 1 | x 1 , x 2 ) , P ( C = 1 | x 1 , x 2 )) • bayes rule: P ( C | x ) = p ( x | C ) P ( C ) p ( x ) • prior is the probability of the class • class likelihood is the probability of the data given the class • evidence is the probability of the data, normalization constant • classifier: choose the class with the highest posterior prob: choose C i if P ( C i ) = max k P ( C k | x ) • example: want to predict success of college applicant given: gpa, sat score • example: predict a patient’s reaction (get better, no diff, get worse) given their blood pressure and ethnic background 3.2 losses and risks need to weight decisions as not all decision have the same consequences • let α i be the action of choosing C i • and λ ik be the loss associated with taking action α i when the class is really C i • then the risk of taking action α i is R ( α i | x ) = � K k =1 λ ik P ( C k | x ) • zero-one loss is often assumed to simplify things. assigning risks can always be done as a post processing step. • example: say P ( C 0 | x ) = . 4 and P ( C 1 | x ) = . 6 but that λ 00 = 0 , λ 01 = 10 , λ 10 = 20 and λ 11 = 0 . So – R ( α 0 | x ) = 0 · . 4 + 10 · . 6 – R ( α 1 | x ) = 20 · . 4 + 0 · . 6 • reject option - create one more α and λ 7

  8. 3.3 discriminant functions • g i ( x ) = − R ( α i ) • g i ( x ) = P ( x | C i ) P ( C i ) when zero-one loss function is considered • show briefly the quadratic discriminator 3.4 utility theory • utility function: UE ( α i | x ) = � k U ik P ( S k | x ) • choose α i if UE ( α i | x ) = max j EU ( α j | x ) • typically defined in monetary terms 3.5 value of information • assessing the value of additional information (attributes) • expected utility of current best action: UE ( x ) = max i � k U ik P ( S k | x ) • with new feature z , UE ( x, z ) = max i � k U ik P ( S k | x, z ) • if EU ( x, z ) > EU ( x ) , then z is useful but only if utility of the additional feature is more than the cost of observation and processing 3.6 baysian nets • define probabilistic networks, graphical models and DAG • (slides) define causes and diagnostic arcs in network • explain P ( R | W ) = P ( W | R ) P ( R ) P ( W ) • explain P ( W | S ) = P ( W | R, S ) P ( R | S ) + P ( W | R, S ) P ( R | S ) • P ( W ) = P ( W | R, S ) P ( R, S )+ P ( W | R, S ) P ( R, S )+ P ( W | R, S ) P ( R, S )+ P ( W | R, S ) P ( R, S ) • explain why P ( S | R, W ) is less than P ( S | W ) • local structure - results in storing fewer parameters and making computations easier • belief propagation and junction trees are methods of efficiently solving nets • classification 3.7 influence diagrams 3.8 association rules 8

Recommend


More recommend