introduction to statistical machine learning
play

Introduction to Statistical Machine Learning Marcus Hutter - PowerPoint PPT Presentation

Introduction to Statistical Machine Learning - 1 - Marcus Hutter Introduction to Statistical Machine Learning Marcus Hutter Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU RSISE NICTA Machine Learning Summer School MLSS-2009,


  1. Introduction to Statistical Machine Learning - 1 - Marcus Hutter Introduction to Statistical Machine Learning Marcus Hutter Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU RSISE NICTA Machine Learning Summer School MLSS-2009, 26 Janurary – 6 February, Canberra

  2. Introduction to Statistical Machine Learning - 2 - Marcus Hutter Abstract This course provides a brief overview of the methods and practice of statistical machine learning. It’s purpose is to (a) give a mini-introduction and background to logicians interested in the AI courses, and (b) to summarize the core concepts covered by the machine learning courses during this week. Topics covered include Bayesian inference and maximum likelihood modeling; regression, classification, density estimation, clustering, principal component analysis; parametric, semi-parametric, and non-parametric models; basis functions, neural networks, kernel methods, and graphical models; deterministic and stochastic optimization; overfitting, regularization, and validation.

  3. Introduction to Statistical Machine Learning - 3 - Marcus Hutter Table of Contents 1. Introduction / Overview / Preliminaries 2. Linear Methods for Regression 3. Nonlinear Methods for Regression 4. Model Assessment & Selection 5. Large Problems 6. Unsupervised Learning 7. Sequential & (Re)Active Settings 8. Summary

  4. Intro/Overview/Preliminaries - 4 - Marcus Hutter 1 INTRO/OVERVIEW/PRELIMINARIES • What is Machine Learning? Why Learn? • Related Fields • Applications of Machine Learning • Supervised ↔ Unsupervised ↔ Reinforcement Learning • Dichotomies in Machine Learning • Mini-Introduction to Probabilities

  5. Intro/Overview/Preliminaries - 5 - Marcus Hutter What is Machine Learning? Machine Learning is concerned with the development of algorithms and techniques that allow computers to learn Learning in this context is the process of gaining understanding by constructing models of observed data with the intention to use them for prediction. Related fields • Artificial Intelligence: smart algorithms • Statistics: inference from a sample • Data Mining: searching through large volumes of data • Computer Science: efficient algorithms and complex models

  6. Intro/Overview/Preliminaries - 6 - Marcus Hutter Why ‘Learn’? There is no need to “learn” to calculate payroll Learning is used when: • Human expertise does not exist (navigating on Mars), • Humans are unable to explain their expertise (speech recognition) • Solution changes in time (routing on a computer network) • Solution needs to be adapted to particular cases (user biometrics) Example: It is easier to write a program that learns to play checkers or backgammon well by self-play rather than converting the expertise of a master player to a program.

  7. Intro/Overview/Preliminaries - 7 - Marcus Hutter Handwritten Character Recognition an example of a difficult machine learning problem Task: Learn general mapping from pixel images to digits from examples

  8. Intro/Overview/Preliminaries - 8 - Marcus Hutter Applications of Machine Learning machine learning has a wide spectrum of applications including: • natural language processing, • search engines, • medical diagnosis, • detecting credit card fraud, • stock market analysis, • bio-informatics, e.g. classifying DNA sequences, • speech and handwriting recognition, • object recognition in computer vision, • playing games – learning by self-play: Checkers, Backgammon. • robot locomotion.

  9. Intro/Overview/Preliminaries - 9 - Marcus Hutter Some Fundamental Types of Learning • Reinforcement Learning • Supervised Learning Agents Classification Regression • Others • Unsupervised Learning SemiSupervised Learning Association Active Learning Clustering Density Estimation

  10. Intro/Overview/Preliminaries - 10 - Marcus Hutter Supervised Learning • Prediction of future cases: Use the rule to predict the output for future inputs • Knowledge extraction: The rule is easy to understand • Compression: The rule is simpler than the data it explains • Outlier detection: Exceptions that are not covered by the rule, e.g., fraud

  11. Intro/Overview/Preliminaries - 11 - Marcus Hutter Classification Example: Credit scoring Differentiating between low-risk and high-risk customers from their Income and Savings Discriminant: IF income > θ 1 AND savings > θ 2 THEN low-risk ELSE high-risk

  12. Intro/Overview/Preliminaries - 12 - Marcus Hutter Regression Example: Price y = f ( x ) +noise of a used car as function of age x

  13. Intro/Overview/Preliminaries - 13 - Marcus Hutter Unsupervised Learning • Learning “what normally happens” • No output • Clustering: Grouping similar instances • Example applications: Customer segmentation in CRM Image compression: Color quantization Bioinformatics: Learning motifs

  14. Intro/Overview/Preliminaries - 14 - Marcus Hutter Reinforcement Learning • Learning a policy: A sequence of outputs • No supervised output but delayed reward • Credit assignment problem • Game playing • Robot in a maze • Multiple agents, partial observability, ...

  15. Intro/Overview/Preliminaries - 15 - Marcus Hutter Dichotomies in Machine Learning (machine) learning / statistical ⇔ logic/knowledge-based (GOFAI) induction ⇔ prediction ⇔ decision ⇔ action regression ⇔ classification independent identically distributed ⇔ sequential / non-iid online learning ⇔ offline/batch learning passive prediction ⇔ active learning parametric ⇔ non-parametric conceptual/mathematical ⇔ computational issues exact/principled ⇔ heuristic supervised learning ⇔ unsupervised ⇔ RL learning

  16. Intro/Overview/Preliminaries - 16 - Marcus Hutter Probability Basics Probability is used to describe uncertain events; the chance or belief that something is or will be true. Example: Fair Six-Sided Die: • Sample space: Ω = { 1 , 2 , 3 , 4 , 5 , 6 } • Events: Even = { 2 , 4 , 6 } , Odd = { 1 , 3 , 5 } ⊆ Ω • Probability: P(6) = 1 6 , P( Even ) = P( Odd ) = 1 2 • Outcome: 6 ∈ E . • Conditional probability: P (6 | Even ) = P (6 and Even ) = 1 / 6 1 / 2 = 1 P ( Even ) 3 General Axioms: • P( {} ) = 0 ≤ P( A ) ≤ 1 = P(Ω) , • P( A ∪ B ) + P( A ∩ B ) = P( A ) + P( B ) , • P( A ∩ B ) = P( A | B )P( B ) .

  17. Intro/Overview/Preliminaries - 17 - Marcus Hutter Probability Jargon Example: (Un)fair coin: Ω = { Tail,Head } ≃ { 0 , 1 } . P(1) = θ ∈ [0 , 1] : Likelihood: P(1101 | θ ) = θ × θ × (1 − θ ) × θ Maximum Likelihood (ML) estimate: ˆ θ = arg max θ P(1101 | θ ) = 3 4 Prior: If we are indifferent, then P( θ ) = const. 1 � Evidence: P(1101) = � θ P(1101 | θ )P( θ ) = 20 (actually ) Posterior: P( θ | 1101) = P(1101 | θ )P( θ ) ∝ θ 3 (1 − θ ) (BAYES RULE!). P(1101) Maximum a Posterior (MAP) estimate: ˆ θ = arg max θ P( θ | 1101) = 3 4 Predictive distribution: P(1 | 1101) = P(11011) P(1101) = 2 3 θ f ( θ )P( θ | ... ) , e.g. E [ θ | 1101] = 2 Expectation: E [ f | ... ] = � 3 2 Variance: Var ( θ ) = E [( θ − E θ ) 2 | 1101] = 63 Probability density: P( θ ) = 1 ε P([ θ, θ + ε ]) for ε → 0

  18. Linear Methods for Regression - 18 - Marcus Hutter 2 LINEAR METHODS FOR REGRESSION • Linear Regression • Coefficient Subset Selection • Coefficient Shrinkage • Linear Methods for Classifiction • Linear Basis Function Regression (LBFR) • Piecewise linear, Splines, Wavelets • Local Smoothing & Kernel Regression • Regularization & 1D Smoothing Splines

  19. Linear Methods for Regression - 19 - Marcus Hutter Linear Regression fitting a linear function to the data • Input “feature” vector x := (1 ≡ x (0) , x (1) , ..., x ( d ) ) ∈ I R d +1 • Real-valued noisy response y ∈ I R . • Linear regression model: y = f w ( x ) = w 0 x (0) + ... + w d x ( d ) ˆ • Data: D = ( x 1 , y 1 ) , ..., ( x n , y n ) • Error or loss function: Example: Residual sum of squares: Loss( w ) = � n i =1 ( y i − f w ( x i )) 2 • Least squares (LSQ) regression: w = arg min w Loss( w ) ˆ • Example: Person’s weight y as a function of age x 1 , height x 2 .

  20. Linear Methods for Regression - 20 - Marcus Hutter Coefficient Subset Selection Problems with least squares regression if d is large: • Overfitting: The plane fits the data well (perfect for d ≥ n ), but predicts (generalizes) badly. • Interpretation: We want to identify a small subset of features important/relevant for predicting y . Solution 1: Subset selection: Take those k out of d features that minimize the LSQ error.

  21. Linear Methods for Regression - 21 - Marcus Hutter Coefficient Shrinkage Solution 2: Shrinkage methods: Shrink the least squares w by penalizing the Loss: Ridge regression: Add ∝ || w || 2 2 . Lasso: Add ∝ || w || 1 . Bayesian linear regression: Comp. MAP arg max w P( w | D ) from prior P ( w ) and sampling model P ( D | w ) . Weights of low variance components shrink most.

Recommend


More recommend