pattern recognition and machine learning polynomial curve
play

PATTERN RECOGNITION AND MACHINE LEARNING Polynomial Curve Fitting - PowerPoint PPT Presentation

Christopher M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING Polynomial Curve Fitting Sum-of-Squares Error Function 0 th Order Polynomial 1 st Order Polynomial 3 rd Order Polynomial 9 th Order Polynomial Over-fitting Root-Mean-Square (RMS)


  1. Christopher M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING

  2. Polynomial Curve Fitting

  3. Sum-of-Squares Error Function

  4. 0 th Order Polynomial

  5. 1 st Order Polynomial

  6. 3 rd Order Polynomial

  7. 9 th Order Polynomial

  8. Over-fitting Root-Mean-Square (RMS) Error:

  9. Polynomial Coefficients

  10. Data Set Size: 9 th Order Polynomial

  11. Data Set Size: 9 th Order Polynomial

  12. Regularization Penalize large coefficient values

  13. Regularization:

  14. Regularization:

  15. Regularization: vs.

  16. Polynomial Coefficients

  17. The Gaussian Distribution

  18. Gaussian Parameter Estimation Likelihood function

  19. Maximum (Log) Likelihood

  20. Properties of and

  21. Curve Fitting Re-visited

  22. Maximum Likelihood Determine by minimizing sum-of-squares error, .

  23. Predictive Distribution

  24. MAP: A Step towards Bayes Determine by minimizing regularized sum-of-squares error, .

  25. Bayesian Curve Fitting

  26. Bayesian Predictive Distribution

  27. Model Selection Cross-Validation

  28. Parametric Distributions Basic building blocks: Need to determine given Representation: or ? Recall Curve Fitting

  29. Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution

  30. Binary Variables (2) N coin flips: Binomial Distribution

  31. Binomial Distribution

  32. Parameter Estimation (1) ML for Bernoulli Given:

  33. Parameter Estimation (2) Example: Prediction: all future tosses will land heads up Overfitting to D

  34. Beta Distribution Distribution over .

  35. Bayesian Bernoulli The Beta distribution provides the conjugate prior for the Bernoulli distribution.

  36. Beta Distribution

  37. Prior ∙ Likelihood = Posterior

  38. Properties of the Posterior As the size of the data set, N , increase

  39. Prediction under the Posterior What is the probability that the next coin toss will land heads up?

  40. Multinomial Variables 1 -of- K coding scheme:

  41. ML Parameter estimation Given: Ensure , use a Lagrange multiplier, ¸ .

  42. The Multinomial Distribution

  43. The Dirichlet Distribution Conjugate prior for the multinomial distribution.

  44. Bayesian Multinomial (1)

  45. Bayesian Multinomial (2)

  46. The Gaussian Distribution

  47. Maximum Likelihood for the Gaussian (1) Given i.i.d. data , the log likeli- hood function is given by Sufficient statistics

  48. Maximum Likelihood for the Gaussian (2) Set the derivative of the log likelihood function to zero, and solve to obtain Similarly

  49. Maximum Likelihood for the Gaussian (3) Under the true distribution Hence define

  50. Bayesian Inference for the Gaussian (1) Assume ¾ 2 is known. Given i.i.d. data , the likelihood function for ¹ is given by This has a Gaussian shape as a function of ¹ (but it is not a distribution over ¹ ).

  51. Bayesian Inference for the Gaussian (2) Combined with a Gaussian prior over ¹ , this gives the posterior Completing the square over ¹ , we see that

  52. Bayesian Inference for the Gaussian (3) … where Note:

  53. Bayesian Inference for the Gaussian (4) Example: for N = 0, 1, 2 and 10.

  54. Bayesian Inference for the Gaussian (5) Sequential Estimation The posterior obtained after observing N { 1 data points becomes the prior when we observe the N th data point.

  55. Bayesian Inference for the Gaussian (6) Now assume ¹ is known. The likelihood function for ¸ = 1/ ¾ 2 is given by This has a Gamma shape as a function of ¸ .

  56. Bayesian Inference for the Gaussian (7) The Gamma distribution

  57. Bayesian Inference for the Gaussian (8) Now we combine a Gamma prior, , with the likelihood function for ¸ to obtain which we recognize as with

  58. Bayesian Inference for the Gaussian (9) If both ¹ and ¸ are unknown, the joint likelihood function is given by We need a prior with the same functional dependence on ¹ and ¸ .

  59. Bayesian Inference for the Gaussian (10) The Gaussian-gamma distribution • Quadratic in ¹ . • Gamma distribution over ¸ . • Linear in ¸ . • Independent of ¹ .

  60. Bayesian Inference for the Gaussian (11) The Gaussian-gamma distribution

  61. Bayesian Inference for the Gaussian (12) Multivariate conjugate priors • ¹ unknown, ¤ known: p ( ¹ ) Gaussian. • ¤ unknown, ¹ known: p ( ¤ ) Wishart, • ¤ and ¹ unknown: p ( ¹ , ¤ ) Gaussian- Wishart,

  62. Student’s t-Distribution where Infinite mixture of Gaussians.

  63. Student’s t-Distribution

  64. Student’s t-Distribution Robustness to outliers: Gaussian vs t-distribution.

  65. Student’s t-Distribution The D -variate case: where . Properties:

  66. The Exponential Family (1) where ´ is the natural parameter and so g ( ´ ) can be interpreted as a normalization coefficient.

  67. The Exponential Family (2.1) The Bernoulli Distribution Comparing with the general form we see that and so Logistic sigmoid

  68. The Exponential Family (2.2) The Bernoulli distribution can hence be written as where

  69. The Exponential Family (3.1) The Multinomial Distribution where, , and NOTE: The ´ k parameters are not independent since the corresponding ¹ k must satisfy

  70. The Exponential Family (3.2) Let . This leads to and Softmax Here the ´ k parameters are independent. Note that and

  71. The Exponential Family (3.3) The Multinomial distribution can then be written as where

  72. The Exponential Family (4) The Gaussian Distribution where

  73. ML for the Exponential Family (1) From the definition of g ( ´ ) we get Thus

  74. ML for the Exponential Family (2) Give a data set, , the likelihood function is given by Thus we have Sufficient statistic

  75. Conjugate priors For any member of the exponential family, there exists a prior Combining with the likelihood function, we get Prior corresponds to º pseudo-observations with value  .

Recommend


More recommend