pattern recognition and machine learning
play

PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 2: Estimation - PowerPoint PPT Presentation

PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 2: Estimation Theory October 2019 Heikki Huttunen heikki.huttunen@tuni.fi Signal Processing Tampere University default Classical Estimation and Detection Theory Before the machine


  1. PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 2: Estimation Theory October 2019 Heikki Huttunen heikki.huttunen@tuni.fi Signal Processing Tampere University

  2. default Classical Estimation and Detection Theory • Before the machine learning part, we will take a look at classical estimation theory. • Estimation theory has many connections to the foundations of modern machine learning. • Outline of the next few hours: 1 Estimation theory: • Fundamentals • Maximum likelihood • Examples 2 Detection theory: • Fundamentals • Error metrics • Examples 2 / 37

  3. default Introduction - estimation • Our goal is to estimate the values of a group of parameters from data. • Examples: radar, sonar, speech, image analysis, biomedicine, communications, control, seismology, etc. • Parameter estimation : Given an N -point data set x = { x [ 0 ] , x [ 1 ] , . . . , x [ N − 1 ] } which depends on the unknown parameter θ ∈ R , we wish to design an estimator g ( · ) for θ ˆ θ = g ( x [ 0 ] , x [ 1 ] , . . . , x [ N − 1 ]) . • The fundamental questions are: 1 What is the model for our data? 2 How to determine its parameters? 3 / 37

  4. default Introductory Example – Straight line 1.2 • Suppose we have the illustrated time series and 1.0 would like to approximate the relationship of the two coordinates. 0.8 • The relationship looks linear, so we could assume 0.6 the following model: y 0.4 y [ n ] = ax [ n ] + b + w [ n ] , 0.2 with a ∈ R and b ∈ R unknown and w [ n ] ∼ N ( 0 , σ 2 ) 0.0 • N ( 0 , σ 2 ) is the normal distribution with mean 0 0.2 and variance σ 2 . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 x 4 / 37

  5. default Introductory Example – Straight line Model candidate 1 (a = 0.07, b = 0.49) 1.25 Model candidate 2 (a = 0.06, b = 0.33) Model candidate 3 1.00 (a = 0.08, b = 0.51) 0.75 • Each pair of a and b represent one line. 0.50 y • Which line of the three would best describe the data set? Or some other line? 0.25 0.00 0.25 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 x 5 / 37

  6. default Introductory Example – Straight line • It can be shown that the best solution (in the maximum likelihood sense; to be defined later) is given by N − 1 N − 1 � � 6 12 ˆ = − y ( n ) + x ( n ) y ( n ) N ( N 2 − 1 ) a N ( N + 1 ) n = 0 n = 0 2 ( 2 N − 1 ) N − 1 N − 1 � � ˆ 6 = y ( n ) − x ( n ) y ( n ) . b N ( N + 1 ) N ( N + 1 ) n = 0 n = 0 • Or, as we will later learn, in an easy matrix form: � ˆ � a ˆ θ = = ( X T X ) − 1 X T y ˆ b 6 / 37

  7. default Introductory Example – Straight line Best Fit 1.2 y = 0.0740x+0.4932 Sum of Squares = 3.62 1.0 0.8 a = 0 . 07401 and ˆ • In this case, ˆ b = 0 . 49319, which 0.6 produces the line shown on the right. y • The line also minimizes the squared distances 0.4 (green dashed lines) between the model (blue 0.2 line) and the data (red circles). 0.0 0.2 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 x 7 / 37

  8. default Introductory Example 2 – Sinusoid • Consider transmitting the sinusoid below. 4 3 2 1 0 1 2 3 4 0 20 40 60 80 100 120 140 160 8 / 37

  9. default Introductory Example 2 – Sinusoid • When the data is received, it is corrupted by noise and the received samples look like below. 4 3 2 1 0 1 2 3 4 0 20 40 60 80 100 120 140 160 • Can we recover the parameters of the sinusoid? 9 / 37

  10. default Introductory Example 2 – Sinusoid • In this case, the problem is to find good values for A , f 0 and φ in the following model: x [ n ] = A cos( 2 π f 0 n + φ ) + w [ n ] , with w [ n ] ∼ N ( 0 , σ 2 ) . 10 / 37

  11. default Introductory Example 2 – Sinusoid • It can be shown that the maximum likelihood estimator; MLE for parameters A , f 0 and φ are given by � � � N − 1 � � ˆ � � = x ( n ) e − 2 π ifn � , f 0 value of f that maximizes � � � n = 0 � � � N − 1 � � x ( n ) e − 2 π i ˆ ˆ 2 � � = f 0 n A � � � � N n = 0 arctan − � N − 1 n = 0 x ( n ) sin( 2 π ˆ f 0 n ) ˆ = φ . � N − 1 n = 0 x ( n ) cos( 2 π ˆ f 0 n ) 11 / 37

  12. default Introductory Example 2 – Sinusoid • It turns out that the sinusoidal parameter estimation is very successful: f 0 = 0.068 (0.068); ˆ ˆ A = 0.692 (0.667); ˆ φ = 0.238 (0.609) 4 3 2 1 0 1 2 3 4 0 20 40 60 80 100 120 140 160 • The blue curve is the original sinusoid, and the red curve is the one estimated from the green circles. • The estimates are shown in the figure (true values in parentheses). 12 / 37

  13. default Introductory Example 2 – Sinusoid • However, the results are different for each realization of the noise w [ n ] . f 0 = 0.068; ˆ ˆ A = 0.652; ˆ f 0 = 0.066; ˆ ˆ A = 0.660; ˆ φ = -0.023 φ = 0.851 4 4 3 3 2 2 1 1 0 0 1 1 2 2 3 3 4 4 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 f 0 = 0.067; ˆ ˆ A = 0.786; ˆ f 0 = 0.459; ˆ ˆ A = 0.618; ˆ φ = 0.814 φ = 0.299 4 4 3 3 2 2 1 1 0 0 1 1 2 2 3 3 4 4 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 13 / 37

  14. default Introductory Example 2 – Sinusoid • Thus, we’re not very interested in an individual case, but rather on the distributions of estimates • What are the expectations: E [ˆ f 0 ] , E [ˆ φ ] and E [ˆ A ] ? • What are their respective variances? • Could there be a better formula that would yield smaller variance? • If yes, how to discover the better estimators? 14 / 37

  15. default LEAST SQUARES 15 / 37

  16. default Least Squares • The general solution for linear case is easy to remember. • Consider the line fitting case: y [ n ] = ax [ n ] + b + w [ n ] . • This can be written in matrix form as follows:   x [ 0 ]   1 y [ 0 ] x [ 1 ]   1 y [ 1 ] � �     x [ 2 ] a     = + w . 1     . . b     . . . . .   ���� . . y [ N − 1 ] θ x [ N − 1 ] 1 � �� � � �� � y X 16 / 37

  17. default Least Squares • Now the model is written compactly as y = X θ + w • Solution: The value of θ minimizing the error w T w is given by: � � − 1 X T y . ˆ θ LS = X T X • See sample Python code here: https: //github.com/mahehu/SGN-41007/blob/master/code/Least_Squares.ipynb 17 / 37

  18. default LS Example with two variables • Consider an example, where microscope images have an uneven illumination. • The illumination pattern appears to have the highest brightness at the center. • Let’s try to fit a paraboloid to remove the effect: z ( x , y ) = c 1 x 2 + c 2 y 2 + c 3 xy + c 4 x + c 5 y + c 6 , with z ( x , y ) the brightness at pixel ( x , y ) . 18 / 37

  19. default LS Example with two variables • In matrix form, the model looks like the following:   c 1     x 2 y 2 z 1 x 1 y 1 x 1 y 1 1   1 1 c 2      x 2 y 2  z 2 x 2 y 2 x 2 y 2 1      2 2  c 3 · = + ǫ ,       (1) . . . . . . .   . . . . . . . c 4     . . . . . . .     c 5 x 2 y 2 z N x N y N x N y N 1 N N � �� � � �� � c 6 � �� � z H c with z k the grayscale value at k ’th pixel ( x k , y k ) . 19 / 37

  20. default LS Example with two variables � � − 1 H T z , • As a result we get the LS fit by ˆ c = H T H ˆ c = ( − 0 . 000080 , − 0 . 000288 , 0 . 000123 , 0 . 022064 , 0 . 284020 , 106 . 538687 ) • Or, in other words z ( x , y ) = − 0 . 000080 x 2 − 0 . 000288 y 2 + 0 . 000123 xy + 0 . 022064 x + 0 . 284020 y + 106 . 538687 . 20 / 37

  21. default LS Example with two variables • Finally, we remove the illumination component by subtraction. 21 / 37

  22. default MAXIMUM LIKELIHOOD 22 / 37

  23. default Maximum Likelihood Estimation • Maximum likelihood (MLE) is the most popular estimation approach due to its applicability in complicated estimation problems. • Maximization of likelihood also appears often as the optimality criterion in machine learning. • The method was proposed by Fisher in 1922, though he published the basic principle already in 1912 as a third year undergraduate. • The basic principle is simple: find the parameter θ that is the most probable to have generated the data x . • The MLE estimator may or may not be optimal in the minimum variance sense. It is not necessarily unbiased, either. 23 / 37

  24. default The Likelihood Function • Consider again the problem of estimating the mean level A of noisy data. • Assume that the data originates from the following model: x [ n ] = A + w [ n ] , where w [ n ] ∼ N ( 0 , σ 2 ) : Constant plus Gaussian random noise with zero mean and variance σ 2 . 3 2 1 0 1 2 3 4 0 50 100 150 200 24 / 37

Recommend


More recommend