1 Being Normal, Simultaneously Maximizing Likelihood with Uniform - PDF document

Likelihood of Data Maximum Likelihood Estimator • The Maximum Likelihood Estimator (MLE) of  , • Consider n I.I.D. random variables X 1 , X 2 , ..., X n is the value of  that maximizes L (  )  X i a sample from density function f (X i |  )  MLE   o Note: now explicitly specify parameter  of distribution  More formally: arg max ( ) L   We want to determine how “likely” the observed data  More convenient to use log-likelihood function , LL (  ): (x 1 , x 2 , ..., x n ) is based on density f (X i |  ) n n          LL ( ) log L ( ) log f ( X | ) log f ( X | )  Define the Likelihood function , L (  ): i i   i 1 i 1  Note that log function is “monotone” for positive values n     L ( ) f ( X | ) i o Formally: x ≤ y  log(x) ≤ log(y) for all x, y > 0  i 1 o This is just a product since X i are I.I.D.  So,  that maximizes LL (  ) also maximizes L (  )  Intuitively: what is probability of observed data using    o Formally: arg max LL ( ) arg max L ( ) density function f (X i |  ), for some choice of    o Similarly, for any positive constant c (not dependent on  ): Demo       arg max ( ( )) arg max ( ) arg max ( ) c LL LL L    Computing the MLE Maximizing Likelihood with Bernoulli • General approach for finding MLE of  • Consider I.I.D. random variables X 1 , X 2 , ..., X n  Determine formula for LL (  )  X i ~ Ber( p )   LL ( )  Probability mass function, f (X i | p), can be written as:  Differentiate LL (  ) w.r.t. (each)  :       x 1 x   f ( X | p ) p ( 1 p ) where x 0 or 1 i i LL ( )  i i  To maximize, set 0 n    (      Likelihood: X 1 X L ) p ( 1 p ) i i  Solve resulting (simultaneous) equation to get  MLE  i 1  Log-likelihood: ˆ  o Make sure that derived is actually a maximum (and not a MLE  n  n   minimum or saddle point). E.g., check LL (  MLE   ) < LL (  MLE ) (         X 1 X LL ) log( p ( 1 p ) ) X (log p ) ( 1 X ) log( 1 p ) i i i i   • This step often ignored in expository derivations i 1 i 1   n      (log ) ( ) log( 1 ) where Y p n Y p Y X • So, we’ll ignore it here too (and won’t require it in this class) i i 1  Differentiate w.r.t. p , and set to 0:  For many standard distributions, someone has already   n LL ( p ) 1 1 Y 1         done this work for you. (Yay!) Y ( n Y ) 0 p X   MLE i p p 1 p n n  i 1 Maximizing Likelihood with Poisson Maximizing Likelihood with Normal • Consider I.I.D. random variables X 1 , X 2 , ..., X n • Consider I.I.D. random variables X 1 , X 2 , ..., X n  X i ~ Poi( l )  X i ~ N( m ,  2 )  l l x  l e l X i n  e i m   1   m 2  2 l  2 ( X ) /( 2 ) ( | )   ( | , )  PMF: f X Likelihood: ( )  PDF: f X e i L i   i 2 x ! ! X  1 i i i  Log-likelihood:  Log-likelihood:  l l   X n n    e i  n n     l  l      1   m 2  2       m  ( ) log( ) log( ) log( ) log( ! ) ( X ) /( 2 ) 2 2 LL e X X LL ( ) log( e ) log( 2 ) ( X ) /( 2 ) i i i   ! i X 2   1 1   i i i i 1 i 1 n n    First, differentiate w.r.t. m , and set to 0:   l  log( l  ) log( ! ) n X X i i    m  1 1 2 i i LL ( , ) n 1 n     m    m  2 2 ( X ) /( 2 ) ( X ) 0  Differentiate w.r.t. l , and set to 0:  m  i 2 i   i 1 i 1  l n n  Then, differentiate w.r.t.  , and set to 0: LL ( ) 1  1       l  n X 0 X  l l i MLE i  m  n ( , 2 ) n 1 n LL  n    i 1 i 1     m      m   2 ( ) 2 /( 2 3 ) ( ) 2 /( 3 ) 0 X X    i  i   i 1 i 1 1

Being Normal, Simultaneously Maximizing Likelihood with Uniform  Now have two equations, two unknowns: • Consider I.I.D. random variables X 1 , X 2 , ..., X n 1 n n n  X i ~ Uni( a , b )    m     m   2 3 ( X ) 0 ( X ) /( ) 0  1      2 i i  a x b    b a i 1 i 1   PDF: f ( X | a , b ) i i   0 otherwise  First, solve for m MLE :  n  1          a x , x ,..., x b n n n 1   1   Likelihood:   b a ( )    1 2 n  m    m  m  L ( X ) 0 X n X   2 i i MLE i n  0 otherwise    i 1 i 1 i 1 o Constraint a < x 1 , x 2 , …, x n < b makes differentiation tricky  Then, solve for  2MLE : o Intuition: want interval size (b – a) to be as small as possible to n n n      m 2  3    2   m 2 ( ) /( ) 0 ( ) X n X maximize likelihood function for each data point  i i   1 1 i i n o But need to make sure all observed data contained in interval 1     m 2 2 ( X ) MLE i MLE • If all observed data not in interval, then L (  ) = 0 n  i 1  Note: m MLE unbiased, but  2  Solution: a MLE = min(x 1 , …, x n ) b MLE = max(x 1 , …, x n ) MLE biased (same as MOM) Understanding MLE with Uniform Once Again, Small Samples = Problems • Consider I.I.D. random variables X 1 , X 2 , ..., X n • How do small samples effect MLE? 1 n  X i ~ Uni(0, 1)  m   In many cases, = sample mean X MLE i n  1  Observe data: i o Unbiased. Not too shabby… o 0.15, 0.20, 0.30, 0.40, 0.65, 0.70, 0.75 1 n     m 2 2  As seen with Normal, ( X ) MLE i MLE n  Likelihood: L( a, 1) Likelihood: L(0, b ) i 1 o Biased. Underestimates for small n (e.g., 0 for n = 1)  As seen with Uniform, a MLE ≥ a and b MLE ≤ b L(a, 1 ) L( 0, b) o Biased. Problematic for small n (e.g., a = b when n = 1)  Small sample phenomena intuitively make sense: o Maximum likelihood  best explain data we’ve seen o Does not attempt to generalize to unseen data a b Properties of MLE Maximizing Likelihood with Multinomial • Maximum Likelihood Estimators are generally: • Consider I.I.D. random variables Y 1 , Y 2 , ..., Y n m ˆ        p 1  Consistent: for  > 0  lim (| | ) 1 P  Y k ~ Multinomial(p 1 , p 2 , ..., p m ), where i   n i  1 m  X n   X i = number of trials with outcome i where i  Potentially biased (though asymptotically less so) i  1 n !  x x x  Asymptotically optimal  PDF: f ( X ,..., X | p ,..., p ) p 1 p 2 ... p m    ! ! ! 1 m 1 m x x x 1 2 m 1 2 m o Has smallest variance of “good” estimators for large samples m m   (      Log-likelihood: ) log( ! ) log( ! ) log( ) LL n X X p  Often used in practice where sample size is large i i i  1  1 i i relative to parameter space m   Account for constraint when differentiating LL (  ) p  1 i o But be careful, there are some very large parameter spaces i  1  Use Lagrange multipliers (drop non-p i terms): o Joint distributions of several variables can cause problems m m • Parameter space grows exponentially   Rock on, dog!    l  ( ) log( ) ( 1 ) A X p p i i i • Parameter space for 10 dependent binary variables  2 10   1 1 i i Joseph-Louis Lagrange (1736-1813) 2

1 Being Normal, Simultaneously Maximizing Likelihood with Uniform - PDF document

Likelihood of Data Maximum Likelihood Estimator The Maximum Likelihood Estimator (MLE) of , Consider n I.I.D. random variables X 1 , X 2 , ..., X n is the value of that maximizes L ( ) X i a sample from density function f (X

Tevatron Combination of SM Higgs Searches and Fourth Generation Limits Satish Desai Fermilab

The LIFETEST Procedure Stratum 1: treatment = 0 Product-Limit Survival Estimates Survival

Sisyphus or Sir Edmund? A Retrospective on Data Sharing Erin Kenneally Dept of Homeland

A Leading Intermediate Copper Producer Q2 2017 Conference Call/Webcast Presentation July 31, 2017

The Beam Radiance Estimate for Volumetric Photon Mapping Wojciech Jarosz in collaboration with

MATH 12002 - CALCULUS I 4.2: Properties of Definite Integrals Professor Donald L. White

DECODE: Detecting Co-Moving Wireless Devices Gayathri Chandrasekaran, Mesut Ali Ergin, Marco

Refining Estimates of the Local Group Mass using Local Velocity Shear and ANN M. McLeod 1 N.

North-South Asymmetry during the Solar Cycles William Wilson

STATISTICAL INFERENCE (STA 121) Olalekan Obisesan, Ph.D & Oladapo Oladoja, MSc Department of

MOL2NET, 2017 , 3, doi:10.3390/mol2net-03-xxxx 2 Introduction The facility/difficulty of

Probability Primer CS60077: Reinforcement Learning Abir Das IIT Kharagpur Sep 07, 2020 Agenda

Likelihood Methods of Inference Toss coin 6 times and get Heads twice. p is probability of getting

Run 1 combination report ZhangKaili 2016-04-18 Tasks Use Combinationtools to

L I K E L I H O O D F R E E I N F E R E N C E http://arxiv.org/abs/1506.02169 Kyle Cranmer New

CSE 527 Lectures 12-13 Markov Models and Hidden Markov Models DNA Methylation CH 3 CpG - 2

Analysis strategies and treatment of systematic effects in the KATRIN experiment Martin Sle zk

Andrea Chiappo andrea.chiappo@fysik.su.se Co-authors: Jan Conrad, Nils Hkansson, Johann

Chapter 7. Neural Networks Wei Pan Division of Biostatistics, School of Public Health, University

Chapter 7. Neural Networks Wei Pan Division of Biostatistics, School of Public Health, University

ss rst st

Revisiting Frank-Wolfe Projection-Free Sparse Convex Optimization Martin Jaggi Ecole

W HY S ELECTING V ARIABLES ? Nowadays many research areas produce data with tenth or hundred

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

1 Being Normal, Simultaneously Maximizing Likelihood with Uniform - PDF document

Likelihood of Data Maximum Likelihood Estimator The Maximum Likelihood Estimator (MLE) of , Consider n I.I.D. random variables X 1 , X 2 , ..., X n is the value of that maximizes L ( ) X i a sample from density function f (X

Tevatron Combination of SM Higgs Searches and Fourth Generation Limits Satish Desai Fermilab

The LIFETEST Procedure Stratum 1: treatment = 0 Product-Limit Survival Estimates Survival

Sisyphus or Sir Edmund? A Retrospective on Data Sharing Erin Kenneally Dept of Homeland

A Leading Intermediate Copper Producer Q2 2017 Conference Call/Webcast Presentation July 31, 2017

The Beam Radiance Estimate for Volumetric Photon Mapping Wojciech Jarosz in collaboration with

MATH 12002 - CALCULUS I 4.2: Properties of Definite Integrals Professor Donald L. White

DECODE: Detecting Co-Moving Wireless Devices Gayathri Chandrasekaran, Mesut Ali Ergin, Marco

Refining Estimates of the Local Group Mass using Local Velocity Shear and ANN M. McLeod 1 N.

North-South Asymmetry during the Solar Cycles William Wilson

STATISTICAL INFERENCE (STA 121) Olalekan Obisesan, Ph.D &amp; Oladapo Oladoja, MSc Department of

MOL2NET, 2017 , 3, doi:10.3390/mol2net-03-xxxx 2 Introduction The facility/difficulty of

Probability Primer CS60077: Reinforcement Learning Abir Das IIT Kharagpur Sep 07, 2020 Agenda

Likelihood Methods of Inference Toss coin 6 times and get Heads twice. p is probability of getting

Run 1 combination report ZhangKaili 2016-04-18 Tasks Use Combinationtools to

L I K E L I H O O D F R E E I N F E R E N C E http://arxiv.org/abs/1506.02169 Kyle Cranmer New

CSE 527 Lectures 12-13 Markov Models and Hidden Markov Models DNA Methylation CH 3 CpG - 2

Analysis strategies and treatment of systematic effects in the KATRIN experiment Martin Sle zk

Andrea Chiappo andrea.chiappo@fysik.su.se Co-authors: Jan Conrad, Nils Hkansson, Johann

Chapter 7. Neural Networks Wei Pan Division of Biostatistics, School of Public Health, University

Chapter 7. Neural Networks Wei Pan Division of Biostatistics, School of Public Health, University

ss rst st

Revisiting Frank-Wolfe Projection-Free Sparse Convex Optimization Martin Jaggi Ecole

W HY S ELECTING V ARIABLES ? Nowadays many research areas produce data with tenth or hundred

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

STATISTICAL INFERENCE (STA 121) Olalekan Obisesan, Ph.D & Oladapo Oladoja, MSc Department of