Maximum Likelihood Estimation for Learning Populations of Parameters Ramya Korlakai Vinayak Postdoctoral Researcher Paul G. Allen School of CSE joint work with Weihao Kong, Gregory Valiant, Sham Kakade Poster #189 ramya@cs.washington.edu 1
Motivation: Large yet Sparse Data Example: Flu data Suppose for a large random subset of the population in California, we observe whether a person caught the flu or not for last 5 years Poster #189 2
Motivation: Large yet Sparse Data Probability of catching flu Example: Flu data (unknown) Suppose for a large random subset of i p i the population in California, we observe whether a person caught the flu or not for last 5 years Poster #189 2
Motivation: Large yet Sparse Data Probability of catching flu Example: Flu data (unknown) Suppose for a large random subset of i p i the population in California, we observe whether a person caught the bias of coin i flu or not for last 5 years Poster #189 2
Motivation: Large yet Sparse Data Probability of catching flu Example: Flu data (unknown) Suppose for a large random subset of i p i the population in California, we observe whether a person caught the bias of coin i flu or not for last 5 years } { 0 0 1 0 1 x i = 2 p i = x i b t = 0 . 4 ± 0 . 45 Poster #189 2
Motivation: Large yet Sparse Data Probability of catching flu Example: Flu data (unknown) Suppose for a large random subset of i p i the population in California, we observe whether a person caught the bias of coin i flu or not for last 5 years } { 0 0 1 0 1 Goal: Can we learn the distribution of x i = 2 the biases over the population? p i = x i b t = 0 . 4 ± 0 . 45 Poster #189 2
Motivation: Large yet Sparse Data Probability of catching flu Example: Flu data (unknown) Suppose for a large random subset of i p i the population in California, we observe whether a person caught the bias of coin i flu or not for last 5 years } { 0 0 1 0 1 Goal: Can we learn the distribution of x i = 2 the biases over the population? p i = x i b t = 0 . 4 ± 0 . 45 • Application domains: Epidemiology, Social Sciences, Psychology, Medicine, Biology • Population size is large, often hundreds of thousands or millions • Number of observations per individual is limited ( sparse ) prohibiting accurate estimation of parameters of interest Poster #189 2
Motivation: Large yet Sparse Data Probability of catching flu Example: Flu data (unknown) Suppose for a large random subset of i p i the population in California, we observe whether a person caught the bias of coin i flu or not for last 5 years } { 0 0 1 0 1 Goal: Can we learn the distribution of x i = 2 the biases over the population? p i = x i b t = 0 . 4 ± 0 . 45 • Application domains: Epidemiology, Social Sciences, Psychology, Medicine, Biology Useful for downstream analysis: • Population size is large, often hundreds of thousands or millions Why? Testing and estimating properties of the distribution • Number of observations per individual is limited ( sparse ) prohibiting accurate estimation of parameters of interest Poster #189 2
Model: Non-parametric Mixture of Binomials Lord 1965, 1969 True Distribution P ? • N independent coins P ? Each coin has its own bias drawn from p i ∼ P ? (unknown) i = 1 , 2 , ..., N (unknown) 0 1 0.5 Poster #189 3
Model: Non-parametric Mixture of Binomials Lord 1965, 1969 True Distribution P ? • N independent coins P ? Each coin has its own bias drawn from p i ∼ P ? (unknown) i = 1 , 2 , ..., N (unknown) 0 1 0.5 • We get to observe t tosses for every coin X i ∼ Bin( t, p i ) ∈ { 0 , 1 , ..., t } Observations: } { 0 0 1 0 1 t = 5 tosses x i = 2 Poster #189 3
Model: Non-parametric Mixture of Binomials Lord 1965, 1969 True Distribution P ? • N independent coins P ? Each coin has its own bias drawn from p i ∼ P ? (unknown) i = 1 , 2 , ..., N (unknown) 0 1 0.5 • We get to observe t tosses for every coin X i ∼ Bin( t, p i ) ∈ { 0 , 1 , ..., t } Observations: } { 0 0 1 0 1 t = 5 tosses x i = 2 { X i } N ˆ • Given , return P P ? estimate of i =1 Poster #189 3
Model: Non-parametric Mixture of Binomials Lord 1965, 1969 True Distribution P ? • N independent coins P ? Each coin has its own bias drawn from p i ∼ P ? (unknown) i = 1 , 2 , ..., N (unknown) 0 1 0.5 • We get to observe t tosses for every coin X i ∼ Bin( t, p i ) ∈ { 0 , 1 , ..., t } Observations: } { 0 0 1 0 1 t = 5 tosses x i = 2 { X i } N ˆ • Given , return P P ? estimate of i =1 ⇣ ⌘ • Wasserstein-1 distance P ? , ˆ W 1 P (Earth Mover’s Distance) Poster #189 3
Learning with Sparse Observations is Non-trivial ⇢ X 1 � t , ...X i t , ..., X N ˆ • Empirical plug-in estimator is bad P plug-in = histogram t ✓ 1 ◆ Number of When t ⌧ N incurs error of Number of Θ t = N = tosses per √ t coins coin Poster #189 4
Learning with Sparse Observations is Non-trivial ⇢ X 1 � t , ...X i t , ..., X N ˆ • Empirical plug-in estimator is bad P plug-in = histogram t ✓ 1 ◆ Number of When t ⌧ N incurs error of Number of Θ t = N = tosses per √ t coins coin • Many recent works on estimating symmetric properties of a discrete distribution with sparse observations Paninski 2003, Valiant and Valiant 2011, Jiao et. al. 2015, Orlitsky et. al. 2016, Acharya et. al. 2017 …. The setting in this work is different Poster #189 4
Learning with Sparse Observations is Non-trivial ⇢ X 1 � t , ...X i t , ..., X N ˆ • Empirical plug-in estimator is bad P plug-in = histogram t ✓ 1 ◆ Number of When t ⌧ N incurs error of Number of Θ t = N = tosses per √ t coins coin • Many recent works on estimating symmetric properties of a discrete distribution with sparse observations Paninski 2003, Valiant and Valiant 2011, Jiao et. al. 2015, Orlitsky et. al. 2016, Acharya et. al. 2017 …. The setting in this work is different • Tian et. al 2017 proposed a moment matching based estimator which achieves ✓ 1 ◆ optimal error of O when t < c log N t Weakness of moment matching estimator is that it fails to obtain optimal error when due to higher variance in larger moments t > c log N Poster #189 4
Learning with Sparse Observations is Non-trivial ⇢ X 1 � t , ...X i t , ..., X N ˆ • Empirical plug-in estimator is bad P plug-in = histogram t ✓ 1 ◆ Number of When t ⌧ N incurs error of Number of Θ t = N = tosses per √ t coins coin • Many recent works on estimating symmetric properties of a discrete distribution with sparse observations Paninski 2003, Valiant and Valiant 2011, Jiao et. al. 2015, Orlitsky et. al. 2016, Acharya et. al. 2017 …. What about Maximum Likelihood Estimator? The setting in this work is different • Tian et. al 2017 proposed a moment matching based estimator which achieves ✓ 1 ◆ optimal error of O when t < c log N t Weakness of moment matching estimator is that it fails to obtain optimal error when due to higher variance in larger moments t > c log N Poster #189 4
Maximum Likelihood Estimator Sufficient statistic: Fingerprint h s h s = # coins that show s heads s = 0 , 1 , ..., t N h = [ h 0 , h 1 , ..h s , .., h t ] fingerprint vector 2 0 1 3 4 5 s Poster #189 5
Maximum Likelihood Estimator Sufficient statistic: Fingerprint h s h s = # coins that show s heads s = 0 , 1 , ..., t N h = [ h 0 , h 1 , ..h s , .., h t ] fingerprint vector 2 0 1 3 4 5 s Observed h , Expected h under the distribution Q ˆ Q ∈ dist[0 , 1] KL P mle ∈ arg min Poster #189 5
Maximum Likelihood Estimator Sufficient statistic: Fingerprint h s h s = # coins that show s heads s = 0 , 1 , ..., t N h = [ h 0 , h 1 , ..h s , .., h t ] fingerprint vector 2 0 1 3 4 5 s Observed h , Expected h under the distribution Q ˆ Q ∈ dist[0 , 1] KL P mle ∈ arg min • NOT the empirical estimator • Convex optimization: Efficient (polynomial time) Poster #189 5
Maximum Likelihood Estimator Sufficient statistic: Fingerprint h s h s = # coins that show s heads s = 0 , 1 , ..., t N h = [ h 0 , h 1 , ..h s , .., h t ] fingerprint vector 2 0 1 3 4 5 s Observed h , Expected h under the distribution Q ˆ Q ∈ dist[0 , 1] KL P mle ∈ arg min • NOT the empirical estimator • Convex optimization: Efficient (polynomial time) • Proposed in late 1960’s by Frederic Lord in the context of psychological testing. Several works study the geometry and identifiability and uniqueness of the solution of the MLE Lord 1965,1969, Turnbull 1976, Laird 1978, Lindsay 1983, Wood 1999 Poster #189 5
Maximum Likelihood Estimator Sufficient statistic: Fingerprint h s h s = # coins that show s heads s = 0 , 1 , ..., t N h = [ h 0 , h 1 , ..h s , .., h t ] fingerprint vector 2 0 1 3 4 5 s Observed h , Expected h under the distribution Q ˆ Q ∈ dist[0 , 1] KL P mle ∈ arg min How well does the MLE recover the distribution? • NOT the empirical estimator • Convex optimization: Efficient (polynomial time) • Proposed in late 1960’s by Frederic Lord in the context of psychological testing. Several works study the geometry and identifiability and uniqueness of the solution of the MLE Lord 1965,1969, Turnbull 1976, Laird 1978, Lindsay 1983, Wood 1999 Poster #189 5
Recommend
More recommend