On Learning Parametric Non-Smooth Continuous Distributions Sudeep - PowerPoint PPT Presentation

On Learning Parametric Non-Smooth Continuous Distributions Sudeep Kamath 1 Alon Orlitsky 2 Venkatadheeraj Pichapati 3 Ehsan Zobeidi 2 1 PDT Partners 2 Department of Electrical and Computer Engineering University of California San Diego 3 Apple Inc.

Motivation Motivation •Learning distribution : Classical problem in statistics •Several applications : •Weather 1 •Finance •Data is •rarely discrete •rarely from smooth class of distributions •Can we learn class of (non-smooth) continuous distributions? 2

Notation Motivation ∞ ∫ • p.d.f. : s.t. f ( x ) f ( x ) dx = 1 −∞ F ( x ) = ∫ x • c.d.f.: f ( x ) −∞ • Continuous distributions: no Dirac delta components in f ( x ) • Parametric distribution: can be defined by a parameter(s) C θ θ f λ ( x ) = λ e − λ x • Eg class of exponential distributions 3

Problem Motivation X n = X 1 , X 2 , X 3 , . . . , X n i.i.d. samples from • n f ( x ) X n • Learn from f ( x ) • Output : p.d.f. g X n ( x ) • Measuring how well approximates ? g ( x ) f ( x ) • Distance function ? D ( f , g X n ) • How to estimate distance over all sequences? : E X n ∼ f D ( f , g X n ) 4

Distance Motivation • Distances between distributions D ℓ 1 ( f , g ) = ∫ ∞ : | f ( x ) − g ( x ) | dx ℓ 1 • x = −∞ 2 ( f , g ) = ∫ ∞ ℓ (2) ( f ( x ) − g ( x )) 2 dx : D ℓ 2 • 2 x = −∞ KL D KL ( f , g ) = D ( f || g ) = ∫ ∞ f ( x )log f ( x ) : g ( x ) dx • x = −∞ • For parametric continuous distributions ℓ 2 • Estimating parameter/s reduces and ℓ 1 2 • KL loss • Applications • compression (information theory) • Machine Learning (log loss) 5

Loss function Motivation • Learning loss : • Average additional bits required to code (n+1)th sample E X n ∼ f D ( f || g X n ) • • Loss over class C θ r n (C θ , g ) = max E X n ∼ f D ( f || g X n ) • f ∈ C θ • Instantaneous redundancy (minimax KL loss) r n (C θ ) = min g r n (C θ , g ) • 6

Cumulative Redundancy Motivation • Compression loss • Additional bits to code X^n n − 1 ∑ R n (C θ ) = min g R n (C θ , g ) = min g max E X n ∼ f D ( f || g X j ) • f ∈ C θ j =0 X n • One can code one by one • Code first sample X 1 X 1 • get an estimate of distribution (using ) and code second sample n − 1 ∑ R n ≤ r i • i =0 7

Gaussian Distributions Motivation • Class of Gaussian distributions with unknown mean and known variance 1 e − ( x − θ )2 f θ ( x ) = 2 • 2 π n 1 ∑ • Estimate mean = (ML estimator, sufficient statistic) X i n i =1 • Output : distribution with estimated mean • Near optimal estimator r n = 1 2 n (1 + o (1)) • • Is it true for any class? 8

Smooth Distributions Motivation • Asymptotic Normality of MLE θ ML ( X n )) → N ( 0, 1 I ( θ ) ) n ( θ − ̂ • • is Fisher Information I ( θ ) ∂ 2 ∂ δ 2 D ( f θ || f θ + δ ) | δ =0 = I ( θ ) • θ ML ) ≈ E[ D ( f θ || f θ ) + ∂ ∂ δ D ( f θ || f θ + δ ) | δ =0 ( ̂ θ ML − θ )] E D ( f θ || f ̂ • + ∂ 2 ∂ δ 2 D ( f θ || f θ + δ ) | δ =0 ( ̂ θ ML − θ ) 2 ] 2 I ( θ ) n = 1 1 = I ( θ ) 2 n 9

Smooth distributions Motivation • Lower bound 1 1 • If parameter can be estimated to , n α lim sup r n ≥ α • n • For smooth distributions it has actually been shown that r n ≈ 1 # parameters 2 n ( ) • • How about non-smooth distributions? • distributions with no Fisher Information? 1 A. Barron, N. Hengartner et al., “Information theory and superefficiency,” The Annals of Statistics, vol. 26, no. 5, pp. 1800–1825, 1998. 10

Uniform distributions Motivation • Class of uniform distributions f θ ( x ) = 1 θ 1 0 ≤ x ≤ θ • ≈ 1 max( X n ) • ML estimator : (Estimates to an accuracy of θ n • KL loss for plug in ML estimator : infinite ℓ 2 and losses are still finite • ℓ 1 2 max( X n ) • Output estimator should provide mass even after • How can we allocate probability? Can be finite? r n 11

Prior Motivation • To derive estimator • Consider Pareto distribution Π on θ • There exists a closed form solution for arg min g E θ ∼Π D ( f , g X n ) • g x n ( x ) = f ( x | x n , θ ∼ Π ) • • Further r n ≥ min g E θ ∼Π D ( f , g X n ) 12

r_n for Uniform class Motivation n max( X n ) • Allocates mass uniformly till n + 1 1 1 max( X n ) • Remaining falls pollynomially ( ) after x n +1 n + 1 • This estimator incurs same loss over all uniform distributions • Hence upper and lower bound matches pdf r n ≈ 1 1.2 • n 1.0 1 0.8 • Here one parameter leads Original n 0.6 Estimator 1 0.4 • recall for smooth class 2 n 0.2 x 0.2 0.4 0.6 0.8 1.0 1.2 13

Uniform with 2 parameters Motivation • Class of uniform distributions with both ends unknown • Similar technique to derive optimal estimator n − 2 min( X n ) max( X n ) • Allocates uniformly between and n 1 1 max( X n ) probability falling pollynomially ( ) after • ( x − min( X n )) n − 1 n 1 1 min( X n ) probability falling pollynomially ( ) before • pdf (max( X n ) − x ) n − 1 n 0.6 r n ≈ 2 1 ( loss per parameter) 0.5 • n n 0.4 Original 0.3 Estimator 0.2 0.1 x 14 - 1.0 - 0.5 0.5 1.0

Uniform with fixed width Motivation • Class of uniform distributions with known width but unknown start point f θ ( x ) = 1 θ ≤ x ≤ θ +1 • • Once again optimal estimator derived using “prior” technique • Optimal estimator min( X n ) max( X n ) • Allocates p.d.f. of 1 between and max( X n ) 1 + min( X n ) • p.d.f. falls linearly between and min( X n ) max( X n ) − 1 • p.d.f. falls linearly between and r n ≈ 1 • n 15

Truncated distributions Motivation • Consider any continuous distribution f • Truncated class: Class of distributions generated by truncating f at θ pdf f θ ( x ) = f ( x ) 0.6 F ( θ ) 1 x ≤ θ • 0.5 • No Fisher information for this class either 0.4 Original 0.3 • Transformation y = F ( x ) Estimator 0.2 • Maps this class to class of uniform distributions 0.1 • Optimal estimator already known x - 2 - 1 1 2 x = F − 1 ( y ) • Map back to using transformation x r n ≈ 1 • n 16

r n in general? Motivation 1 • Is always per parameter? r n n • Consider triangle distribution pdf • Fisher Information doesn’t exist 2.0 • Looks smoother than uniform 1.5 1 1 Original • Is loss or ? 1.0 Estimator n 2 n 0.5 x 0.2 0.4 0.6 0.8 1.0 1.2 1.4 17

Scaled Distributions Motivation • Triangle distribution can in fact be seen as a scaled distribution • Consider a p.d.f. with all mass between 0 and 1 • One can scale (stretch) the distribution f θ ( x ) = 1 θ f ( x θ ) • • Pareto distribution is again “least favorable” prior • Optimal estimator can be derived: x i ∞ 0 ∏ n +1 1 θ ) d θ ∫ θ f ( i =0 θ g x n ( x n +1 ) = ∞ x i 0 ∏ n 1 θ ) d θ ∫ θ f ( i =0 θ ∀ x ∈ (0,1), f ( x ) ≠ 0, f (1 − ) ≠ 0, f ′ � (1 − ) • If is finite r n ≈ 1 • n • Recovers for class of uniform distributions starting at 0 r n 18

Scaled Distributions Motivation f (1 − ) = 0 • For triangle distribution, • Previous result doesn’t apply • Calculating is tricky r n • Derived bounds on which suggest bounds on R n r n • lim r n = 0 ≥ 1 • Bounds suggest for triangle distributions can be and r n 2 n 3 2 − π 4 ≈ 0.715 < 1 ≤ n n 19

Future Work Motivation • Establishing r n • for scaled • other classes of distributions 20

On Learning Parametric Non-Smooth Continuous Distributions Sudeep - PowerPoint PPT Presentation

On Learning Parametric Non-Smooth Continuous Distributions Sudeep Kamath 1 Alon Orlitsky 2 Venkatadheeraj Pichapati 3 Ehsan Zobeidi 2 1 PDT Partners 2 Department of Electrical and Computer Engineering University of California San Diego 3 Apple

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

Semi-parametric and response setup non-parametric approaches to Parametric models

Strengthening Smooth Transition Strengthening Smooth Transition Strengthening Smooth Transition

Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 Graham Neubig Non-parametric

Formal Modeling in Cognitive Science 1 Distributions Lecture 20: Joint, Marginal, and Conditional

Parametric Methods Steven J Zeil Old Dominion Univ. Fall 2010 1 Distributions Estimating

Continuous Distributions 1.8-1.9: Continuous Random Variables 1.10.1: Uniform Distribution

Continuous Distributions 1.8-1.9: Continuous Random Variables 1.10.1: Uniform Distribution

Introduction to non-parametric Bayes Introduction to non-parametric Bayes methods 1 Overview

Lecture 5: Probability Distributions Random Variables Probability Distributions

? ? ? ? Basic Charts Outline - Distributions & Histograms - Mean, Mode, Average - Chart

AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic

Overview Verifying Continuous-Time Markov Chains Negative exponential distributions 1 Lecture

Chapter 5 Continuous Random Variables Continuous Probability Distributions Continuous Probability

Which Distributions (or Families of Continuous . . . Distributions) Best Represent Example:

Towards a non-parametric Towards a non-parametric stochastic framework: a consistent approach of

Normal Selection Model Results from Heckman and Honor (1990) James J. Heckman University of

Aster Models Stat 8053 Lecture Notes Charles J. Geyer School of Statistics University of

ECG782: Multidimensional Digital Signal Processing Spatial Domain Filtering

Lecture 6: Space/Order Information Visualization CPSC 533C, Fall 2006 Tamara Munzner UBC

Run Time Complexity In typical application the total run time of a genetic algorithm is

STAT 401A - Statistical Methods for Research Workers Two-way ANOVA Jarad Niemi (Dr. J) Iowa

Project Charge CSE 510 ! The lectures have been broad samplers Projects of various topics in

Perceptual Evaluation of Color-to-Grayscale Image Conversions Martin adk Czech Technical

On Learning Parametric Non-Smooth Continuous Distributions Sudeep - PowerPoint PPT Presentation

On Learning Parametric Non-Smooth Continuous Distributions Sudeep Kamath 1 Alon Orlitsky 2 Venkatadheeraj Pichapati 3 Ehsan Zobeidi 2 1 PDT Partners 2 Department of Electrical and Computer Engineering University of California San Diego 3 Apple

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

Semi-parametric and response setup non-parametric approaches to Parametric models

Strengthening Smooth Transition Strengthening Smooth Transition Strengthening Smooth Transition

Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 Graham Neubig Non-parametric

Formal Modeling in Cognitive Science 1 Distributions Lecture 20: Joint, Marginal, and Conditional

Parametric Methods Steven J Zeil Old Dominion Univ. Fall 2010 1 Distributions Estimating

Continuous Distributions 1.8-1.9: Continuous Random Variables 1.10.1: Uniform Distribution

Continuous Distributions 1.8-1.9: Continuous Random Variables 1.10.1: Uniform Distribution

Introduction to non-parametric Bayes Introduction to non-parametric Bayes methods 1 Overview

Lecture 5: Probability Distributions Random Variables Probability Distributions

? ? ? ? Basic Charts Outline - Distributions &amp; Histograms - Mean, Mode, Average - Chart

AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic

Overview Verifying Continuous-Time Markov Chains Negative exponential distributions 1 Lecture

Chapter 5 Continuous Random Variables Continuous Probability Distributions Continuous Probability

Which Distributions (or Families of Continuous . . . Distributions) Best Represent Example:

Towards a non-parametric Towards a non-parametric stochastic framework: a consistent approach of

Normal Selection Model Results from Heckman and Honor (1990) James J. Heckman University of

Aster Models Stat 8053 Lecture Notes Charles J. Geyer School of Statistics University of

ECG782: Multidimensional Digital Signal Processing Spatial Domain Filtering

Lecture 6: Space/Order Information Visualization CPSC 533C, Fall 2006 Tamara Munzner UBC

Run Time Complexity In typical application the total run time of a genetic algorithm is

STAT 401A - Statistical Methods for Research Workers Two-way ANOVA Jarad Niemi (Dr. J) Iowa

Project Charge CSE 510 ! The lectures have been broad samplers Projects of various topics in

Perceptual Evaluation of Color-to-Grayscale Image Conversions Martin adk Czech Technical

? ? ? ? Basic Charts Outline - Distributions & Histograms - Mean, Mode, Average - Chart