Size Estimation - Statistical Models for Underreporting Gerhard - PowerPoint PPT Presentation

Size Estimation - Statistical Models for Underreporting Gerhard Neubauer, Gordana Djuraš & Herwig Friedl JOANNEUM RESEARCH and Technical University, Graz useR! 2009, Rennes 8.-10. July 2009 – p. 1

1 Introduction useR! 2009, Rennes 8.-10. July 2009 – p. 2

Underreporting Any sample of count data may be incomplete ■ criminology: crimes with an aspect of shame (sexuality, domestic violence) or theft of low values goods ■ public health: infectious (HIV) or chronic (diabetes) disease data ■ production: error counts in a production process ■ traffic accidents with minor damage Estimation of total number of cases useR! 2009, Rennes 8.-10. July 2009 – p. 3

Binomial Model Event reported Yes No R 1 − R R ∼ Bernoulli ( π ) useR! 2009, Rennes 8.-10. July 2009 – p. 4

Binomial Model Event reported Yes No R 1 − R R ∼ Bernoulli ( π ) iid sample of size λ ⇒ Y = � i R i ∼ Binomial ( λ, π ) var( Y ) = σ 2 = λπ (1 − π ) E( Y ) = µ = λπ, useR! 2009, Rennes 8.-10. July 2009 – p. 4

Binomial Model Event reported Yes No R 1 − R R ∼ Bernoulli ( π ) iid sample of size λ ⇒ Y = � i R i ∼ Binomial ( λ, π ) var( Y ) = σ 2 = λπ (1 − π ) E( Y ) = µ = λπ, Y the number of reported events π the reporting probability the total number of events - size parameter λ U = λ − Y the number of unreported events useR! 2009, Rennes 8.-10. July 2009 – p. 4

Binomial Model Event reported Yes No R 1 − R R ∼ Bernoulli ( π ) iid sample of size λ ⇒ Y = � i R i ∼ Binomial ( λ, π ) var( Y ) = σ 2 = λπ (1 − π ) E( Y ) = µ = λπ, Both λ and π have to be estimated No longer member of Exponential Family useR! 2009, Rennes 8.-10. July 2009 – p. 5

Estimation For T iid samples Y t Method of Moments For the binomial we have var( Y ) = µ − µ 2 /λ ≤ µ Limitation to data with s 2 < ¯ y For s 2 > ¯ y 1. Regression approach using ML ind Y t ∼ Binomial ( λ t , π ) with λ t = f ( x t , β ) Neubauer & Friedl (2006) 2. Mixed model approaches useR! 2009, Rennes 8.-10. July 2009 – p. 6

2 Alternative Models useR! 2009, Rennes 8.-10. July 2009 – p. 7

Beta-Binomial Random reporting probability P Y t | P ∼ Binomial ( λ, p ) P ∼ Beta ( γ, δ ) Y t ∼ Beta-Binomial ( λ, γ, δ ) Mean-variance relation µ − µ 2 � � var( Y t ) = φ λ φ = λ + γ + δ 1 + γ + δ ≥ 1 useR! 2009, Rennes 8.-10. July 2009 – p. 8

Poisson Random total number of cases L Y t | L ∼ Binomial ( l, π ) L ∼ Poisson ( λ ) Y t ∼ Poisson ( λπ ) Parameters not identified useR! 2009, Rennes 8.-10. July 2009 – p. 9

Negative Binomial Additional randomness in E( L ) L | K ∼ Poisson ( kλ ) Y t | K ∼ Poisson ( kλπ ) K ∼ Gamma ( ω, ω ) Y t ∼ Negative Binomial ( ω, 1 − π ) ω the number of unreported cases π the reporting probability Mean-variance relation var( Y t ) = µ + µ 2 ω ≥ µ useR! 2009, Rennes 8.-10. July 2009 – p. 10

Beta-Poisson Consider both binomial parameters as random Y | L, P ∼ Binomial ( L, P ) L ∼ Poisson ( λ ) P ∼ Beta ( γ, δ ) Y ∼ Beta-Poisson ( λ, γ, δ ) γ E ( Y ) = λπ = µ where π = γ + δ φ = 1 + λ (1 − π ) var( Y ) = µφ with 1 + γ + δ ≥ 1 useR! 2009, Rennes 8.-10. July 2009 – p. 11

Generalized Poisson Distribution Consul(1989) Moments θ θ E( Y ) = var( Y ) = (1 − τ ) (1 − τ ) 3 ■ τ = 0 : E( Y ) = var( Y ) ⇒ Poisson ( θ ) ■ 0 < τ < 1 : E( Y ) < var( Y ) ⇒ Neg. Binomial ■ τ < 0 : E( Y ) > var( Y ) ⇒ Binomial useR! 2009, Rennes 8.-10. July 2009 – p. 12

Conditional Poisson Models Motivation: π → 1 leads to Y → λ in the binomial approach Assume Y | L ∼ Poisson ( L ) Choose p ( L ) such that E( Y ) = λπ and var( Y ) = λπφ For example: L ∼ Binomial ( λ, π ) 1 < φ = 2 − π < 2 2 < φ = 2 − π L ∼ Negative Binomial ( λ, π ) 1 − π < ∞ useR! 2009, Rennes 8.-10. July 2009 – p. 13

Regression Modelling For all models - except the GP - we use exp( α ) λ t,β = exp( x ′ t β ) and π α = 1 + exp( α ) x t , d -vector of known regressors β , d -vector of unknown parameters For the GP model we use θ t,β = exp( x ′ t β ) τ α = 1 − exp( − α ) and Test α = 0 ⇒ Identify Poisson misdispersion useR! 2009, Rennes 8.-10. July 2009 – p. 14

3 Implementation useR! 2009, Rennes 8.-10. July 2009 – p. 15

R package sizEst Full ML estimation of all models: done Conditional Poisson in work Testing competing models: in development Testing parameters within models: done Model diagnostics: done Main functions: arrayEst, sizEst Implemented methods: sizEst: plot, predict, residuals, summary summary.arrayEst useR! 2009, Rennes 8.-10. July 2009 – p. 16

4 Real Data Application useR! 2009, Rennes 8.-10. July 2009 – p. 17

Stroke Data Hypothesis: Slight strokes are not seen in hospitals Data: Hospital discharges Output from function arrayEst() iterations loglik chi.sq gradient p1 GP 0.5 3 -405.256 94.602 0.000 0.4594 NegBin 0.5 3 -405.195 94.411 0.000 0.4604 BetaBin 0.5 19 -402.135 80.779 0.000 0.8230 BetaPois 0.5 30 -408.170 80.811 0.000 0.9097 useR! 2009, Rennes 8.-10. July 2009 – p. 18

Stroke Data Output from function sizEst() Distribution: BetaPois Formula: y ˜ beta01 + T.cos1 + T.sin1 - 1 Estimate Std. Error t value Pr(>|t|) alpha1 2.309 0.289 8.002 0 beta01 5.071 0.025 204.435 0 T.cos1 0.060 0.016 3.673 0 T.sin1 -0.083 0.016 -5.249 0 Theta 11.231 --- --- - Performance measures: measures loglik -408.170 chi.sq 80.811 df.residual 92.000 aic 824.341 bic 834.599 Reporting Probabilities: lower estimate upper alpha1 0.8622 0.9097 0.9571 useR! 2009, Rennes 8.-10. July 2009 – p. 19

Stroke Data Distribution=BetaPois, p0=0.3 Distribution=BetaPois, p0=0.3 200 150 150 Frequency Frequency 100 100 50 50 0 0 0 20 40 60 80 0 20 40 60 80 Mean function from Quasi−Poisson−Model (blue) and BetaPois model (red) Lambda function from BetaPois model (red) Pearson residuals Pearson residuals 1.0 0.4 0.8 0.6 0.3 Density 0.4 ACF 0.2 0.2 0.1 −0.2 0.0 −2 −1 0 1 2 0 5 10 15 Histogram of residuals and N(0,1) density Lag π = 0 . 91 ˆ useR! 2009, Rennes 8.-10. July 2009 – p. 20

Crime Register Data SIMO : Austrian online crime register ■ Time range: 2004 - 2007 ■ weekly counts ■ 132 regions ■ different crime categories In most cases Poisson overdispersion useR! 2009, Rennes 8. -10. July 2009 – p. 21

GP model estimates Shop Lifting Bicycle Theft 120 100 100 80 80 Frequency Frequency 60 60 40 40 20 20 0 0 0 50 100 150 200 0 50 100 150 200 Week Week π = 0 . 69 ˆ π = 0 . 67 ˆ useR! 2009, Rennes 8. -10. July 2009 – p. 22

Beta-Poisson model estimates Shop Lifting Bicycle Theft 120 150 100 80 Frequency Frequency 100 60 40 50 20 0 0 0 50 100 150 200 0 50 100 150 200 Week Week π = 0 . 65 ˆ π = 0 . 52 ˆ useR! 2009, Rennes 8. -10. July 2009 – p. 23

Summary ■ Great variety of models ■ MLE based implementation in R ■ Good performance for simulated data ■ Reasonable estimates for real data useR! 2009, Rennes 8. -10. July 2009 – p. 24

Summary ■ Great variety of models ■ MLE based implementation in R ■ Good performance for simulated data ■ Reasonable estimates for real data Future work ■ Implement Conditional Poisson models ■ Non-nested Testing for more than two models useR! 2009, Rennes 8. -10. July 2009 – p. 24

Size Estimation - Statistical Models for Underreporting Gerhard - PowerPoint PPT Presentation

Size Estimation - Statistical Models for Underreporting Gerhard Neubauer, Gordana Djura & Herwig Friedl JOANNEUM RESEARCH and Technical University, Graz useR! 2009, Rennes 8.-10. July 2009 p. 1 1 Introduction useR! 2009, Rennes

Learning Models from Data with Measurement Error: Tackling Underreporting Roy Adams, Yuelong Ji,

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models

Lumber Size Lumber Size Control Control Studies Studies Lumber Size Control Lumber Size

Lab 2 discussion Last Time Debugging Its a science use experiments to refine

TRAUMA AND HELP SEEKING How to understand underreporting DENISE SAINT ARNAULT, PHD, RN

Estimating MCMIS Crash File Underreporting from Carrier Records March 22, 2017 Office of

Congressional Budget Office July 21, 2017 Correcting for Underreporting of Government Transfers:

How CBO Adjusts for Underreporting of Means-Tested Transfers in Its Distributional Analyses

Envelope Wages, Underreporting and Tax Evasion: The Case of Turkey Selin Pelek (Galatasaray

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Sunthud Pornprasertmanit W. Joel Schneider Sample Size Estimation Approach Power

Software size measures and their usefulness for software project estimation Software Size

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Handling array size limitations Handling array size limitations Issue: array size is fixed

C o l o r G l a s s C o n d e n s a t e a n d p a r t o n s a t u

Inference and Estimation Using Nearest Neighbors 2019 The Second Korea-Japan Machine Learning

Introduction to Machine Learning Random Forest: Benchmarking Trees, Forests, and Bagging K-NN

Non-Bayesian Classifiers Part I: k -Nearest Neighbor Classifier and Distance Functions Selim

Monodromy Solver: sequential and parallel joint with Nathan Bliss (UIC), Tim Duff (Georgia Tech),

Pr [ X i ] . i = 1 E [ X ] = Pr [ X i ] . i = 1 Proof: One has E [

Classification and Clustering of RNAseq data Verena Zuber IMISE, University of Leipzig 5th June

QGP-like effects in Small Systems with LHC Run3+ Naghmeh Mohammadi arxiv:1812.06772 (HL-LHC WG5

Size Estimation - Statistical Models for Underreporting Gerhard - PowerPoint PPT Presentation

Size Estimation - Statistical Models for Underreporting Gerhard Neubauer, Gordana Djura & Herwig Friedl JOANNEUM RESEARCH and Technical University, Graz useR! 2009, Rennes 8.-10. July 2009 p. 1 1 Introduction useR! 2009, Rennes

Learning Models from Data with Measurement Error: Tackling Underreporting Roy Adams, Yuelong Ji,

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

STA 214: Probability &amp; Statistical Models STA 214: Analysis of Statistical Models

Lumber Size Lumber Size Control Control Studies Studies Lumber Size Control Lumber Size

Lab 2 discussion Last Time Debugging Its a science use experiments to refine

TRAUMA AND HELP SEEKING How to understand underreporting DENISE SAINT ARNAULT, PHD, RN

Estimating MCMIS Crash File Underreporting from Carrier Records March 22, 2017 Office of

Congressional Budget Office July 21, 2017 Correcting for Underreporting of Government Transfers:

How CBO Adjusts for Underreporting of Means-Tested Transfers in Its Distributional Analyses

Envelope Wages, Underreporting and Tax Evasion: The Case of Turkey Selin Pelek (Galatasaray

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Sunthud Pornprasertmanit W. Joel Schneider Sample Size Estimation Approach Power

Software size measures and their usefulness for software project estimation Software Size

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Handling array size limitations Handling array size limitations Issue: array size is fixed

C o l o r G l a s s C o n d e n s a t e a n d p a r t o n s a t u

Inference and Estimation Using Nearest Neighbors 2019 The Second Korea-Japan Machine Learning

Introduction to Machine Learning Random Forest: Benchmarking Trees, Forests, and Bagging K-NN

Non-Bayesian Classifiers Part I: k -Nearest Neighbor Classifier and Distance Functions Selim

Monodromy Solver: sequential and parallel joint with Nathan Bliss (UIC), Tim Duff (Georgia Tech),

Pr [ X i ] . i = 1 E [ X ] = Pr [ X i ] . i = 1 Proof: One has E [

Classification and Clustering of RNAseq data Verena Zuber IMISE, University of Leipzig 5th June

QGP-like effects in Small Systems with LHC Run3+ Naghmeh Mohammadi arxiv:1812.06772 (HL-LHC WG5

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models