(U) A Method for Regression Analysis on Sparse Datasets Daniel - PowerPoint PPT Presentation

(U) A Method for Regression Analysis on Sparse Datasets Daniel Barkmeyer NRO CAAG June 2015

Background • Traditional regression analysis, e.g. Zero-Bias Minimum Percent Error (ZMPE), can run into problems when important independent variables are known only for some datapoints (sparsely populated) • Omit data for which not all drivers tested are known, or • Do not test as drivers those data fields that are not fully populated • NRO CAAG’s Commercial -like Acquisition Program Study (CAPS)* ameliorated this issue • Empirically-derived scoring term based on known drivers • Scores independent of unknown drivers • Regression determines contribution of drivers to score, and coefficients expressing DV as a function of score • Linear regression only OBJECTIVE: Apply score-based regression to power-form functions with multiplicative error terms * Alvarado, W., Barkmeyer, D., and Burgess, E. “Commercial - like Acquisitions: Practices and Costs.” Journal of Cost Analysis an d Parametrics, V3, Issue 1. 2 NRO CAAG

Regression on Sparse Datasets • Advantage – retain explanatory power of sparsely-populated drivers, degrees of freedom in regressions derived from sparsely- populated datasets • For a given independent variable n , if x n is unknown for a datapoint, the influence of n is removed from the score for that datapoint • Datapoint can be retained in the regression as long as some x n are known • Allows all partially-populated datapoints to inform regression Operating Mobile (1) or Operational (1) or ZMPE Data Point Cost Weight Scoring Regression Wavelength Stationary (0) Experimental (0) Regression S = f(W ,l, Mob,Op) 1 $ 18 154 250 0 0 Include Include 2 $ 95 650 1 Omit Include S = f(W,Op) S = f( l, Mob) 3 $ 54 450 0 Omit Include S = f(W, l ,Op) 4 $ 52 310 500 1 Omit Include S = f(W ,l, Mob,Op) 5 $ 68 776 450 0 0 Include Include S = f(W ,l, Mob,Op) 6 $ 165 490 505 1 1 Include Include S = f(W, l, Op) 7 $ 307 900 800 0 Omit Include 8 $ 60 100 1 1 Omit Include S = f(W,Mob,Op) S = f(W ,l, Mob,Op) 9 $ 123 281 550 1 0 Include Include S = f(W, l, Mob) 10 $ 82 200 380 1 Omit Include With 4 drivers, 6 DOF With 4 drivers, 0 DOF 3 NRO CAAG

Scoring Method for Power Function Form • For a linear regression equation in the CAPS model, the score was calculated as a weighted average of the normalized known drivers: 𝑥 𝑜 𝑦 𝑜 𝑇 𝑚𝑗𝑜𝑓𝑏𝑠 = 𝑦 𝑜 known 𝑥 𝑜 • For a power function equation, the desired form of the score is a weighted geometric mean of the normalized known drivers: 𝑥 𝑜 ln 𝑦 𝑜 𝑦𝑜 𝑙𝑜𝑝𝑥𝑜 𝑥 𝑜 𝑇 𝑞𝑝𝑥𝑓𝑠 = 𝑓 where (ln 𝑦 𝑜 ) − (ln 𝑦 𝑜 ) 𝑛𝑗𝑜 , 𝑦 𝑜 continuous (ln 𝑦 𝑜 ) 𝑛𝑏𝑦 − (ln 𝑦 𝑜 ) 𝑛𝑗𝑜 ln 𝑦 𝑜 = 𝑦 𝑜 − ( 𝑦 𝑜 ) 𝑛𝑗𝑜 , 𝑦 𝑜 binary ( 𝑦 𝑜 ) 𝑛𝑏𝑦 − ( 𝑦 𝑜 ) 𝑛𝑗𝑜 • Where x n is not known, the n th term drops out of the numerator and denominator in the score 4 NRO CAAG

Power Function Form • It can be shown that the form for the score reduces the regression equation 𝐷 𝑧 = 𝐵 + 𝐶 ∙ 𝑇 𝑞𝑝𝑥𝑓𝑠 to the desired power function form: 𝑦 𝑜 𝑄 𝑜 ∙ 𝑦 𝑜 𝑧 = 𝐵 + 𝑅 ∙ 𝑄 𝑜 𝑦 𝑜 cont . 𝑦 𝑜 bin . with constants P and Q functions of B , C , weightings, and max/min values of drivers (all constants) • This form can be used on sparse or full datasets 5 NRO CAAG

Dataset for Testing • Created a dataset representative of typical cost estimating problem • 100 datapoints, 4 independent variable drivers • Driver 1 continuous, lognormally distributed, mean value 500, coefficient of variation 0.65, minimum value 100 • Driver 2 continuous, lognormally distributed, mean value 15, coefficient of variation 5.0, minimum value 0 • Driver 3 binary, 33% of data has value of 1 • Driver 4 binary, 50% of data has value of 1 • Dependent variable values set by the equation 𝑧 = 100 + 20 ∙ 𝑦 10.6 ∙ 𝑦 20.3 ∙ 3 𝑦 3 ∙ 1.2 𝑦 4 ∙ 𝜁 A Q P 1 P 2 P 3 P 4 with error term e lognormally distributed, mean value 1, coefficient of variation 0.4, minimum value 0 • Underlying behavior of the data is known • Regression results can be compared against the expected result 6 NRO CAAG

Validation – 100% Populated • For the test dataset, regression of the form 𝑦 3 ∙ 𝑄 𝑧 = (𝐵 + 𝑅 ∙ 𝑦 1𝑄 1 ∙ 𝑦 2𝑄 2 ∙ 𝑄 3 𝑦 4 ) ∙ 𝜁 4 with objective to minimize the value 100 𝜁 𝑜2 𝑔 𝑝𝑐𝑘 = 𝑜=1 ZMPE Score-Based • • Optimize A, Q, P 1 , P 2 , P 3 , P 4 Convert regression equation to 𝑧 = (𝐵 + 𝐶 𝑓 𝐷∙ 𝑥 1 ln 𝑦 1 +𝑥 2 ln 𝑦 2 +𝑥 3 ln 𝑦 3 +𝑥 4 ln 𝑦 4 ) ∙ 𝜁 𝑥 1 +𝑥 2 +𝑥 3 +𝑥 4 • Solution: • Optimize A, B, C, w 1 , w 2 , w 3 , w 4 • Solution and re-conversion: A 0.0 A 0.0 A 0.0 Q 18.9 B 32.1 Q 18.9 C 7.10 P 1 0.63 P 1 0.63 w 1 17% P 2 0.34 P 2 0.34 w 2 66% P 3 2.89 P 3 2.89 w 3 15% P 4 1.14 P 4 1.14 w 4 2% Score-Based method reproduces ZMPE solution on fully-populated dataset 7 NRO CAAG

Score-Based Regression vs. ZMPE • Validated Score-based method is equivalent to ZMPE on a fully- populated dataset • Next step: sparsely-populated test cases • Individual drivers sparsely populated • 50 regressions, each with randomly-selected values removed, at every 5% interval of population percent between 100% and 5% populated • Other 3 drivers fully populated • All drivers sparsely populated • 200 regressions, each with randomly-selected values removed, at every 5% interval of overall population percent between 100% and 5% • Comparison metric: Characteristic Underlying Percent Error • CUPE is measured across the entire dataset, including values that were removed to simulate sparseness of data 100 𝜗 %𝑜2 𝑜=1 where 𝜗 %𝑜 is the percent error between the • Defined as 𝐷𝑉𝑄𝐹 = 𝐸𝑃𝐺 actual y and the regression equation’s predicted y for the n th datapoint • Measures how well the regression (with incomplete data) captures the underlying relationship (if the data were complete) 8 NRO CAAG

Score-Based Regression vs. ZMPE Weaker Continuous Driver Sparse • Average Degrees of Freedom of Trial Regressions vs. Weak Continuous Degrees of Freedom Driver % Populated 100 • ZMPE regression DOF decreases 90 Average DOF of 50 Trial Regressions 80 linearly with % Populated 70 • Score-based regression retains all 60 50 DOF from the full dataset 40 Sparse Data Method 30 ZMPE 20 10 0 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Continuous Driver (Driver 1) Percent Populated Average CUPE of Trial Regressions vs. Weak Continuous Driver % Populated 100.0% • CUPE of resultant estimating Average CUPE of 50 Trial Regressions 90.0% Sparse Data Method relationship against full dataset ZMPE 80.0% • Models show similar performance 70.0% down to 30% populated 60.0% • Below 30% populated, score-based 50.0% method is better able to model the 40.0% underlying relationship 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Continuous Driver (Driver 1) Percent Populated 9 NRO CAAG

Score-Based Regression vs. ZMPE Stronger Continuous Driver Sparse • Degrees of Freedom Average Degrees of Freedom of Trial Regressions vs. Strong Continuous Driver % Populated • 100 ZMPE regression DOF decreases 90 Average DOF of 50 Trial Regressions linearly with % Populated 80 70 • Score-based regression retains all 60 DOF from the full dataset 50 40 Sparse Data Method 30 ZMPE 20 10 0 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Continuous Driver (Driver 2) Percent Populated Average CUPE of Trial Regressions vs. Strong Continuous Driver % Populated 100.0% • CUPE of resultant estimating Average CUPE of 50 Trial Regressions 90.0% Sparse Data Method relationship against full dataset ZMPE 80.0% • ZMPE performs much better above 70.0% very low population percentages 60.0% • Score-based method only proves 50.0% better able to capture underlying relationship once ZMPE DOF 40.0% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% becomes very small Continuous Driver (Driver 2) Percent Populated 10 NRO CAAG

Score-Based Regression vs. ZMPE Binary Driver Sparse • Degrees of Freedom Average Degrees of Freedom of Trial Regressions vs. Stratifier % Populated 100 • ZMPE regression DOF decreases 90 Average DOF of 50 Trial Regressions 80 linearly with % Populated 70 • Score-based regression retains all 60 50 DOF from the full dataset 40 30 Sparse Data Method 20 ZMPE 10 0 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Stratifier (Driver 3) Percent Populated Average CUPE of Trial Regressions vs. Stratifier % Populated 100.0% • CUPE of resultant estimating Average CUPE of 50 Trial Regressions 90.0% Sparse Data Method relationship against full dataset ZMPE 80.0% • ZMPE performs slightly better above 70.0% 20% populated 60.0% • Score-based method proves better 50.0% able to capture underlying relationship 40.0% once ZMPE DOF becomes small 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Stratifier (Driver 3) Percent Populated 11 NRO CAAG

(U) A Method for Regression Analysis on Sparse Datasets Daniel - PowerPoint PPT Presentation

(U) A Method for Regression Analysis on Sparse Datasets Daniel Barkmeyer NRO CAAG June 2015 Background Traditional regression analysis, e.g. Zero-Bias Minimum Percent Error (ZMPE), can run into problems when important independent

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Secure Linear Regression on Secure Linear Regression on Vertically Partitioned Datasets

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Sparse regression DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Science

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

Least Mean Squares Regression Machine Learning 1 Least Squares Method for regression

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

Analysis of variance and regression Other types of regression models Other types of regression

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Linear Regression

January 2019 Important Notice The information contained in this presentation is for

The Role of the Guernsey Registry in Charity and NPO registration Helen Proudlove-Gains Deputy

2019 Compliance Workshop Session 2: Alberta Emission Offsets Alberta Environment and Parks

Security Regression Addressing Security Regression by Unit Testing Christopher Grayson

When to invest in high speed rail British experience Chris Nash Research Professor

Mark Carlson Hui Shan Missaka Warusawitharana Motivation Concerns about the impact of bank

R-CNN minus R Karel Lenc, Andrea Vedaldi Object detection 2 Goal : tightly enclose objects of a