(U) A Method for Regression Analysis on Sparse Datasets Daniel Barkmeyer NRO CAAG June 2015
Background • Traditional regression analysis, e.g. Zero-Bias Minimum Percent Error (ZMPE), can run into problems when important independent variables are known only for some datapoints (sparsely populated) • Omit data for which not all drivers tested are known, or • Do not test as drivers those data fields that are not fully populated • NRO CAAG’s Commercial -like Acquisition Program Study (CAPS)* ameliorated this issue • Empirically-derived scoring term based on known drivers • Scores independent of unknown drivers • Regression determines contribution of drivers to score, and coefficients expressing DV as a function of score • Linear regression only OBJECTIVE: Apply score-based regression to power-form functions with multiplicative error terms * Alvarado, W., Barkmeyer, D., and Burgess, E. “Commercial - like Acquisitions: Practices and Costs.” Journal of Cost Analysis an d Parametrics, V3, Issue 1. 2 NRO CAAG
Regression on Sparse Datasets • Advantage – retain explanatory power of sparsely-populated drivers, degrees of freedom in regressions derived from sparsely- populated datasets • For a given independent variable n , if x n is unknown for a datapoint, the influence of n is removed from the score for that datapoint • Datapoint can be retained in the regression as long as some x n are known • Allows all partially-populated datapoints to inform regression Operating Mobile (1) or Operational (1) or ZMPE Data Point Cost Weight Scoring Regression Wavelength Stationary (0) Experimental (0) Regression S = f(W ,l, Mob,Op) 1 $ 18 154 250 0 0 Include Include 2 $ 95 650 1 Omit Include S = f(W,Op) S = f( l, Mob) 3 $ 54 450 0 Omit Include S = f(W, l ,Op) 4 $ 52 310 500 1 Omit Include S = f(W ,l, Mob,Op) 5 $ 68 776 450 0 0 Include Include S = f(W ,l, Mob,Op) 6 $ 165 490 505 1 1 Include Include S = f(W, l, Op) 7 $ 307 900 800 0 Omit Include 8 $ 60 100 1 1 Omit Include S = f(W,Mob,Op) S = f(W ,l, Mob,Op) 9 $ 123 281 550 1 0 Include Include S = f(W, l, Mob) 10 $ 82 200 380 1 Omit Include With 4 drivers, 6 DOF With 4 drivers, 0 DOF 3 NRO CAAG
Scoring Method for Power Function Form • For a linear regression equation in the CAPS model, the score was calculated as a weighted average of the normalized known drivers: 𝑥 𝑜 𝑦 𝑜 𝑇 𝑚𝑗𝑜𝑓𝑏𝑠 = 𝑦 𝑜 known 𝑥 𝑜 • For a power function equation, the desired form of the score is a weighted geometric mean of the normalized known drivers: 𝑥 𝑜 ln 𝑦 𝑜 𝑦𝑜 𝑙𝑜𝑝𝑥𝑜 𝑥 𝑜 𝑇 𝑞𝑝𝑥𝑓𝑠 = 𝑓 where (ln 𝑦 𝑜 ) − (ln 𝑦 𝑜 ) 𝑛𝑗𝑜 , 𝑦 𝑜 continuous (ln 𝑦 𝑜 ) 𝑛𝑏𝑦 − (ln 𝑦 𝑜 ) 𝑛𝑗𝑜 ln 𝑦 𝑜 = 𝑦 𝑜 − ( 𝑦 𝑜 ) 𝑛𝑗𝑜 , 𝑦 𝑜 binary ( 𝑦 𝑜 ) 𝑛𝑏𝑦 − ( 𝑦 𝑜 ) 𝑛𝑗𝑜 • Where x n is not known, the n th term drops out of the numerator and denominator in the score 4 NRO CAAG
Power Function Form • It can be shown that the form for the score reduces the regression equation 𝐷 𝑧 = 𝐵 + 𝐶 ∙ 𝑇 𝑞𝑝𝑥𝑓𝑠 to the desired power function form: 𝑦 𝑜 𝑄 𝑜 ∙ 𝑦 𝑜 𝑧 = 𝐵 + 𝑅 ∙ 𝑄 𝑜 𝑦 𝑜 cont . 𝑦 𝑜 bin . with constants P and Q functions of B , C , weightings, and max/min values of drivers (all constants) • This form can be used on sparse or full datasets 5 NRO CAAG
Dataset for Testing • Created a dataset representative of typical cost estimating problem • 100 datapoints, 4 independent variable drivers • Driver 1 continuous, lognormally distributed, mean value 500, coefficient of variation 0.65, minimum value 100 • Driver 2 continuous, lognormally distributed, mean value 15, coefficient of variation 5.0, minimum value 0 • Driver 3 binary, 33% of data has value of 1 • Driver 4 binary, 50% of data has value of 1 • Dependent variable values set by the equation 𝑧 = 100 + 20 ∙ 𝑦 10.6 ∙ 𝑦 20.3 ∙ 3 𝑦 3 ∙ 1.2 𝑦 4 ∙ 𝜁 A Q P 1 P 2 P 3 P 4 with error term e lognormally distributed, mean value 1, coefficient of variation 0.4, minimum value 0 • Underlying behavior of the data is known • Regression results can be compared against the expected result 6 NRO CAAG
Validation – 100% Populated • For the test dataset, regression of the form 𝑦 3 ∙ 𝑄 𝑧 = (𝐵 + 𝑅 ∙ 𝑦 1𝑄 1 ∙ 𝑦 2𝑄 2 ∙ 𝑄 3 𝑦 4 ) ∙ 𝜁 4 with objective to minimize the value 100 𝜁 𝑜2 𝑔 𝑝𝑐𝑘 = 𝑜=1 ZMPE Score-Based • • Optimize A, Q, P 1 , P 2 , P 3 , P 4 Convert regression equation to 𝑧 = (𝐵 + 𝐶 𝑓 𝐷∙ 𝑥 1 ln 𝑦 1 +𝑥 2 ln 𝑦 2 +𝑥 3 ln 𝑦 3 +𝑥 4 ln 𝑦 4 ) ∙ 𝜁 𝑥 1 +𝑥 2 +𝑥 3 +𝑥 4 • Solution: • Optimize A, B, C, w 1 , w 2 , w 3 , w 4 • Solution and re-conversion: A 0.0 A 0.0 A 0.0 Q 18.9 B 32.1 Q 18.9 C 7.10 P 1 0.63 P 1 0.63 w 1 17% P 2 0.34 P 2 0.34 w 2 66% P 3 2.89 P 3 2.89 w 3 15% P 4 1.14 P 4 1.14 w 4 2% Score-Based method reproduces ZMPE solution on fully-populated dataset 7 NRO CAAG
Score-Based Regression vs. ZMPE • Validated Score-based method is equivalent to ZMPE on a fully- populated dataset • Next step: sparsely-populated test cases • Individual drivers sparsely populated • 50 regressions, each with randomly-selected values removed, at every 5% interval of population percent between 100% and 5% populated • Other 3 drivers fully populated • All drivers sparsely populated • 200 regressions, each with randomly-selected values removed, at every 5% interval of overall population percent between 100% and 5% • Comparison metric: Characteristic Underlying Percent Error • CUPE is measured across the entire dataset, including values that were removed to simulate sparseness of data 100 𝜗 %𝑜2 𝑜=1 where 𝜗 %𝑜 is the percent error between the • Defined as 𝐷𝑉𝑄𝐹 = 𝐸𝑃𝐺 actual y and the regression equation’s predicted y for the n th datapoint • Measures how well the regression (with incomplete data) captures the underlying relationship (if the data were complete) 8 NRO CAAG
Score-Based Regression vs. ZMPE Weaker Continuous Driver Sparse • Average Degrees of Freedom of Trial Regressions vs. Weak Continuous Degrees of Freedom Driver % Populated 100 • ZMPE regression DOF decreases 90 Average DOF of 50 Trial Regressions 80 linearly with % Populated 70 • Score-based regression retains all 60 50 DOF from the full dataset 40 Sparse Data Method 30 ZMPE 20 10 0 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Continuous Driver (Driver 1) Percent Populated Average CUPE of Trial Regressions vs. Weak Continuous Driver % Populated 100.0% • CUPE of resultant estimating Average CUPE of 50 Trial Regressions 90.0% Sparse Data Method relationship against full dataset ZMPE 80.0% • Models show similar performance 70.0% down to 30% populated 60.0% • Below 30% populated, score-based 50.0% method is better able to model the 40.0% underlying relationship 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Continuous Driver (Driver 1) Percent Populated 9 NRO CAAG
Score-Based Regression vs. ZMPE Stronger Continuous Driver Sparse • Degrees of Freedom Average Degrees of Freedom of Trial Regressions vs. Strong Continuous Driver % Populated • 100 ZMPE regression DOF decreases 90 Average DOF of 50 Trial Regressions linearly with % Populated 80 70 • Score-based regression retains all 60 DOF from the full dataset 50 40 Sparse Data Method 30 ZMPE 20 10 0 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Continuous Driver (Driver 2) Percent Populated Average CUPE of Trial Regressions vs. Strong Continuous Driver % Populated 100.0% • CUPE of resultant estimating Average CUPE of 50 Trial Regressions 90.0% Sparse Data Method relationship against full dataset ZMPE 80.0% • ZMPE performs much better above 70.0% very low population percentages 60.0% • Score-based method only proves 50.0% better able to capture underlying relationship once ZMPE DOF 40.0% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% becomes very small Continuous Driver (Driver 2) Percent Populated 10 NRO CAAG
Score-Based Regression vs. ZMPE Binary Driver Sparse • Degrees of Freedom Average Degrees of Freedom of Trial Regressions vs. Stratifier % Populated 100 • ZMPE regression DOF decreases 90 Average DOF of 50 Trial Regressions 80 linearly with % Populated 70 • Score-based regression retains all 60 50 DOF from the full dataset 40 30 Sparse Data Method 20 ZMPE 10 0 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Stratifier (Driver 3) Percent Populated Average CUPE of Trial Regressions vs. Stratifier % Populated 100.0% • CUPE of resultant estimating Average CUPE of 50 Trial Regressions 90.0% Sparse Data Method relationship against full dataset ZMPE 80.0% • ZMPE performs slightly better above 70.0% 20% populated 60.0% • Score-based method proves better 50.0% able to capture underlying relationship 40.0% once ZMPE DOF becomes small 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Stratifier (Driver 3) Percent Populated 11 NRO CAAG
Recommend
More recommend