considerations in predictive modeling Oscar Miranda-Domnguez, PhD, - PowerPoint PPT Presentation

Fifth order Mean Square Error Polynomial order OHSU Minn 22.35 23.16 1 21.22 23.27 2 16.21 39.03 3 15.61 36.77 4 14.14 44.55 5 56

Sixth order Mean Square Error Polynomial order OHSU Minn 22.35 23.16 1 21.22 23.27 2 16.21 39.03 3 15.61 36.77 4 14.14 44.55 5 14.13 49.96 6 57

Take-home message Testing performance on the same data used to obtain a model leads to overfitting. Do not do it. 58

How to know that the best model is a third order polynomial? Mean Square Error Polynomial order OHSU Minn 22.35 23.16 1 21.22 23.27 2 16.21 39.03 3 15.61 36.77 4 14.14 44.55 5 14.13 49.96 6 59

How to know that the best model is a third order polynomial? Mean Square Error Polynomial order OHSU Minn 22.35 23.16 1 21.22 23.27 2 16.21 39.03 3 15.61 36.77 4 14.14 44.55 5 14.13 49.96 6 Use hold-out cross-validation! 60

Let’s use hold -out cross-validation to fit the most generalizable model for this data set 61

Make two partitions: Let’s use 90% of the sample for modeling and hold 10% out for testing 62

Use the partition modeling to fit the simplest model. Then predict in-sample and out-sample data A reasonable cost function is the mean of the sum of squares’s residuals 63

Resample and repeat Keep track of the errors. 64

Repeat N times 65

Increase model complexity, Increase order complexity Keep track of the errors. 66

Third order 67

Fourth order 68

Visualize results Pick the best (lowest out-of- sample prediction) Notice how the in-sample (modeling) error decreases as order increases: OVERFITTING 69

Take-home message Cross-validation is a useful tool towards predictive modeling. Partial-least squares regression requires cross-validation for predictive modeling to avoid overfitting 70

Generating Null hypothesis data Why is it important to generate a null distribution? 71

How do you know that your model behaves better than chance? • What is chance in the context of modeling and hold-out cross-validation? 72

Let’s suppose this is your data Original data 9𝑦 1 − 7𝑦 2 + ⋯ − 4𝑦 𝑜 = 21 −𝑦 1 + 9𝑦 2 + ⋯ + 2𝑦 𝑜 = 19 2𝑦 1 + 7𝑦 2 + ⋯ + 2𝑦 𝑜 = 77 1𝑦 1 − 6𝑦 2 + ⋯ + 1𝑦 𝑜 = 20 7𝑦 1 − 2𝑦 2 + ⋯ − 9𝑦 𝑜 = 62 73

Make two random partitions: modeling and validation Original data Modeling Validation 9𝑦 1 − 7𝑦 2 + ⋯ − 4𝑦 𝑜 = 21 9𝑦 1 − 7𝑦 2 + ⋯ − 4𝑦 𝑜 = 21 9𝑦 1 − 7𝑦 2 + ⋯ − 4𝑦 𝑜 = 21 −𝑦 1 + 9𝑦 2 + ⋯ + 2𝑦 𝑜 = 19 −𝑦 1 + 9𝑦 2 + ⋯ + 2𝑦 𝑜 = 19 −𝑦 1 + 9𝑦 2 + ⋯ + 2𝑦 𝑜 = 19 2𝑦 1 + 7𝑦 2 + ⋯ + 2𝑦 𝑜 = 77 2𝑦 1 + 7𝑦 2 + ⋯ + 2𝑦 𝑜 = 77 2𝑦 1 + 7𝑦 2 + ⋯ + 2𝑦 𝑜 = 77 1𝑦 1 − 6𝑦 2 + ⋯ + 1𝑦 𝑜 = 20 1𝑦 1 − 6𝑦 2 + ⋯ + 1𝑦 𝑜 = 20 1𝑦 1 − 6𝑦 2 + ⋯ + 1𝑦 𝑜 = 20 7𝑦 1 − 2𝑦 2 + ⋯ − 9𝑦 𝑜 = 62 7𝑦 1 − 2𝑦 2 + ⋯ − 9𝑦 𝑜 = 62 7𝑦 1 − 2𝑦 2 + ⋯ − 9𝑦 𝑜 = 62 74

Randomize predictor and outcomes in the partition used for modeling Original data Modeling Validation 9𝑦 1 − 7𝑦 2 + ⋯ − 4𝑦 𝑜 = 21 9𝑦 1 − 7𝑦 2 + ⋯ − 4𝑦 𝑜 = 77 9𝑦 1 − 7𝑦 2 + ⋯ − 4𝑦 𝑜 = 21 −𝑦 1 + 9𝑦 2 + ⋯ + 2𝑦 𝑜 = 19 −𝑦 1 + 9𝑦 2 + ⋯ + 2𝑦 𝑜 = 19 −𝑦 1 + 9𝑦 2 + ⋯ + 2𝑦 𝑜 = 19 2𝑦 1 + 7𝑦 2 + ⋯ + 2𝑦 𝑜 = 77 2𝑦 1 + 7𝑦 2 + ⋯ + 2𝑦 𝑜 = 20 2𝑦 1 + 7𝑦 2 + ⋯ + 2𝑦 𝑜 = 77 1𝑦 1 − 6𝑦 2 + ⋯ + 1𝑦 𝑜 = 20 1𝑦 1 − 6𝑦 2 + ⋯ + 1𝑦 𝑜 = 21 1𝑦 1 − 6𝑦 2 + ⋯ + 1𝑦 𝑜 = 20 7𝑦 1 − 2𝑦 2 + ⋯ − 9𝑦 𝑜 = 62 7𝑦 1 − 2𝑦 2 + ⋯ − 9𝑦 𝑜 = 62 7𝑦 1 − 2𝑦 2 + ⋯ − 9𝑦 𝑜 = 62 75

Estimate out-of-sample performance: Original data Modeling Validation 9𝑦 1 − 7𝑦 2 + ⋯ − 4𝑦 𝑜 = 21 9𝑦 1 − 7𝑦 2 + ⋯ − 4𝑦 𝑜 = 77 9𝑦 1 − 7𝑦 2 + ⋯ − 4𝑦 𝑜 = 21 −𝑦 1 + 9𝑦 2 + ⋯ + 2𝑦 𝑜 = 19 −𝑦 1 + 9𝑦 2 + ⋯ + 2𝑦 𝑜 = 19 −𝑦 1 + 9𝑦 2 + ⋯ + 2𝑦 𝑜 = 19 2𝑦 1 + 7𝑦 2 + ⋯ + 2𝑦 𝑜 = 77 2𝑦 1 + 7𝑦 2 + ⋯ + 2𝑦 𝑜 = 20 2𝑦 1 + 7𝑦 2 + ⋯ + 2𝑦 𝑜 = 77 1𝑦 1 − 6𝑦 2 + ⋯ + 1𝑦 𝑜 = 20 1𝑦 1 − 6𝑦 2 + ⋯ + 1𝑦 𝑜 = 21 1𝑦 1 − 6𝑦 2 + ⋯ + 1𝑦 𝑜 = 20 7𝑦 1 − 2𝑦 2 + ⋯ − 9𝑦 𝑜 = 62 7𝑦 1 − 2𝑦 2 + ⋯ − 9𝑦 𝑜 = 62 7𝑦 1 − 2𝑦 2 + ⋯ − 9𝑦 𝑜 = 62 - Calculate the model in the partition “Modeling” - Predict outcome on the partition “Validation” - Estimate “goodness of the fit”: mean square error 76

Repeat and keep track of the errors Original data Modeling Validation 9𝑦 1 − 7𝑦 2 + ⋯ − 4𝑦 𝑜 = 21 9𝑦 1 − 7𝑦 2 + ⋯ − 4𝑦 𝑜 = 21 9𝑦 1 − 7𝑦 2 + ⋯ − 4𝑦 𝑜 = 21 −𝑦 1 + 9𝑦 2 + ⋯ + 2𝑦 𝑜 = 19 −𝑦 1 + 9𝑦 2 + ⋯ + 2𝑦 𝑜 = 62 −𝑦 1 + 9𝑦 2 + ⋯ + 2𝑦 𝑜 = 19 2𝑦 1 + 7𝑦 2 + ⋯ + 2𝑦 𝑜 = 77 2𝑦 1 + 7𝑦 2 + ⋯ + 2𝑦 𝑜 = 77 2𝑦 1 + 7𝑦 2 + ⋯ + 2𝑦 𝑜 = 77 1𝑦 1 − 6𝑦 2 + ⋯ + 1𝑦 𝑜 = 20 1𝑦 1 − 6𝑦 2 + ⋯ + 1𝑦 𝑜 = 19 1𝑦 1 − 6𝑦 2 + ⋯ + 1𝑦 𝑜 = 20 7𝑦 1 − 2𝑦 2 + ⋯ − 9𝑦 𝑜 = 62 7𝑦 1 − 2𝑦 2 + ⋯ − 9𝑦 𝑜 = 20 7𝑦 1 − 2𝑦 2 + ⋯ − 9𝑦 𝑜 = 62 - Calculate the model in the partition “Modeling” - Predict outcome on the partition “Validation” - Estimate “goodness of the fit”: mean square error 77

Compare performance (mean squares error in out-of-sample data) to determine if your model predicts better than chance! Mean Square Errors 78

Example using Neuroimaging data cross-validation, regularization and PLSR fconn_regression tool 79

I’ll use as a case the study of cueing in freezing of gait in Parkinson’s disease Freezing of gait, a pretty descriptive name, is an additional symptom present on some patients Freezing can lead to falls, which adds an extra burden in Parkinson’s disease http://parkinsonteam.blogspot.com/2011/10 https://en.wikipedia.org/wiki/Parkinson's_disease /prevencion-de-caidas-en-personas-con.html 80

Auditory cues, like beats at a constant rate, are an effective intervention to reduce freezing episodes in some patients Open loop 81 Ashoori A, Eagleman DM, Jankovic J. Effects of Auditory Rhythm and Music on Gait Disturbances in Parkinson’s Disease [Internet]. Front Neurol 2015;

The goal of the study is to determine whether improvement after cueing can be predicted by resting state functional connectivity 82

Available data Resting state functional MRI 83

Approach 1. Calculate rs-fconn • Group data per functional network pairs: Default-Default, Default- Visual, … 2. Use PLSR and cross-validation to determine whether improvement can be predicted using connectivity from specific brain networks 3. Explore outputs 4. Report findings 84

First step is to calculate resting state functional connectivity and group data per functional system pairs 85

PLSR and cross-validation This can be done using the tool Parameters fconn_regression • Partition size • Hold-one out • Hold-three out • How many components: • 2, 3, 4,… • Number of repetitions • 100?, 500?,… • Calculate null-hypothesis data • Number of repetitions: 10,000? 86

Comparing distribution of prediction errors for real versus null-hypotheses data Sorted by Cohen effect size Visual and subcortical Auditory and default Somatosensory lateral and Ventral attention Effect size = 0.87 Effect size = 0.81 Effect size = 0.78 Ventral Visual Auditory Attn Subcortical Default Somatosensory lateral Mean square error Mean square error 87 Mean square error

We have a virtual machine and a working example Let us know if you are interested in a break-out session 88

Topics • Partial-least squares Regression • Feature Selection • Cross-Validation • Null Distribution/Permutations • An Example • Regularization • Truncated singular value decomposition • Connectotyping: model based functional connectivity • Example: models that generalize across datasets! 89

Regularization Truncated singular value decomposition 90

# Measurements = # Variables # Measurements > # Variables # Measurements < # Variables The system What about repeated measurements (real What about (real) limited data: data with noise) 4 = 2𝐵 4.0 = 2.0𝐵 → 𝐵 = 2.00 8 = 4𝛽 + 𝛾 3.9 = 2.1𝐵 → 𝐵 ≈ 1.86 has a unique solution Select the solution with the lowest mean There are 2 variables ( 𝛽 and 𝛾) and 1 𝐵 = 2 square error! measurements. 4.0 2.0 Solving the system: = 2.1 𝐵 3.9 8 − 4𝛽 = 𝛾 𝑧 = 𝑦𝐵 Using linear algebra (𝒚 pseudo-inverse ) All the points on 𝛾 = 8 − 4𝛽 solve the 𝐵 = 𝑦 ′ 𝑦 −1 𝑦 ′ 𝑧 system. 𝐵 ≈ 1.9286 In other words, This 𝑩 minimizes σ 𝐬𝐟𝐭𝐣𝐞𝐯𝐛𝐦𝐭 𝟑 there is an infinite number of solutions!

What if you can’t reduce the number of features? Regularization is a powerful approach to handle this kind of problems (ill-posed systems) 92

We know that the pseudo-inverse offers the optimal solution (lowest least squares) for systems with more measurements than observations 93

We can use the pseudo-inverse to calculate a solution in systems with more measurements than observations 94

Example: Imagine a given outcome can be predicted by 379 variables,… 𝑧 = 𝛾 1 𝑦 1 + 𝛾 2 𝑦 2 + ⋯ 𝛾 379 𝑦 379 1) 95

And that you have 163 observations: 𝑧 = 𝛾 1 𝑦 1 + 𝛾 2 𝑦 2 + ⋯ 𝛾 379 𝑦 379 1) 𝑧 = 𝛾 1 𝑦 1 + 𝛾 2 𝑦 2 + ⋯ 𝛾 379 𝑦 379 2) 3) 𝑧 = 𝛾 1 𝑦 1 + 𝛾 2 𝑦 2 + ⋯ 𝛾 379 𝑦 379 … 𝑧 = 𝛾 1 𝑦 1 + 𝛾 2 𝑦 2 + ⋯ 𝛾 379 𝑦 379 163) 96

Using the pseudo-inverse you can obtain a solution with high predictability 97

Using the pseudo-inverse you can obtain a solution with high predictability This solution, however, is problematic: *unstable beta weights *over fitting *not applicable to outside dataset 98

What does “unstable beta weights” mean? Let’s suppose age and weight are two variables used in your model For one participant you used • Age: 10.0 years • Weight: 70 pounds • Corresponding outcome: “score” of 3.7 There was, however, an error in data collection and the real values are: • Age: 10.5 years • Weight: 71 pounds 99

Updating predictions in the same model Stable beta-weights: Let’s suppose age and weight are two variables used score ~ 3.9 in your model For one participant you used Unstable beta • Age: 10.0 years weights: • Weight: 70 pounds • Corresponding outcome: “score” of 3.7 score ~ -344,587.42 There was, however, an error in data collection and the real values are: • Age: 10.5 years • Weight: 71 pounds 100

considerations in predictive modeling Oscar Miranda-Domnguez, PhD, - PowerPoint PPT Presentation

Important concepts and considerations in predictive modeling Oscar Miranda-Domnguez, PhD, MSc. Research Assistant Professor Developmental Cognition and Neuroimaging Lab, OHSU Models try to identify associations between variables: ,

Session 3 Upskilling for Predictive Analytics Travis M Short, FSA Upskilling for Predictive

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

Predictive Analytics for Capacity Planning HIC 2015 Andrae Gaeth What is predictive

Overcoming big data bottlenecks in healthcare : a Predictive Modeling case study Predictive

COVID-19 Predictive Analytics April 8th, 2020 Predictive Analytics Focus Areas Health System

Session 2 Predictive Analytics in Policyholder Behavior Eileen Burns, FSA, MAAA David Wang, FSA,

pTec Predictive Maintenance Solution Predictive Maintenance Solutions by Indalyz AG What if you

pTec by Indalyz AG A predictive maintenance software solution for effective asset management

Educational Predictive Analytics: Navigating Disparate Views Aaron Springer , Victoria Chou,

Session 5 A brief introduction to Predictive Modeling Lichen Bao, Ph.D A Brief Introduction to

High-Fidelity Coupling of Predictive Plasma-Wall Models Goal: Develop a predictive model of the

Lessons Learned (the Hard Way) in an Organization from Predictive Modeling Projects Predictive

Generalized Model Predictive Control (Discretely Generalized MPC) Sa sa V. Rakovi c, Ph.D.

Welcome Overview of Predictive Analytics Claudia Perlich Chief Scientist, Dstillery Predictive

Predictive microbiology Survival, multiplication, or Predictive Modeling death of spoilage

Contents 1 Causal Inference and Predictive Comparison 2 1.1 How Predictive Comparison Can

Sobolev spaces on non-Lipschitz sets with application to BIEs on fractal screens Andrea Moiola

Homogenization of scalar and Stokes equations with drift M. Briane, IRMA Rennes & P. G

Juan Casado-Daz University of Sevilla Model problem: , , > 0,

Non-standard solutions of isentropic Euler with Riemann data Camillo De Lellis Universitt

Homogenization and the Neumann Poincar e operator Eric Bonnetier, Charles Dapogny, Faouzi Triki

Control of Coupled Slow and Fast Dynamics Zvi Artstein Presentations in: DIMACS Workshop on

Coupled physics inverse problems Introduction The problem and Jacobians of -harmonic mappings

An extremal eigenvalue problem for a two-phase conductor 1 Carlos Conca , Rajesh Mahadevan , Leon

considerations in predictive modeling Oscar Miranda-Domnguez, PhD, - PowerPoint PPT Presentation

Important concepts and considerations in predictive modeling Oscar Miranda-Domnguez, PhD, MSc. Research Assistant Professor Developmental Cognition and Neuroimaging Lab, OHSU Models try to identify associations between variables: ,

Session 3 Upskilling for Predictive Analytics Travis M Short, FSA Upskilling for Predictive

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

Predictive Analytics for Capacity Planning HIC 2015 Andrae Gaeth What is predictive

Overcoming big data bottlenecks in healthcare : a Predictive Modeling case study Predictive

COVID-19 Predictive Analytics April 8th, 2020 Predictive Analytics Focus Areas Health System

Session 2 Predictive Analytics in Policyholder Behavior Eileen Burns, FSA, MAAA David Wang, FSA,

pTec Predictive Maintenance Solution Predictive Maintenance Solutions by Indalyz AG What if you

pTec by Indalyz AG A predictive maintenance software solution for effective asset management

Educational Predictive Analytics: Navigating Disparate Views Aaron Springer , Victoria Chou,

Session 5 A brief introduction to Predictive Modeling Lichen Bao, Ph.D A Brief Introduction to

High-Fidelity Coupling of Predictive Plasma-Wall Models Goal: Develop a predictive model of the

Lessons Learned (the Hard Way) in an Organization from Predictive Modeling Projects Predictive

Generalized Model Predictive Control (Discretely Generalized MPC) Sa sa V. Rakovi c, Ph.D.

Welcome Overview of Predictive Analytics Claudia Perlich Chief Scientist, Dstillery Predictive

Predictive microbiology Survival, multiplication, or Predictive Modeling death of spoilage

Contents 1 Causal Inference and Predictive Comparison 2 1.1 How Predictive Comparison Can

Sobolev spaces on non-Lipschitz sets with application to BIEs on fractal screens Andrea Moiola

Homogenization of scalar and Stokes equations with drift M. Briane, IRMA Rennes &amp; P. G

Juan Casado-Daz University of Sevilla Model problem: , , &gt; 0,

Non-standard solutions of isentropic Euler with Riemann data Camillo De Lellis Universitt

Homogenization and the Neumann Poincar e operator Eric Bonnetier, Charles Dapogny, Faouzi Triki

Control of Coupled Slow and Fast Dynamics Zvi Artstein Presentations in: DIMACS Workshop on

Coupled physics inverse problems Introduction The problem and Jacobians of -harmonic mappings

An extremal eigenvalue problem for a two-phase conductor 1 Carlos Conca , Rajesh Mahadevan , Leon

Homogenization of scalar and Stokes equations with drift M. Briane, IRMA Rennes & P. G

Juan Casado-Daz University of Sevilla Model problem: , , > 0,