tidyfun tidy functional data
play

tidyfun : Tidy Functional Data A new framework for working with - PowerPoint PPT Presentation

tidyfun : Tidy Functional Data A new framework for working with functional data in R Fabian Scheipl 1 Jeff Goldsmith 2 1 : Dept. of Statistics, LMU Munich 2 : Columbia University Mailman School of Public Health Functional Data Painful to work with:


  1. tidyfun : Tidy Functional Data A new framework for working with functional data in R Fabian Scheipl 1 Jeff Goldsmith 2 1 : Dept. of Statistics, LMU Munich 2 : Columbia University Mailman School of Public Health

  2. Functional Data Painful to work with: ◮ huge amounts of data ◮ regular grids? irregular grids? ◮ work with: ◮ raw data? ◮ smooth/interpolated? ◮ basis representations? 2 / 61

  3. Functional Data Painful: Two (2.5, actually..) bad options to keep it in the same data.frame as the rest of your data: 1. wide format: ◮ way too many weird columns ◮ need to keep track of argument values t separately somehow: ## Observations: 382 ## Variables: 96 ## $ id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009... ## $ sex <fct> female, female, male, male, male, male, male, male, ... ## $ case <fct> control, control, control, control, control, control... ## $ cca_0 <dbl> 0.4909345, 0.4721627, 0.5023738, 0.4021894, 0.401874... ## $ cca_0.011 <dbl> 0.5168018, 0.4868219, 0.5136516, 0.4225127, 0.405580... ## $ cca_0.022 <dbl> 0.5356539, 0.5022577, 0.5392542, 0.4398983, 0.398548... ## $ cca_0.033 <dbl> 0.5553587, 0.5233635, 0.5742101, 0.4600235, 0.386000... ## $ cca_0.043 <dbl> 0.5927610, 0.5524401, 0.6031339, 0.4751297, 0.408824... ## $ cca_0.054 <dbl> 0.6326935, 0.5872003, 0.6335913, 0.4990257, 0.425183... ## $ cca_0.065 <dbl> 0.6500317, 0.5968158, 0.6357108, 0.5165528, 0.429537... ## $ cca_0.076 <dbl> 0.6556130, 0.6026607, 0.6350799, 0.5552692, 0.444545... ## $ cca_0.087 <dbl> 0.6493701, 0.5922767, 0.6201638, 0.5826485, 0.486943... ## $ cca_0.098 <dbl> 0.6378739, 0.5791859, 0.6086281, 0.6005767, 0.510038... 3 / 61 ## $ cca_0.109 <dbl> 0.6286463, 0.5714253, 0.5910287, 0.6135842, 0.537097...

  4. Functional Data Painful: Two (2.5, actually..) bad options to keep it in the same data.frame as the rest of your data: 2. long format: ◮ unwieldy amounts of rows, lots of duplication for non-functional data ◮ need to keep track of grouping structure (which rows belong to the same curve?) throughout ◮ infeasible if we have more than one function per observational unit ## Observations: 35,526 ## Variables: 6 ## $ id <dbl> 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001... ## $ sex <fct> female, female, female, female, female, female, fema... ## $ case <fct> control, control, control, control, control, control... ## $ cca_id <chr> "1001_1", "1001_1", "1001_1", "1001_1", "1001_1", "1... ## $ cca_arg <dbl> 0.000, 0.011, 0.022, 0.033, 0.043, 0.054, 0.065, 0.0... ## $ cca_value <dbl> 0.4909345, 0.5168018, 0.5356539, 0.5553587, 0.592761... 4 / 61

  5. Functional Data Painful to work with: Third bad option: matrix columns in a data.frame . Sucks, too: ◮ not really well supported (breaks lots of tidyverse -stuff, much unexpected behavior in base ) ◮ more trouble than it’s worth: doesn’t solve how to keep track of argument values 5 / 61

  6. Functional Data Despite all that, people keep measuring ever more of the damn things. Let’s make dealing with functional data in R less painful. 6 / 61

  7. Start at the end... 7 / 61

  8. This is what we’re aiming for: # group-wise functional medians: medians <- dti %>% group_by (case, sex) %>% summarize (median_rcst = median (rcst)) ggplot (medians) + geom_spaghetti ( aes (y = median_rcst, col = sex, linetype = case)) sex 0.7 male median_rcst female 0.6 0.5 case control 0.4 MS 0.00 0.25 0.50 0.75 1.00 8 / 61

  9. This is what we’re aiming for: dti[, -1] ## # A tibble: 382 x 4 ## sex case cca rcst ## <fct> <fct> <tfd> <tfd> ## 1 female control (0.000,0.49);(0.011,0.52);(... (0.000,0.26);(0.019,0.45);(... ## 2 female control (0.000,0.47);(0.011,0.49);(... ( 0.22,0.44);( 0.24,0.48);(... ## 3 male control (0.000,0.50);(0.011,0.51);(... ( 0.22,0.42);( 0.24,0.41);(... ## 4 male control (0.000,0.40);(0.011,0.42);(... (0.000,0.51);(0.019,0.50);(... ## 5 male control (0.000,0.40);(0.011,0.41);(... ( 0.22,0.40);( 0.24,0.41);(... ## 6 male control (0.000,0.45);(0.011,0.45);(... (0.056,0.47);(0.074,0.49);(... ## 7 male control (0.000,0.55);(0.011,0.56);(... (0.000,0.52);(0.019,0.53);(... ## 8 male control (0.000,0.45);(0.011,0.48);(... (0.000,0.33);(0.019,0.42);(... ## 9 male control (0.000,0.50);(0.011,0.51);(... (0.000,0.57);(0.019,0.55);(... ## 10 male control (0.000,0.46);(0.011,0.47);(... ( 0.22,0.44);( 0.24,0.45);(... ## # ... with 372 more rows 9 / 61

  10. tidyfun The goal of tidyfun is to provide accessible and well-documented software that makes functional data analysis in R easy , specifically: data wrangling and exploratory analysis. tidyfun provides: ◮ new data types for representing functional data: tfd & tfb ◮ arithmetic operators , descriptive statistics and graphics functions for such data ◮ tidyverse -verbs for handling functional data inside data frames. 10 / 61

  11. Plan for today ◮ tidyfun ’s data types ◮ tidyfun ’s methods & functions ◮ Discussion & Feedback: ◮ What’s stupid? ◮ What is too complicated? ◮ What am I missing? 11 / 61

  12. tf -Class: Definition 12 / 61

  13. tf -class tf is a new data type for (vectors of) functional data: ◮ an abstract superclass for functional data in 2 forms: ◮ as (argument, value)-tuples : subclass tfd , also irregular or sparse ◮ or in basis representation : subclass tfb ◮ basically, a glorified list of numeric vectors (... since list s work well as columns of data frames ...) ◮ with additional attributes that define function-like behavior: ◮ how to evaluate the given “functions” for new arguments ◮ their domain ◮ the resolution of the argument values ◮ S3 based 13 / 61

  14. Example Data A C 0.60 B D ex 0.50 E 0.40 0.0 0.2 0.4 0.6 0.8 1.0 ex ## tfd[5] on (0,1) based on 93 evaluations each ## interpolation by tf_approx_linear ## A: (0.000,0.49);(0.011,0.52);(0.022,0.54); ... ## B: (0.000,0.47);(0.011,0.49);(0.022,0.50); ... ## C: (0.000,0.50);(0.011,0.51);(0.022,0.54); ... ## D: (0.000,0.40);(0.011,0.42);(0.022,0.44); ... ## E: (0.000,0.40);(0.011,0.41);(0.022,0.40); ... 14 / 61

  15. Example Data glimpse (dti) ## Observations: 382 ## Variables: 5 ## $ id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 101... ## $ sex <fct> female, female, male, male, male, male, male, male, male,... ## $ case <fct> control, control, control, control, control, control, con... ## $ cca <tfd> 1001_1: (0.000,0.49);(0.011,0.52);(0.022,0.54); ..., 1002... ## $ rcst <tfd> 1001_1: (0.000,0.26);(0.019,0.45);(0.037,0.40); ..., 1002... 15 / 61

  16. tf subclass: tfd tfd objects contain “raw” functional data: ◮ represented as a list of evaluations f i ( t ) | t = t ′ and corresponding arg ument vector(s) t ′ ◮ has a domain : the range of valid arg s. ex %>% tf_evaluations () %>% str ## List of 5 ## $ : num [1:93] 0.491 0.517 0.536 0.555 0.593 ... ## $ : num [1:93] 0.472 0.487 0.502 0.523 0.552 ... ## $ : num [1:93] 0.502 0.514 0.539 0.574 0.603 ... ## $ : num [1:93] 0.402 0.423 0.44 0.46 0.475 ... ## $ : num [1:93] 0.402 0.406 0.399 0.386 0.409 ... ex %>% tf_arg () %>% str ## num [1:93] 0 0.0109 0.0217 0.0326 0.0435 ... ex %>% tf_domain () ## [1] 0 1 16 / 61

  17. tf subclass: tfd ◮ contains an evaluator function that defines how to inter-/extrapolate evaluations between arg s (and remembers results of previous calls) tf_evaluator (ex) %>% str ## function (x, arg, evaluations) ## - attr(*, "memoised")= logi TRUE ## - attr(*, "class")= chr [1:2] "memoised" "function" tf_evaluator (ex) <- tf_approx_spline 17 / 61

  18. tf subclass: tfd ◮ tfd has subclasses for regular data with a common grid and irregular or sparse data. dti$rcst[1:2] ## tfd[2] on (0,1) based on 43 to 55 (mean: 49) evaluations each ## inter-/extrapolation by tf_approx_linear ## 1001_1: (0.000,0.26);(0.019,0.45);(0.037,0.40); ... ## 1002_1: ( 0.22,0.44);( 0.24,0.48);( 0.26,0.48); ... dti$rcst[1:2] %>% tf_arg () %>% str ## List of 2 ## $ 1001_1: num [1:55] 0 0.019 0.037 0.056 0.074 0.093 0.111 0.13 0.148 0.167 ... ## $ 1002_1: num [1:43] 0.222 0.241 0.259 0.278 0.296 0.315 0.333 0.352 0.37 0.389 ... dti$rcst[1:2] %>% plot (pch = "x", col = viridis (2)) xx xx xx xxxxxxxxxx x xxx 0.7 x xx x xx x x x xxx xxxxxx 0.6 xx xxxx xxxxx x x xxx x xx xxxxx xx 0.5 x . x xxx x xx x x xxxxxx x x xx x x 0.4 x x x x x x xxx xx 0.3 x 0.0 0.2 0.4 0.6 0.8 1.0 18 / 61

  19. tf subclass: tfd dti$cca[1:3] %>% str (1) ## List of 3 ## $ 1001_1: num [1:93] 0.491 0.517 0.536 0.555 0.593 ... ## $ 1002_1: num [1:93] 0.472 0.487 0.502 0.523 0.552 ... ## $ 1003_1: num [1:93] 0.502 0.514 0.539 0.574 0.603 ... ## - attr(*, "arg")=List of 1 ## - attr(*, "domain")= num [1:2] 0 1 ## - attr(*, "evaluator")=function (x, arg, evaluations) ## ..- attr(*, "memoised")= logi TRUE ## ..- attr(*, "class")= chr [1:2] "memoised" "function" ## - attr(*, "evaluator_name")= chr "tf_approx_linear" ## - attr(*, "resolution")= num 0.01 ## - attr(*, "class")= chr [1:3] "tfd_reg" "tfd" "tf" 19 / 61

Recommend


More recommend