Logging changes in data with lumberjack Mark van der Loo, Statistics - PowerPoint PPT Presentation

Logging changes in data with lumberjack Mark van der Loo, Statistics Netherlands @markvdloo | github.com/markvanderloo eRum2018

The next 15 minutes ◮ Motivation ◮ How to do it ◮ Why it works ◮ Examples eRum2018

Example # 'retailers' dataset from the 'validate' package head (dat,3) ## Id turnover other.rev total.rev ## 1 RET01 NA NA 1130 ## 2 RET02 1607 NA 1607 ## 3 RET03 6886 -33 6919 Computing task Estimate mean(other.rev) / mean(turnover) eRum2018

Clean up and compute result library (dcmodify); library (simputation); library (dplyr) dat %>% modify_so ( if (other.rev < 0) other.rev <- - 1 * other.rev) %>% impute_const (other.rev ~ 0) %>% impute_rlm (turnover ~ total.rev) %>% impute_median (turnover ~ 1) %>% summarize (result = mean (other.rev) /mean (turnover)) ## result ## 1 0.08844255 eRum2018

Questions We are using a pretty complex estimator Estimate = f (input) = (mean ◦ impute ◦ clean)(input) How important is each step for the final result? ◮ How many cells are altered by each step of the cleaning process? ◮ How do e.g. the column means change during the cleaning? ◮ How about the variance? ◮ . . . eRum2018

Logging changes in data Wish list ◮ Working for all data in/data out functions ◮ User-definable logging ◮ Near-zero change in workflow eRum2018

Using lumberjack out <- dat %L>% # Tag data for logging; use lumberjack start_log ( cellwise $new (key="Id") ) %L>% # Do your cleanup modify_so ( if (other.rev < 0) other.rev <- - 1 * other.rev) %L>% impute_rlm (turnover ~ total.rev) %L>% impute_median (turnover ~ 1) %L>% impute_const (other.rev ~ 0) %L>% # Dump log to file dump_log () %L>% # continue with analyses summarize (result= mean (other.rev) /mean (turnover)) ## Dumped a log at cellwise.csv eRum2018

Check the logging info read.csv ("cellwise.csv") %L>% head (3) ## step time ## 1 1 2018-05-16 10:30:42 CEST ## 2 2 2018-05-16 10:30:42 CEST ## 3 2 2018-05-16 10:30:42 CEST ## expression key ## 1 modify_so(if (other.rev < 0) other.rev <- -1 * other.rev) RET03 ## 2 impute_rlm(turnover ~ total.rev) RET01 ## 3 impute_rlm(turnover ~ total.rev) RET05 ## variable old new ## 1 other.rev -33 33.000 ## 2 turnover NA 1125.608 ## 3 turnover NA 5597.627 eRum2018

How it works start_log(data, logger) Attach a logger object to the data. The data ‘wants’ to be logged. Lumberjack: %L>% Check if the data has a logger, if so: use it. dump_log(data, stop=TRUE) Dump logging info, remove logger (by default) eRum2018

The lumberjack operator In stead of this: # not-a-pipe pseudocode `%>%` <- function (x, f){ f (x) } Do this: # lumberjack pseudocode `%L>%` <- function (x, f){ input <- data output <- f (x) if ( x wants to be logged ) store logging info based on input and / or output output } eRum2018

Some loggers In lumberjack ◮ simple : test if input is identical to output. ◮ filedump : dump the whole dataset after each operation ◮ expression_logger : log the result of user-defined expressions In validate ◮ lbj_cells : Summary of cell changes (see next slide) ◮ lbj_rules : Summary of changes in validation rule compliance In daff ◮ lbj_daff : Create a data diff file. eRum2018

The lbj_cells logger: count cells changed unadapted still available adapted available imputed total removed missing still missing Van der loo and de jonge (2018) eRum2018

The lbj_cells logger dat %L>% start_log (validate ::lbj_cells ()) %L>% ... dump_log () %L>% summarize (result= mean (other.rev) /mean (turnover)) ## Dumped a log at /home/mark/projects/tex/eRum2018/pres/cells.csv ## result ## 1 0.08844255 eRum2018

The lbj_cells logger read.csv ("cells.csv") %>% gather (variable, n_cells, - step, - time, - expression) %>% ggplot ( aes (x=step,y=n_cells,color=variable)) + geom_line (size=1) 250 variable 200 adapted available 150 cells n_cells imputed missing 100 new_missing still_available still_missing 50 unadapted 0 0 1 2 3 4 step eRum2018

Log any list of expressions (version ≥ 0 . 3 . 0 ) logger <- expression_logger $new ( mean_or = mean (other.rev, na.rm=TRUE) , mean_to = mean (turnover, na.rm=TRUE) ) dat %L>% start_log (logger) %L>% ... dump_log () %L>% summarize (result= mean (other.rev) /mean (turnover)) ## Dumped a log at expression_log.csv eRum2018

Log any list of expressions (version ≥ 0 . 3 . 0 ) read.csv ("expression_log.csv") %>% gather (variable, value, - expression, - step) %>% ggplot ( aes (x=step,y=value, col=variable)) + geom_line (size=1) + geom_point () 20000 15000 variable value mean_or 10000 mean_to 5000 1 2 3 4 step eRum2018

Logger API: create your own loggers A logger is a R6 or RC object with at least: ◮ $add(meta, input, output) − meta : list(expr, src) (expression and source) − input : input data − output : output data ◮ $dump() This function dumps the logged information For package authors You can Extend the lumberjack pkg (see vignette). eRum2018

More information SDCR M. van der Loo and E. de Jonge (2018) Statistical Data Cleaning with applications in R Wiley, Inc. lumberjack 0.2.0 ◮ Available on CRAN Vignettes ◮ Getting started ◮ Creating loggers eRum2018

Logging changes in data with lumberjack Mark van der Loo, Statistics - PowerPoint PPT Presentation

Logging changes in data with lumberjack Mark van der Loo, Statistics Netherlands @markvdloo | github.com/markvanderloo eRum2018 The next 15 minutes Motivation How to do it Why it works Examples eRum2018 Example # 'retailers'

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #5: LOGGING

ALMA Common Software Basic Track Logging and Error Systems Logging system conceptual overview

Debugging & Logging Java Logging Java has built-in support for logging Logs contain

Samson Logging Tires Logging Tire Size Definition 24.5-32/16 24.5 = section width in inches -

Logging and Recovery Module 6, Lectures 3 and 4 If you are going to be in the logging business,

Logging with ASP.NET Core Damien Bowden Microsoft MVP https://damienbod.com @damien_bod Why

LHC LOGGING Timeline of t he proj ect , resources Cont ext : where does logging f it in? Basic

Australias actions to combat illegal logging Andrew Lieschke and Wayne Terpstra Compliance and

Secure Audit Logging Systems Secure Audit Logging Systems Richard Kramer, Member IEEE Oregon

Land Use Perm it Regulations Tem porary Logging Entrances Tem porary Logging Entrances Mutaz

TomskGAZPROMgeofjzika Company Profjle { Logging while drilling { Production logging for reservoir

IT350: Web & Internet Programming Set 16: Sessions Logging In Correctly 1 Logging In

THE LOGGING LOOPHOLE How the Logging Industrys Unregulated Carbon Emissions Undermine

MODULE 1: LOGGING IN & NAVIGATION IDIS Online for CDBG Entitlement Communities 1 Logging In

J2E Logging in General Early years Key Stage 1 Key Stage 2 Home learning

RPC / failure 1 last time redo logging (fjnish) (weird?) choice not to use redo logging for

Architecture using Functional Programming concepts < + > Jorge Castillo

T/Key: Second-Factor Authentication Without Server Secrets Dima Kogan 1 , Nathan Manohar 2 , Dan

CCHL: Compression-Consolidation Hardware Logging for Efficient Failure-Atomic Persistent Memory

Fast Transaction Logging for Smartphones Hao Luo , University of Nebraska Lincoln Hong Jiang ,

Qualitative Evaluation Food for Thought Nest thermostat https://youtu.be/oxOukh_Ma6o

Collecting User's Data in a Socially-Responsible Manner. Photograph: Daniel

0 Simple Key Managemen t for PIM Authen tication Keys Thomas Hardjono Brad Cain Ba y

Distributed Key Management and Cryptographic Agility Tolga Acar 24 Feb. 2011 1 Overview

Logging changes in data with lumberjack Mark van der Loo, Statistics - PowerPoint PPT Presentation

Logging changes in data with lumberjack Mark van der Loo, Statistics Netherlands @markvdloo | github.com/markvanderloo eRum2018 The next 15 minutes Motivation How to do it Why it works Examples eRum2018 Example # 'retailers'

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #5: LOGGING

ALMA Common Software Basic Track Logging and Error Systems Logging system conceptual overview

Debugging &amp; Logging Java Logging Java has built-in support for logging Logs contain

Samson Logging Tires Logging Tire Size Definition 24.5-32/16 24.5 = section width in inches -

Logging and Recovery Module 6, Lectures 3 and 4 If you are going to be in the logging business,

Logging with ASP.NET Core Damien Bowden Microsoft MVP https://damienbod.com @damien_bod Why

LHC LOGGING Timeline of t he proj ect , resources Cont ext : where does logging f it in? Basic

Australias actions to combat illegal logging Andrew Lieschke and Wayne Terpstra Compliance and

Secure Audit Logging Systems Secure Audit Logging Systems Richard Kramer, Member IEEE Oregon

Land Use Perm it Regulations Tem porary Logging Entrances Tem porary Logging Entrances Mutaz

TomskGAZPROMgeofjzika Company Profjle { Logging while drilling { Production logging for reservoir

IT350: Web &amp; Internet Programming Set 16: Sessions Logging In Correctly 1 Logging In

THE LOGGING LOOPHOLE How the Logging Industrys Unregulated Carbon Emissions Undermine

MODULE 1: LOGGING IN &amp; NAVIGATION IDIS Online for CDBG Entitlement Communities 1 Logging In

J2E Logging in General Early years Key Stage 1 Key Stage 2 Home learning

RPC / failure 1 last time redo logging (fjnish) (weird?) choice not to use redo logging for

Architecture using Functional Programming concepts &lt; + &gt; Jorge Castillo

T/Key: Second-Factor Authentication Without Server Secrets Dima Kogan 1 , Nathan Manohar 2 , Dan

CCHL: Compression-Consolidation Hardware Logging for Efficient Failure-Atomic Persistent Memory

Fast Transaction Logging for Smartphones Hao Luo , University of Nebraska Lincoln Hong Jiang ,

Qualitative Evaluation Food for Thought Nest thermostat https://youtu.be/oxOukh_Ma6o

Collecting User's Data in a Socially-Responsible Manner. Photograph: Daniel

0 Simple Key Managemen t for PIM Authen tication Keys Thomas Hardjono Brad Cain Ba y

Distributed Key Management and Cryptographic Agility Tolga Acar 24 Feb. 2011 1 Overview

Debugging & Logging Java Logging Java has built-in support for logging Logs contain

IT350: Web & Internet Programming Set 16: Sessions Logging In Correctly 1 Logging In

MODULE 1: LOGGING IN & NAVIGATION IDIS Online for CDBG Entitlement Communities 1 Logging In

Architecture using Functional Programming concepts < + > Jorge Castillo