overview an introduction to the r environment r
play

Overview An Introduction to the R Environment R Programming What - PowerPoint PPT Presentation

Overview An Introduction to the R Environment R Programming What does an R function look like Flow control Matrix algebra Peter Dalgaard Optimizers Department of Biostatistics Largish example: Time splitting University of


  1. Overview An Introduction to the R Environment R Programming ◮ What does an R function look like ◮ Flow control ◮ Matrix algebra Peter Dalgaard ◮ Optimizers Department of Biostatistics ◮ Largish example: Time splitting University of Copenhagen ◮ (The whole thing will be rather suprficial) ISCB Conference, Szeged, June 2005 R Programming ISCB2005 R Programming ISCB2005 Simple Functions Flow control ◮ logit <- function(p) log(p/(1-p)) ◮ if/else ◮ logit(0.5) ◮ ifelse() ◮ Formal arguments ◮ switch() ◮ Actual arguments ◮ for loops ◮ Positional matching: plot(x,y) ◮ repeat , while ◮ Keyword matching: t.test(x ~ g, mu=2, ◮ break alternative="less") ◮ Partial matching: t.test(x ~ g, mu=2, alt="l") R Programming ISCB2005 R Programming ISCB2005

  2. Apply-functions/loop avoidance Matrix algebra ◮ lapply – list-apply ◮ R contains a pretty full set of primitives for matrix calculus ◮ sapply – simplifying apply ◮ A %*% B for matrix multiplication ◮ tapply – tabulating apply ◮ solve(A, b) for solving linear equations. ( solve(A) for matrix inverse) ◮ apply , sweep – along slices of tables ◮ Various special products and decompositions ◮ replicate – repeat expression R Programming ISCB2005 R Programming ISCB2005 Example of Matrix Code Optimization Tools Code to calculate the Greenhouse-Geisser epsilon: ◮ 1-dimensional: optimize Psi <- T %*% Sigma %*% t(T) B <- T %*% object$SSD %*% t(T) ◮ nlm , Newton-style optimizer pp <- nrow(T) U <- solve(Psi,B) ◮ optim , wrapper for several advanced optimizers lambda <- Re(eigen(U)$values) GG.eps <- sum(lambda)^2/sum(lambda^2)/pp R Programming ISCB2005 R Programming ISCB2005

  3. Demo 1 Example: Time Splitting ◮ Split survival data into bands according to some time scale mll <- function(theta) -dbinom(4, 10, theta, log=TRUE) ◮ Used in survival analysis and epidemiology plot(mll) optimize(mll, c(0,1)) ◮ Vector of (delayed-entry) survival times nlm(mll, .5) optim(.5, mll, method="BFGS") ◮ Vector of break points ◮ (Possibly) individual origin of time scale for splitting R Programming ISCB2005 R Programming ISCB2005 “The SAS Way” The R Way Might mimic the SAS strategy, but it is inefficient in R. Here’s (pseudocode) another idea: loop over individuals (implicit) loop over intervals loop over intervals { { select subjects that overlap with interval if overlap with survival time trim times to interval { keep resulting vector trim survival time to interval } output modified case stick all vectors together } } That way, we can utilize vectorization of the selection and trimming tasks. R Programming ISCB2005 R Programming ISCB2005

  4. Trimming to a Single Interval Trimming as Function ◮ Survival: time1, time2, event trimToI <- function(I) ◮ Interval: left, right { left <- I[1] + origin ; right <- I[2] + origin ◮ Turns out to be easier to "shoot first and ask questions en <- pmax(time1, left) ex <- pmin(time2, right) later" about the overlap: ev <- event & (time2 <= right) valid <- (en < ex) en <- pmax(time1, left) data.frame(S=Surv(en[valid], ex[valid], ev[valid]), ex <- pmin(time2, right) subj=subj[valid]) ev <- event & (time2 <= right) } valid <- (en < ex) Surv(en[valid], ex[valid], ev[valid]) This is designed to be defined inside the actual timesplit function. Notice that some variables are left to be bound via Notice that left and right can be vectors and incorporate a lexical scope . subject-specific origin. R Programming ISCB2005 R Programming ISCB2005 Processing All Time Bands Describing the Timebands We want to use some sort of apply-function and collect results The output from our timesplitting function should contain a as list of data frames. Here’s a nice way: variable describing to which time band each piece belongs (this is not obvious if the origin differs between individuals) nbrk <- length(brks) Getting the result as a factor with the proper level set is a little Imat <- cbind(brks[-nbrk], brks[-1]) l <- apply(Imat, 1, trimToI) tricky. Here’s one way: and then just stick things together with lbl <- apply(Imat, 1, function(I) paste("(",I[1],",", I[2],"]", sep="")) f <- factor(lbl, levels=lbl) # avoid level sorting result <- do.call("rbind", l) rep(f, lapply(l, nrow)) as it turns out, you need to do a little bit more because the Next slide gives the final function. rbind turns the Surv objects into matrices. R Programming ISCB2005 R Programming ISCB2005

  5. timesplit <- function(S, brks, origin=0, subj=1:nrow(S)) { time1 <- S[,1] ; time2 <- S[,2] ; event <- S[,3] trimToI <- function(I) { left <- I[1] + origin ; right <- I[2] + origin en <- pmax(time1, left) ex <- pmin(time2, right) ev <- event & (time2 <= right) valid <- (en < ex) data.frame(S=Surv(en[valid], ex[valid], ev[valid]), subj=subj[valid]) } nbrk <- length(brks) Imat <- cbind(brks[-nbrk], brks[-1]) l <- apply(Imat, 1, trimToI) result <- do.call("rbind", l) attr(result$S, "type") <- "counting" class(result$S) <- "Surv" lbl <- apply(Imat, 1, function(I) paste("(",I[1],",", I[2],"]", sep="")) f <- factor(lbl, levels=lbl) # avoid level sorting issues cbind(result, band=rep(f, lapply(l, nrow))) }

Recommend


More recommend