zihang yin introduction
play

Zihang Yin Introduction R is commonly used as an open share - PowerPoint PPT Presentation

Zihang Yin Introduction R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge. Frequently these analytical methods require data


  1. Zihang Yin

  2. Introduction   R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge.  Frequently these analytical methods require data sets that are far too large to analyze on local memory.  Our assumption is that each analyst should understand R, but have a limited understanding of Hadoop.

  3. Perspectives   The R and Hadoop Integrated Programming Environment is R package to compute across massive data sets, create subsets, apply routines to subsets, produce displays on subsets across a cluster of computers using the Hadoop DFS and Hadoop MapReduce framework. This is accomplished from within the R environment, using standard R programming idioms.  Enabling the integration of these methods will drive greater analytical productivity and extend the capabilities of companies.

  4. Approach   The native language of Hadoop is Java. Java is not suitable for rapid development such as is needed for a data analysis environment. Hadoop Streaming bridges this gap. Users can write MapReduce programs in other languages e.g. Python, Ruby, Perl which is then deployed over the cluster. Hadoop Streaming then transfers the input data from Hadoop to the user program and vice versa. However,  Data analysis from R does not involve the user writing code to be deployed from the command line. The analyst has massive data sitting in the background, she needs to create data, partition the data, compute summaries or displays. This need to be evaluated from the R environment and the results returned to R. Ideally not having to resort to the command line.

  5. Solution --- RHIPE   RHIPE consist of several functions to interact with the HDFS e.g. save data sets, read data, created by RHIPE MapReduce, delete files.  Compose and launch MapReduce jobs from R using the command rhmr and rhex. Monitor the status using rhstatus which returns an R object. Stop jobs using rhkill  Compute side effect files. The output of parallel computations may include the creation of PDF files, R data sets, CVS files etc. These will be copied by RHIPE to a central location on the HDFS removing the need for the user to copy them from the compute nodes or setting up a network file system.

  6. Solution --- RHIPE   Data sets that are created by RHIPE can be read using other languages such as Java, Perl, Python and C. The serialization format used by RHIPE (converting R objects to binary data) uses Googles Protocol Buffers which is very fast and creates compact representations for R objects. Ideal for massive data sets.  Data sets created using RHIPE are key-value pairs. A key is mapped to a value. A MapReduce computations iterates over the key,value pairs in parallel. If the output of a RHIPE job creates unique keys the output can be treated as a external- memory associative dictionary. RHIPE can thus be used as a medium scale (millions of keys) disk based dictionary, which is useful for loading R objects into R.

  7. RHIPE FUNCTION   rhget - Copying from the HDFS  rhput - Copying to the HDFS  rhwrite - Writing R data to the HDFS  rhread - Reading data from HDFS into R  rhgetkeys - Reading Values from Map Files

  8. PACKAGING A JOB FOR MAPREDUCE   rhex - Submitting a MapReduce R Object to Hadoop  rhmr - Creating the MapReduce Object  Functions to Communicate with Hadoop during MapReduce  rhcollect - Writing Data to Hadoop MapReduce  rhstatus - Updating the Status of the Job during Runtime

  9. Setup   Using eucalyptus create the hadoop The cluster has one master node and one slave node.  The Hadoop version that compatible with RHIPE is R-0.20-2.  Installing Google protobuf for searilization  Installing R  ./configure –enable-R-shalib  Make  Make check  Make install  Installing Rhipe as the add-on package  Create an image on eucalyptus thus it saves further efforts.

  10. Example 1   How to make your text file with random numbers make.numbers <- function(N,dest,cols=5,factor=1,local=FALSE){ ## p is how long the word will be, longer more unique words ## factor, if equal to 1, then exactly N rows, otherwise N*factor rows ## cols how many columns per row map <- as.expression(bquote({ COLS <- .(COLS) F <- .(F) lapply(map.values,function(r){ for(i in 1:F){ f <- runif(COLS) rhcollect(NULL,f) } }) },list(COLS=cols,F=factor)))

  11. Example 1   How to make your text file with random numbers R Library(Rhipe) mapred <- list() if (local) mapred$mapred.job.tracker <- 'local' mapred[['mapred.field.separator']]=" " mapred[['mapred.textoutputformat.usekey']]=FALSE mapred$mapred.reduce.tasks=0 z <- rhmr(map=map, N=N,ofolder=dest,inout=c("lapp","text"), mapred=mapred) rhex(z) } make.numbers(N=1000, "/tmp/somenumbers",cols=10) ## read them in (don't if N is too large!) f <- rhread("/tmp/somenumbers/", type="text")

  12. Example 2   How to compute the mean  Mapper ## You want to compute the mean and sd (is ro == correlation?) For  this (and let's ## forget about numerical accuracy), we need the sums and sum of squares of the K ## columns. Using that you can compute mean and sd.  map <- expression({ ## K is the number of colums ## the number of rows is the length of map.values ## map.values is a list of lines ## this approach is okay, if you want /all/ the columns K <- 10 l <- length(map.values) all.lines <- as.numeric(unlist(strsplit(unlist(map.values),"[[:space:]]+"))) dim(all.lines) <- c(l, K) ## K is the number of columns sums <- apply(all.lines, 2, sum) ##by columns sqs <- apply(all.lines,2, function(r) sum(r^2)) # by columns sapply(1:K, function(r) rhcollect(r, c(l,sums[r],sqs[r]))) })

  13. Example 2   How to compute the mean  Reducer  reduce <- expression( pre = { totals <- c(0,0,0)}, reduce = { totals <- totals + apply(do.call('rbind', reduce.values),2,sum) }, post = {rhcollect(reduce.key,totals) } ) ## the mapred bit is optional, but if you have K columns, why run more reducers?  mr <- list(mapred.reduce.tasks=K) y <- rhmr(map=map, reduce=reduce,combiner=TRUE,inout=c("text","sequence"),ifolder="/tmp/somenumbers", ofolder="/tmp/means",mapred=mr) w <- rhex(y,async=TRUE) z <- rhstatus(w, mon.sec=5) results <- if(z$state=="SUCCEEDED") rhread("/tmp/means") else NULL if(!is.null(results)){ results <- cbind(unlist(lapply(results,"[[",1)), do.call('rbind',lapply(results,"[[",2))) colnames(results) <- c("col.num","n","sum","ssq") }

  14. Conclusion   In summary, the objective of RHIPE is to let the user focus on thinking about the data. The difficulties in distributing computations and storing data across a cluster are automatically handled by RHIPE and Hadoop.

Recommend


More recommend