Zihang Yin
Introduction R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge. Frequently these analytical methods require data sets that are far too large to analyze on local memory. Our assumption is that each analyst should understand R, but have a limited understanding of Hadoop.
Perspectives The R and Hadoop Integrated Programming Environment is R package to compute across massive data sets, create subsets, apply routines to subsets, produce displays on subsets across a cluster of computers using the Hadoop DFS and Hadoop MapReduce framework. This is accomplished from within the R environment, using standard R programming idioms. Enabling the integration of these methods will drive greater analytical productivity and extend the capabilities of companies.
Approach The native language of Hadoop is Java. Java is not suitable for rapid development such as is needed for a data analysis environment. Hadoop Streaming bridges this gap. Users can write MapReduce programs in other languages e.g. Python, Ruby, Perl which is then deployed over the cluster. Hadoop Streaming then transfers the input data from Hadoop to the user program and vice versa. However, Data analysis from R does not involve the user writing code to be deployed from the command line. The analyst has massive data sitting in the background, she needs to create data, partition the data, compute summaries or displays. This need to be evaluated from the R environment and the results returned to R. Ideally not having to resort to the command line.
Solution --- RHIPE RHIPE consist of several functions to interact with the HDFS e.g. save data sets, read data, created by RHIPE MapReduce, delete files. Compose and launch MapReduce jobs from R using the command rhmr and rhex. Monitor the status using rhstatus which returns an R object. Stop jobs using rhkill Compute side effect files. The output of parallel computations may include the creation of PDF files, R data sets, CVS files etc. These will be copied by RHIPE to a central location on the HDFS removing the need for the user to copy them from the compute nodes or setting up a network file system.
Solution --- RHIPE Data sets that are created by RHIPE can be read using other languages such as Java, Perl, Python and C. The serialization format used by RHIPE (converting R objects to binary data) uses Googles Protocol Buffers which is very fast and creates compact representations for R objects. Ideal for massive data sets. Data sets created using RHIPE are key-value pairs. A key is mapped to a value. A MapReduce computations iterates over the key,value pairs in parallel. If the output of a RHIPE job creates unique keys the output can be treated as a external- memory associative dictionary. RHIPE can thus be used as a medium scale (millions of keys) disk based dictionary, which is useful for loading R objects into R.
RHIPE FUNCTION rhget - Copying from the HDFS rhput - Copying to the HDFS rhwrite - Writing R data to the HDFS rhread - Reading data from HDFS into R rhgetkeys - Reading Values from Map Files
PACKAGING A JOB FOR MAPREDUCE rhex - Submitting a MapReduce R Object to Hadoop rhmr - Creating the MapReduce Object Functions to Communicate with Hadoop during MapReduce rhcollect - Writing Data to Hadoop MapReduce rhstatus - Updating the Status of the Job during Runtime
Setup Using eucalyptus create the hadoop The cluster has one master node and one slave node. The Hadoop version that compatible with RHIPE is R-0.20-2. Installing Google protobuf for searilization Installing R ./configure –enable-R-shalib Make Make check Make install Installing Rhipe as the add-on package Create an image on eucalyptus thus it saves further efforts.
Example 1 How to make your text file with random numbers make.numbers <- function(N,dest,cols=5,factor=1,local=FALSE){ ## p is how long the word will be, longer more unique words ## factor, if equal to 1, then exactly N rows, otherwise N*factor rows ## cols how many columns per row map <- as.expression(bquote({ COLS <- .(COLS) F <- .(F) lapply(map.values,function(r){ for(i in 1:F){ f <- runif(COLS) rhcollect(NULL,f) } }) },list(COLS=cols,F=factor)))
Example 1 How to make your text file with random numbers R Library(Rhipe) mapred <- list() if (local) mapred$mapred.job.tracker <- 'local' mapred[['mapred.field.separator']]=" " mapred[['mapred.textoutputformat.usekey']]=FALSE mapred$mapred.reduce.tasks=0 z <- rhmr(map=map, N=N,ofolder=dest,inout=c("lapp","text"), mapred=mapred) rhex(z) } make.numbers(N=1000, "/tmp/somenumbers",cols=10) ## read them in (don't if N is too large!) f <- rhread("/tmp/somenumbers/", type="text")
Example 2 How to compute the mean Mapper ## You want to compute the mean and sd (is ro == correlation?) For this (and let's ## forget about numerical accuracy), we need the sums and sum of squares of the K ## columns. Using that you can compute mean and sd. map <- expression({ ## K is the number of colums ## the number of rows is the length of map.values ## map.values is a list of lines ## this approach is okay, if you want /all/ the columns K <- 10 l <- length(map.values) all.lines <- as.numeric(unlist(strsplit(unlist(map.values),"[[:space:]]+"))) dim(all.lines) <- c(l, K) ## K is the number of columns sums <- apply(all.lines, 2, sum) ##by columns sqs <- apply(all.lines,2, function(r) sum(r^2)) # by columns sapply(1:K, function(r) rhcollect(r, c(l,sums[r],sqs[r]))) })
Example 2 How to compute the mean Reducer reduce <- expression( pre = { totals <- c(0,0,0)}, reduce = { totals <- totals + apply(do.call('rbind', reduce.values),2,sum) }, post = {rhcollect(reduce.key,totals) } ) ## the mapred bit is optional, but if you have K columns, why run more reducers? mr <- list(mapred.reduce.tasks=K) y <- rhmr(map=map, reduce=reduce,combiner=TRUE,inout=c("text","sequence"),ifolder="/tmp/somenumbers", ofolder="/tmp/means",mapred=mr) w <- rhex(y,async=TRUE) z <- rhstatus(w, mon.sec=5) results <- if(z$state=="SUCCEEDED") rhread("/tmp/means") else NULL if(!is.null(results)){ results <- cbind(unlist(lapply(results,"[[",1)), do.call('rbind',lapply(results,"[[",2))) colnames(results) <- c("col.num","n","sum","ssq") }
Conclusion In summary, the objective of RHIPE is to let the user focus on thinking about the data. The difficulties in distributing computations and storing data across a cluster are automatically handled by RHIPE and Hadoop.
Recommend
More recommend