r scidb julia
play

R SciDB Julia Mert Terzihan Zhixiong Chen R 1. What is R In - PowerPoint PPT Presentation

R SciDB Julia Mert Terzihan Zhixiong Chen R 1. What is R In the 1970s, at Bell Labs, John Chambers developed a statistical programming language S The aim was to turn ideas into software, quickly and faithfully R is an


  1. R SciDB Julia Mert Terzihan Zhixiong Chen

  2. R

  3. 1. What is R ● In the 1970s, at Bell Labs, John Chambers developed a statistical programming language – S ○ The aim was to turn ideas into software, quickly and faithfully ○ R is an implementation of S, initially written by R obert Gentleman and R oss Ihaka in 1993. ● R is a language and environment for statistical computing and graphics

  4. 2. Features ● Object Oriented ○ similar to Python ● Optimized for Vector/Matrix operation ○ similar to Matlab ● Fully statistical analysis support ● Part of the GNU FREE software project ● Over 4300 user contributed packages

  5. 3. Study Plan ● Scalar ● Vector ● Matrix ● Data Frame ● The apply Function ● Statistics ● Plot

  6. Scalar ● Use R as a calculator > 4+6 [1] 10 > x<-6 /* '<-' means to assign value 6 to object x */ > y<-4 > x+y [1] 10 > x<-"Hello world" /* String support */ > x [1] "Hello world"

  7. Vector ● Create a vector > x<-c(5,9,1,0) /* function c is to concatenate individual elements */ > x [1] 5 9 1 0 > x<-1:10 /* generate the numbers from 1 to 10 */ > x [1] 1 2 3 4 5 6 7 8 9 10 > seq(1,9,by=2) /* generate the numbers stepping by 2 from 1 to 9 */ [1] 1 3 5 7 9 > seq(8,20,length=6) /*evenly generate 6 numbers from 8 to 20 inclusively */ [1] 8.0 10.4 12.8 15.2 17.6 20.0

  8. Vector ● Access a vector, indexing from 1 and using [] > x<-rep(1:3,6) /* repeatedly generating numbers from 1 to 3 6 times */ > x [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 > x[1:9] /* Get the numbers indexed from 1 to 9 */ [1] 1 2 3 1 2 3 1 2 3 > x[c(3,6,9)] /* Get the numbers indexed as 3, 6, and 9 */ [1] 3 3 3 > x[-c(3,6,9)] /* '-' is to exclude particular elements */ [1] 1 2 1 2 1 2 1 2 3 1 2 3 1 2 3

  9. Vector ● Access a vector, masking > x [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 > mask = x == 3 /* Create a mask */ > mask /* mask is stored as a vector of logic(boolean) values */ [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE [16] FALSE FALSE TRUE > x[mask] [1] 3 3 3 3 3 3 > x[!mask] /* '!' is to reverse each logic value in the mask vector */ [1] 1 2 1 2 1 2 1 2 1 2 1 2

  10. Matrix ● Create a matrix > x<-c(5,7,9) > y<-c(6,3,4) > z<-cbind(x,y) /* bind two vectors as a column-wise matrix */ > z x y [1,] 5 6 [2,] 7 3 [3,] 9 4 > matrix(c(5,7,9,6,3,4),nrow=3) /* create a 3-row matrix from the vector */ > diag(3) /* identity*/ [,1] [,2] [,1] [,2] [,3] [1,] 5 6 [1,] 1 0 0 [2,] 7 3 [2,] 0 1 0 [3,] 9 4 [3,] 0 0 1

  11. Matrix ● Matrix Operations, component-wise > z<-matrix(c(5,7,9,6,3,4),nrow=3,byrow=T) > z [,1] [,2] [1,] 5 7 [2,] 9 6 [3,] 3 4 > y<-matrix(c(1,3,0,9,5,-1),nrow=3,byrow=T) > y [,1] [,2] > y+z > y*z [1,] 1 3 [,1] [,2] [,1] [,2] [2,] 0 9 [1,] 6 10 [1,] 5 21 [3,] 5 -1 [2,] 9 15 [2,] 0 54 [3,] 8 3 [3,] 15 -4

  12. Matrix ● Matrix Operations, based on definition > y [,1] [,2] > t(z) /*transpose*/ [1,] 1 3 [,1] [,2] [2,] 0 9 [1,] 3 -2 [3,] 5 -1 [2,] 4 6 > z<-matrix(c(3,4,-2,6),nrow=2,byrow=T) > z [,1] [,2] [1,] 3 4 [2,] -2 6 > solve(z) /* inverse */ > y%*%x /*multiplication*/ [,1] [,1] [,2] [1,] 0.23076923 -0.1538462 [1,] 26 [2,] 0.07692308 0.1153846 [2,] 63 [3,] 18

  13. Matrix ● Access a matrix, indexing > y [,1] [,2] [1,] 1 3 [2,] 0 9 [3,] 5 -1 > y[1,2] /* fetch a specific value */ [1] 3 > y[c(1,2),] /* use vectors */ > y[1:2,] /* fetch rows */ [,1] [,2] [,1] [,2] [1,] 1 3 [1,] 1 3 [2,] 0 9 [2,] 0 9 > y[,2] /* fetch columns */ [1] 3 9 -1

  14. Matrix ● Access a matrix, masking > y [,1] [,2] [1,] 1 3 [2,] 0 9 [3,] 5 -1 > mask<-y>0 > mask [,1] [,2] [1,] TRUE TRUE [2,] FALSE TRUE [3,] TRUE FALSE > y[mask] [1] 1 5 3 9

  15. Data Frame ● Create, like a table in database mydata <- data.frame(col1, col2, col3,...) > patientID <- c(1, 2, 3, 4) > age <- c(25, 34, 28, 52) > diabetes <- c("Type1", "Type2", "Type1", "Type1") > status <- c("Poor", "Improved", "Excellent", "Poor") > patientdata <- data.frame(patientID, age, diabetes, status) > patientdata patientID age diabetes status 1 1 25 Type1 Poor 2 2 34 Type2 Improved 3 3 28 Type1 Excellent 4 4 52 Type1 Poor

  16. Data Frame ● Access a data frame > patientdata patientID age diabetes status 1 1 25 Type1 Poor 2 2 34 Type2 Improved 3 3 28 Type1 Excellent 4 4 52 Type1 Poor > patientdata[1:3,] /*Treat it as a special matrix*/ patientID age diabetes status 1 1 25 Type1 Poor 2 2 34 Type2 Improved 3 3 28 Type1 Excellent > patientdata$patientID /*Access using column name*/ [1] 1 2 3 4

  17. The apply Function ● Apply a function to data structure elements > y [,1] [,2] [1,] 1 3 [2,] 0 9 [3,] 5 -1 > func <- function(x){ /*define a function func: 1+0.1*y */ + x = x+10 + return (x/10) + } > apply(y,c(1,2),func) /* apply the func on all elements in matrix y */ [,1] [,2] [1,] 1.1 1.3 [2,] 1.0 1.9 [3,] 1.5 0.9

  18. Statistics ● Some handy distributions > dnorm(c(3,2),0,1) /* normal distribution */ [1] 0.004431848 0.053990967 > x<-seq(-5,10,by=.1) > dnorm(x,3,2) [1] 6.691511e-05 8.162820e-05 9.932774e-05 1.205633e-04 1.459735e-04 1.762978e-04 [7] 2.123901e-04 2.552325e-04 3.059510e-04 3.658322e-04 … ... d*:density function p*:distribution function q*:quantile function (the inverse distribution function) dnorm,pnorm,qnorm dt,pt,qt binomial,exponential,posson,gamma

  19. Statistics ● Simulations to randomly simulate 100 observations from the N(3,4) > rnorm(100,3,2) [1] 2.75259237 0.99932968 0.63348792 3.48292324 2.60880274 3.78258364 5.68923819 [8] 0.08003764 1.93627124 2.53843236 3.52610754 5.31448617 2.73017110 3.35264165 …… rnorm,rt,rpois

  20. Plot ● ploting x*sin(x) > f <- function(x) { /* define the function f(x)=x*sin(x) */ + return (x*sin(x)) + } > plot(f,-20*pi,20*pi) /* plot f between -20*pi and 20*pi */ > abline(0,1,lty=2) /* lty = 2 means dash line */ /* add a dash line with intercept 0 and slope 1 */ > abline(0,-1,lty=2) /* add a dash line with intercept 0 and slope -1 */

  21. More? ● The help() function ● Refer to the official manual ○ http://cran.r-project.org/manuals.html ● A wonderful 4-week long online course ○ http://blog.revolutionanalytics. com/2012/12/coursera-videos.html ● A good book ○ ‘R in Action’ by Robert Kabacoff ● Google

  22. 4. Bonus ● Installation ○ Tested on Ubuntu12.04http://livesoncoffee. wordpress.com/2012/12/09/installing-r-on-ubuntu- 12-04/ ○ ignore some error like “Unknown media type in type 'all/all'” ● RStudio ○ a wonderful IDE for R programmers ○ http://www.rstudio.com/

  23. Ricardo Integrating R and Hadoop

  24. Motivation ● Statistical software, such as R, provides rich functionality for data analysis and modeling, but can handle only limited amounts of data ● Data management systems, such as hadoop, can handle large data, but provides insufficient analytical functionality Union is strength!

  25. Solution ● Ricardo decompose data-analysis algorithms into ○ parts executed by the R statistical analysis system ○ parts handled by the Hadoop data management system.

  26. Components ● R ○ The core of statistical analysis ● Large-Scale Data Management Systems ○ HDFS ○ Work with dirty, semi/un-structured data ○ Massive data storage, manipulation and parallel processing ● Jaql ○ A JSON Query Language ○ The declarative interface to Hadoop for Ricardo ○ Like Pig, Hive

  27. Architecture

  28. Conclusion ● The current version has poor performance

  29. Overview of SciDB Large Scale Array Storage, Processing and Analysis

  30. Context 1. Background and Motivation 2. Features and Functionality 3. Data Definition 4. Data Manipulation 5. Architecture

  31. What is SciDB? ● Massively parallel storage manager ● Able to parallelize large scale array processing algorithms

  32. 1. Background and Motivation ● Modern scientific data differs from business data in three important respects: ○ Sensor arrays consist of rectangular ‘arrays’ of individual sensors ○ Scientific analysis requires sophisticated data processing methods ■ Ex: Noisy data needs to be ‘cleaned’ ○ Data generated by modern scientific instruments is extremely large ● Array Data Model is more desirable in scientific domains ○ With notions of adjacency or neighborhood ○ Ordering is fundamental ● Complexity of data processing needs a much more flexible data management platform ○ A different kind of DBMS

Recommend


More recommend