Why use R? Introduction to R: To perform inferential statistics - PowerPoint PPT Presentation

Why use R? Introduction to R: • To perform inferential statistics (e.g., use a statistical test to calculate a p value) test to calculate a p-value) Using R for statistics and data analysis U i R f t ti ti d d t l i • To do real statistics (unlike in Excel) • To create custom figures • To automate analysis routines (and make them more reproducible) • To reduce copying and pasting – But Unix commands may be easier – ask us But Unix commands may be easier ask us • To use up-to-date analysis algorithms • Real statisticians use it BaRC Hot Topics – October 2011 • It’s free George Bell, Ph.D. 2 http://iona.wi.mit.edu/bio/education/R2011/ Why not use R? Getting started • Log into tak • A spreadsheet application already works fine ssh ssh –l USERNAME tak l USERNAME tak • You’re already using another statistics package • Start R – Ex: Prism, MatLab R • It’s hard to use at first or – You have to know what commands to use • Real statisticians use it • Go to R (http://www.r-project.org/) ( p p j g ) • You don’t know how to get started Y d ’t k h t t t t d • Download “base” from CRAN and install it on – Irrelevant if you’re here today your computer • Open the program 3 4

Start of an R session RStudio interface On tak On your own computer Requires R; free download from http://rstudio.org/ 5 6 Getting help Handling data • Use the Help menu • Data can be numerical or text • Check out “Manuals” C ec out a ua s • Data can be organized into D b i d i Html help – http://www.r-project.org/ – Vectors (lists of values) – contributed documentation – Matrices (2-dimensional tables of data) • Use R’s help – Data frames (a combination of different types of data) ?median [show info] • Data can be entered ??median [search docs] • Search the web – By typing (using the “c” command to combine things) “ j di ” – “r-project median” – From files • Our favorite book: • Names of data should start with letters – Introductory Statistics with R – Uppercase + lowercase helps (myWTmice) (Peter Dalgard) – Can include dots (my.WT.mice) 7 8

Good practices Example commands # Number of tumors (from litter 2 on 11 July 2010) • Save all useful commands and rationale wt = c(5, 6, 7) – Add comments (starting with “#”) Add comments (starting with # ) ko = c(8, 9, 11) ko = c(8 9 11) – Use history() to get previous commands # Try default t-test settings (Welch's 2-sample t-test) t.test(wt, ko) • Two approaches # Do standard 2-sample t-test – Write commands in R and then paste into a text file, or t.test(wt, ko, var.equal=T) # Save the results as a variable • By convention, we end files of R commands with “.R” wt.vs.ko = t.test(wt, ko, var.equal=T) • Use a specific name for file (ex: compare_WT_KO_weights.R) # What are the different parts of this data frame? – Write commands in a text editor and paste into R session. names(wt.vs.ko) ( ) • Use the up-arrow to get to previous command # Just print the p-value wt.vs.ko$p.value – Minimize typing, as this increases potential errors. # What commands did we use? • To clear your R window, use Ctrl-L history(max.show=Inf) 9 10 Reading files - intro Running a series of commands • Copy and paste commands into R session, or • Take R to your preferred directory () • Execute a script in R, or • Execute a script in R or source("compare_WT_KO_weights.R") [but not so useful in this case, since we aren’t creating any files] • [tak only] – Change to working directory with Unix command cd /nfs/BaRC/Hot_Topics/Intro_to_R • Check where you are (e.g., get your working directory) y ( g , g y g y) – Run R, with script as input (print to screen), or R R ith i t i t ( i t t ) and see what files are there R --vanilla < compare_WT_KO_weights.R > getwd() – Run R, with script as input (save output) [1] "X:/bell/Hot_Topics/Intro_to_R“ R --vanilla < compare_WT_KO_weights.R > R_out.txt > dir() [1] "compare_WT_KO_weights.R" 11 12

Command output Reading data files • Usually it’s easiest to read data from a file – Organize in Excel with one-word column names Organize in Excel with one word column names – Save as tab-delimited text Partial output from • Check that file is there R on tak, if saved as a file (R_out.txt list.files() from previous • Read file slide), also looks something like this tumors = read delim("tumors wt ko txt" tumors = read.delim( tumors_wt_ko.txt , header=T) header=T) (but without the ( colors). • Check that it’s OK > tumors wt ko 1 5 8 2 6 9 3 7 11 13 14 Accessing data Creating an output table > tumors wt ko > tumors$wt # Use the column name 1 5 8 • Most analyses involve several outputs 2 6 9 [1] 5 6 7 • You may want to create a matrix to hold it all Y t t t t i t h ld it ll 3 7 11 > tumors[1:3,1] # [rows, columns] • Create an empty matrix [1] 5 6 7 > tumors[,1] # missing row or column => all – name rows and columns [1] 5 6 7 > tumors[1:2,1:2] # select a submatrix pvals.out = matrix(data=NA, ncol=2, nrow=2) wt ko colnames(pvals out) = c(“two tail" colnames(pvals.out) = c( two.tail , one.tail ) “one tail") 1 5 8 rownames(pvals.out) = c("Welch", "Wilcoxon") 2 6 9 pvals.out > t.test(tumors$wt, tumors$ko) # t-test as before two.tail one.tail Welch NA NA Wilcoxon NA NA 15 16

Filling the output table (matrix) Printing the output table • Do the stats • We may want to round the p-values pvals.out.rounded = round(pvals.out, 4) pvals out rounded = round(pvals out 4) # Welch’s test (t-test with pooled variance) # Welch’s test (t-test with pooled variance) pvals.out[1,1] = t.test(tumors$wt, tumors$ko)$p.value • Print the matrix (table) pvals.out[1,2] = t.test(tumors$wt, tumors$ko, write.table(pvals.out.rounded, alt="less")$p.value file="Tumor_pvals.txt", quote=F, sep="\t") # Wilcoxon rank sum test (non-parametric alternative to t-test) • Warning: output column names are shifted by 1 pvals.out[2,1] = wilcox.test(tumors$wt, when read in Excel tumors$ko)$p.value pvals.out[2,2] = wilcox.test(tumors$wt, tumors$ko, alt="less")$p.value pvals.out two.tail one.tail Welch 0.04191452 0.02095726 Wilcoxon 0.10000000 0.05000000 17 18 Introduction to figures Boxplot description • R is very powerful and very flexible with its figure generation generation • Any aspect of a figure should be modifiable <= 1.5 x IQR 75 th percentile • Some figures aren’t available in spreadsheets IQR median 25 th percentile • Boxplot example Any points beyond the whiskers are defined as defined as boxplot(tumors) # Simplest case “outliers” Right-click to save figure # Add some more details boxplot(tumors, col=c("gray", "red"), main="MFG appears to be a tumor suppressor", ylab="number of tumors") 19 20

Figure formats and sizes Bioconductor and other packages • By default, figures on tak are saved as “Rplots.pdf” • Many statisticians have extended R by creating packages (libraries) containing a set of commands packages (libraries) containing a set of commands • Helpful figure names can be included in code • Helpful figure names can be included in code to do something special – Ex: affy, limma, edgeR, made4 • To select name and size (in inches) of pdf file • For a huge list of Bioconductor packages, see pdf(“tumor_boxplot.pdf”, w=11, h=8.5) http://www.bioconductor.org/packages/release/Software.html boxplot(tumors) # can have >1 page • All require the package to be installed AND explicitly dev.off() # tell R that we’re done called, for example, library(limma) • To create another format (with size in pixels) • Install what you need on your computer or, for tak, png(“tumor_boxplot.png”, w=1800, h=1200) ask the IT group to install packages via boxplot(tumors) dev.off() http://tak.wi.mit.edu/trac/newticket 21 22 Other useful commands More resources from BaRC • “Statistics for Biologists” course: library() mean() round(x, n) – http://iona.wi.mit.edu/bio/education/stats2007/ http://iona wi mit edu/bio/education/stats2007/ dir() median() min() • “Microarray Analysis” course length() sd() max() – http://jura.wi.mit.edu/bio/education/bioinfo2007/arrays/ dim() rbind() paste() • R scripts for Bioinformatics – http://iona.wi.mit.edu/bio/bioinfo/Rscripts/ nrow() cbind() x[x>0] • List of R modules installed on tak List of R modules installed on tak ncol() l() sort() t() x[c(1,3,5)] [ (1 3 5)] – http://tak/trac/wiki/R unique() rev() seq(from, to, by) • We’re glad to share commands and/or scripts to t() log(x, base) commandArgs() get you started 23 24

Upcoming Hot Topics • Introduction to R Graphics (tomorrow) • • Introduction to Bioconductor Introduction to Bioconductor - microarray and RNA-Seq analysis microarray and RNA Seq analysis (Thursday) • Unix, Perl, and Perl modules (short course) • Quality control for high-throughput data • RNA-Seq analysis • Gene list enrichment analysis • Galaxy • S Sequence alignment: pairwise and multiple li t i i d lti l • See http://iona.wi.mit.edu/bio/hot_topics/ • Other ideas? Let us know. 25

Why use R? Introduction to R: To perform inferential statistics - PowerPoint PPT Presentation

Why use R? Introduction to R: To perform inferential statistics (e.g., use a statistical test to calculate a p value) test to calculate a p-value) Using R for statistics and data analysis U i R f t ti ti d d t l i To do real

For personal use only For personal use only For personal use only For personal use only For

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

How and Why We Use CTCSS (PL) in Amateur Radio How and Why We Use CTCSS (PL) in Amateur Radio How

Smart City Wonderful Life Guo

Why is there a price to pay? Why is there a price to pay? Why cant God just

International Time Use Community & Policy-Relevant Time Use Research Time Use Research

For personal use only nextdc.com 1 For personal use only nextdc.com 2 For personal use only

Use Cases Use Cases Use Use cases cases 2003 Giorgini Information Acquisition -- 1 Basi

ROCKIN the Ps! State Use Items Why is it mandatory to purchase State Use items? Why is it

July 29, 2018 WHY I PICKED POSTGRES OVER ORACLE WHY WHY This presentation Open source

Multi Use Civic Facility Multi Use Civic Facility Multi Use Civic Facility Multi Use Civic

Land Use Land Use Land Use Land Use Assumptions Plan Assumptions Plan 2011- 2020 2011- 2020

Wheelchair use in everyday life Stephen Sprigle Why understand wheelchair use in everyday

Treating Tobacco Use Disorders as an Addiction: Why clinicians should address it, and som e

Make clickers work Make clickers work Why use clickers? Why use clickers? for you... for

Why use Mannatech products ?? Why use Mannatech products ?? Modulate your immune system

Fast leakage assessment Oscar Reparaz COSIC / KU Leuven Benedikt Gierlichs CHES 2017 Taipei

Checking Assumptions Normal distributions: use probability plot (or quantile-quantile plot);

Estimating Mean and Need to Consider . . . Variance under Interval Simplest Case: . . .

Outlines Weighted Graph Cuts without Eigenvectors: A Multilevel Approach (PAMI 2007) User-Guided

University of the Basque Country (EHU) Systems for the NIST 2011 LRE Mikel Penagarikano, Amparo

Data privacy: introduction Vicen c Torra January 15, 2018 Privacy, Information and

Physics in DUNE UK WP1 Steve Dennis DUNE UK Meeting University of Manchester 05/03/2019 Steve

A First Course on Kinetics and Reaction Engineering Class 19 on Unit 18 Where Were Going

Why use R? Introduction to R: To perform inferential statistics - PowerPoint PPT Presentation

Why use R? Introduction to R: To perform inferential statistics (e.g., use a statistical test to calculate a p value) test to calculate a p-value) Using R for statistics and data analysis U i R f t ti ti d d t l i To do real

For personal use only For personal use only For personal use only For personal use only For

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

How and Why We Use CTCSS (PL) in Amateur Radio How and Why We Use CTCSS (PL) in Amateur Radio How

Smart City Wonderful Life Guo

Why is there a price to pay? Why is there a price to pay? Why cant God just

International Time Use Community &amp; Policy-Relevant Time Use Research Time Use Research

For personal use only nextdc.com 1 For personal use only nextdc.com 2 For personal use only

Use Cases Use Cases Use Use cases cases 2003 Giorgini Information Acquisition -- 1 Basi

ROCKIN the Ps! State Use Items Why is it mandatory to purchase State Use items? Why is it

July 29, 2018 WHY I PICKED POSTGRES OVER ORACLE WHY WHY This presentation Open source

Multi Use Civic Facility Multi Use Civic Facility Multi Use Civic Facility Multi Use Civic

Land Use Land Use Land Use Land Use Assumptions Plan Assumptions Plan 2011- 2020 2011- 2020

Wheelchair use in everyday life Stephen Sprigle Why understand wheelchair use in everyday

Treating Tobacco Use Disorders as an Addiction: Why clinicians should address it, and som e

Make clickers work Make clickers work Why use clickers? Why use clickers? for you... for

Why use Mannatech products ?? Why use Mannatech products ?? Modulate your immune system

Fast leakage assessment Oscar Reparaz COSIC / KU Leuven Benedikt Gierlichs CHES 2017 Taipei

Checking Assumptions Normal distributions: use probability plot (or quantile-quantile plot);

Estimating Mean and Need to Consider . . . Variance under Interval Simplest Case: . . .

Outlines Weighted Graph Cuts without Eigenvectors: A Multilevel Approach (PAMI 2007) User-Guided

University of the Basque Country (EHU) Systems for the NIST 2011 LRE Mikel Penagarikano, Amparo

Data privacy: introduction Vicen c Torra January 15, 2018 Privacy, Information and

Physics in DUNE UK WP1 Steve Dennis DUNE UK Meeting University of Manchester 05/03/2019 Steve

A First Course on Kinetics and Reaction Engineering Class 19 on Unit 18 Where Were Going

International Time Use Community & Policy-Relevant Time Use Research Time Use Research