forest classifier for R in SPRINT Lawrence Mitchell EPCC - PowerPoint PPT Presentation
A parallel random forest classifier for R in SPRINT Lawrence Mitchell EPCC lawrence.mitchell@ed.ac.uk Outline SPRINT Random Forests Parallelisation results/lessons Future work SPRINT The Simple Parallel R Interface
A parallel random forest classifier for R in SPRINT Lawrence Mitchell EPCC lawrence.mitchell@ed.ac.uk
Outline • SPRINT • Random Forests • Parallelisation results/lessons • Future work
SPRINT • The Simple Parallel R Interface www.r-sprint.org [Hill et al, BMC Bioinformatics (2008)] • Enable R users to easily exploit HPC resources • Motivating use case: analysis of gene expression data • R library: install.packages (“sprint”) • Parallel replacements for time-consuming analysis functions • Written in C with MPI for parallelisation • Portable – multicore desktops; clusters; supercomputers; EC2
Example: permutation adjusted p-values data(golub) library(multtest) ## Can take a long time resT <- mt.maxT(golub, golub.cl, test="t", side="abs") quit(save="no") data(golub) library(sprint) ## Run in parallel resT <- pmaxT(golub, golub.cl, test="t", side="abs") pterminate() quit(save="no")
Supported functionality • Available now on CRAN – Pearson Correlation (replacing cor ) – Permutation adjusted p-values (replacing mt.maxT , from multtest) – Partioning around medoids (replacing pam ) • Coming soon: implemented, in test phase – Parallel apply (like apply ) – Bootstrapping (replacing boot ) – Random Forests (builds on randomForest package) – Rank Products (implements functionality from Bioconductor package RP ) • Under development – RMA analysis ( affy package from Bioconductor)
Classifying data • Have probe data from 100 patients in two groups • Want to classify further patient data into correct group – e.g. Susceptibility to some disease • Construct classification model using test data • Predict classification of unseen data • Random Forests provide such a method
Random Forests • An ensemble tree classifier • Bootstrap samples of a dataset – genetic data typically: O(100) cases; O(10000) probes • One tree per sample, giving a forest ... • Compute classifications over this ensemble
Existing implementations • randomForest [Liaw & Wiener, R News (2002)] – serial, R version • parf [ Topid et al, Parallel Numerics (2005)] – task parallel F90, no R version • randomjungle [Schwarz et al, Bioinformatics (2010)] – task parallel C++, designed for microarray data, no R version
Parallelisation • Build on existing serial R package • Spread the generation of bootstraps across processes • Combine the results on single (master) process • 96000 probes; 65 cases
Combining takes time
Combine in parallel • Do a tree reduction (compare MPI_Reduce ) • 24000 probes; 65 cases
Faster time to solution • More efficient exploitation of computing resource • 96000 probes; 65 cases
Lessons • HPC resources are hard to exploit efficiently • Profile and benchmark your implementation carefully
Future work in SPRINT • Better (transparent) support for large datasets: starts October ‘11 • More analysis routines – we need user input here • More efficient serial random forest (use randomjungle ?) • ...
Thanks DPM Team EPCC Team • • Terry Sloan Peter Ghazal • • Michal Piotrowski Thorsten Forster • • Lawrence Mitchell Muriel Mewissen • Savvas Petrou • Bartek Dobrzelecki • Jon Hill • Florian Scharinger Wellcome Trust grant 086696/Z/08/Z and edikt2 The HECToR distributed CSE service operated by NAG Ltd The Centre for Numerical Algorithms and Intelligent Software
Questions?
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.