A parallel random forest classifier for R in SPRINT Lawrence Mitchell EPCC lawrence.mitchell@ed.ac.uk
Outline • SPRINT • Random Forests • Parallelisation results/lessons • Future work
SPRINT • The Simple Parallel R Interface www.r-sprint.org [Hill et al, BMC Bioinformatics (2008)] • Enable R users to easily exploit HPC resources • Motivating use case: analysis of gene expression data • R library: install.packages (“sprint”) • Parallel replacements for time-consuming analysis functions • Written in C with MPI for parallelisation • Portable – multicore desktops; clusters; supercomputers; EC2
Example: permutation adjusted p-values data(golub) library(multtest) ## Can take a long time resT <- mt.maxT(golub, golub.cl, test="t", side="abs") quit(save="no") data(golub) library(sprint) ## Run in parallel resT <- pmaxT(golub, golub.cl, test="t", side="abs") pterminate() quit(save="no")
Supported functionality • Available now on CRAN – Pearson Correlation (replacing cor ) – Permutation adjusted p-values (replacing mt.maxT , from multtest) – Partioning around medoids (replacing pam ) • Coming soon: implemented, in test phase – Parallel apply (like apply ) – Bootstrapping (replacing boot ) – Random Forests (builds on randomForest package) – Rank Products (implements functionality from Bioconductor package RP ) • Under development – RMA analysis ( affy package from Bioconductor)
Classifying data • Have probe data from 100 patients in two groups • Want to classify further patient data into correct group – e.g. Susceptibility to some disease • Construct classification model using test data • Predict classification of unseen data • Random Forests provide such a method
Random Forests • An ensemble tree classifier • Bootstrap samples of a dataset – genetic data typically: O(100) cases; O(10000) probes • One tree per sample, giving a forest ... • Compute classifications over this ensemble
Existing implementations • randomForest [Liaw & Wiener, R News (2002)] – serial, R version • parf [ Topid et al, Parallel Numerics (2005)] – task parallel F90, no R version • randomjungle [Schwarz et al, Bioinformatics (2010)] – task parallel C++, designed for microarray data, no R version
Parallelisation • Build on existing serial R package • Spread the generation of bootstraps across processes • Combine the results on single (master) process • 96000 probes; 65 cases
Combining takes time
Combine in parallel • Do a tree reduction (compare MPI_Reduce ) • 24000 probes; 65 cases
Faster time to solution • More efficient exploitation of computing resource • 96000 probes; 65 cases
Lessons • HPC resources are hard to exploit efficiently • Profile and benchmark your implementation carefully
Future work in SPRINT • Better (transparent) support for large datasets: starts October ‘11 • More analysis routines – we need user input here • More efficient serial random forest (use randomjungle ?) • ...
Thanks DPM Team EPCC Team • • Terry Sloan Peter Ghazal • • Michal Piotrowski Thorsten Forster • • Lawrence Mitchell Muriel Mewissen • Savvas Petrou • Bartek Dobrzelecki • Jon Hill • Florian Scharinger Wellcome Trust grant 086696/Z/08/Z and edikt2 The HECToR distributed CSE service operated by NAG Ltd The Centre for Numerical Algorithms and Intelligent Software
Questions?
Recommend
More recommend