forest classifier for R in SPRINT Lawrence Mitchell EPCC - PowerPoint PPT Presentation

A parallel random forest classifier for R in SPRINT Lawrence Mitchell EPCC lawrence.mitchell@ed.ac.uk

Outline • SPRINT • Random Forests • Parallelisation results/lessons • Future work

SPRINT • The Simple Parallel R Interface www.r-sprint.org [Hill et al, BMC Bioinformatics (2008)] • Enable R users to easily exploit HPC resources • Motivating use case: analysis of gene expression data • R library: install.packages (“sprint”) • Parallel replacements for time-consuming analysis functions • Written in C with MPI for parallelisation • Portable – multicore desktops; clusters; supercomputers; EC2

Example: permutation adjusted p-values data(golub) library(multtest) ## Can take a long time resT <- mt.maxT(golub, golub.cl, test="t", side="abs") quit(save="no") data(golub) library(sprint) ## Run in parallel resT <- pmaxT(golub, golub.cl, test="t", side="abs") pterminate() quit(save="no")

Supported functionality • Available now on CRAN – Pearson Correlation (replacing cor ) – Permutation adjusted p-values (replacing mt.maxT , from multtest) – Partioning around medoids (replacing pam ) • Coming soon: implemented, in test phase – Parallel apply (like apply ) – Bootstrapping (replacing boot ) – Random Forests (builds on randomForest package) – Rank Products (implements functionality from Bioconductor package RP ) • Under development – RMA analysis ( affy package from Bioconductor)

Classifying data • Have probe data from 100 patients in two groups • Want to classify further patient data into correct group – e.g. Susceptibility to some disease • Construct classification model using test data • Predict classification of unseen data • Random Forests provide such a method

Random Forests • An ensemble tree classifier • Bootstrap samples of a dataset – genetic data typically: O(100) cases; O(10000) probes • One tree per sample, giving a forest ... • Compute classifications over this ensemble

Existing implementations • randomForest [Liaw & Wiener, R News (2002)] – serial, R version • parf [ Topid et al, Parallel Numerics (2005)] – task parallel F90, no R version • randomjungle [Schwarz et al, Bioinformatics (2010)] – task parallel C++, designed for microarray data, no R version

Parallelisation • Build on existing serial R package • Spread the generation of bootstraps across processes • Combine the results on single (master) process • 96000 probes; 65 cases

Combining takes time

Combine in parallel • Do a tree reduction (compare MPI_Reduce ) • 24000 probes; 65 cases

Faster time to solution • More efficient exploitation of computing resource • 96000 probes; 65 cases

Lessons • HPC resources are hard to exploit efficiently • Profile and benchmark your implementation carefully

Future work in SPRINT • Better (transparent) support for large datasets: starts October ‘11 • More analysis routines – we need user input here • More efficient serial random forest (use randomjungle ?) • ...

Thanks DPM Team EPCC Team • • Terry Sloan Peter Ghazal • • Michal Piotrowski Thorsten Forster • • Lawrence Mitchell Muriel Mewissen • Savvas Petrou • Bartek Dobrzelecki • Jon Hill • Florian Scharinger Wellcome Trust grant 086696/Z/08/Z and edikt2 The HECToR distributed CSE service operated by NAG Ltd The Centre for Numerical Algorithms and Intelligent Software

Questions?

forest classifier for R in SPRINT Lawrence Mitchell EPCC - PowerPoint PPT Presentation

A parallel random forest classifier for R in SPRINT Lawrence Mitchell EPCC lawrence.mitchell@ed.ac.uk Outline SPRINT Random Forests Parallelisation results/lessons Future work SPRINT The Simple Parallel R Interface

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

U.S. Forest Service Forest Service U.S. Forest Inventory and Analysis Forest Service Research

Lazy Associative Classification Decision Tree Classifier (Eager) Associative Classifier By

Epping Forest Arts Epping Forest Arts Epping Forest Councils Epping Forest Councils Arts

Forest management associations Forest owners own associations Forest Management Association is

CURRENT U.S. FOREST DATA AND MAPS Forest age FIA MapMaker Forest ownership TPO Data CURRENT

Forest Health Protection Priorities in the US Forest Service Rick Cooksey Continental Dialogue

National Forest Monitoring and National Forest Inventory at FAO FAO Forestry

US Forest Service Presentation Forest Health and Water Implications United State Forest Service

Logic-based Evaluation of Forest Logic-based Evaluation of Forest Ecosystem Sustainability

PERTINENT FACTS ABOUT THE FOREST SURVEY "What is the Forest Survey? Edward C. Crafts, Chief,

Data Mining with Weka Class 2 Lesson 1 Be a classifier! Ian H. Witten Department of Computer

When and Why to use a Classifier? When and Why to use a Classifier? Alan Rector Alan Rector

When and Why to use a Classifier? When and Why to use a Classifier? Alan Rector Alan Rector

Lecture 2: Nearest Neighbour Classifier Aykut Erdem September 2017 Hacettepe University Your

Maximum Entropy Classifier Ensembling using Ge- netic Algorithm for NER in Bengali Asif Ekbal 1

Contents 1

INTERNATIONAL STANDARDS SETTING : RESEARCH INFORMATION PRESENTATION TO OECD NESTI BUREAU

Valuing whales: strandings, cultural storytelling and community engagement Anton van Helden

The Synoptic Problem: An Overview Noah Kelley Advanced Greek Grammar, Fall 2016 I. Introduction

Holding, Financing, Deductions and VAT Alexis Tsielepis Managing Director, Chelco VAT Ltd

Dr Richard Emsley Centre for Biostatistics, Institute of Population Health, The University of

Muthoot Finance Limited SEPTEMBER 2012 Disclaimer :- Muthoot Finance Limited, is proposing,

THE CHAMBER OF TAX CONSULTANTS 10 TH RESIDENTIAL REFRESHER CONFERENCE ON INTERNATIONAL TAXATION

forest classifier for R in SPRINT Lawrence Mitchell EPCC - PowerPoint PPT Presentation

A parallel random forest classifier for R in SPRINT Lawrence Mitchell EPCC lawrence.mitchell@ed.ac.uk Outline SPRINT Random Forests Parallelisation results/lessons Future work SPRINT The Simple Parallel R Interface

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

U.S. Forest Service Forest Service U.S. Forest Inventory and Analysis Forest Service Research

Lazy Associative Classification Decision Tree Classifier (Eager) Associative Classifier By

Epping Forest Arts Epping Forest Arts Epping Forest Councils Epping Forest Councils Arts

Forest management associations Forest owners own associations Forest Management Association is

CURRENT U.S. FOREST DATA AND MAPS Forest age FIA MapMaker Forest ownership TPO Data CURRENT

Forest Health Protection Priorities in the US Forest Service Rick Cooksey Continental Dialogue

National Forest Monitoring and National Forest Inventory at FAO FAO Forestry

US Forest Service Presentation Forest Health and Water Implications United State Forest Service

Logic-based Evaluation of Forest Logic-based Evaluation of Forest Ecosystem Sustainability

PERTINENT FACTS ABOUT THE FOREST SURVEY &quot;What is the Forest Survey? Edward C. Crafts, Chief,

Data Mining with Weka Class 2 Lesson 1 Be a classifier! Ian H. Witten Department of Computer

When and Why to use a Classifier? When and Why to use a Classifier? Alan Rector Alan Rector

When and Why to use a Classifier? When and Why to use a Classifier? Alan Rector Alan Rector

Lecture 2: Nearest Neighbour Classifier Aykut Erdem September 2017 Hacettepe University Your

Maximum Entropy Classifier Ensembling using Ge- netic Algorithm for NER in Bengali Asif Ekbal 1

Contents 1

INTERNATIONAL STANDARDS SETTING : RESEARCH INFORMATION PRESENTATION TO OECD NESTI BUREAU

Valuing whales: strandings, cultural storytelling and community engagement Anton van Helden

The Synoptic Problem: An Overview Noah Kelley Advanced Greek Grammar, Fall 2016 I. Introduction

Holding, Financing, Deductions and VAT Alexis Tsielepis Managing Director, Chelco VAT Ltd

Dr Richard Emsley Centre for Biostatistics, Institute of Population Health, The University of

Muthoot Finance Limited SEPTEMBER 2012 Disclaimer :- Muthoot Finance Limited, is proposing,

THE CHAMBER OF TAX CONSULTANTS 10 TH RESIDENTIAL REFRESHER CONFERENCE ON INTERNATIONAL TAXATION

PERTINENT FACTS ABOUT THE FOREST SURVEY "What is the Forest Survey? Edward C. Crafts, Chief,