P o we r o f E n se m bl e s B argava S ubrama n ian D ata S cienti s - PowerPoint PPT Presentation

P o we r o f E n se m bl e s B argava S ubrama n ian D ata S cienti s t C isco S ystems , I ndia

T w o h un t sm e n g o b ir d -h u nt i ng. B o th h un t sm e n c an h it a t a rg e t w it h p r ob a bi l it y o f 0.2 . T h ey s ee a fm o c k o f 150 b i rd s , a t op a b a ny a n t re e . F i rs t h u nt s ma n t a ke s a i m a nd fj r e s t hr e e c on t in u ou s s h ot s . A m in u te a ft e r t ha t , t h e s ec o nd h un t sm a n fj r e s t hr e e s ho t s a t t he b an y an t re e . H o w m an y b i rd s d i d t he s ec o nd h un t sm a n s ho o t?

H o w m an y b i rd s d i d t he s ec o nd h un t sm a n s ho o t? A n d t he n , t h er e w e re n on e

Y o ur m od e l i s o nl y a s g o od a s y ou (a n d y ou r f e at u re s )

F e at u re i de n ti fj c a ti o n/ c r ea t io n /g e ne r at i on t ak e s a l o t o f t im e

T w o d i fg e r en t m o de l s w it h s a me f ea t ur e s c an r es u lt i n d i fg e r en t o u tp u ts W h y?

T w o d i fg e r en t m o de l s w it h s a me f ea t ur e s c an r es u lt i n d i fg e r en t o u tp u ts S e ar c he d d ifg e r en t r e gi o ns o f t he s ol u ti o n s pa c e

S ome c ommon p roblem s f aced by m odeler s 1. D ifg e rent m odels 2. M odel p aramet e rs 3. N umber of f eature s

P o ss i bl e S o lu t io n A p pr o ac h ?

E n se m bl e m o de l s a re o ur f ri e nd s

W h at i s a n e ns e mb l e?

C P U a s a p r ox y f o r h um a n I Q

C l ev e r A lg o ri t hm i c w ay t o s ea r ch t he s ol u ti o n s pa c e

B u t i s i t n ew?

B u t i s i t n ew? K n ow n t o r e se a rc h er s /a c ad e mi a f o r l on g . W a sn't w id e ly u se d i n i n du s tr y u n ti l ....

S u cc e ss S to r y N e t fm i x $ 1 m i ll i on p ri z e c om p et i ti o n

S o me A dv a nt a ge s 1. I mprove d a ccurac y 2. R obustn e ss 3. P aralle l izatio n

B a se m od e l d iv e rs i ty M o de l a g gr e ga t io n

B ase M odel 1. D ifg e rent t rainin g s ets 2. F eature s amplin g 3. D ifg e rent a lgorit h ms 4. D ifg e rent H yperpa r ameter s

M odel A ggrega t ion 1. V oting 2. A veragi n g 3. B agging 4. S tackin g

W H ER E I S P Y TH O N ?

RandomizedSearchCV from scipy.stats import randint as sp_randint from sklearn.grid_search import GridSearchCV, RandomizedSearchCV # build a classifier clf = RandomForestClassifier(n_estimators=20) # specify parameters and distributions to sample from param_dist = {"max_depth": [3, None], "max_features": sp_randint(1, 11), "min_samples_split": sp_randint(1, 11), "min_samples_leaf": sp_randint(1, 11), "bootstrap": [True, False], "criterion": ["gini", "entropy”]} # run randomized search n_iter_search = 20 random_search = RandomizedSearchCV(clf, param_distributions=param_dist, n_iter=n_iter_search)

hyperopt P y th o n l ib r ar y f o r s er i al a nd p ar a ll e l o pt i mi z at i on o ve r a w kw a rd s ea r ch s pa c es, w h ic h m a y i nc l ud e r e al-v a lu e d, d i sc r et e , a n d c on d it i on a l d im e ns i on s . h ttps ://g ithub .c om /h yperop t/h yperop t

hyperopt # define an objective function def objective(args): # Define the objective function here # define a search space from hyperopt import hp space = hp.choice('a', [ ('Model 1', randomForestModel), ('Model 2', xgboostModel) ]) # minimize the objective over the space from hyperopt import fmin, tpe best = fmin(objective, space, algo=tpe.suggest, max_evals=100)

joblib 1. t ranspa r ent d isk -c aching of t h e o utput v alues a n d la z y r e -e valuat i on (m emoize p attern ) 2. e asy s imple p aralle l c omputi n g 3. l ogging a n d t racing of t h e e xecuti o n

joblib import pandas as pd from sklearn.externals import joblib # build a classifier train = pd.read_csv('train.csv') clf = RandomForestClassifier(n_estimators=20) clf.fit(train) # once the classifier is built we can store it as a synchronized object # and can load it later and use it to predict, thereby reducing memory footprint. joblib.dump(clf, 'randomforest_20estimator.pkl') clf = joblib.load('randomforest_20estimator.pkl')

D i sa d va n ta g es 1. M odel h uman r eadabi l ity i sn 't g reat 2. T ime /E fg o rt t rade -o fg to i mprove a ccurac y m a y n o t ma k e s ense

Q u es t io n s ?