CS480/680 Lecture 24: July 29, 2019 Gradient Boosting, Bagging, - PowerPoint PPT Presentation

CS480/680 Lecture 24: July 29, 2019 Gradient Boosting, Bagging, Decision Forest [RN] Sec. 18.10, [M] Sec. 16.2.5, 16.4.5 [B] Chap. 14, [HTF] Chap 10, 15-16, [D] Chap. 13 University of Waterloo CS480/680 Spring 2019 Pascal Poupart 1

Gradient Boosting • AdaBoost designed for classification • How can we use boosting for regression? • Answer: Gradient Boosting University of Waterloo CS480/680 Spring 2019 Pascal Poupart 2

Gradient Boosting Idea: • Predictor ! " at stage # incurs loss $(! " & , () • Train ℎ "+, to approximate negative gradient: ℎ "+, & ≈ − /$(! " & , () /! " (&) • Update predictor by adding a multiple 0 "+, of ℎ "+, : ! "+, & ← ! " & + 0 "+, ℎ "+, (&) University of Waterloo CS480/680 Spring 2019 Pascal Poupart 3

Squared Loss • Consider squared loss ) # $ % − ' % * ! " # $ % , ' % = * " • Negative gradient corresponds to residual , -. / 0 $ 1 ,2 1 − = ' % − " # $ % = , % -/ 0 $ 1 • Train base learner ℎ #4) with residual dataset { $ % , , % ∀% } • Base learner ℎ #4) can be any non-linear predictor (often a small decision tree) University of Waterloo CS480/680 Spring 2019 Pascal Poupart 4

Gradient Boosting Algorithm • Initialize predictor with a constant ! : # (% & ) = )*+,-. / ∑ & 1 !, 3 & " • For 4 = 1 to 6 do & = − 89 : ;<= % > ,? > – Compute pseudo residuals: * 8: ;<= (% > ) – Train a base learner ℎ A with r esidual dataset { % & , * & ∀& } – Optimize step length: M A = )*+,-. N ∑ & 1 " AOP % & + Mℎ A % & , 3 & – Update predictor: " A % ← " AOP % + M A ℎ A (%) University of Waterloo CS480/680 Spring 2019 Pascal Poupart 5

XGBoost • eXtreme Gradient Boosting – Package optimized for speed and accuracy – XGBoost used in >12 winning entries for various challenges https://github.com/dmlc/xgboost/tree/master/demo#mac hine-learning-challenge-winning-solutions University of Waterloo CS480/680 Spring 2019 Pascal Poupart 6

Boosting vs Bagging • Review University of Waterloo CS480/680 Spring 2019 Pascal Poupart 7

Independent classifiers/predictors • How can we obtain independent classifiers/predictors for bagging? • Bootstrap sampling – Sample (without replacement) subset of data • Random projection – Sample (without replacement) subset of features • Learn different classifiers/predictors based on each data subset and feature subset University of Waterloo CS480/680 Spring 2019 Pascal Poupart 8

Bagging For k = 1 to K ! " ← sample data subset $ " ← sample feature subset ℎ " ← train classifier/predictor based on ! " and $ " Classification: &'()*+,-(ℎ / 0 , … , ℎ 3 0 ) Regression: '56*'76(ℎ / 0 , … , ℎ 3 0 ) Random forest: bag of decision trees University of Waterloo CS480/680 Spring 2019 Pascal Poupart 9

Application: Xbox 360 Kinect • Microsoft Cambridge • Body part recognition: supervised learning University of Waterloo CS480/680 Spring 2019 Pascal Poupart 10

Depth camera • Kinect Gray scale depth map Infrared image University of Waterloo CS480/680 Spring 2019 Pascal Poupart 11

Kinect Body Part Recognition • Problem: label each pixel with a body part University of Waterloo CS480/680 Spring 2019 Pascal Poupart 12

Kinect Body Part Recognition • Features: depth differences between pairs of pixels • Classification: forest of decision trees University of Waterloo CS480/680 Spring 2019 Pascal Poupart 13

Large Scale Machine Learning • Big data – Large number of data instances – Large number of features • Solution: distribute computation (parallel computation) – GPU (Graphics Processing Unit) – Many cores University of Waterloo CS480/680 Spring 2019 Pascal Poupart 14

GPU computation • Many Machine Learning algorithms consist of vector, matrix and tensor operations – A tensor is a multidimensional array • GPU (Graphics Processing Units) can perform arithmetic operations on all elements of a tensor in parallel • Packages that facilitate ML programming on GPUs: Keras, PyTorch, TensorFlow, MXNet, Theano, Caffe, DL4J University of Waterloo CS480/680 Spring 2019 Pascal Poupart 15

Multicore Computation • Idea: Train a different classifier/predictor with a subset of the data on each core • How can we combine the classifiers/predictors? • Should we take the average of the parameters of the classifiers/predictors? No , this might lead to a worse classifier/predictor. This is especially problematic for models with hidden variables/units such as neural networks and hidden Markov models University of Waterloo CS480/680 Spring 2019 Pascal Poupart 16

Bad case of parameter averaging • Consider two threshold neural networks that encode the exclusive-or Boolean function • Averaging the weights yields a new neural network that does not encode exclusive-or University of Waterloo CS480/680 Spring 2019 Pascal Poupart 17

Safely Combining Predictions • A safe approach to ensemble learning is to combine the predictions (not the parameters) • Classification: majority vote of the classes predicted by the classifiers • Regression: average of the predictions computed by the regressors University of Waterloo CS480/680 Spring 2019 Pascal Poupart 18

Other UW Courses Related to ML CS486/686: Artificial Intelligence • CS475/675: Computational Linear Algebra • CS485/685: Theoretical Foundations of ML (Shai Ben-David) • CS794 Optimization for Data Science • CS795 Fundamentals of Optimization • CS870: Biologically Plausible Neural Networks (Jeff Orchard) • CS898: Deep Learning and its Applications (Ming Li) • CS885: Reinforcement Learning (Pascal Poupart) • STAT440/840: Computational Inference • STAT441/841: Statistical Learning – Classification • STAT442/890: Data visualization • STAT444/844: Statistical Learning – Regression • STAT450/850: Estimation and hypothesis testing • University of Waterloo CS480/680 Spring 2019 Pascal Poupart 19

Data Science at UW • https://uwaterloo.ca/data-science/ • Intersection of AI, Machine Learning, Data Systems, Statistics and Optimization • Bachelor in Data Science • Master in Data Science (and Artificial Intelligence) – Course-based option – Thesis-based option University of Waterloo CS480/680 Spring 2019 Pascal Poupart 20

CS480/680 Lecture 24: July 29, 2019 Gradient Boosting, Bagging, - PowerPoint PPT Presentation

CS480/680 Lecture 24: July 29, 2019 Gradient Boosting, Bagging, Decision Forest [RN] Sec. 18.10, [M] Sec. 16.2.5, 16.4.5 [B] Chap. 14, [HTF] Chap 10, 15-16, [D] Chap. 13 University of Waterloo CS480/680 Spring 2019 Pascal Poupart 1 Gradient

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

CS480/680 Lecture 18: July 8, 2019 Recurrent and Recursive Neural Networks [GBC] Chap. 10

CS480/680 Lecture 22: July 22, 2019 Ensemble Learning [RN] Sec. 18.10, [M] Sec. 16.2.5, [B]

CS480/680 Lecture 19: July 10, 2019 Attention and Transformer Networks [Vaswani et al.,

CS480/680 Lecture 9: June 5, 2019 Perceptrons, Neural Networks [D] Chapt. 4, [HTF] Chapt. 11,

CS480/680 Lecture 4: May 15, 2019 Statistical Learning [RN]: Sec 20.1, 20.2, [M]: Sec. 2.2, 3.2

CS480/680 Lecture 15: June 26, 2019 Deep Neural Networks [GBC] Chap. 6, 7, 8 University of

CS480/680 Machine Learning Lecture 1: May 6 th , 2019 Course Introduction Pascal Poupart

CS480/680 Lecture 2: May 8 th , 2019 Nearest Neighbour [RN] Sec. 18.8.1, [HTF] Sec. 2.3.2, [D]

CS480/680 Lecture 12: June 17, 2019 Gaussian Processes [B] Section 6.4 [M] Chap. 15 [HTF] Sec.

CS480/680 Lecture 8: June 3, 2019 Classification by Logistic Regression, Generalized linear

CS480/680 Machine Learning Lecture 3: May 13, 2019 Linear Regression [RN] Sec. 18.6.1, [HTF]

CS480/680 Lecture 7: May 29, 2019 Classification with Mixture of Gaussians [B] Sections 4.2,

CS480/680 Lecture 11: June 12, 2019 Kernel methods [D] Chap. 11 [B] Sec. 6.1, 6.2 [M] Sec.

CS480/680 Lecture 14: June 24, 2019 Support Vector Machines (continued) [B] Sec. 7.1 [D] Sec.

CS480/680 Machine Learning Lecture 8: January 30 th , 2020 Graphical Models Zahra Sheikhbahaee

Book of Joshua Outline 1) His Commission Chap 1 2) His Command Chap 2-22

Notes: UW RTG IPDE Summer School 2011 Finite Volume Methods and the Clawpack Software Randall

Lecture 1: Introduction to Digital Logic Design CSE 140: Components and Design Techniques for

Definition of a Transition Graph A transition graph is defined by a 5-tuple: A finite set of

A Tool for Scenario Analysis: An IPCC perspective Christophe Cassen (cassen@entre-cired.fr),

Chapter Objectives To evaluate reasons for and against using credit and Chapter 6. Short Term

M O L S O N C O O R S B R E W I N G C O M PA N Y B A R C L AY S B A C K TO S C H O O L C O N

COVID-19 BUSINESS SERIES: SESSION 1: CRITICAL RESOURCES FOR BUSINESS Critical Resources for