Machine Learning 2: Nonlinear Regression Stefano Ermon April 13, 2016 Stefano Ermon April 13, 2016 1 / 51 Machine Learning 2: Nonlinear Regression
Non-linear regression 3 Peak Hourly Demand (GW) 2.5 2 1.5 0 20 40 60 80 100 High Temperature (F) High temperature / peak demand observations for all days in 2008-2011 Stefano Ermon April 13, 2016 2 / 51 Machine Learning 2: Nonlinear Regression
Stefano Ermon April 13, 2016 3 / 51 Machine Learning 2: Nonlinear Regression
Central idea of non-linear regression: same as linear regression, just with non-linear features x 2 i E.g. φ ( x i ) = x i 1 Two ways to construct non-linear features: explicitly (construct actual feature vector), or implicitly (using kernels) Stefano Ermon April 13, 2016 4 / 51 Machine Learning 2: Nonlinear Regression
Observed Data 3 d = 2 Peak Hourly Demand (GW) 2.5 2 1.5 0 20 40 60 80 100 High Temperature (F) Degree 2 polynomial Stefano Ermon April 13, 2016 5 / 51 Machine Learning 2: Nonlinear Regression
Observed Data 3 d = 3 Peak Hourly Demand (GW) 2.5 2 1.5 0 20 40 60 80 100 High Temperature (F) Degree 3 polynomial Stefano Ermon April 13, 2016 6 / 51 Machine Learning 2: Nonlinear Regression
Observed Data 3 d = 4 Peak Hourly Demand (GW) 2.5 2 1.5 0 20 40 60 80 100 High Temperature (F) Degree 4 polynomial Stefano Ermon April 13, 2016 7 / 51 Machine Learning 2: Nonlinear Regression
Constructing explicit feature vectors Polynomial features (max degree d ) z d z d − 1 . ∈ R d +1 . Special case, n=1: φ ( z ) = . z 1 � n n � ∈ R ( n + d d ) � z b i � General case: φ ( z ) = i : b i ≤ d i =1 i =1 Stefano Ermon April 13, 2016 8 / 51 Machine Learning 2: Nonlinear Regression
1 0.5 φ i (x) 0 1 −0.5 x x 2 x 3 −1 −1 −0.5 0 0.5 1 x Plot of polynomial bases Stefano Ermon April 13, 2016 9 / 51 Machine Learning 2: Nonlinear Regression
Radial basis function (RBF) features Defined by bandwidth σ and k RBF centers µ j ∈ R n , j = 1 , . . . , k � −� z − µ j � 2 � φ j ( z ) = exp 2 σ 2 1 0.8 Feature Value 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Input Stefano Ermon April 13, 2016 10 / 51 Machine Learning 2: Nonlinear Regression
Difficulties with non-linear features Problem #1: Computational difficulties Polynomial features, � n + d � = O ( d n ) k = d RBF features; suppose we want centers in uniform grid over input space (w/ d centers along each dimension) k = d n In both cases, exponential in the size of the input dimension; quickly intractable to even store in memory Stefano Ermon April 13, 2016 11 / 51 Machine Learning 2: Nonlinear Regression
Problem #2: Representational difficulties With many features, our prediction function becomes very expressive Can lead to overfitting (low error on input data points, but high error nearby) Let’s see an intuitive example Stefano Ermon April 13, 2016 12 / 51 Machine Learning 2: Nonlinear Regression
Observed Data Observed Data 3 3 d = 1 d = 2 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 2.5 2.5 2 2 1.5 1.5 0 20 40 60 80 100 0 20 40 60 80 100 High Temperature (F) High Temperature (F) Observed Data Observed Data 3 3 d = 4 d = 50 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 2.5 2.5 2 2 1.5 1.5 0 20 40 60 80 100 0 20 40 60 80 100 High Temperature (F) High Temperature (F) Least-squares fits for polynomial features of different degrees Stefano Ermon April 13, 2016 13 / 51 Machine Learning 2: Nonlinear Regression
Observed Data Observed Data 3 3 num RBFs = 2 num RBFs = 4 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 2.5 2.5 2 2 1.5 1.5 0 20 40 60 80 100 0 20 40 60 80 100 High Temperature (F) High Temperature (F) Observed Data Observed Data 3 3 num RBFs = 10 num RBFs = 50, λ = 0 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 2.5 2.5 2 2 1.5 1.5 0 20 40 60 80 100 0 20 40 60 80 100 High Temperature (F) High Temperature (F) Least-squares fits for different numbers of RBFs Stefano Ermon April 13, 2016 14 / 51 Machine Learning 2: Nonlinear Regression
A few ways to deal with representational problem: Choose less expressive function (e.g., lower degree polynomial, fewer RBF centers, larger RBF bandwidth) Regularization : penalize large parameters θ m � y i , y i ) + λ � θ � 2 minimize ℓ (ˆ 2 θ i =1 λ : regularization parameter, trades off between low loss and small values of θ (often, don’t regularize constant term) Stefano Ermon April 13, 2016 15 / 51 Machine Learning 2: Nonlinear Regression
6000 5000 4000 J( θ ) 3000 2000 1000 0 0 2 4 6 8 10 12 || θ || 2 Pareto optimal surface for 20 RBF functions Stefano Ermon April 13, 2016 16 / 51 Machine Learning 2: Nonlinear Regression
Observed Data Observed Data 3 3 num RBFs = 50, λ = 0 num RBFs = 50, λ = 2 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 2.5 2.5 2 2 1.5 1.5 0 20 40 60 80 100 0 20 40 60 80 100 High Temperature (F) High Temperature (F) Observed Data Observed Data 3 3 num RBFs = 50, λ = 50 num RBFs = 50, λ = 1000 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 2.5 2.5 2 2 1.5 1.5 0 20 40 60 80 100 0 20 40 60 80 100 High Temperature (F) High Temperature (F) RBF fits varying regularization parameter (not regularizing constant term) Stefano Ermon April 13, 2016 17 / 51 Machine Learning 2: Nonlinear Regression
Regularization : penalize large parameters θ m � y i , y i ) + λ � θ � 2 minimize ℓ (ˆ 2 θ i =1 λ : regularization parameter, trades off between low loss and small values of θ (often, don’t regularize constant term) Solve with normal equations like before 2 + λθ T θ � Φ θ − y � 2 minimize θ θ T Φ T Φ θ − 2 y T Φ θ + y T y + λθ T θ minimize θ Φ T Φ + λI θ − 2 y T Φ θ + y T y θ T � � minimize θ Setting gradient to zero θ ⋆ = (Φ T Φ + λI ) − 1 Φ T y Stefano Ermon April 13, 2016 18 / 51 Machine Learning 2: Nonlinear Regression
Evaluating algorithms How do we determine when an algorithm achieves “good” performance? How should we tune the parameters of the learning algorithms (regularization parameter, choice of features, etc?) How do we report the performance of learning algorithms? Stefano Ermon April 13, 2016 19 / 51 Machine Learning 2: Nonlinear Regression
One possibility: just look at the loss function m ℓ ( θ T φ ( x i ) , y i ) � J ( θ ) = i =1 The problem: adding more features will always decrease the loss Example example: random outputs, random features, we can get zero loss for enough features m = 500; y = randn(m,1); Phi = randn(m,m); theta = (Phi´ * Phi) \ (Phi´ * y); norm(Phi*theta - y)^2 ans = 2.3722e-22 Stefano Ermon April 13, 2016 20 / 51 Machine Learning 2: Nonlinear Regression
A better criterion: training and testing loss Training set: x i ∈ R n , y i ∈ R , i = 1 , . . . , m i ∈ R n , y ′ Testing set: x ′ i ∈ R , i = 1 , . . . , m ′ Find parameters by minimizing loss on the training set, but evaluate on the testing set m Training: θ ⋆ = arg min ℓ ( θ T φ ( x i ) , y i ) � θ i =1 Evaluation: Average Loss = 1 m ′ ℓ (( θ ⋆ ) T φ ( x ′ i ) , y ′ i ) Performance on test set called generalization performance. Stefano Ermon April 13, 2016 21 / 51 Machine Learning 2: Nonlinear Regression
Sometimes, there is a natural breakdown between training and testing data (e.g., train system on one year, test on the next) More common, simply divide the data: for example, use 70% for training, 30% for testing % Phi, y, m are all the data m train = ceil(0.7*m); m test = m - m train; p = randperm(m); Phi train = Phi(p(1:m train),:); y train = y(p(1:m train)); Phi test = Phi(p(m train+1:end),:); y test = y(p(m train+1:end)); Stefano Ermon April 13, 2016 22 / 51 Machine Learning 2: Nonlinear Regression
3 Peak Hourly Demand (GW) 2.5 2 1.5 0 20 40 60 80 100 High Temperature (F) High temperature / peak demand observations Stefano Ermon April 13, 2016 23 / 51 Machine Learning 2: Nonlinear Regression
0.09 Training set 0.08 Testing set Average squared loss 0.07 0.06 0.05 0.04 0.03 0.02 0 2 4 6 8 10 Polynomial degree, d Testing loss versus degree of polynomial Stefano Ermon April 13, 2016 24 / 51 Machine Learning 2: Nonlinear Regression
4 10 Training set Testing set Average squared loss 2 10 0 10 −2 10 0 10 20 30 40 Polynomial degree, d Testing loss (log-scale) versus degree of polynomial Stefano Ermon April 13, 2016 25 / 51 Machine Learning 2: Nonlinear Regression
0.1 Training set 0.09 Testing set 0.08 Average squared loss 0.07 0.06 0.05 0.04 0.03 0.02 2 4 6 8 10 12 Number of RBFs y Testing loss versus number of RBF bases Stefano Ermon April 13, 2016 26 / 51 Machine Learning 2: Nonlinear Regression
1 10 Training set Testing set Average squared loss 0 10 −1 10 −2 10 0 20 40 60 80 100 Number of RBFs Testing loss (log-scale) versus number of RBF bases Stefano Ermon April 13, 2016 27 / 51 Machine Learning 2: Nonlinear Regression
Recommend
More recommend