SLIDE 1 CIVIL-557
Decision-aid methodologies in transportation
Transport and Mobility Laboratory TRANSP-OR École Polytechnique Fédérale de Lausanne EPFL
Tim Hillel
Lecture 3: Logistic regression and probabilistic metrics
SLIDE 2
Case study
SLIDE 3
Mode choice
SLIDE 4
Last week
Data science process Dataset Deterministic methods – K-Nearest Neighbours (KNN) – Decision Tree (DT) Discrete metrics
SLIDE 5
Today
Theory of probabilistic classification Probabilistic metrics Probabilistic classifiers – Logistic regression
SLIDE 6
But first…
…a bit more feature processing
SLIDE 7
Feature processing
Last week - scaling – Crucial when algorithms consider distance – Standard scaling – zero-mean unit-variance What about missing values? And categorical data?
SLIDE 8
Missing values
SLIDE 9
Missing values
Possible solutions? Remove rows (instances) Remove columns (features) Assume default value – Zero – Mean – Random value – Other?
SLIDE 10
Missing values
Removing rows – can introduce sampling bias! Removing columns – reduces available features Default value – needs to make sense
SLIDE 11
Categorical variables
SLIDE 12
Categorical data
All data must be numerical – possible solutions? Numerical encoding Binary encoding Remove feature?
SLIDE 13
Numerical encoding
Implies order: Blue < Silver < Red and distance: Red – Silver = Silver - Blue
Blue Silver Red 1 2 3
SLIDE 14
Binary encoding
Blue Silver Red
Car Colour:Blue Colour:Silver Colour:Red 1 1 2 1 3 1 4 1 … … … …
A lot more features!
SLIDE 15
Dataset
Validation schemes
Train-test split: Test set must be unseen data. How to sample test set? How to test model hyperparameters?
Train Test
SLIDE 16
Sampling test set
Previously used random sampling Assumes data is independent Not always a good assumption – Sampling bias/recording errors – Hierarchical data Instead, use external validation Validate the model on data sampled separately from the training data (e.g. separate year)
SLIDE 17
Train Test Train
Validate
Testing model hyperparameters
Can only be performed on the training data Possible solution – split again: Train-validate-test
SLIDE 18
Decision trees
Trip < 1km Walk Owns car Drive Bus
Yes Yes No No
SLIDE 19
DT Metrics
GINI Impurity: 𝐻 𝑞 =
𝑗=1 𝐾
𝑞𝑗 1 − 𝑞𝑗 = 1 −
𝑗=1 𝐾
𝑞𝑗2 where 𝑞𝑗 is the proportion of class 𝑗 Entropy: 𝐼 𝑞 = −
𝑗=1 𝐾
𝑞𝑗 log2 𝑞𝑗
SLIDE 20
Decision trees
Remember: Decision trees only see ranking of feature values…. …therefore scaling does not need to be applied!
SLIDE 21
Lab
Notebook 1: Categorical features
SLIDE 22
Probabilistic classification
Previously considered deterministic classifiers – Classifier predicts discrete class ො 𝑧 Now consider probabilistic classification – Classifier predicts continuous probability for each class
SLIDE 23 Notation
Dataset 𝐸 contains 𝑂 elements 𝐾 classes Each element 𝑜 associated with: – Feature vector 𝑦𝑜 of 𝑙 real features
◆𝑦𝑜 ∈ ℝ𝑙
– Set of class indicators 𝑧𝑜 ∈ 0,1𝐾:
◆σ𝑘=1
𝐾
𝑧𝑗𝑜 = 1
Selected class (ground-truth) – 𝑗𝑜 = σ𝑗=1
𝐾
𝑗𝑧𝑗𝑜
SLIDE 24
Notation continued
Classifier 𝑄 maps feature vector 𝑦𝑜 to probability distribution across 𝐾 classes – 𝑄: ℝ𝑙 → 0,1 𝐾 Probability for each class 𝑗: – 𝑄 𝑗 𝑦𝑜 > 0 – σ𝑗=1
𝐾
𝑄 𝑗 𝑦𝑜 = 1 Therefore, probability for selected (ground-truth) class: – 𝑄(𝑗𝑜|𝑦𝑜)
SLIDE 25
Example
Four modes (walk, cycle, pt, drive) – 𝐾 = 4 For pt trip: – 𝑧𝑜 = 0, 0, 1, 0 – 𝑗𝑜 = 3 Predicted probabilities for arbitrary classifier 𝑄 – 𝑄 𝑦𝑜 = 0.1, 0.2, 0.4, 0.3 – 𝑄 𝑗𝑜 𝑦𝑜 = 0.4
SLIDE 26 𝒐 𝒋𝒐 𝑸(𝟐|𝒚𝒐) 𝑸(𝟑|𝒚𝒐) 𝑸(𝟒|𝒚𝒐) 𝑸(𝟓|𝒚𝒐) 𝑸(𝒋𝒐|𝒚𝒐) 1 2 0.11 0.84 0.03 0.02 0.84 2 1 0.82 0.04 0.06 0.08 0.82 3 4 0.11 0.18 0.05 0.66 0.66 4 1 0.57 0.22 0.12 0.09 0.57 5 3 0.10 0.03 0.75 0.12 0.75
Likelihood
Likelihood: – = ς𝑜=1
𝑂
𝑄(𝑗𝑜|𝑦𝑜) – = 0.84 × 0.82 × 0.66 × 0.57 × 0.75 – = 0.2036
𝒐 𝒋𝒐 𝑸(𝟐|𝒚𝒐) 𝑸(𝟑|𝒚𝒐) 𝑸(𝟒|𝒚𝒐) 𝑸(𝟓|𝒚𝒐) 1 2 0.11 0.84 0.03 0.02 2 1 0.82 0.04 0.06 0.08 3 4 0.11 0.18 0.05 0.66 4 1 0.57 0.22 0.12 0.09 5 3 0.10 0.03 0.75 0.12 𝒐 𝒋𝒐 1 2 2 1 3 4 4 1 5 3 𝒐 1 2 3 4 5
SLIDE 27 Likelihood
𝒐 𝒋𝒐 𝑸(𝟐|𝒚𝒐) 𝑸(𝟑|𝒚𝒐) 𝑸(𝟒|𝒚𝒐) 𝑸(𝟓|𝒚𝒐) 𝑸(𝒋𝒐|𝒚𝒐) 1 4 0.11 0.84 0.03 0.02 0.02 2 1 0.82 0.04 0.06 0.08 0.82 3 4 0.11 0.18 0.05 0.66 0.66 4 1 0.57 0.22 0.12 0.09 0.57 5 3 0.10 0.03 0.75 0.12 0.75
Likelihood: – = ς𝑜=1
𝑂
𝑄(𝑗𝑜|𝑦𝑜) – = 0.02 × 0.82 × 0.66 × 0.57 × 0.75 – = 0.0046
SLIDE 28
Log-likelihood
Likelihood bound between 0 and 1 Tends to 0 as 𝑂 increases – Computational issues with small numbers Use log-likelihood: – = ln(ς𝑜=1
𝑂
𝑄(𝑗𝑜|𝑦𝑜)) – = σ𝑜=1
𝑂
ln 𝑄 (𝑗𝑜|𝑦𝑜)
SLIDE 29 Log-likelihood
𝒐 𝒋𝒐 𝑸(𝟐|𝒚𝒐) 𝑸(𝟑|𝒚𝒐) 𝑸(𝟒|𝒚𝒐) 𝑸(𝟓|𝒚𝒐) 𝑸(𝒋𝒐|𝒚𝒐) 1 2 0.11 0.84 0.03 0.02 0.84 2 1 0.82 0.04 0.06 0.08 0.82 3 4 0.11 0.18 0.05 0.66 0.66 4 1 0.57 0.22 0.12 0.09 0.57 5 3 0.10 0.03 0.75 0.12 0.75
Log-likelihood: – = σ𝑜=1
𝑂
ln 𝑄(𝑗𝑜|𝑦𝑜) – = ln(0.84 × 0.82 × 0.66 × 0.57 × 0.75) – = −1.638
SLIDE 30 Log-likelihood
𝒐 𝒋𝒐 𝑸(𝟐|𝒚𝒐) 𝑸(𝟑|𝒚𝒐) 𝑸(𝟒|𝒚𝒐) 𝑸(𝟓|𝒚𝒐) 𝑸(𝒋𝒐|𝒚𝒐) 1 4 0.11 0.84 0.03 0.02 0.02 2 1 0.82 0.04 0.06 0.08 0.82 3 4 0.11 0.18 0.05 0.66 0.66 4 1 0.57 0.22 0.12 0.09 0.57 5 3 0.10 0.03 0.75 0.12 0.75
Log-likelihood: – = σ𝑜=1
𝑂
ln 𝑄(𝑗𝑜|𝑦𝑜) – = ln(0.02 × 0.82 × 0.66 × 0.57 × 0.75) – = −5.376
SLIDE 31
Cross-entropy loss
Log-likelihood bound between −inf and 0 – Tends to −inf as 𝑂 increases Can’t compare between different dataset sizes – Normalize (divide by 𝑂) to get cross-entropy loss (CEL) 𝐼 – Typically take negative 𝑀 = −1 𝑜
𝑜=1 𝑂
ln 𝑄 (𝑗𝑜|𝑦𝑜)
SLIDE 32
Cross-entropy loss
Can also derive from Shannon’s cross entropy 𝐼(𝑞, 𝑟) (hence name) Minimising cross-entropy loss equivalent to maximising likelihood of data under the model – Maximum likelihood estimation (MLE) Other metrics exist e.g. Brier score (MSE), but CEL is golden standard
SLIDE 33
Probabilistic classifiers
Need a model which generates multiclass probability distribution from feature vector 𝑦𝑜 Lot’s of possibilities! Today, logistic regression
SLIDE 34
Linear regression
Logistic regression is an extension of linear regression – Often called linear models 𝑔(𝑦) =
𝑙=1 𝐿
𝛾𝑙𝑦𝑙 + 𝛾𝑝 Find parameters (𝛾) using gradient descent – Typically minimise MSE
SLIDE 35
Logistic regression
Pass linear regression through function 𝜏 𝑔 𝑦 = 𝜏(
𝑙=1 𝐿
𝛾𝑙𝑦𝑙 + 𝛾𝑝) 𝜏 needs to take real value and return value in [0,1] – 𝜏 𝑨 ∈ 0,1 ∀ 𝑦 ∈ ℝ
SLIDE 36
Binary case – sigmoid function
𝜏 𝑨 = 1 1 + 𝑓−𝑨
SLIDE 37
Multinomial case – softmax function
𝜏 𝑨 𝑗 = 𝑓𝑨𝑗 σ𝑘=1
𝐾
𝑓𝑨𝑗
SLIDE 38
Logistic regression
Separate set of parameters (𝛾) for each class 𝑗 – Normalised to zero for one class Separate value 𝑨 for each class – Utilities in Discrete Choice Models Minimise cross-entropy loss (MLE) Solve using gradient descent for parameters – 𝛾∗ = arg min
𝛾 𝑀𝛾
SLIDE 39
Issues
Overconfidence – Linearly separable data – parameters tend to infinity (unrealistic probability estimates) Outliers – Outliers can have disproportionate effect on parameters (high variance) Multi-colinearity – If features are highly correlated, can cause numerical instability
SLIDE 40 Solution: Regularisation
Penalise model for large parameter values – Reduces variance (therefore overfitting) Regularisation – 𝛾∗ = arg min
𝛾 {𝑀𝛾 + 𝐷 × 𝑔 𝛾 }
Two candidates for 𝑔 – L1 normalisation (LASSO)
◆σ𝑗 |𝛾𝑗| (similar to Manhattan distance)
– L2 normalisation (Ridge)
◆σ𝑗 𝛾2 (similar to Euclidean distance)
SLIDE 41
Regularisation
L1 regularisation shrinks less important parameters to zero – Feature selection? L2 regularisation penalises larger parameters more
SLIDE 42
Regularisation
𝐷 and choice of regularisation (L1/L2) are hyperparameters Larger 𝐷, more regularisation (higher bias, lower variance)
SLIDE 43
Logistic regression
Softmax 𝒚 𝒜 5 features 3 classes
SLIDE 44
Homework
Notebook 2: Logistic regression