Decision-aid methodologies in transportation Lecture 3: Logistic - - PowerPoint PPT Presentation

decision aid methodologies in transportation
SMART_READER_LITE
LIVE PREVIEW

Decision-aid methodologies in transportation Lecture 3: Logistic - - PowerPoint PPT Presentation

CIVIL-557 Decision-aid methodologies in transportation Lecture 3: Logistic regression and probabilistic metrics Tim Hillel Transport and Mobility Laboratory TRANSP-OR cole Polytechnique Fdrale de Lausanne EPFL Case study Mode choice


slide-1
SLIDE 1

CIVIL-557

Decision-aid methodologies in transportation

Transport and Mobility Laboratory TRANSP-OR École Polytechnique Fédérale de Lausanne EPFL

Tim Hillel

Lecture 3: Logistic regression and probabilistic metrics

slide-2
SLIDE 2

Case study

slide-3
SLIDE 3

Mode choice

slide-4
SLIDE 4

Last week

Data science process Dataset Deterministic methods – K-Nearest Neighbours (KNN) – Decision Tree (DT) Discrete metrics

slide-5
SLIDE 5

Today

Theory of probabilistic classification Probabilistic metrics Probabilistic classifiers – Logistic regression

slide-6
SLIDE 6

But first…

…a bit more feature processing

slide-7
SLIDE 7

Feature processing

Last week - scaling – Crucial when algorithms consider distance – Standard scaling – zero-mean unit-variance What about missing values? And categorical data?

slide-8
SLIDE 8

Missing values

slide-9
SLIDE 9

Missing values

Possible solutions? Remove rows (instances) Remove columns (features) Assume default value – Zero – Mean – Random value – Other?

slide-10
SLIDE 10

Missing values

Removing rows – can introduce sampling bias! Removing columns – reduces available features Default value – needs to make sense

slide-11
SLIDE 11

Categorical variables

slide-12
SLIDE 12

Categorical data

All data must be numerical – possible solutions? Numerical encoding Binary encoding Remove feature?

slide-13
SLIDE 13

Numerical encoding

Implies order: Blue < Silver < Red and distance: Red – Silver = Silver - Blue

Blue Silver Red 1 2 3

slide-14
SLIDE 14

Binary encoding

Blue Silver Red

Car Colour:Blue Colour:Silver Colour:Red 1 1 2 1 3 1 4 1 … … … …

A lot more features!

slide-15
SLIDE 15

Dataset

Validation schemes

Train-test split: Test set must be unseen data. How to sample test set? How to test model hyperparameters?

Train Test

slide-16
SLIDE 16

Sampling test set

Previously used random sampling Assumes data is independent Not always a good assumption – Sampling bias/recording errors – Hierarchical data Instead, use external validation Validate the model on data sampled separately from the training data (e.g. separate year)

slide-17
SLIDE 17

Train Test Train

Validate

Testing model hyperparameters

Can only be performed on the training data Possible solution – split again: Train-validate-test

slide-18
SLIDE 18

Decision trees

Trip < 1km Walk Owns car Drive Bus

Yes Yes No No

slide-19
SLIDE 19

DT Metrics

GINI Impurity: 𝐻 𝑞 = ෍

𝑗=1 𝐾

𝑞𝑗 1 − 𝑞𝑗 = 1 − ෍

𝑗=1 𝐾

𝑞𝑗2 where 𝑞𝑗 is the proportion of class 𝑗 Entropy: 𝐼 𝑞 = − ෍

𝑗=1 𝐾

𝑞𝑗 log2 𝑞𝑗

slide-20
SLIDE 20

Decision trees

Remember: Decision trees only see ranking of feature values…. …therefore scaling does not need to be applied!

slide-21
SLIDE 21

Lab

Notebook 1: Categorical features

slide-22
SLIDE 22

Probabilistic classification

Previously considered deterministic classifiers – Classifier predicts discrete class ො 𝑧 Now consider probabilistic classification – Classifier predicts continuous probability for each class

slide-23
SLIDE 23

Notation

Dataset 𝐸 contains 𝑂 elements 𝐾 classes Each element 𝑜 associated with: – Feature vector 𝑦𝑜 of 𝑙 real features

◆𝑦𝑜 ∈ ℝ𝑙

– Set of class indicators 𝑧𝑜 ∈ 0,1𝐾:

◆σ𝑘=1

𝐾

𝑧𝑗𝑜 = 1

Selected class (ground-truth) – 𝑗𝑜 = σ𝑗=1

𝐾

𝑗𝑧𝑗𝑜

slide-24
SLIDE 24

Notation continued

Classifier 𝑄 maps feature vector 𝑦𝑜 to probability distribution across 𝐾 classes – 𝑄: ℝ𝑙 → 0,1 𝐾 Probability for each class 𝑗: – 𝑄 𝑗 𝑦𝑜 > 0 – σ𝑗=1

𝐾

𝑄 𝑗 𝑦𝑜 = 1 Therefore, probability for selected (ground-truth) class: – 𝑄(𝑗𝑜|𝑦𝑜)

slide-25
SLIDE 25

Example

Four modes (walk, cycle, pt, drive) – 𝐾 = 4 For pt trip: – 𝑧𝑜 = 0, 0, 1, 0 – 𝑗𝑜 = 3 Predicted probabilities for arbitrary classifier 𝑄 – 𝑄 𝑦𝑜 = 0.1, 0.2, 0.4, 0.3 – 𝑄 𝑗𝑜 𝑦𝑜 = 0.4

slide-26
SLIDE 26

𝒐 𝒋𝒐 𝑸(𝟐|𝒚𝒐) 𝑸(𝟑|𝒚𝒐) 𝑸(𝟒|𝒚𝒐) 𝑸(𝟓|𝒚𝒐) 𝑸(𝒋𝒐|𝒚𝒐) 1 2 0.11 0.84 0.03 0.02 0.84 2 1 0.82 0.04 0.06 0.08 0.82 3 4 0.11 0.18 0.05 0.66 0.66 4 1 0.57 0.22 0.12 0.09 0.57 5 3 0.10 0.03 0.75 0.12 0.75

Likelihood

Likelihood: – = ς𝑜=1

𝑂

𝑄(𝑗𝑜|𝑦𝑜) – = 0.84 × 0.82 × 0.66 × 0.57 × 0.75 – = 0.2036

𝒐 𝒋𝒐 𝑸(𝟐|𝒚𝒐) 𝑸(𝟑|𝒚𝒐) 𝑸(𝟒|𝒚𝒐) 𝑸(𝟓|𝒚𝒐) 1 2 0.11 0.84 0.03 0.02 2 1 0.82 0.04 0.06 0.08 3 4 0.11 0.18 0.05 0.66 4 1 0.57 0.22 0.12 0.09 5 3 0.10 0.03 0.75 0.12 𝒐 𝒋𝒐 1 2 2 1 3 4 4 1 5 3 𝒐 1 2 3 4 5

slide-27
SLIDE 27

Likelihood

𝒐 𝒋𝒐 𝑸(𝟐|𝒚𝒐) 𝑸(𝟑|𝒚𝒐) 𝑸(𝟒|𝒚𝒐) 𝑸(𝟓|𝒚𝒐) 𝑸(𝒋𝒐|𝒚𝒐) 1 4 0.11 0.84 0.03 0.02 0.02 2 1 0.82 0.04 0.06 0.08 0.82 3 4 0.11 0.18 0.05 0.66 0.66 4 1 0.57 0.22 0.12 0.09 0.57 5 3 0.10 0.03 0.75 0.12 0.75

Likelihood: – = ς𝑜=1

𝑂

𝑄(𝑗𝑜|𝑦𝑜) – = 0.02 × 0.82 × 0.66 × 0.57 × 0.75 – = 0.0046

slide-28
SLIDE 28

Log-likelihood

Likelihood bound between 0 and 1 Tends to 0 as 𝑂 increases – Computational issues with small numbers Use log-likelihood: – = ln(ς𝑜=1

𝑂

𝑄(𝑗𝑜|𝑦𝑜)) – = σ𝑜=1

𝑂

ln 𝑄 (𝑗𝑜|𝑦𝑜)

slide-29
SLIDE 29

Log-likelihood

𝒐 𝒋𝒐 𝑸(𝟐|𝒚𝒐) 𝑸(𝟑|𝒚𝒐) 𝑸(𝟒|𝒚𝒐) 𝑸(𝟓|𝒚𝒐) 𝑸(𝒋𝒐|𝒚𝒐) 1 2 0.11 0.84 0.03 0.02 0.84 2 1 0.82 0.04 0.06 0.08 0.82 3 4 0.11 0.18 0.05 0.66 0.66 4 1 0.57 0.22 0.12 0.09 0.57 5 3 0.10 0.03 0.75 0.12 0.75

Log-likelihood: – = σ𝑜=1

𝑂

ln 𝑄(𝑗𝑜|𝑦𝑜) – = ln(0.84 × 0.82 × 0.66 × 0.57 × 0.75) – = −1.638

slide-30
SLIDE 30

Log-likelihood

𝒐 𝒋𝒐 𝑸(𝟐|𝒚𝒐) 𝑸(𝟑|𝒚𝒐) 𝑸(𝟒|𝒚𝒐) 𝑸(𝟓|𝒚𝒐) 𝑸(𝒋𝒐|𝒚𝒐) 1 4 0.11 0.84 0.03 0.02 0.02 2 1 0.82 0.04 0.06 0.08 0.82 3 4 0.11 0.18 0.05 0.66 0.66 4 1 0.57 0.22 0.12 0.09 0.57 5 3 0.10 0.03 0.75 0.12 0.75

Log-likelihood: – = σ𝑜=1

𝑂

ln 𝑄(𝑗𝑜|𝑦𝑜) – = ln(0.02 × 0.82 × 0.66 × 0.57 × 0.75) – = −5.376

slide-31
SLIDE 31

Cross-entropy loss

Log-likelihood bound between −inf and 0 – Tends to −inf as 𝑂 increases Can’t compare between different dataset sizes – Normalize (divide by 𝑂) to get cross-entropy loss (CEL) 𝐼 – Typically take negative 𝑀 = −1 𝑜 ෍

𝑜=1 𝑂

ln 𝑄 (𝑗𝑜|𝑦𝑜)

slide-32
SLIDE 32

Cross-entropy loss

Can also derive from Shannon’s cross entropy 𝐼(𝑞, 𝑟) (hence name) Minimising cross-entropy loss equivalent to maximising likelihood of data under the model – Maximum likelihood estimation (MLE) Other metrics exist e.g. Brier score (MSE), but CEL is golden standard

slide-33
SLIDE 33

Probabilistic classifiers

Need a model which generates multiclass probability distribution from feature vector 𝑦𝑜 Lot’s of possibilities! Today, logistic regression

slide-34
SLIDE 34

Linear regression

Logistic regression is an extension of linear regression – Often called linear models 𝑔(𝑦) = ෍

𝑙=1 𝐿

𝛾𝑙𝑦𝑙 + 𝛾𝑝 Find parameters (𝛾) using gradient descent – Typically minimise MSE

slide-35
SLIDE 35

Logistic regression

Pass linear regression through function 𝜏 𝑔 𝑦 = 𝜏(෍

𝑙=1 𝐿

𝛾𝑙𝑦𝑙 + 𝛾𝑝) 𝜏 needs to take real value and return value in [0,1] – 𝜏 𝑨 ∈ 0,1 ∀ 𝑦 ∈ ℝ

slide-36
SLIDE 36

Binary case – sigmoid function

𝜏 𝑨 = 1 1 + 𝑓−𝑨

slide-37
SLIDE 37

Multinomial case – softmax function

𝜏 𝑨 𝑗 = 𝑓𝑨𝑗 σ𝑘=1

𝐾

𝑓𝑨𝑗

slide-38
SLIDE 38

Logistic regression

Separate set of parameters (𝛾) for each class 𝑗 – Normalised to zero for one class Separate value 𝑨 for each class – Utilities in Discrete Choice Models Minimise cross-entropy loss (MLE) Solve using gradient descent for parameters – 𝛾∗ = arg min

𝛾 𝑀𝛾

slide-39
SLIDE 39

Issues

Overconfidence – Linearly separable data – parameters tend to infinity (unrealistic probability estimates) Outliers – Outliers can have disproportionate effect on parameters (high variance) Multi-colinearity – If features are highly correlated, can cause numerical instability

slide-40
SLIDE 40

Solution: Regularisation

Penalise model for large parameter values – Reduces variance (therefore overfitting) Regularisation – 𝛾∗ = arg min

𝛾 {𝑀𝛾 + 𝐷 × 𝑔 𝛾 }

Two candidates for 𝑔 – L1 normalisation (LASSO)

◆σ𝑗 |𝛾𝑗| (similar to Manhattan distance)

– L2 normalisation (Ridge)

◆σ𝑗 𝛾2 (similar to Euclidean distance)

slide-41
SLIDE 41

Regularisation

L1 regularisation shrinks less important parameters to zero – Feature selection? L2 regularisation penalises larger parameters more

slide-42
SLIDE 42

Regularisation

𝐷 and choice of regularisation (L1/L2) are hyperparameters Larger 𝐷, more regularisation (higher bias, lower variance)

slide-43
SLIDE 43

Logistic regression

Softmax 𝒚 𝒜 5 features 3 classes

slide-44
SLIDE 44

Homework

Notebook 2: Logistic regression