decision aid methodologies in transportation
play

Decision-aid methodologies in transportation Lecture 3: Logistic - PowerPoint PPT Presentation

CIVIL-557 Decision-aid methodologies in transportation Lecture 3: Logistic regression and probabilistic metrics Tim Hillel Transport and Mobility Laboratory TRANSP-OR cole Polytechnique Fdrale de Lausanne EPFL Case study Mode choice


  1. CIVIL-557 Decision-aid methodologies in transportation Lecture 3: Logistic regression and probabilistic metrics Tim Hillel Transport and Mobility Laboratory TRANSP-OR École Polytechnique Fédérale de Lausanne EPFL

  2. Case study

  3. Mode choice

  4. Last week Data science process Dataset Deterministic methods – K-Nearest Neighbours (KNN) – Decision Tree (DT) Discrete metrics

  5. Today Theory of probabilistic classification Probabilistic metrics Probabilistic classifiers – Logistic regression

  6. But first… …a bit more feature processing

  7. Feature processing Last week - scaling – Crucial when algorithms consider distance – Standard scaling – zero-mean unit-variance What about missing values? And categorical data?

  8. Missing values

  9. Missing values Possible solutions? Remove rows (instances) Remove columns (features) Assume default value – Zero – Mean – Random value – Other?

  10. Missing values Removing rows – can introduce sampling bias ! Removing columns – reduces available features Default value – needs to make sense

  11. Categorical variables

  12. Categorical data All data must be numerical – possible solutions? Numerical encoding Binary encoding Remove feature?

  13. Numerical encoding Silver 2 Blue 1 Red 3 Implies order: Blue < Silver < Red and distance: Red – Silver = Silver - Blue

  14. Binary encoding Silver Blue Red Car Colour:Blue Colour:Silver Colour:Red 1 0 1 0 2 1 0 0 3 0 0 1 4 1 0 0 … … … … A lot more features!

  15. Validation schemes Train-test split: Train Dataset Test Test set must be unseen data. How to sample test set? How to test model hyperparameters?

  16. Sampling test set Previously used random sampling Assumes data is independent Not always a good assumption – Sampling bias/recording errors – Hierarchical data Instead, use external validation Validate the model on data sampled separately from the training data (e.g. separate year)

  17. Testing model hyperparameters Can only be performed on the training data Possible solution – split again: Train-validate-test Train Train Test Validate

  18. Decision trees Trip < 1km Yes No Walk Owns car Yes No Drive Bus

  19. DT Metrics GINI Impurity: 𝐾 𝐾 𝑞 𝑗 2 𝐻 𝑞 = ෍ 𝑞 𝑗 1 − 𝑞 𝑗 = 1 − ෍ 𝑗=1 𝑗=1 Entropy: 𝐾 𝐼 𝑞 = − ෍ 𝑞 𝑗 log 2 𝑞 𝑗 𝑗=1 where 𝑞 𝑗 is the proportion of class 𝑗

  20. Decision trees Remember: Decision trees only see ranking of feature values…. …therefore scaling does not need to be applied!

  21. Lab Notebook 1: Categorical features

  22. Probabilistic classification Previously considered deterministic classifiers – Classifier predicts discrete class ො 𝑧 Now consider probabilistic classification – Classifier predicts continuous probability for each class

  23. Notation Dataset 𝐸 contains 𝑂 elements 𝐾 classes Each element 𝑜 associated with: – Feature vector 𝑦 𝑜 of 𝑙 real features ◆ 𝑦 𝑜 ∈ ℝ 𝑙 – Set of class indicators 𝑧 𝑜 ∈ 0,1 𝐾 : 𝐾 ◆ σ 𝑘=1 𝑧 𝑗𝑜 = 1 Selected class (ground-truth) 𝐾 – 𝑗 𝑜 = σ 𝑗=1 𝑗𝑧 𝑗𝑜

  24. Notation continued Classifier 𝑄 maps feature vector 𝑦 𝑜 to probability distribution across 𝐾 classes – 𝑄: ℝ 𝑙 → 0,1 𝐾 Probability for each class 𝑗 : – 𝑄 𝑗 𝑦 𝑜 > 0 𝐾 – σ 𝑗=1 𝑄 𝑗 𝑦 𝑜 = 1 Therefore, probability for selected (ground-truth) class: – 𝑄(𝑗 𝑜 |𝑦 𝑜 )

  25. Example Four modes ( walk, cycle, pt, drive ) – 𝐾 = 4 For pt trip: – 𝑧 𝑜 = 0, 0, 1, 0 – 𝑗 𝑜 = 3 Predicted probabilities for arbitrary classifier 𝑄 – 𝑄 𝑦 𝑜 = 0.1, 0.2, 0.4, 0.3 – 𝑄 𝑗 𝑜 𝑦 𝑜 = 0.4

  26. Likelihood 𝒐 𝒐 𝒐 𝒐 𝒋 𝒐 𝒋 𝒐 𝒋 𝒐 𝑸(𝟐|𝒚 𝒐 ) 𝑸(𝟐|𝒚 𝒐 ) 𝑸(𝟑|𝒚 𝒐 ) 𝑸(𝟑|𝒚 𝒐 ) 𝑸(𝟒|𝒚 𝒐 ) 𝑸(𝟒|𝒚 𝒐 ) 𝑸(𝟓|𝒚 𝒐 ) 𝑸(𝟓|𝒚 𝒐 ) 𝑸(𝒋 𝒐 |𝒚 𝒐 ) 1 1 1 1 2 2 2 0.11 0.11 0.84 0.84 0.03 0.03 0.02 0.02 0.84 2 2 2 2 1 1 1 0.82 0.82 0.04 0.04 0.06 0.06 0.08 0.08 0.82 3 3 3 3 4 4 4 0.11 0.11 0.18 0.18 0.05 0.05 0.66 0.66 0.66 4 4 4 4 1 1 1 0.57 0.57 0.22 0.22 0.12 0.12 0.09 0.09 0.57 5 5 5 5 3 3 3 0.10 0.10 0.03 0.03 0.75 0.75 0.12 0.12 0.75 Likelihood: 𝑂 – = ς 𝑜=1 𝑄(𝑗 𝑜 |𝑦 𝑜 ) – = 0.84 × 0.82 × 0.66 × 0.57 × 0.75 – = 0.2036

  27. Likelihood 𝒐 𝒋 𝒐 𝑸(𝟐|𝒚 𝒐 ) 𝑸(𝟑|𝒚 𝒐 ) 𝑸(𝟒|𝒚 𝒐 ) 𝑸(𝟓|𝒚 𝒐 ) 𝑸(𝒋 𝒐 |𝒚 𝒐 ) 1 4 0.11 0.84 0.03 0.02 0.02 2 1 0.82 0.04 0.06 0.08 0.82 3 4 0.11 0.18 0.05 0.66 0.66 4 1 0.57 0.22 0.12 0.09 0.57 5 3 0.10 0.03 0.75 0.12 0.75 Likelihood: 𝑂 – = ς 𝑜=1 𝑄(𝑗 𝑜 |𝑦 𝑜 ) – = 0.02 × 0.82 × 0.66 × 0.57 × 0.75 – = 0.0046

  28. Log-likelihood Likelihood bound between 0 and 1 Tends to 0 as 𝑂 increases – Computational issues with small numbers Use log-likelihood: 𝑂 – = ln(ς 𝑜=1 𝑄(𝑗 𝑜 |𝑦 𝑜 )) 𝑂 – = σ 𝑜=1 ln 𝑄 (𝑗 𝑜 |𝑦 𝑜 )

  29. Log-likelihood 𝒐 𝒋 𝒐 𝑸(𝟐|𝒚 𝒐 ) 𝑸(𝟑|𝒚 𝒐 ) 𝑸(𝟒|𝒚 𝒐 ) 𝑸(𝟓|𝒚 𝒐 ) 𝑸(𝒋 𝒐 |𝒚 𝒐 ) 1 2 0.11 0.84 0.03 0.02 0.84 2 1 0.82 0.04 0.06 0.08 0.82 3 4 0.11 0.18 0.05 0.66 0.66 4 1 0.57 0.22 0.12 0.09 0.57 5 3 0.10 0.03 0.75 0.12 0.75 Log-likelihood: 𝑂 – = σ 𝑜=1 ln 𝑄(𝑗 𝑜 |𝑦 𝑜 ) – = ln(0.84 × 0.82 × 0.66 × 0.57 × 0.75) – = −1.638

  30. Log-likelihood 𝒐 𝒋 𝒐 𝑸(𝟐|𝒚 𝒐 ) 𝑸(𝟑|𝒚 𝒐 ) 𝑸(𝟒|𝒚 𝒐 ) 𝑸(𝟓|𝒚 𝒐 ) 𝑸(𝒋 𝒐 |𝒚 𝒐 ) 1 4 0.11 0.84 0.03 0.02 0.02 2 1 0.82 0.04 0.06 0.08 0.82 3 4 0.11 0.18 0.05 0.66 0.66 4 1 0.57 0.22 0.12 0.09 0.57 5 3 0.10 0.03 0.75 0.12 0.75 Log-likelihood: 𝑂 – = σ 𝑜=1 ln 𝑄(𝑗 𝑜 |𝑦 𝑜 ) – = ln(0.02 × 0.82 × 0.66 × 0.57 × 0.75) – = −5.376

  31. Cross-entropy loss Log-likelihood bound between −inf and 0 – Tends to −inf as 𝑂 increases Can’t compare between different dataset sizes – Normalize (divide by 𝑂 ) to get cross-entropy loss (CEL) 𝐼 – Typically take negative 𝑂 𝑀 = −1 𝑜 ෍ ln 𝑄 (𝑗 𝑜 |𝑦 𝑜 ) 𝑜=1

  32. Cross-entropy loss Can also derive from Shannon’s cross entropy 𝐼(𝑞, 𝑟) (hence name) Minimising cross-entropy loss equivalent to maximising likelihood of data under the model – Maximum likelihood estimation (MLE) Other metrics exist e.g. Brier score (MSE), but CEL is golden standard

  33. Probabilistic classifiers Need a model which generates multiclass probability distribution from feature vector 𝑦 𝑜 Lot’s of possibilities! Today, logistic regression

  34. Linear regression Logistic regression is an extension of linear regression – Often called linear models 𝐿 𝑔(𝑦) = ෍ 𝛾 𝑙 𝑦 𝑙 + 𝛾 𝑝 𝑙=1 Find parameters ( 𝛾 ) using gradient descent – Typically minimise MSE

  35. Logistic regression Pass linear regression through function 𝜏 𝐿 𝑔 𝑦 = 𝜏(෍ 𝛾 𝑙 𝑦 𝑙 + 𝛾 𝑝 ) 𝑙=1 𝜏 needs to take real value and return value in [0,1] – 𝜏 𝑨 ∈ 0,1 ∀ 𝑦 ∈ ℝ

  36. Binary case – sigmoid function 1 𝜏 𝑨 = 1 + 𝑓 −𝑨

  37. Multinomial case – softmax function 𝑓 𝑨 𝑗 𝜏 𝑨 𝑗 = 𝐾 𝑓 𝑨 𝑗 σ 𝑘=1

  38. Logistic regression Separate set of parameters ( 𝛾 ) for each class 𝑗 – Normalised to zero for one class Separate value 𝑨 for each class – Utilities in Discrete Choice Models Minimise cross-entropy loss (MLE) Solve using gradient descent for parameters – 𝛾 ∗ = arg min 𝛾 𝑀 𝛾

  39. Issues Overconfidence – Linearly separable data – parameters tend to infinity (unrealistic probability estimates) Outliers – Outliers can have disproportionate effect on parameters (high variance) Multi-colinearity – If features are highly correlated, can cause numerical instability

  40. Solution: Regularisation Penalise model for large parameter values – Reduces variance (therefore overfitting ) Regularisation – 𝛾 ∗ = arg min 𝛾 {𝑀 𝛾 + 𝐷 × 𝑔 𝛾 } Two candidates for 𝑔 – L1 normalisation (LASSO) ◆ σ 𝑗 |𝛾 𝑗 | (similar to Manhattan distance) – L2 normalisation (Ridge) ◆ σ 𝑗 𝛾 2 (similar to Euclidean distance)

  41. Regularisation L1 regularisation shrinks less important parameters to zero – Feature selection? L2 regularisation penalises larger parameters more

  42. Regularisation 𝐷 and choice of regularisation (L1/L2) are hyperparameters Larger 𝐷 , more regularisation (higher bias, lower variance)

  43. Logistic regression 𝒚 𝒜 Softmax 3 classes 5 features

  44. Homework Notebook 2: Logistic regression

Recommend


More recommend