decision theory and loss
play

Decision Theory, and Loss Functions CMSC 691 UMBC Some slides - PowerPoint PPT Presentation

Decision Theory, and Loss Functions CMSC 691 UMBC Some slides adapted from Hamed Pirsiavash Todays Goal: learn about empirical risk minimization Set t = 0 Pick a starting value t F Until converged: 1. Get value y t = F( t )


  1. Decision Theory, and Loss Functions CMSC 691 UMBC Some slides adapted from Hamed Pirsiavash

  2. Today’s Goal: learn about empirical risk minimization Set t = 0 Pick a starting value ΞΈ t F Until converged: 𝑂 1. Get value y t = F( ΞΈ t ) argmin ෍ β„“ 𝑧 𝑗 , β„Ž πœ„ π’š 𝑗 2. Get derivative g t = F’(ΞΈ t ) h 3. Get scaling factor ρ t 𝑗=1 4. Set ΞΈ t+1 = ΞΈ t + ρ t *g t 5. Set t += 1

  3. Outline Decision Theory Loss Functions Multiclass vs. Multilabel Prediction

  4. Decision Theory β€œDecision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (β€œstate of the world”) Output: a decision yΜƒ

  5. Decision Theory β€œDecision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (β€œstate of the world”) Output: a decision yΜƒ Requirement 1: a decision (hypothesis) function h( x ) to produce yΜƒ

  6. Decision Theory β€œDecision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (β€œstate of the world”) Output: a decision yΜƒ Requirement 1: a decision (hypothesis) function h( x ) to produce yΜƒ Requirement 2: a function β„“ (y, yΜƒ) telling us how wrong we are

  7. Decision Theory β€œDecision theory is trivial, apart from the computational details” – MacKay, ITILA, Ch 36 Input: x (β€œstate of the world”) Output: a decision yΜƒ Requirement 1: a decision (hypothesis) function h( x ) to produce yΜƒ Requirement 2: a loss function β„“ (y, yΜƒ) telling us how wrong we are Goal: minimize our expected loss across any possible input

  8. Requirement 1: Decision Function Gold/correct labels h(x) instance 1 instance 2 Machine score Evaluator Learning Predictor instance 3 instance 4 Extra-knowledge h(x) is our predictor (classifier, regression model, clustering model, etc.)

  9. Requirement 2: Loss Function β€œell” (fancy l predicted label/result character) optimize β„“ ? β„“ 𝑧, ො 𝑧 β‰₯ 0 β€’ minimize β€’ maximize β€œcorrect” label/result loss: A function that tells you how much to penalize a prediction yΜƒ from the correct answer y

  10. Requirement 2: Loss Function β€œell” (fancy l predicted label/result character) Negative β„“ ( βˆ’β„“ ) is β„“ 𝑧, ො 𝑧 β‰₯ 0 called a utility or reward function β€œcorrect” label/result loss: A function that tells you how much to penalize a prediction yΜƒ from the correct answer y

  11. Decision Theory minimize expected loss across any possible input arg min 𝑧 𝔽[β„“(𝑧, ො 𝑧)] ො

  12. Risk Minimization minimize expected loss across any possible input arg min 𝑧 𝔽[β„“(𝑧, ො 𝑧)] = arg min β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))] ො a particular , unspecified input pair ( x ,y)… but we want any possible pair

  13. Decision Theory minimize expected loss across any possible input input arg min 𝑧 𝔽[β„“(𝑧, ො 𝑧)] = ො arg min β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))] = argmin 𝔽 π’š,𝑧 βˆΌπ‘„ β„“ 𝑧, β„Ž π’š h Assumption: there exists some true (but likely unknown) distribution P over inputs x and outputs y

  14. Risk Minimization minimize expected loss across any possible input arg min 𝑧 𝔽[β„“(𝑧, ො 𝑧)] = ො arg min β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))] = argmin 𝔽 π’š,𝑧 βˆΌπ‘„ β„“ 𝑧, β„Ž π’š = h argmin h ∫ β„“ 𝑧, β„Ž π’š 𝑄 π’š, 𝑧 𝑒(π’š, 𝑧)

  15. Risk Minimization minimize expected loss across any possible input arg min 𝑧 𝔽[β„“(𝑧, ො 𝑧)] = ො arg min β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))] = argmin 𝔽 π’š,𝑧 βˆΌπ‘„ β„“ 𝑧, β„Ž π’š = h argmin h ∫ β„“ 𝑧, β„Ž π’š 𝑄 π’š, 𝑧 𝑒(π’š, 𝑧) we don’t know this distribution*! *we could try to approximate it analytically

  16. (Posterior) Empirical Risk Minimization minimize expected (posterior) loss across our observed input arg min 𝑧 𝔽[β„“(𝑧, ො 𝑧)] = ො arg min β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))] = argmin 𝔽 π’š,𝑧 βˆΌπ‘„ β„“ 𝑧, β„Ž π’š β‰ˆ h 𝑂 1 argmin 𝑂 ෍ 𝔽 π‘§βˆΌπ‘„(β‹…|π’š 𝒋 ) β„“ 𝑧, β„Ž π’š 𝒋 h 𝑗=1

  17. Empirical Risk Minimization minimize expected loss across our observed input (& output) arg min 𝑧 𝔽[β„“(𝑧, ො 𝑧)] = ො arg min β„Ž 𝔽[β„“(𝑧, β„Ž(π’š))] = argmin 𝔽 π’š,𝑧 βˆΌπ‘„ β„“ 𝑧, β„Ž π’š β‰ˆ h 𝑂 1 argmin 𝑂 ෍ β„“ 𝑧 𝑗 , β„Ž π’š 𝑗 h 𝑗=1

  18. Empirical Risk Minimization minimize expected loss across our observed input (& output) 𝑂 argmin ෍ β„“ 𝑧 𝑗 , β„Ž π’š 𝑗 h 𝑗=1 change ΞΈ β†’ change the behavior of the classifier our classifier/predictor controlled by our parameters ΞΈ

  19. Best Case: Optimize Empirical Risk with Gradients 𝑂 argmin ෍ β„“ 𝑧 𝑗 , β„Ž πœ„ π’š 𝑗 h 𝑗=1 change ΞΈ β†’ change the behavior of the classifier 𝑂 argmin ෍ β„“ 𝑧 𝑗 , β„Ž πœ„ π’š 𝑗 πœ„ 𝑗=1

  20. Best Case: Optimize Empirical Risk with Gradients 𝑂 argmin ෍ β„“ 𝑧 𝑗 , β„Ž πœ„ π’š 𝑗 πœ„ 𝑗=1 𝐺(πœ„) change ΞΈ β†’ change the behavior of the classifier How? Use Gradient Descent on 𝐺(πœ„) ! differentiating might not always work: β€œβ€¦ apart from the computational details”

  21. Best Case: Optimize Empirical Risk with Gradients 𝑂 argmin ෍ β„“ 𝑧 𝑗 , β„Ž πœ„ π’š 𝑗 πœ„ 𝑗=1 change ΞΈ β†’ change the behavior of the classifier πœ–β„“ 𝑧 𝑗 , ො 𝑧 = β„Ž πœ„ π’š 𝑗 𝛼 πœ„ 𝐺 = ෍ 𝛼 πœ„ β„Ž πœ„ π’š 𝒋 πœ– ො 𝑧 𝑗 differentiating might not always work: β€œβ€¦ apart from the computational details”

  22. Best Case: Optimize Empirical Risk with Gradients 𝑂 argmin ෍ β„“ 𝑧 𝑗 , β„Ž πœ„ π’š 𝑗 πœ„ 𝑗=1 change ΞΈ β†’ change the behavior of the classifier πœ–β„“ 𝑧 𝑗 , ො 𝑧 = β„Ž πœ„ π’š 𝑗 𝛼 πœ„ 𝐺 = ෍ 𝛼 πœ„ β„Ž πœ„ π’š 𝒋 πœ– ො 𝑧 𝑗 Step 1: compute the gradient of the loss wrt the predicted value differentiating might not always work: β€œβ€¦ apart from the computational details”

  23. Best Case: Optimize Empirical Risk with Gradients 𝑂 argmin ෍ β„“ 𝑧 𝑗 , β„Ž πœ„ π’š 𝑗 πœ„ 𝑗=1 change ΞΈ β†’ change the behavior of the classifier πœ–β„“ 𝑧 𝑗 , ො 𝑧 = β„Ž πœ„ π’š 𝑗 𝛼 πœ„ 𝐺 = ෍ 𝛼 πœ„ β„Ž πœ„ π’š 𝒋 πœ– ො 𝑧 𝑗 Step 2: compute the gradient of Step 1: compute the gradient of the predicted the loss wrt the predicted value value wrt πœ„ . differentiating might not always work: β€œβ€¦ apart from the computational details”

  24. Outline Decision Theory Loss Functions Multiclass vs. Multilabel Prediction

  25. Loss Functions Serve a Task Probabilistic Neural Classification Fully-supervised Generative Memory- based Regression Semi-supervised Conditional Exemplar … Spectral Clustering Un-supervised the task : what kind the data : amount of the approach : how of problem are you human input/number any data are being solving? of labeled examples used

  26. Classification: Supervised Machine Learning Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection … Authorship identification Input: an instance d Ξ³ learns to associate a fixed set of classes C = { c 1 , c 2 ,…, c J } certain features of A training set of m hand-labeled instances (d 1 ,c 1 ),....,(d m ,c m ) instances with their Output: a learned classifier Ξ³ that maps instances labels to classes

  27. Classification Loss Function Example: 0-1 Loss 𝑧 = α‰Š 0, if 𝑧 = ො 𝑧 β„“ 𝑧, ො 1, if 𝑧 β‰  ො 𝑧

  28. Classification Loss Function Example: 0-1 Loss 𝑧 = α‰Š 0, if 𝑧 = ො 𝑧 β„“ 𝑧, ො 1, if 𝑧 β‰  ො 𝑧 Problem 1: not differentiable wrt ො 𝑧 (or ΞΈ )

  29. Convex surrogate loss functions Surrogate loss: replace Zero/one loss by a smooth function Easier to optimize if the surrogate loss is convex 𝑧 𝑗 ෝ Courtesy Hamed Pirsiavash, CIML

  30. Example: ERM with Exponential loss objective Courtesy Hamed Pirsiavash

  31. Example: ERM with Exponential loss objective gradient Courtesy Hamed Pirsiavash

  32. Example: ERM with Exponential loss objective gradient update loss term high for misclassified points Courtesy Hamed Pirsiavash

  33. Structured Classification: Sequence & Structured Prediction Courtesy Hamed Pirsiavash

  34. Classification Loss Function Example: 0-1 Loss 𝑧 = α‰Š 0, if 𝑧 = ො 𝑧 β„“ 𝑧, ො 1, if 𝑧 β‰  ො 𝑧 Problem 1: not differentiable wrt ො 𝑧 (or ΞΈ) Problem 2: too strict. Solution 1: Specialize 0-1 to Structured Prediction the structured problem at involves many individual hand decisions

  35. Regression Like classification, but real-valued

  36. Regression Example: Stock Market Prediction Courtesy Hamed Pirsiavash

  37. Regression Loss Function Examples squared loss/MSE (Mean squared error) 𝑧 2 β„“ 𝑧, ො 𝑧 = y βˆ’ ො 𝑧 is a real value β†’ ො nicely differentiable (generally) ☺

  38. Regression Loss Function Examples squared loss/MSE absolute loss (Mean squared error) 𝑧 2 β„“ 𝑧, ො 𝑧 = |𝑧 βˆ’ ො 𝑧 | β„“ 𝑧, ො 𝑧 = y βˆ’ ො 𝑧 is a real value β†’ ො Absolute value is nicely differentiable mostly differentiable (generally) ☺

Recommend


More recommend