Decision Theory, and Loss Functions CMSC 691 UMBC Some slides adapted from Hamed Pirsiavash
Todayβs Goal: learn about empirical risk minimization Set t = 0 Pick a starting value ΞΈ t F Until converged: π 1. Get value y t = F( ΞΈ t ) argmin ΰ· β π§ π , β π π π 2. Get derivative g t = Fβ(ΞΈ t ) h 3. Get scaling factor Ο t π=1 4. Set ΞΈ t+1 = ΞΈ t + Ο t *g t 5. Set t += 1
Outline Decision Theory Loss Functions Multiclass vs. Multilabel Prediction
Decision Theory βDecision theory is trivial, apart from the computational detailsβ β MacKay, ITILA, Ch 36 Input: x (βstate of the worldβ) Output: a decision yΜ
Decision Theory βDecision theory is trivial, apart from the computational detailsβ β MacKay, ITILA, Ch 36 Input: x (βstate of the worldβ) Output: a decision yΜ Requirement 1: a decision (hypothesis) function h( x ) to produce yΜ
Decision Theory βDecision theory is trivial, apart from the computational detailsβ β MacKay, ITILA, Ch 36 Input: x (βstate of the worldβ) Output: a decision yΜ Requirement 1: a decision (hypothesis) function h( x ) to produce yΜ Requirement 2: a function β (y, yΜ) telling us how wrong we are
Decision Theory βDecision theory is trivial, apart from the computational detailsβ β MacKay, ITILA, Ch 36 Input: x (βstate of the worldβ) Output: a decision yΜ Requirement 1: a decision (hypothesis) function h( x ) to produce yΜ Requirement 2: a loss function β (y, yΜ) telling us how wrong we are Goal: minimize our expected loss across any possible input
Requirement 1: Decision Function Gold/correct labels h(x) instance 1 instance 2 Machine score Evaluator Learning Predictor instance 3 instance 4 Extra-knowledge h(x) is our predictor (classifier, regression model, clustering model, etc.)
Requirement 2: Loss Function βellβ (fancy l predicted label/result character) optimize β ? β π§, ΰ· π§ β₯ 0 β’ minimize β’ maximize βcorrectβ label/result loss: A function that tells you how much to penalize a prediction yΜ from the correct answer y
Requirement 2: Loss Function βellβ (fancy l predicted label/result character) Negative β ( ββ ) is β π§, ΰ· π§ β₯ 0 called a utility or reward function βcorrectβ label/result loss: A function that tells you how much to penalize a prediction yΜ from the correct answer y
Decision Theory minimize expected loss across any possible input arg min π§ π½[β(π§, ΰ· π§)] ΰ·
Risk Minimization minimize expected loss across any possible input arg min π§ π½[β(π§, ΰ· π§)] = arg min β π½[β(π§, β(π))] ΰ· a particular , unspecified input pair ( x ,y)β¦ but we want any possible pair
Decision Theory minimize expected loss across any possible input input arg min π§ π½[β(π§, ΰ· π§)] = ΰ· arg min β π½[β(π§, β(π))] = argmin π½ π,π§ βΌπ β π§, β π h Assumption: there exists some true (but likely unknown) distribution P over inputs x and outputs y
Risk Minimization minimize expected loss across any possible input arg min π§ π½[β(π§, ΰ· π§)] = ΰ· arg min β π½[β(π§, β(π))] = argmin π½ π,π§ βΌπ β π§, β π = h argmin h β« β π§, β π π π, π§ π(π, π§)
Risk Minimization minimize expected loss across any possible input arg min π§ π½[β(π§, ΰ· π§)] = ΰ· arg min β π½[β(π§, β(π))] = argmin π½ π,π§ βΌπ β π§, β π = h argmin h β« β π§, β π π π, π§ π(π, π§) we donβt know this distribution*! *we could try to approximate it analytically
(Posterior) Empirical Risk Minimization minimize expected (posterior) loss across our observed input arg min π§ π½[β(π§, ΰ· π§)] = ΰ· arg min β π½[β(π§, β(π))] = argmin π½ π,π§ βΌπ β π§, β π β h π 1 argmin π ΰ· π½ π§βΌπ(β |π π ) β π§, β π π h π=1
Empirical Risk Minimization minimize expected loss across our observed input (& output) arg min π§ π½[β(π§, ΰ· π§)] = ΰ· arg min β π½[β(π§, β(π))] = argmin π½ π,π§ βΌπ β π§, β π β h π 1 argmin π ΰ· β π§ π , β π π h π=1
Empirical Risk Minimization minimize expected loss across our observed input (& output) π argmin ΰ· β π§ π , β π π h π=1 change ΞΈ β change the behavior of the classifier our classifier/predictor controlled by our parameters ΞΈ
Best Case: Optimize Empirical Risk with Gradients π argmin ΰ· β π§ π , β π π π h π=1 change ΞΈ β change the behavior of the classifier π argmin ΰ· β π§ π , β π π π π π=1
Best Case: Optimize Empirical Risk with Gradients π argmin ΰ· β π§ π , β π π π π π=1 πΊ(π) change ΞΈ β change the behavior of the classifier How? Use Gradient Descent on πΊ(π) ! differentiating might not always work: ββ¦ apart from the computational detailsβ
Best Case: Optimize Empirical Risk with Gradients π argmin ΰ· β π§ π , β π π π π π=1 change ΞΈ β change the behavior of the classifier πβ π§ π , ΰ· π§ = β π π π πΌ π πΊ = ΰ· πΌ π β π π π π ΰ· π§ π differentiating might not always work: ββ¦ apart from the computational detailsβ
Best Case: Optimize Empirical Risk with Gradients π argmin ΰ· β π§ π , β π π π π π=1 change ΞΈ β change the behavior of the classifier πβ π§ π , ΰ· π§ = β π π π πΌ π πΊ = ΰ· πΌ π β π π π π ΰ· π§ π Step 1: compute the gradient of the loss wrt the predicted value differentiating might not always work: ββ¦ apart from the computational detailsβ
Best Case: Optimize Empirical Risk with Gradients π argmin ΰ· β π§ π , β π π π π π=1 change ΞΈ β change the behavior of the classifier πβ π§ π , ΰ· π§ = β π π π πΌ π πΊ = ΰ· πΌ π β π π π π ΰ· π§ π Step 2: compute the gradient of Step 1: compute the gradient of the predicted the loss wrt the predicted value value wrt π . differentiating might not always work: ββ¦ apart from the computational detailsβ
Outline Decision Theory Loss Functions Multiclass vs. Multilabel Prediction
Loss Functions Serve a Task Probabilistic Neural Classification Fully-supervised Generative Memory- based Regression Semi-supervised Conditional Exemplar β¦ Spectral Clustering Un-supervised the task : what kind the data : amount of the approach : how of problem are you human input/number any data are being solving? of labeled examples used
Classification: Supervised Machine Learning Assigning subject Age/gender identification categories, topics, or Language Identification genres Sentiment analysis Spam detection β¦ Authorship identification Input: an instance d Ξ³ learns to associate a fixed set of classes C = { c 1 , c 2 ,β¦, c J } certain features of A training set of m hand-labeled instances (d 1 ,c 1 ),....,(d m ,c m ) instances with their Output: a learned classifier Ξ³ that maps instances labels to classes
Classification Loss Function Example: 0-1 Loss π§ = α 0, if π§ = ΰ· π§ β π§, ΰ· 1, if π§ β ΰ· π§
Classification Loss Function Example: 0-1 Loss π§ = α 0, if π§ = ΰ· π§ β π§, ΰ· 1, if π§ β ΰ· π§ Problem 1: not differentiable wrt ΰ· π§ (or ΞΈ )
Convex surrogate loss functions Surrogate loss: replace Zero/one loss by a smooth function Easier to optimize if the surrogate loss is convex π§ π ΰ· Courtesy Hamed Pirsiavash, CIML
Example: ERM with Exponential loss objective Courtesy Hamed Pirsiavash
Example: ERM with Exponential loss objective gradient Courtesy Hamed Pirsiavash
Example: ERM with Exponential loss objective gradient update loss term high for misclassified points Courtesy Hamed Pirsiavash
Structured Classification: Sequence & Structured Prediction Courtesy Hamed Pirsiavash
Classification Loss Function Example: 0-1 Loss π§ = α 0, if π§ = ΰ· π§ β π§, ΰ· 1, if π§ β ΰ· π§ Problem 1: not differentiable wrt ΰ· π§ (or ΞΈ) Problem 2: too strict. Solution 1: Specialize 0-1 to Structured Prediction the structured problem at involves many individual hand decisions
Regression Like classification, but real-valued
Regression Example: Stock Market Prediction Courtesy Hamed Pirsiavash
Regression Loss Function Examples squared loss/MSE (Mean squared error) π§ 2 β π§, ΰ· π§ = y β ΰ· π§ is a real value β ΰ· nicely differentiable (generally) βΊ
Regression Loss Function Examples squared loss/MSE absolute loss (Mean squared error) π§ 2 β π§, ΰ· π§ = |π§ β ΰ· π§ | β π§, ΰ· π§ = y β ΰ· π§ is a real value β ΰ· Absolute value is nicely differentiable mostly differentiable (generally) βΊ
Recommend
More recommend