Calibration of Convex Surrogate Losses via Property Elicitation Jessie Finocchiaro October 10, 2019
Introduction Machine Learning Predictions about Some property of data future events
Outline • Background • Surrogates • Calibration • Properties • Necessary and Sufficient Conditions • Case Study: Abstain • Dimensionality • Conclusion
Empirical Risk Minimization (ERM) Assumption: data comes i.i.d. from P over 𝒴, 𝒵 . Goal: minimize risk. 𝑆𝑗𝑡𝑙 ℎ = ∫ 𝑀 ℎ 𝑦 , 𝑧 𝑒𝑄(𝑦, 𝑧) Problem: don’t know P. Settle for minimizing empirical risk. 𝑄(𝑌, 𝑍) Take 𝑛 inputs 𝑦 𝑗 ∈ 𝒴, 𝑧 𝑗 ∈ 𝒵 . Loss 𝑛 𝑍 𝐹𝑆𝑗𝑡𝑙 ℎ = 1 𝑛 𝑀(ℎ 𝑦 𝑗 , 𝑧 𝑗 ) Against outcome 𝑧 𝑗 𝑌 𝑗=1 Prediction given feature 𝑦 𝑗 Average
Conditional Probability Pr[𝑍|𝑌 = 3] Form hypothesis ℎ: 𝒴 → ℛ . Want to learn ℎ(𝑌) minimizing 𝔽 𝑌,𝑍 ~𝑄 𝑀(ℎ 𝑌 , 𝑍) . Pr[𝑍] If ℎ 𝑦 minimizes 𝑀 𝑠, 𝑍 over 𝑠 for all 𝑦 ∈ 𝒴 , obviously true. Abstract ℎ 𝑦 to 𝑠 , look at distribution on simplex, Δ 𝒵 . 𝑍
Setting Classification-like problems Multiclass classification (with reject option) Ranking Top-k classification Notation Finite outcomes 𝒵 ≔ 𝑜 Report set ℛ ≔ 𝑙 Probability Distribution over 𝒵: p ∈ Δ 𝒵 𝑞 𝑧 = Pr[𝑍 = 𝑧]
Outline • Background • Surrogates • Calibration • Properties • Necessary and Sufficient Conditions • Case Study: Abstain • Dimensionality • Conclusion
Surrogates Surrogates for 0-1 loss, 𝒵 = {−1,1} Loss functions measure error. Created with a task in mind. Often discrete. L(r,1) Discrete losses hard to optimize. Surrogates should approximate the original loss well. Report r
Calibration A calibrated loss 𝑀 is “good approximation” of discrete loss ℓ(𝑠, 𝑧) . Let ℓ be a discrete loss. A surrogate loss function 𝑀: ℝ 𝑒 × 𝒵 → ℝ + is said to be ℓ -calibrated if there exists a link function 𝜔: ℝ 𝑒 → ℛ such that . Inf over reports not linked to the argmin of the discrete loss.
Consistency Let 𝑔 𝑛 : 𝒴 → ℝ 𝑒 be the hypothesis learned from training on 𝑛 samples drawn i.i.d. over 𝑄. 𝑀 (𝑔) is the expected loss L by predicting 𝑔(𝑌) when 𝑌, 𝑍 ~𝑄 . 𝑓𝑠 𝑄 𝑔 𝑛 is said to be L-consistent if 𝑀 𝑔 𝑀 𝑔 ≔ 𝑓𝑠 𝑀,∗ 𝑓𝑠 𝑛 → 𝑔:𝒴→ℛ 𝑓𝑠 inf 𝑄 𝑄 𝑄 . Losses are calibrated. Hypotheses are consistent.
Relating calibration and consistency Let ℓ: ℛ × 𝒵 → ℝ + be a discrete loss. A surrogate loss function 𝑀: ℝ 𝑒 × 𝒵 → ℝ + is ℓ -calibrated if and only if there exists a link function 𝜔: ℝ 𝑒 → ℛ such that for all distributions 𝑄 on 𝒴 × 𝒵 and all sequences of surrogate hypotheses 𝑔 𝑛 : 𝒴 → ℝ 𝑒 , we have 𝑀,∗ ⇒ 𝑓𝑠 𝑀 𝑔 𝑛 → 𝑓𝑠 ℓ 𝜔 ∘ 𝑔 𝑛 → 𝑓𝑠 ℓ,∗ 𝑓𝑠 𝑄 𝑄 𝑄 𝑄 Converging to optimal Linked hypothesis converges hypothesis for surrogate to optimal loss for discrete Ramaswamy et al. (2015) Theorem 3, originally Tewari and Bartlett (2007) Theorem 2.
Pause Questions so far? Formalize properties. We use these to study calibration.
Properties A property Γ: Δ 𝒵 → ℛ is a function mapping probability distributions to reports. If it’s easier, think of 𝑞 ∈ Δ 𝒵 as conditional probability. A property Γ: Δ 𝒵 → ℛ is elicitable if there is a loss function 𝑀: ℛ × 𝒵 → ℝ + such that, for all 𝑞 ∈ Δ 𝒵 , Γ 𝑞 = argmin 𝑠 𝔽 𝑍~𝑞 𝑀(𝑠, 𝑍) . Here, we say the loss 𝑀 elicits Γ . Elicitable properties have convex level sets. (Lambert and Shoham 2009) Γ 𝑠 = {𝑞 ∈ Δ 𝒵 : 𝑠 ∈ Γ 𝑞 }
“Drawing” a property 𝑜 -simplex in (𝑜 − 1) -dimensional space. Example: n = 3
Calibration… in terms of properties A property Γ: Δ 𝒵 → ℝ 𝑒 and link 𝜔: ℝ 𝑒 → ℛ are ℓ -calibrated if 𝑣 𝑛 → Γ 𝑞 ⇒ 𝔽 𝑞 ℓ 𝜔 𝑣 𝑛 , 𝑍 → min 𝑠 𝔽 𝑞 ℓ 𝑠, 𝑍 . i.e. The property value can always be linked to the argmin of loss. Tool to study geometric properties of losses eliciting Γ . Definition courtesy of Agarwal and Agarwal (2015)
Property Papers Lambert, Shoham (2009): Eliciting Truthful Answers to Multiple- choice questions. Finite properties are elicitable iff their level sets form a power diagram. Agarwal, Agarwal (2015): On Consistent Surrogate Risk Minimization and Property Elicitation. There’s a connection between properties and surrogate losses. Frongillo, Kash (2015): On Elicitation Complexity. Every property is elicitable, but the question is how elicitable.
Calibrated surrogates Positive Normal Sets. Necessary Conditions. Sufficient Conditions. Relationship between positive normal sets and level sets of property.
Positive Normal Sets Finite outcome setting: rewrite 𝑀: ℛ → ℝ 𝑜 be the vector of loss values should each outcome occur. Linearity of expectation: rewrite 𝔽 𝑞 𝑀 𝑠, 𝑍 = < 𝑞, 𝑀 𝑠 > . Positive normal set Outcome Distributions Expected loss vector Inf expected where on outcome loss possible vector Definition for sequences: sequence converges to inf.
Necessary Condition Let ℓ be a discrete loss and let 𝑀 be ℓ -calibrated. Let 𝛿 be the property elicited by ℓ . For all u ∈ 𝒯 𝑀 = 𝑑𝑝𝑜𝑤(𝑗𝑛 𝑀 ) , there exists an 𝑠 ∈ ℛ such that 𝒪 𝑀 𝑣 ⊆ 𝛿 𝑠 Ramaswamy et al. (2015) Theorem 6
Sufficient Condition Suppose there exists some Example: 0-1 loss and hinge finite set of 𝑣 𝑗 ∈ 𝒯 𝑀 such that 𝒪 ℎ𝑗𝑜𝑓 {2,0} = 𝑞 ∈ Δ 2 : 𝑞 1 ≥ 1 2 ⊆ 𝛿 1 ∪ 𝑗 𝒪 𝑀 (𝑣 𝑗 ) = Δ 𝒵 , and for each i = 𝑞 ∈ Δ 2 : 𝑞 1 ≤ 1 𝒪 ℎ𝑗𝑜𝑓 0,2 2 ⊆ 𝛿 −1 there exists an r 𝑘 ∈ ℛ such that 𝒪 𝑀 𝑣 𝑗 ⊆ 𝛿 𝑠 𝑘 . Then 𝑀 is ℓ -calibrated. 𝑞 1 = 1 2 Ramaswamy et al. (2015) Theorem 8
Outline • Background • Surrogates • Calibration • Properties • Necessary and Sufficient Conditions • Case Study: Abstain • Dimensionality • Conclusion
Case Study: Abstain Situations where the cost of misclassification is high. College admissions. Medical diagnoses. Discrete loss for this problem: 0, 𝑠 = 𝑧 1 1 2 𝑠, 𝑧 = ℓ 2 , 𝑠 = ⊥ 1, 𝑠 ≠ 𝑧 𝑝𝑠 ⊥
Historical calibrated surrogates • Crammer Singer (2001) 𝑀 𝐷𝑇 𝑣, 𝑧 = 1 + max 𝑘≠𝑧 𝑣 𝑘 − 𝑣 𝑧 + 𝜔 𝐷𝑇 𝑣 = ቊ𝑏𝑠𝑛𝑏𝑦 1≤𝑗≤𝑜 𝑣 𝑗 , 𝑣 1 − 𝑣 2 > 𝜐 ⊥, 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 • One vs All (Rafkin, Klateau 2004) 𝑀 𝑃𝑤𝐵 𝑣, 𝑧 = ∑𝕁{𝑧 = 𝑗} 1 − 𝑣 𝑗 + + 𝕁 𝑧 ≠ 𝑗 1 + 𝑣 𝑗 + 𝑏𝑠𝑛𝑏𝑦 1≤𝑗≤𝑜 𝑣 𝑗 , max 𝑣 𝑘 > 𝜐 𝜔 𝑃𝑤𝐵 𝑣 = ൝ 𝑘 ⊥, 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓
BEP Surrogate and Link 𝑞 = (1 4 , 1 4 , 1 4 , 1 4) 𝑞 = ( 7 10 , 1 10 , 1 10 , 1 10) Ramaswamy et al. (2018) Section 4
Why BEP? 1 BEP, CS, and OvA are all ℓ 2 -calibrated for abstain loss. Convex surrogates takes report 𝑣 ∈ ℝ 𝑒 . BEP: 𝑒 = log 2 (𝑜) . CS and OvA: 𝑒 = 𝑜. Why does dimension 𝑒 matter?
Outline • Background • Surrogates • Calibration • Properties • Necessary and Sufficient Conditions • Case Study: Abstain • Dimensionality • Conclusion
Dimensionality Want algorithms that are efficient and accurate. Reducing dimension makes optimization problem more efficient. Calibration guarantees accuracy.
Elicitation Complexity Elic( Γ ) = Minimum dimension 𝑒 such that Γ is elicitable by a 𝑒 - dimensional loss. Maybe Γ isn’t itself 1 -elicitable, but can be calculated by g ∘ Γ , where Γ is 1-elicitable. Then Elic( Γ ) = 1. Called indirect elicitation. Example: Γ 𝑞 = 𝔽 𝑞 𝑍 2 . Elicit 𝔽 𝑞 [𝑍] and take : 𝑦 ↦ 𝑦 2 .
Convex Calibration Dimension Special case of elicitation complexity. ccdim( ℓ ) = minimum dimension 𝑒 such that there is a convex ℓ - calibrated surrogate 𝑀: ℝ 𝑒 × 𝒵 → ℝ . 1 Ex: ccdim( ℓ 2 ) ≤ log 2 (𝑜) because of BEP surrogate. From Ramaswamy et al. (2015) Definition 10.
Bounds on CC dimension Understood through Feasible Subspace Dimension. Tight bound for properties where all the level sets intersect on interior of simplex. ccdim( ℓ) = n-1 Does not apply to abstain. These results from Ramaswamy et al. (2015).
Feasible Subspace Dimension The feasible subspace dimension 𝜈 𝒟 𝑞 of a convex set 𝒟 at the point 𝑞 ∈ 𝒟 is the dimension of ℱ 𝒟 𝑞 ∩ −ℱ 𝒟 (𝑞) . ℱ 𝒟 𝑞 is the cone of feasible directions of 𝒟 at 𝑞. Essentially: the dimension of the smallest face of 𝒟 containing 𝑞 .
Feasible Subspace Dimension
Feasible Subspace Dimension
Recommend
More recommend