Learning wit ith Pairw rwis ise Losses Problems, Algorithms and Analysis Purushottam Kar Microsoft Research India
Outl tline • Part I: Introduction to pairwise loss functions • Example applications • Part II: Batch learning with pairwise loss functions • Learning formulation: no algorithmic details • Generalization bounds • The coupling phenomenon • Decoupling techniques • Part III: Online learning with pairwise loss functions • A generic online algorithm • Regret analysis • Online-to-batch conversion bounds • A decoupling technique for online-to-batch conversions E0 370: Statistical Learning Theory 2
Part I: I: In Introduction E0 370: Statistical Learning Theory 3
What is is a lo loss fu functio ion? ℓ: ℋ → ℝ + • We observe empirical losses on data 𝑇 = 𝑦 1 , … 𝑦 𝑜 ℓ 𝑦 𝑗 ⋅ = ℓ ℎ, 𝑦 𝑗 • … and try to minimize them (e.g. classfn, regression) ℒ 𝑇 ℎ = 1 ℎ = inf ℒ 𝑇 ℎ , 𝑜 ∑ℓ 𝑦 𝑗 ℎ ℎ∈ℋ • … in the hope that 1 𝑜 ∑ℓ 𝑦 𝑗 ⋅ − 𝔽ℓ 𝑦 ⋅ ∞ ≤ 𝜗 • ... so that ℎ ≤ ℒ ℎ ∗ + 𝜗, ℒ ℒ ℎ = 𝔽ℓ 𝑦 ℎ E0 370: Statistical Learning Theory 4
Metric ic Learnin ing • Penalize metric for bringing blue and red points close • Loss function needs to consider two points at a time! • … in other words a pairwise loss function 1, 𝑧 1 ≠ 𝑧 2 and 𝑁 𝑦 1 , 𝑦 2 < 𝛿 1 • E.g. ℓ 𝑦 1 ,𝑦 2 𝑁 = 1, 𝑧 1 = 𝑧 2 and 𝑁 𝑦 1 , 𝑦 2 > 𝛿 2 0, otherwise E0 370: Statistical Learning Theory 5
Pairw irwis ise Loss Functio ions • Typically, loss functions are based on ground truth ℓ 𝑦 ℎ = ℓ ℎ 𝑦 , 𝑧 𝑦 • Thus, for metric learning, loss functions look like ℓ 𝑦 1 ,𝑦 2 ℎ = ℓ ℎ 𝑦 1 , 𝑦 2 , 𝑧 𝑦 1 , 𝑦 2 • In previous example, we had ℎ 𝑦 1 , 𝑦 2 = 𝑁 𝑦 1 , 𝑦 2 and 𝑧 𝑦 1 , 𝑦 2 = 𝑧 1 𝑧 2 • Useful to learn patterns that capture data interactions E0 370: Statistical Learning Theory 6
Pairw irwis ise Loss Functio ions Examples: ( 𝜚 is any margin loss function e.g. hinge loss) • Metric learning [Jin et al NIPS ‘ 09] ℓ 𝑦 1 ,𝑦 2 𝑁 = 𝜚 𝑧 1 𝑧 2 1 − 𝑁 𝑦 1 , 𝑦 2 • Preference learning [Xing et al NIPS ‘ 02] • S-goodness [Balcan-Blum ICML ‘ 06] ℓ 𝑦 1 ,𝑦 2 𝐿 = 𝜚 𝑧 1 𝑧 2 𝐿 𝑦 1 , 𝑦 2 • Kernel-target alignment [Cortes et al ICML ‘ 10] • Bipartite ranking, (p)AUC [Narasimhan-Agarwal ICML ‘ 13] ℓ 𝑦 1 ,𝑦 2 𝑔 = 𝜚 𝑔 𝑦 1 − 𝑔 𝑦 2 𝑧 1 − 𝑧 2 E0 370: Statistical Learning Theory 7
Learnin ing Obje jectiv ives in in Pairw irwise Learnin ing • Given training data 𝑦 1 , 𝑦 2 , … 𝑦 𝑜 • Learn ℎ: 𝒴 × 𝒴 → 𝒵 such that ℎ ≤ ℒ ℎ ∗ + 𝜗 ℒ (will define ℒ ⋅ and ℒ ⋅ shortly) Challenges: • Training data given as singletons, not pairs • Algorithmic efficiency • Generalization error bounds E0 370: Statistical Learning Theory 8
Part II: II: Batch Learning E0 370: Statistical Learning Theory 9
Part II: II: Batch Learning Batch Learning for Unary Losses E0 370: Statistical Learning Theory 10
Trainin ing wit ith Unary Loss Functions • Notion of empirical loss ℒ: ℋ → ℝ + • Given training data 𝑇 = 𝑦 1 , … , 𝑦 𝑜 , natural notion ℒ 𝑇 ⋅ = 1 𝑜 ∑ℓ ⋅, 𝑦 𝑗 • Empirical risk minimization dictates us to find ℎ , s.t. ℒ 𝑇 ℎ ≤ inf ℒ 𝑇 ℎ ℎ∈ℋ • Note that ℒ ⋅ is a U-statistic ℒ 𝑇 : ℋ → ℝ + s.t. • U-statistic : a notion of “training loss” ∀ℎ ∈ ℋ, 𝔽 ℒ 𝑇 ℎ = ℒ ℎ E0 370: Statistical Learning Theory 11
Generali lization bounds for Unary ry Loss Functio ions • Step 1 : Bound excess risk by supr ē mus excess risk ℒ ℒ 𝑇 ℎ − ℒ ℎ − ℎ ≤ sup ℒ 𝑇 ℎ ℎ∈ℋ • Step 2 : Apply McDiarmid’s inequality ℒ 𝑇 ℎ is not perturbed by changing any 𝑦 𝑗 1 ℒ ℒ 𝑇 ℎ − ℒ ℎ − + ℎ ≤ 𝔽 sup ℒ 𝑇 ℎ 𝒫 𝑜 ℎ∈ℋ • Step 3 : Analyze the expected supr ē mus excess risk ℒ ℎ − 𝔽 − 𝔽 sup ℒ 𝑇 ℎ = 𝔽 sup ℒ 𝑇 ℎ ℒ 𝑇 ℎ ℎ∈ℋ ℎ∈ℋ 𝑇 ℎ − ≤ 𝔽 sup ℒ ℒ 𝑇 ℎ (Jensen′s inequality) ℎ∈ℋ E0 370: Statistical Learning Theory 12
Analy lyzin ing th the Expected Supr upr ē mus Excess Ris isk 𝑇 ℎ − 𝔽 sup ℒ ℒ 𝑇 ℎ ℎ∈ℋ • For unary losses ℒ 𝑇 ⋅ = ∑ℓ 𝑦 𝑗 ⋅ • Analyzing this term through symmetrization easy 1 ≤ 2 n 𝔽 sup ∑ℓ 𝑦 𝑗 ℎ − ℓ 𝑦 𝑗 ℎ 𝑜 𝔽 sup ∑𝜗 𝑗 ℓ 𝑦 𝑗 ℎ ℎ∈ℋ ℎ∈ℋ ≤ 2𝑀 1 𝑜 𝔽 sup ∑𝜗 𝑗 ℎ 𝑦 𝑗 ≈ 𝒫 𝑜 ℎ∈ℋ E0 370: Statistical Learning Theory 13
Part II: II: Batch Learning Batch Learning for Pairwise Loss Functions E0 370: Statistical Learning Theory 14
Trainin ing wit ith Pairw irwis ise Loss Functions • Given training data 𝑦 1 , 𝑦 2 , … 𝑦 𝑜 , choose a U-statistic • U-statistic should use terms like ℓ 𝑦 𝑗 ,𝑦 𝑘 ℎ ( the kernel ) • Population risk defined as ℒ ⋅ = 𝔽ℓ 𝑦,𝑦 ′ ⋅ Examples: • For any index set Ω ⊂ 𝑜 × 𝑜 , define ℒ S ⋅; Ω = 1 Ω ℓ 𝑦 𝑗 ,𝑦 𝑘 ⋅ 𝑗,𝑘 ∈Ω • Choice of Ω = 𝑗, 𝑘 : 𝑗 ≠ 𝑘 maximizes data utilization • Various ways of optimizing inf ℒ 𝑇 ℎ (e.g. SSG) ℎ∈ℋ E0 370: Statistical Learning Theory 15
Generali lization bounds for Pairw irwise Loss Functio ions • Step 1 : Bound excess risk by supr ē mus excess risk ℒ ℒ 𝑇 ℎ − ℒ ℎ − ℎ ≤ sup ℒ 𝑇 ℎ ℎ∈ℋ • Step 2 : Apply McDiarmid ’ s inequality Check that ℒ 𝑇 ℎ is not perturbed by changing any 𝑦 𝑗 1 ℒ ℒ 𝑇 ℎ − ℒ ℎ − + ℎ ≤ 𝔽 sup ℒ 𝑇 ℎ 𝒫 𝑜 ℎ∈ℋ • Step 3 : Analyze the expected supr ē mus excess risk ℒ ℎ − 𝔽 − 𝔽 sup ℒ 𝑇 ℎ = 𝔽 sup ℒ 𝑇 ℎ ℒ 𝑇 ℎ ℎ∈ℋ ℎ∈ℋ 𝑇 ℎ − ≤ 𝔽 sup ℒ ℒ 𝑇 ℎ (Jensen′s inequality) ℎ∈ℋ E0 370: Statistical Learning Theory 16
Analy lyzin ing th the Expected Supr upr ē mus Excess Ris isk 𝑇 ℎ − 𝔽 sup ℒ ℒ 𝑇 ℎ ℎ∈ℋ • For pairwise losses ℒ 𝑇 ⋅ = ∑ 𝑗≠𝑘 ℓ 𝑦 𝑗 ,𝑦 𝑘 ⋅ • Clean symmetrization not possible due to coupling 2𝔽 sup ℓ ℎ − ℓ 𝑦 𝑗 ,𝑦 𝑘 ℎ 𝑦 𝑗 , 𝑦 𝑘 ℎ∈ℋ 𝑗 𝑘 • Solutions [see Clémençon et al Ann. Stat. ‘ 08] • Alternate representation of U-statistics • Hoeffding decomposition E0 370: Statistical Learning Theory 17
Part III III: Onli line Learning E0 370: Statistical Learning Theory 18
Part III III: Onli line Learning A Whirlwind Tour of Online Learning for Unary Losses E0 370: Statistical Learning Theory 19
Model l for Onli line Learnin ing wit ith Unary Losses Propose hypothesis ℎ 𝑢−1 ∈ ℋ Update Receive loss ℎ 𝑢−1 → ℎ 𝑢 ℓ 𝑢 ⋅ = ℓ 𝑦 𝑢 ,⋅ • Regret ℜ 𝑈 = ∑ℓ 𝑢 ℎ 𝑢−1 − inf ℎ∈ℋ ∑ℓ 𝑢 ℎ E0 370: Statistical Learning Theory 20
Onli line Learnin ing Alg lgorit ithms • Generalized Infinitesimal Gradient Ascent (GIGA) [Zinkevich ’03] ℎ 𝑢 = ℎ 𝑢−1 − 𝜃 𝑢 𝛼 ℎ ℓ 𝑢 ℎ 𝑢−1 • Follow the Regularized Leader (FTRL) [Hazan et al ‘06] 𝑢−1 ℓ 𝜐 ℎ + 𝜏 𝑢 ℎ 2 ℎ 𝑢 = argmin ℎ∈ℋ 𝜐=1 • Under some conditions ℜ 𝑈 ≤ 𝒫 𝑈 • Under strong er conditions ℜ 𝑈 ≤ 𝒫 log 𝑈 E0 370: Statistical Learning Theory 21
Onli line to Batch Conversio ion for Unary Losses • Key insight: ℎ 𝑢−1 is evaluated on an unseen point [Cesa-Bianchi et al ‘01] 𝔽 ℓ 𝑢 ℎ 𝑢−1 |𝜏(𝑦 1 , … , 𝑦 𝑢−1 ) = 𝔽ℓ ℎ 𝑢−1 , 𝑦 𝑢 = ℒ ℎ 𝑢−1 • Set up a martingale difference sequence 𝑊 𝑢 = ℒ ℎ 𝑢−1 − ℓ 𝑢 ℎ 𝑢−1 𝔽 𝑊 𝑢 |𝜏 𝑦 1 , … , 𝑦 𝑢−1 = 0 • Azuma-Hoeffding gives us ∑ℒ ℎ 𝑢−1 ≤ ∑ℓ 𝑢 ℎ 𝑢−1 + 𝒫 𝑈 ∑ℓ 𝑢 ℎ ∗ ≥ 𝑈ℒ ℎ ∗ − 𝒫 𝑈 • Together we get ∑ℒ ℎ 𝑢−1 − 𝑈ℒ ℎ ∗ ≤ ℜ 𝑈 + 𝒫 𝑈 E0 370: Statistical Learning Theory 22
Onli line to Batch Conversio ion for Unary Losses • Hypothesis selection 1 • Convex loss function ℎ = 𝑈 ∑ℎ 𝑢 ℎ ≤ 1 𝑈 ∑ℒ ℎ 𝑢 ≤ ℒ ℎ ∗ + ℜ 𝑈 1 ℒ 𝑈 + 𝒫 𝑈 • More involved for non convex losses • Better results possible [Tewari-Kakade ‘08 ] • Assume strongly convex loss functions ∑ℒ ℎ 𝑢−1 ≤ 𝑈ℒ ℎ ∗ + ℜ 𝑈 + 𝒫 ℜ 𝑈 • For ℜ 𝑈 = 𝒫 log 𝑈 , this reduces to ℎ ≤ 1 𝒫 log 𝑈 𝑈 ∑ℒ ℎ 𝑢 ≤ ℒ ℎ ∗ + ℒ 𝑈 E0 370: Statistical Learning Theory 23
Part III III: Onli line Learning Online Learning for Pairwise Loss Functions E0 370: Statistical Learning Theory 24
Model l for Onli line Learnin ing wit ith Pairw irwise Losses Propose hypothesis ℎ 𝑢−1 ∈ ℋ Update Receive loss ℎ 𝑢−1 → ℎ 𝑢 ℓ 𝑢 ⋅ = ? • Regret ℜ 𝑈 = ? E0 370: Statistical Learning Theory 25
Recommend
More recommend