Learning wit ith Pairw rwis ise Losses Problems, Algorithms and - PowerPoint PPT Presentation

Learning wit ith Pairw rwis ise Losses Problems, Algorithms and Analysis Purushottam Kar Microsoft Research India

Outl tline • Part I: Introduction to pairwise loss functions • Example applications • Part II: Batch learning with pairwise loss functions • Learning formulation: no algorithmic details • Generalization bounds • The coupling phenomenon • Decoupling techniques • Part III: Online learning with pairwise loss functions • A generic online algorithm • Regret analysis • Online-to-batch conversion bounds • A decoupling technique for online-to-batch conversions E0 370: Statistical Learning Theory 2

Part I: I: In Introduction E0 370: Statistical Learning Theory 3

What is is a lo loss fu functio ion? ℓ: ℋ → ℝ + • We observe empirical losses on data 𝑇 = 𝑦 1 , … 𝑦 𝑜 ℓ 𝑦 𝑗 ⋅ = ℓ ℎ, 𝑦 𝑗 • … and try to minimize them (e.g. classfn, regression) ℒ 𝑇 ℎ = 1 ℎ = inf ℒ 𝑇 ℎ , 𝑜 ∑ℓ 𝑦 𝑗 ℎ ℎ∈ℋ • … in the hope that 1 𝑜 ∑ℓ 𝑦 𝑗 ⋅ − 𝔽ℓ 𝑦 ⋅ ∞ ≤ 𝜗 • ... so that ℎ ≤ ℒ ℎ ∗ + 𝜗, ℒ ℒ ℎ = 𝔽ℓ 𝑦 ℎ E0 370: Statistical Learning Theory 4

Metric ic Learnin ing • Penalize metric for bringing blue and red points close • Loss function needs to consider two points at a time! • … in other words a pairwise loss function 1, 𝑧 1 ≠ 𝑧 2 and 𝑁 𝑦 1 , 𝑦 2 < 𝛿 1 • E.g. ℓ 𝑦 1 ,𝑦 2 𝑁 = 1, 𝑧 1 = 𝑧 2 and 𝑁 𝑦 1 , 𝑦 2 > 𝛿 2 0, otherwise E0 370: Statistical Learning Theory 5

Pairw irwis ise Loss Functio ions • Typically, loss functions are based on ground truth ℓ 𝑦 ℎ = ℓ ℎ 𝑦 , 𝑧 𝑦 • Thus, for metric learning, loss functions look like ℓ 𝑦 1 ,𝑦 2 ℎ = ℓ ℎ 𝑦 1 , 𝑦 2 , 𝑧 𝑦 1 , 𝑦 2 • In previous example, we had ℎ 𝑦 1 , 𝑦 2 = 𝑁 𝑦 1 , 𝑦 2 and 𝑧 𝑦 1 , 𝑦 2 = 𝑧 1 𝑧 2 • Useful to learn patterns that capture data interactions E0 370: Statistical Learning Theory 6

Pairw irwis ise Loss Functio ions Examples: ( 𝜚 is any margin loss function e.g. hinge loss) • Metric learning [Jin et al NIPS ‘ 09] ℓ 𝑦 1 ,𝑦 2 𝑁 = 𝜚 𝑧 1 𝑧 2 1 − 𝑁 𝑦 1 , 𝑦 2 • Preference learning [Xing et al NIPS ‘ 02] • S-goodness [Balcan-Blum ICML ‘ 06] ℓ 𝑦 1 ,𝑦 2 𝐿 = 𝜚 𝑧 1 𝑧 2 𝐿 𝑦 1 , 𝑦 2 • Kernel-target alignment [Cortes et al ICML ‘ 10] • Bipartite ranking, (p)AUC [Narasimhan-Agarwal ICML ‘ 13] ℓ 𝑦 1 ,𝑦 2 𝑔 = 𝜚 𝑔 𝑦 1 − 𝑔 𝑦 2 𝑧 1 − 𝑧 2 E0 370: Statistical Learning Theory 7

Learnin ing Obje jectiv ives in in Pairw irwise Learnin ing • Given training data 𝑦 1 , 𝑦 2 , … 𝑦 𝑜 • Learn ℎ: 𝒴 × 𝒴 → 𝒵 such that ℎ ≤ ℒ ℎ ∗ + 𝜗 ℒ (will define ℒ ⋅ and ℒ ⋅ shortly) Challenges: • Training data given as singletons, not pairs • Algorithmic efficiency • Generalization error bounds E0 370: Statistical Learning Theory 8

Part II: II: Batch Learning E0 370: Statistical Learning Theory 9

Part II: II: Batch Learning Batch Learning for Unary Losses E0 370: Statistical Learning Theory 10

Trainin ing wit ith Unary Loss Functions • Notion of empirical loss ℒ: ℋ → ℝ + • Given training data 𝑇 = 𝑦 1 , … , 𝑦 𝑜 , natural notion ℒ 𝑇 ⋅ = 1 𝑜 ∑ℓ ⋅, 𝑦 𝑗 • Empirical risk minimization dictates us to find ℎ , s.t. ℒ 𝑇 ℎ ≤ inf ℒ 𝑇 ℎ ℎ∈ℋ • Note that ℒ ⋅ is a U-statistic ℒ 𝑇 : ℋ → ℝ + s.t. • U-statistic : a notion of “training loss” ∀ℎ ∈ ℋ, 𝔽 ℒ 𝑇 ℎ = ℒ ℎ E0 370: Statistical Learning Theory 11

Generali lization bounds for Unary ry Loss Functio ions • Step 1 : Bound excess risk by supr ē mus excess risk ℒ ℒ 𝑇 ℎ − ℒ ℎ − ℎ ≤ sup ℒ 𝑇 ℎ ℎ∈ℋ • Step 2 : Apply McDiarmid’s inequality ℒ 𝑇 ℎ is not perturbed by changing any 𝑦 𝑗 1 ℒ ℒ 𝑇 ℎ − ℒ ℎ − + ℎ ≤ 𝔽 sup ℒ 𝑇 ℎ 𝒫 𝑜 ℎ∈ℋ • Step 3 : Analyze the expected supr ē mus excess risk ℒ ℎ − 𝔽 − 𝔽 sup ℒ 𝑇 ℎ = 𝔽 sup ℒ 𝑇 ℎ ℒ 𝑇 ℎ ℎ∈ℋ ℎ∈ℋ 𝑇 ℎ − ≤ 𝔽 sup ℒ ℒ 𝑇 ℎ (Jensen′s inequality) ℎ∈ℋ E0 370: Statistical Learning Theory 12

Analy lyzin ing th the Expected Supr upr ē mus Excess Ris isk 𝑇 ℎ − 𝔽 sup ℒ ℒ 𝑇 ℎ ℎ∈ℋ • For unary losses ℒ 𝑇 ⋅ = ∑ℓ 𝑦 𝑗 ⋅ • Analyzing this term through symmetrization easy 1 ≤ 2 n 𝔽 sup ∑ℓ 𝑦 𝑗 ℎ − ℓ 𝑦 𝑗 ℎ 𝑜 𝔽 sup ∑𝜗 𝑗 ℓ 𝑦 𝑗 ℎ ℎ∈ℋ ℎ∈ℋ ≤ 2𝑀 1 𝑜 𝔽 sup ∑𝜗 𝑗 ℎ 𝑦 𝑗 ≈ 𝒫 𝑜 ℎ∈ℋ E0 370: Statistical Learning Theory 13

Part II: II: Batch Learning Batch Learning for Pairwise Loss Functions E0 370: Statistical Learning Theory 14

Trainin ing wit ith Pairw irwis ise Loss Functions • Given training data 𝑦 1 , 𝑦 2 , … 𝑦 𝑜 , choose a U-statistic • U-statistic should use terms like ℓ 𝑦 𝑗 ,𝑦 𝑘 ℎ ( the kernel ) • Population risk defined as ℒ ⋅ = 𝔽ℓ 𝑦,𝑦 ′ ⋅ Examples: • For any index set Ω ⊂ 𝑜 × 𝑜 , define ℒ S ⋅; Ω = 1 Ω ℓ 𝑦 𝑗 ,𝑦 𝑘 ⋅ 𝑗,𝑘 ∈Ω • Choice of Ω = 𝑗, 𝑘 : 𝑗 ≠ 𝑘 maximizes data utilization • Various ways of optimizing inf ℒ 𝑇 ℎ (e.g. SSG) ℎ∈ℋ E0 370: Statistical Learning Theory 15

Generali lization bounds for Pairw irwise Loss Functio ions • Step 1 : Bound excess risk by supr ē mus excess risk ℒ ℒ 𝑇 ℎ − ℒ ℎ − ℎ ≤ sup ℒ 𝑇 ℎ ℎ∈ℋ • Step 2 : Apply McDiarmid ’ s inequality Check that ℒ 𝑇 ℎ is not perturbed by changing any 𝑦 𝑗 1 ℒ ℒ 𝑇 ℎ − ℒ ℎ − + ℎ ≤ 𝔽 sup ℒ 𝑇 ℎ 𝒫 𝑜 ℎ∈ℋ • Step 3 : Analyze the expected supr ē mus excess risk ℒ ℎ − 𝔽 − 𝔽 sup ℒ 𝑇 ℎ = 𝔽 sup ℒ 𝑇 ℎ ℒ 𝑇 ℎ ℎ∈ℋ ℎ∈ℋ 𝑇 ℎ − ≤ 𝔽 sup ℒ ℒ 𝑇 ℎ (Jensen′s inequality) ℎ∈ℋ E0 370: Statistical Learning Theory 16

Analy lyzin ing th the Expected Supr upr ē mus Excess Ris isk 𝑇 ℎ − 𝔽 sup ℒ ℒ 𝑇 ℎ ℎ∈ℋ • For pairwise losses ℒ 𝑇 ⋅ = ∑ 𝑗≠𝑘 ℓ 𝑦 𝑗 ,𝑦 𝑘 ⋅ • Clean symmetrization not possible due to coupling 2𝔽 sup ℓ ℎ − ℓ 𝑦 𝑗 ,𝑦 𝑘 ℎ 𝑦 𝑗 , 𝑦 𝑘 ℎ∈ℋ 𝑗 𝑘 • Solutions [see Clémençon et al Ann. Stat. ‘ 08] • Alternate representation of U-statistics • Hoeffding decomposition E0 370: Statistical Learning Theory 17

Part III III: Onli line Learning E0 370: Statistical Learning Theory 18

Part III III: Onli line Learning A Whirlwind Tour of Online Learning for Unary Losses E0 370: Statistical Learning Theory 19

Model l for Onli line Learnin ing wit ith Unary Losses Propose hypothesis ℎ 𝑢−1 ∈ ℋ Update Receive loss ℎ 𝑢−1 → ℎ 𝑢 ℓ 𝑢 ⋅ = ℓ 𝑦 𝑢 ,⋅ • Regret ℜ 𝑈 = ∑ℓ 𝑢 ℎ 𝑢−1 − inf ℎ∈ℋ ∑ℓ 𝑢 ℎ E0 370: Statistical Learning Theory 20

Onli line Learnin ing Alg lgorit ithms • Generalized Infinitesimal Gradient Ascent (GIGA) [Zinkevich ’03] ℎ 𝑢 = ℎ 𝑢−1 − 𝜃 𝑢 𝛼 ℎ ℓ 𝑢 ℎ 𝑢−1 • Follow the Regularized Leader (FTRL) [Hazan et al ‘06] 𝑢−1 ℓ 𝜐 ℎ + 𝜏 𝑢 ℎ 2 ℎ 𝑢 = argmin ℎ∈ℋ 𝜐=1 • Under some conditions ℜ 𝑈 ≤ 𝒫 𝑈 • Under strong er conditions ℜ 𝑈 ≤ 𝒫 log 𝑈 E0 370: Statistical Learning Theory 21

Onli line to Batch Conversio ion for Unary Losses • Key insight: ℎ 𝑢−1 is evaluated on an unseen point [Cesa-Bianchi et al ‘01] 𝔽 ℓ 𝑢 ℎ 𝑢−1 |𝜏(𝑦 1 , … , 𝑦 𝑢−1 ) = 𝔽ℓ ℎ 𝑢−1 , 𝑦 𝑢 = ℒ ℎ 𝑢−1 • Set up a martingale difference sequence 𝑊 𝑢 = ℒ ℎ 𝑢−1 − ℓ 𝑢 ℎ 𝑢−1 𝔽 𝑊 𝑢 |𝜏 𝑦 1 , … , 𝑦 𝑢−1 = 0 • Azuma-Hoeffding gives us ∑ℒ ℎ 𝑢−1 ≤ ∑ℓ 𝑢 ℎ 𝑢−1 + 𝒫 𝑈 ∑ℓ 𝑢 ℎ ∗ ≥ 𝑈ℒ ℎ ∗ − 𝒫 𝑈 • Together we get ∑ℒ ℎ 𝑢−1 − 𝑈ℒ ℎ ∗ ≤ ℜ 𝑈 + 𝒫 𝑈 E0 370: Statistical Learning Theory 22

Onli line to Batch Conversio ion for Unary Losses • Hypothesis selection 1 • Convex loss function ℎ = 𝑈 ∑ℎ 𝑢 ℎ ≤ 1 𝑈 ∑ℒ ℎ 𝑢 ≤ ℒ ℎ ∗ + ℜ 𝑈 1 ℒ 𝑈 + 𝒫 𝑈 • More involved for non convex losses • Better results possible [Tewari-Kakade ‘08 ] • Assume strongly convex loss functions ∑ℒ ℎ 𝑢−1 ≤ 𝑈ℒ ℎ ∗ + ℜ 𝑈 + 𝒫 ℜ 𝑈 • For ℜ 𝑈 = 𝒫 log 𝑈 , this reduces to ℎ ≤ 1 𝒫 log 𝑈 𝑈 ∑ℒ ℎ 𝑢 ≤ ℒ ℎ ∗ + ℒ 𝑈 E0 370: Statistical Learning Theory 23

Part III III: Onli line Learning Online Learning for Pairwise Loss Functions E0 370: Statistical Learning Theory 24

Model l for Onli line Learnin ing wit ith Pairw irwise Losses Propose hypothesis ℎ 𝑢−1 ∈ ℋ Update Receive loss ℎ 𝑢−1 → ℎ 𝑢 ℓ 𝑢 ⋅ = ? • Regret ℜ 𝑈 = ? E0 370: Statistical Learning Theory 25

Learning wit ith Pairw rwis ise Losses Problems, Algorithms and - PowerPoint PPT Presentation

Learning wit ith Pairw rwis ise Losses Problems, Algorithms and Analysis Purushottam Kar Microsoft Research India Outl tline Part I: Introduction to pairwise loss functions Example applications Part II: Batch learning with

Pairw ise Variability Index: Variability Index: Pairw ise Evaluating the Cognitive Evaluating

Contents of Presentation Types of losses Causes of losses Prevention of losses

Food Losses/Waste in Food Value Chains Food Losses/Waste in Food Value Chains Areas

Workshop Is Issues wit ith UK Survey of f Nois ise Attit itudes (SoNA) th August 2019 14 th

SIG IG1510: Power Your Material Editing wit ith Substance Designer, MDL and Ir Iray Sebastien

Sahar hara a Be Beach ch Sahara ara Beach ch Perfec fect place e to connec ect wit ith

RWIS Automated Advisory System Centralized advisory system for the control of Dynamic Message

RWIS Data Integration for Performance Measures Improved Decision Making Planning Operations

LOSSES OEE Workshop Siyambulela Bozo: Junior Project Manager AIDC - TPM Pres resentation

E xe rc ise fo r fa lls pre ve ntio n: An inve nto ry o f e xe rc ise pro g ra ms I MPACT ,

RWIS in ODOTs Winter Plan Overview Our Responsibility 43,000 Lane miles Ohio roads

Create a Waterw ise Stone one Landscape House se Lands dscape Design gn w ith Sharon w

Piping Systems and Flow Analysis ( Chapter 3) 2 Learning Outcomes (Chapter 3) Losses in

Machine Learning pipeline wit ith R Contributions to whiteboxing machine learning for

Zoonoses Online Education Proje ject Onlin line cou ourses wit ith vid videos for or th the

& UMass IT Conference June 7, 2017 Learning w ith Purpose Learning w ith Purpose CI O

Efficient tracking of a growing number of experts Jaouad Mourtada & Odalric-ambrym Maillard

L M A D A Learning And Mining from DatA NANJING UNIVERSITY Adaptive Regret of Convex and

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Linear Bandits D avid P al Google, New York & Department of Computing Science

without Regret Barbara Jobstmann EPFL and Jasper DA CNRS, Verimag Joint Work with Christian von

Q4 05 STRATEGIC OVERVIEW Investor Community Conference Call TONY COMPER President and Chief

Learning wit ith Pairw rwis ise Losses Problems, Algorithms and - PowerPoint PPT Presentation

Learning wit ith Pairw rwis ise Losses Problems, Algorithms and Analysis Purushottam Kar Microsoft Research India Outl tline Part I: Introduction to pairwise loss functions Example applications Part II: Batch learning with

Pairw ise Variability Index: Variability Index: Pairw ise Evaluating the Cognitive Evaluating

Contents of Presentation Types of losses Causes of losses Prevention of losses

Food Losses/Waste in Food Value Chains Food Losses/Waste in Food Value Chains Areas

Workshop Is Issues wit ith UK Survey of f Nois ise Attit itudes (SoNA) th August 2019 14 th

SIG IG1510: Power Your Material Editing wit ith Substance Designer, MDL and Ir Iray Sebastien

Sahar hara a Be Beach ch Sahara ara Beach ch Perfec fect place e to connec ect wit ith

RWIS Automated Advisory System Centralized advisory system for the control of Dynamic Message

RWIS Data Integration for Performance Measures Improved Decision Making Planning Operations

LOSSES OEE Workshop Siyambulela Bozo: Junior Project Manager AIDC - TPM Pres resentation

E xe rc ise fo r fa lls pre ve ntio n: An inve nto ry o f e xe rc ise pro g ra ms I MPACT ,

RWIS in ODOTs Winter Plan Overview Our Responsibility 43,000 Lane miles Ohio roads

Create a Waterw ise Stone one Landscape House se Lands dscape Design gn w ith Sharon w

Piping Systems and Flow Analysis ( Chapter 3) 2 Learning Outcomes (Chapter 3) Losses in

Machine Learning pipeline wit ith R Contributions to whiteboxing machine learning for

Zoonoses Online Education Proje ject Onlin line cou ourses wit ith vid videos for or th the

&amp; UMass IT Conference June 7, 2017 Learning w ith Purpose Learning w ith Purpose CI O

Efficient tracking of a growing number of experts Jaouad Mourtada &amp; Odalric-ambrym Maillard

L M A D A Learning And Mining from DatA NANJING UNIVERSITY Adaptive Regret of Convex and

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many

Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Linear Bandits D avid P al Google, New York &amp; Department of Computing Science

without Regret Barbara Jobstmann EPFL and Jasper DA CNRS, Verimag Joint Work with Christian von

Q4 05 STRATEGIC OVERVIEW Investor Community Conference Call TONY COMPER President and Chief

& UMass IT Conference June 7, 2017 Learning w ith Purpose Learning w ith Purpose CI O

Efficient tracking of a growing number of experts Jaouad Mourtada & Odalric-ambrym Maillard

Linear Bandits D avid P al Google, New York & Department of Computing Science