fast di fg erentiable so ru ing and ranking
play

Fast di fg erentiable so ru ing and ranking M.Blondel O. Teboul - PowerPoint PPT Presentation

Fast di fg erentiable so ru ing and ranking M.Blondel O. Teboul Q. Berthet J. Djolonga March 12th, 2020 Background Proposed method Experimental results Background Proposed method Experimental results DL as Di fg erentiable Programming


  1. Fast di fg erentiable 
 so ru ing and ranking M.Blondel O. Teboul Q. Berthet J. Djolonga March 12th, 2020

  2. Background Proposed method Experimental results

  3. Background Proposed method Experimental results

  4. DL as Di fg erentiable Programming

  5. DL as Di fg erentiable Programming Deep learning increasingly synonymous with differentiable programming “People are now building a new kind of software by assembling networks of parameterized functional blocks (including loops and conditionals) and by training them from examples using some form of gradient-based optimization.” Yann LeCun, 2018 People are now building a new kind of software by assembling networks of parameterized functional blocks and by training them from examples using some form of gradient-based optimization . An increasingly large number of people are de�ning the networks procedurally in a data-dependent way (with loops and conditionals), allowing them to change dynamically as a function of the input data fed to them. Yann LeCun, 2018.

  6. DL as Di fg erentiable Programming Deep learning increasingly synonymous with differentiable programming “People are now building a new kind of software by assembling networks of parameterized functional blocks (including loops and conditionals) and by training them from examples using some form of gradient-based optimization.” Yann LeCun, 2018 People are now building a new kind of software by assembling networks of parameterized functional blocks and by training them from examples using some form of gradient-based optimization . Many computer programming operations remain poorly differentiable An increasingly large number of people are de�ning the networks procedurally in a data-dependent way (with loops and conditionals), allowing them to change dynamically as a function of the input data fed to them. In this work, we focus on sorting and ranking . Yann LeCun, 2018.

  7. So ru ing as subroutine in ML Trimmed k- NN 
 regression 
 (1) select neighbours 
 (2) majority vote ignore large errors Classifiers 
 select top- k activations MoM 
 Ranking / Sorting estimators O(n log n) Learning to rank 
 NDCG loss and others Descriptive statistics 
 Empirical distribution function 
 Rank-based statistics 
 quantile normalization data viewed as ranks Slide credit: Marco Cuturi

  8. So ru ing θ 1 θ 4 θ 2 θ 3 Argsort (decending) σ ( θ ) = (2,4,3,1)

  9. So ru ing θ 1 θ 4 θ 2 θ 3 Argsort (decending) σ ( θ ) = (2,4,3,1) s ( θ ) ≜ θ σ ( θ ) Sort (descending)

  10. So ru ing θ 1 θ 4 θ 2 θ 3 Argsort (decending) σ ( θ ) = (2,4,3,1) s ( θ ) ≜ θ σ ( θ ) = ( θ 2 , θ 4 , θ 3 , θ 1 ) Sort (descending)

  11. So ru ing θ 1 θ 4 θ 2 θ 3 Argsort (decending) σ ( θ ) = (2,4,3,1) s ( θ ) ≜ θ σ ( θ ) = ( θ 2 , θ 4 , θ 3 , θ 1 ) Sort (descending) piecewise linear induces 
 non-convexity

  12. Ranking θ 1 θ 4 θ 2 θ 3 r ( θ ) ≜ σ − 1 ( θ ) Ranks

  13. Ranking θ 1 θ 4 θ 2 θ 3 r ( θ ) ≜ σ − 1 ( θ ) = (4,1,3,2) Ranks

  14. Ranking θ 1 θ 4 θ 2 θ 3 r ( θ ) ≜ σ − 1 ( θ ) = (4,1,3,2) Ranks discontinuous piecewise constant

  15. Related work on so fu ranks Soft ranks : differentiable proxies to “hard” ranks

  16. Related work on so fu ranks Soft ranks : differentiable proxies to “hard” ranks ● Random perturbation technique to compute expected ranks in O(n 3 ) time [Taylor et al., 2008]

  17. Related work on so fu ranks Soft ranks : differentiable proxies to “hard” ranks ● Random perturbation technique to compute expected ranks in O(n 3 ) time [Taylor et al., 2008] ● Using pairwise comparisons in O(n 2 ) time [Qin et al., 2010] r i ( θ ) ≜ 1 + ∑ 1 [ θ i < θ j ] i ≠ j

  18. Related work on so fu ranks Soft ranks : differentiable proxies to “hard” ranks ● Random perturbation technique to compute expected ranks in O(n 3 ) time [Taylor et al., 2008] ● Using pairwise comparisons in O(n 2 ) time [Qin et al., 2010] r i ( θ ) ≜ 1 + ∑ 1 [ θ i < θ j ] i ≠ j ● Regularized optimal transport approach and Sinkhorn in 
 O(T n 2 ) time [Cuturi et al., 2019]

  19. Related work on so fu ranks Soft ranks : differentiable proxies to “hard” ranks ● Random perturbation technique to compute expected ranks in O(n 3 ) time [Taylor et al., 2008] ● Using pairwise comparisons in O(n 2 ) time [Qin et al., 2010] r i ( θ ) ≜ 1 + ∑ 1 [ θ i < θ j ] i ≠ j ● Regularized optimal transport approach and Sinkhorn in 
 O(T n 2 ) time [Cuturi et al., 2019] None of these works achieves O(n log n) complexity

  20. Background Proposed method Experimental results

  21. Our proposal

  22. Our proposal • Differentiable (soft) relaxations of s( θ ) and r( θ )

  23. Our proposal • Differentiable (soft) relaxations of s( θ ) and r( θ ) • Two formulations: L2 and Entropy regularised

  24. Our proposal • Differentiable (soft) relaxations of s( θ ) and r( θ ) • Two formulations: L2 and Entropy regularised • “Convexification” effect

  25. Our proposal • Differentiable (soft) relaxations of s( θ ) and r( θ ) • Two formulations: L2 and Entropy regularised • “Convexification” effect • Exact computation in O(n log n) time (forward pass)

  26. Our proposal • Differentiable (soft) relaxations of s( θ ) and r( θ ) • Two formulations: L2 and Entropy regularised • “Convexification” effect • Exact computation in O(n log n) time (forward pass) • Exact multiplication with the Jacobian in O(n) time 
 without unrolling (backward pass)

  27. Strategy outline

  28. Strategy outline 1. Express s( θ ) and r( θ ) as linear programs (LP) over convex polytopes

  29. Strategy outline 1. Express s( θ ) and r( θ ) as linear programs (LP) over convex polytopes → Turn algorithmic function into an optimization problem

  30. Strategy outline 1. Express s( θ ) and r( θ ) as linear programs (LP) over convex polytopes → Turn algorithmic function into an optimization problem 2. Introduce regularization in the LP

  31. Strategy outline 1. Express s( θ ) and r( θ ) as linear programs (LP) over convex polytopes → Turn algorithmic function into an optimization problem 2. Introduce regularization in the LP → Turn LP into a projection onto convex polytopes

  32. Strategy outline 1. Express s( θ ) and r( θ ) as linear programs (LP) over convex polytopes → Turn algorithmic function into an optimization problem 2. Introduce regularization in the LP → Turn LP into a projection onto convex polytopes 3. Derive algorithm for computing the projection

  33. Strategy outline 1. Express s( θ ) and r( θ ) as linear programs (LP) over convex polytopes → Turn algorithmic function into an optimization problem 2. Introduce regularization in the LP → Turn LP into a projection onto convex polytopes 3. Derive algorithm for computing the projection → Ideally, the projection shoud be computable in the same cost as the original function…

  34. Strategy outline 1. Express s( θ ) and r( θ ) as linear programs (LP) over convex polytopes → Turn algorithmic function into an optimization problem 2. Introduce regularization in the LP → Turn LP into a projection onto convex polytopes 3. Derive algorithm for computing the projection → Ideally, the projection shoud be computable in the same cost as the original function… 4. Derive algorithm for differentiating the projection

  35. Strategy outline 1. Express s( θ ) and r( θ ) as linear programs (LP) over convex polytopes → Turn algorithmic function into an optimization problem 2. Introduce regularization in the LP → Turn LP into a projection onto convex polytopes 3. Derive algorithm for computing the projection → Ideally, the projection shoud be computable in the same cost as the original function… 4. Derive algorithm for differentiating the projection → Could be challenging (argmin differentiation problem)

  36. Strategy outline Cuturi et al. [2019] This work

  37. Strategy outline Cuturi et al. [2019] This work 1. LP Birkhoff polytope Permutahedron (2 , 3 , 1) ϕ ((2 , 3 , 1)) ϕ ((1 , 3 , 2)) (1 , 3 , 2) (3 , 2 , 1) ϕ ((3 , 2 , 1)) 𝒬 ⊂ ℝ n ℬ ⊂ ℝ n × n ϕ ((1 , 2 , 3)) (1 , 2 , 3) (3 , 1 , 2) ϕ ((3 , 1 , 2)) ϕ ((2 , 1 , 3)) (2 , 1 , 3)

  38. Strategy outline Cuturi et al. [2019] This work 1. LP Birkhoff polytope Permutahedron (2 , 3 , 1) ϕ ((2 , 3 , 1)) ϕ ((1 , 3 , 2)) (1 , 3 , 2) (3 , 2 , 1) ϕ ((3 , 2 , 1)) 𝒬 ⊂ ℝ n ℬ ⊂ ℝ n × n ϕ ((1 , 2 , 3)) (1 , 2 , 3) (3 , 1 , 2) ϕ ((3 , 1 , 2)) ϕ ((2 , 1 , 3)) (2 , 1 , 3) 2. Regularization Entropy L2 or Entropy

  39. Strategy outline Cuturi et al. [2019] This work 1. LP Birkhoff polytope Permutahedron (2 , 3 , 1) ϕ ((2 , 3 , 1)) ϕ ((1 , 3 , 2)) (1 , 3 , 2) (3 , 2 , 1) ϕ ((3 , 2 , 1)) 𝒬 ⊂ ℝ n ℬ ⊂ ℝ n × n ϕ ((1 , 2 , 3)) (1 , 2 , 3) (3 , 1 , 2) ϕ ((3 , 1 , 2)) ϕ ((2 , 1 , 3)) (2 , 1 , 3) 2. Regularization Entropy L2 or Entropy Pool Adjacent 
 3. Computation Sinkhorn Violators (PAV)

  40. Strategy outline Cuturi et al. [2019] This work 1. LP Birkhoff polytope Permutahedron (2 , 3 , 1) ϕ ((2 , 3 , 1)) ϕ ((1 , 3 , 2)) (1 , 3 , 2) (3 , 2 , 1) ϕ ((3 , 2 , 1)) 𝒬 ⊂ ℝ n ℬ ⊂ ℝ n × n ϕ ((1 , 2 , 3)) (1 , 2 , 3) (3 , 1 , 2) ϕ ((3 , 1 , 2)) ϕ ((2 , 1 , 3)) (2 , 1 , 3) 2. Regularization Entropy L2 or Entropy Pool Adjacent 
 3. Computation Sinkhorn Violators (PAV) Backprop through Differentiate 4. Differentiation Sinkhorn iterates PAV solution

Recommend


More recommend