logarithmic time prediction
play

Logarithmic Time Prediction John Langford Microsoft Research - PowerPoint PPT Presentation

Logarithmic Time Prediction John Langford Microsoft Research DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms The Multiclass Prediction Problem Repeatedly 1 See x 2 Predict y { 1 , ..., K } 3 See y The Multiclass


  1. Logarithmic Time Prediction John Langford Microsoft Research DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms

  2. The Multiclass Prediction Problem Repeatedly 1 See x 2 Predict ˆ y ∈ { 1 , ..., K } 3 See y

  3. The Multiclass Prediction Problem Repeatedly 1 See x 2 Predict ˆ y ∈ { 1 , ..., K } 3 See y Goal: Find h ( x ) minimizing error rate: ( x , y ) ∼ D ( h ( x ) � = y ) Pr with h ( x ) fast.

  4. Why?

  5. Why?

  6. Trick #1 K is small

  7. Trick #2: A hierarchy exists

  8. Trick #2: A hierarchy exists So use Trick #1 repeatedly.

  9. Trick #3: Shared representation

  10. Trick #3: Shared representation Very helpful... but computation in the last layer can still blow up.

  11. Trick #4: “Structured Prediction”

  12. Trick #4: “Structured Prediction” But what if the structure is unclear?

  13. Trick #5: GPU

  14. Trick #5: GPU 4 Teraflops is great... yet still burns energy.

  15. How fast can we hope to go?

  16. How fast can we hope to go? Theorem: There exists multiclass classification problems where achieving 0 error rate requires Ω(log K ) time to train or test per example.

  17. How fast can we hope to go? Theorem: There exists multiclass classification problems where achieving 0 error rate requires Ω(log K ) time to train or test per example. Proof: By construction Pick y ∼ U (1 , ..., K )

  18. How fast can we hope to go? Theorem: There exists multiclass classification problems where achieving 0 error rate requires Ω(log K ) time to train or test per example. Proof: By construction Pick y ∼ U (1 , ..., K ) Any prediction algorithm outputting less than log 2 K bits loses with constant probability. Any training algorithm reading an example requires Ω(log 2 K ) time.

  19. Can we predict in time O (log 2 K )? Computational Advantage of Log Time 100000 K / log(K) 10000 Benefit 1000 100 10 1 10 100 1000 10000 100000 1e+06 K

  20. Not it #1: Sparse Error Correcting Output Codes 1 Create O (log K ) binary vectors b iy of length K

  21. Not it #1: Sparse Error Correcting Output Codes 1 Create O (log K ) binary vectors b iy of length K 2 Train O (log K ) binary classifiers h i to minimize error rate: Pr x , y ( h i ( x ) � = b iy )

  22. Not it #1: Sparse Error Correcting Output Codes 1 Create O (log K ) binary vectors b iy of length K 2 Train O (log K ) binary classifiers h i to minimize error rate: Pr x , y ( h i ( x ) � = b iy ) 3 Predict by finding y with minimal error.

  23. Not it #1: Sparse Error Correcting Output Codes 1 Create O (log K ) binary vectors b iy of length K 2 Train O (log K ) binary classifiers h i to minimize error rate: Pr x , y ( h i ( x ) � = b iy ) 3 Predict by finding y with minimal error. Prediction is Ω( K )

  24. Not it #2: Hierarchy Construction 1 Build confusion matrix of errors.

  25. Not it #2: Hierarchy Construction 1 Build confusion matrix of errors. 2 Recursive partition to create hierarchy.

  26. Not it #2: Hierarchy Construction 1 Build confusion matrix of errors. 2 Recursive partition to create hierarchy. 3 Apply hierarchy solution.

  27. Not it #2: Hierarchy Construction 1 Build confusion matrix of errors. 2 Recursive partition to create hierarchy. 3 Apply hierarchy solution. Training is Ω( K ) or worse.

  28. Not it #3: Unnormalized learning Train K regressors by For each example ( x , y ) 1 Train regressor y with ( x , 1).

  29. Not it #3: Unnormalized learning Train K regressors by For each example ( x , y ) 1 Train regressor y with ( x , 1). 2 Pick y ′ � = y uniformly at random. 3 Train regressor y ′ with ( x , − 1).

  30. Not it #3: Unnormalized learning Train K regressors by For each example ( x , y ) 1 Train regressor y with ( x , 1). 2 Pick y ′ � = y uniformly at random. 3 Train regressor y ′ with ( x , − 1). Prediction is still Ω( K ).

  31. Can we predict in time O (log 2 K )?

  32. Is logarithmic time even possible? 1 v {2,3} P(y=1) = .4 P(y=2) = .3 2 v 3 P(y=3) = .3 1 3 2 P ( { 2 , 3 } ) > P (1) ⇒ lose for divide and conquer

  33. Filter Trees [BLR09] 1 v {2,3} P(y=1) = .4 P(y=2) = .3 2 v 3 P(y=3) = .3 1 3 2 1 Learn 2 v 3 first 2 Throw away all error examples 3 Learn 1 v Survivors Theorem: For all multiclass problems, for all binary classifiers, Multiclass Regret ≤ Average Binary Regret * log( K )

  34. Can you make it robust? Winner 1 2 3 4 5 6 7 8

  35. Can you make it robust? Winners 1 2 3 4 5 6 7 8

  36. Can you make it robust? Winners 1 2 3 4 5 6 7 8

  37. Can you make it robust? Winners 1 2 3 4 5 6 7 8 Theorem: [BLR09] For all multiclass problems, for all binary classifiers, a log(K)-correcting tournament satisfies: Multiclass Regret ≤ Average Binary Regret * 5.5 Determined best paper prize for ICML2012 (area chair decisions).

  38. How do you learn structure? Not all partitions are equally difficult. Compare { 1 , 7 } v { 3 , 8 } to { 1 , 8 } v { 3 , 7 } What is better?

  39. How do you learn structure? Not all partitions are equally difficult. Compare { 1 , 7 } v { 3 , 8 } to { 1 , 8 } v { 3 , 7 } What is better? [BWG10]: Better to confuse near leaves than near root. Intuition: the root predictor tends to be overconstrained while the leafwards predictors are less constrained.

  40. The Partitioning Problem [CL14] Given a set of n examples each with one of K labels, find a partitioner h that maximizes: E x , y | Pr( h ( x ) = 1 , y ) − Pr( h ( x ) = 1) Pr( y ) |

  41. The Partitioning Problem [CL14] Given a set of n examples each with one of K labels, find a partitioner h that maximizes: � E x Pr( y ) | Pr( h ( x ) = 1 | x ∈ X y ) − Pr( h ( x ) = 1) | y where X y is the set of x associated with y .

  42. The Partitioning Problem [CL14] Given a set of n examples each with one of K labels, find a partitioner h that maximizes: Nonconvex for any symmetric hypothesis class (ouch)

  43. Bottom Up doesn’t work 1 2 3 Suppose you use linear representations.

  44. Bottom Up doesn’t work 1 2 3 Suppose you use linear representations. Suppose you first build a 1v3 predictor.

  45. Bottom Up doesn’t work 1 2 3 Suppose you use linear representations. Suppose you first build a 1v3 predictor. Suppose you then build a 2v { 1v3 } predictor. You lose.

  46. Does partitioning recurse well? Theorem: If at every node n , E x , y | Pr( h ( x ) = 1 , y ) − Pr( h ( x ) = 1) Pr( y ) | > γ then after � 4(1 − γ )2 ln k � 1 γ 2 ǫ splits, the multiclass error is less than ǫ .

  47. Online Partitioning Relax the optimization criteria: � � � E x | y [ˆ y ( x )] − E x [ˆ y ( x )] E x , y � ... and approximate with running average

  48. Online Partitioning Relax the optimization criteria: � � � E x | y [ˆ y ( x )] − E x [ˆ y ( x )] E x , y � ... and approximate with running average Let e = 0 and for all y , e y = 0 , n y = 0 For each example ( x , y ) 1 if e y < e then b = − 1 else b = 1 2 Update w using ( x , b ) 3 n y ← n y + 1 4 e y ← ( n y − 1) e y + ˆ y ( x ) n y n y 5 e ← ( t − 1) e + ˆ y ( x ) t t Apply recursively to construct a tree structure.

  49. Accuracy for a fixed training time LOMtree vs one-against-all 1 LOMtree accuracy 0.1 OAA 0.01 0.001 26 isolet 105 sector 1000 aloi 21841 imagenet 105033 ODP number of classes

  50. Test Error %, optimized, no train-time constraint Performance of Log-time algorithms 100 Rand 90 Filter 80 LOM 70 Test Error % 60 50 40 30 20 10 0 Isolet Sector Aloi Imagenet ODP

  51. Test Error %, optimized, no train-time constraint Compared to OAA 100 Rand 90 Filter 80 LOM 70 OAA Test Error % 60 50 40 30 20 10 0 Isolet Sector Aloi Imagenet ODP

  52. Classes vs Test time ratio LOMtree vs one−against−all 12 10 log 2 (time ratio) 8 6 4 2 6 8 10 12 14 16 log 2 (number of classes)

  53. Can we predict in time O (log 2 K )?

  54. Can we predict in time O (log 2 K )? What is the right way to achieve consistency and dynamic partition?

  55. Can we predict in time O (log 2 K )? What is the right way to achieve consistency and dynamic partition? How can you balance representation complexity and sample complexity?

  56. Bibliography Alina Beygelzimer, John Langford, Pradeep Ravikumar, Error-Correcting Tournaments, http://arxiv.org/abs/0902.3176 Samy Bengio, Jason Weston, David Grangier, Label embedding trees for large multi-class tasks, NIPS 2010. Anna Choromanska, John Langford, Logarithmic Time Online Multiclass prediction, http://arxiv.org/abs/1406.1822

Recommend


More recommend