Isotonic regression Definition Given data { ( x t , y t ) } T t =1 ⊂ R × R , find isotonic (nondecreasing) f ∗ : R → R , which minimizes squared error over the labels: T � ( y t − f ( x t )) 2 , min : f t =1 subject to : x t ≥ x q = ⇒ f ( x t ) ≥ f ( x q ) , q , t ∈ { 1 , . . . , T } . The optimal solution f ∗ is called isotonic regression function. What only matters are values f ( x t ), t = 1 , . . . , T . 13 / 59
Isotonic regression example (source: scikit-learn.org ) 14 / 59
Properties of isotonic regression Depends on instances ( x ) only through their order relation. Only defined at points { x 1 , . . . , x T } . Often extended to R by linear interpolation. Piecewise constants (splits the data into level sets). Self-averaging property: the value of f ∗ in a given level set equals the average of labels in that level set. For any v : � 1 where S v = { t : f ∗ ( x t ) = v } . v = y t | S v | t ∈ S v When y ∈ { 0 , 1 } , produces calibrated (empirical) probabilities: E emp [ y | f ∗ = v ] = v 15 / 59
Pool Adjacent Violators Algorithm (PAVA) Iterative merging of of data points into blocks until no violators of isotonic constraints exist. The values assigned to each block is the average over labels in this block. The final assignments to blocks corresponds to the level sets of isotonic regression. Works in linear O ( T ) time, but requires the data to be sorted. 16 / 59
Generalized isotonic regression Definition t =1 ⊂ R × R , find isotonic f ∗ : R → R which Given data { ( x t , y t ) } T minimizes: T � min ∆( y t , f ( x t )) . isotonic f t =1 Squared loss ( y t − f ( x t )) 2 replaced with general loss ∆( y t , f ( x t )). 17 / 59
Generalized isotonic regression Definition t =1 ⊂ R × R , find isotonic f ∗ : R → R which Given data { ( x t , y t ) } T minimizes: T � min ∆( y t , f ( x t )) . isotonic f t =1 Squared loss ( y t − f ( x t )) 2 replaced with general loss ∆( y t , f ( x t )). Theorem [Robertson et al., 1998] All loss functions of the form: ∆( y , z ) = Ψ( y ) − Ψ( z ) − Ψ ′ ( z )( y − z ) for some strictly convex Ψ result in the same isotonic regression function f ∗ . 17 / 59
Generalized isotonic regression – examples ∆( y , z ) = Ψ( y ) − Ψ( z ) − Ψ ′ ( z )( y − z ) Squared function Ψ( y ) = y 2 : ∆( y , z ) = y 2 − z 2 − 2 f ( y − z ) = ( y − z ) 2 (squared loss) . Entropy Ψ( y ) = − y log y − (1 − y ) log(1 − y ), y ∈ [0 , 1] ∆( y , z ) = − y log z − (1 − y ) log(1 − z ) (cross-entropy) . Negative logarithm Ψ( y ) = − log y , y > 0 ∆( y , z ) = y z − log y (Itakura-Saito distance / Burg entropy) . z 18 / 59
Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions 19 / 59
Online learning framework A theoretical framework for the analysis of online algorithms. Learning process by its very nature is incremental. Avoids stochastic (e.g., i.i.d.) assumptions on the data sequence, designs algorithms which work well for any data. Meaningful performance guarantees based on observed quantities: regret bounds. 20 / 59
Online learning framework t → t + 1 prediction suffered loss learner (strategy) � y t = f t ( x t ) ℓ ( y t , � y t ) f t : X → Y new instance ( x t , ?) feedback: y t 21 / 59
Online learning framework Set of strategies (actions) F ; known loss function ℓ . Learner starts with some initial strategy (action) f 1 . For t = 1 , 2 , . . . : 1 Learner observes instance x t . 2 Learner predicts with � y t = f t ( x t ). 3 The environment reveals outcome y t . 4 Learner suffers loss ℓ ( y t , � y t ). 5 Learner updates its strategy f t → f t +1 . 22 / 59
Online learning framework The goal of the learner is to be close to the best f in hindsight. Cumulative loss of the learner: T � � L T = ℓ ( y t , � y t ) . t =1 Cumulative loss of the best strategy f in hindsight: T � L ∗ T = min ℓ ( y t , f ( x t )) . f ∈F t =1 Regret of the learner: regret T = � L T − L ∗ T . The goal is to minimize regret over all possible data sequences. 23 / 59
Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions 24 / 59
Online isotonic regression 1 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X 25 / 59
Online isotonic regression 1 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 5 x 7 x 8 X 25 / 59
Online isotonic regression 1 � y 5 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 5 x 7 x 8 X 25 / 59
Online isotonic regression 1 y 5 � y 5 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 5 x 7 x 8 X 25 / 59
Online isotonic regression 1 y 5 y 5 − y 5 ) 2 loss = ( � � y 5 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 5 x 7 x 8 X 25 / 59
Online isotonic regression 1 Y 0 x 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X 25 / 59
Online isotonic regression 1 Y � y 1 0 x 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X 25 / 59
Online isotonic regression 1 Y � y 1 y 1 0 x 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X 25 / 59
Online isotonic regression 1 Y � y 1 y 1 − y 1 ) 2 loss = ( � y 1 0 x 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X 25 / 59
Online isotonic regression 1 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X 25 / 59
Online isotonic regression 1 Y 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 X 25 / 59
Online isotonic regression The protocol Given: x 1 < x 2 < . . . < x T . At trial t = 1 , . . . , T : Environment chooses a yet unlabeled point x i t . Learner predicts � y i t ∈ [0 , 1]. Environment reveals label y i t ∈ [0 , 1]. y i t ) 2 . Learner suffers squared loss ( y i t − � 26 / 59
Online isotonic regression The protocol Given: x 1 < x 2 < . . . < x T . At trial t = 1 , . . . , T : Environment chooses a yet unlabeled point x i t . Learner predicts � y i t ∈ [0 , 1]. Environment reveals label y i t ∈ [0 , 1]. y i t ) 2 . Learner suffers squared loss ( y i t − � Strategies = isotonic functions: F = { f : f ( x 1 ) ≤ f ( x 2 ) ≤ . . . ≤ f ( x T ) } 26 / 59
Online isotonic regression The protocol Given: x 1 < x 2 < . . . < x T . At trial t = 1 , . . . , T : Environment chooses a yet unlabeled point x i t . Learner predicts � y i t ∈ [0 , 1]. Environment reveals label y i t ∈ [0 , 1]. y i t ) 2 . Learner suffers squared loss ( y i t − � Strategies = isotonic functions: F = { f : f ( x 1 ) ≤ f ( x 2 ) ≤ . . . ≤ f ( x T ) } T T � � y i t ) 2 − min ( y i t − f ( x i t )) 2 regret T = ( y i t − � f ∈F t =1 t =1 26 / 59
Online isotonic regression F = { f : f ( x 1 ) ≤ f ( x 2 ) ≤ . . . ≤ f ( x T ) } T T � � y i t ) 2 − min ( y i t − f ( x i t )) 2 regret T = ( y i t − � f ∈F t =1 t =1 Cumulative loss of the learner should not be much larger than the loss of (optimal) isotonic regression function in hindsight. Only the order x 1 < . . . < x T matters, not the values. 27 / 59
The adversary is too powerful! Every algorithm will have Ω( T ) regret 1 Y 0 X 28 / 59
The adversary is too powerful! Every algorithm will have Ω( T ) regret 1 Y 0 x 1 x 1 X 28 / 59
The adversary is too powerful! Every algorithm will have Ω( T ) regret 1 Y � y 1 0 x 1 x 1 X 28 / 59
The adversary is too powerful! Every algorithm will have Ω( T ) regret y 1 1 Y � y 1 0 x 1 x 1 X 28 / 59
The adversary is too powerful! Every algorithm will have Ω( T ) regret y 1 1 loss ≥ 1 / 4 Y � y 1 0 x 1 x 1 X 28 / 59
The adversary is too powerful! Every algorithm will have Ω( T ) regret 1 Y 0 x 2 x 2 x 1 X 28 / 59
The adversary is too powerful! Every algorithm will have Ω( T ) regret 1 � y 2 Y 0 x 2 x 2 x 1 X 28 / 59
The adversary is too powerful! Every algorithm will have Ω( T ) regret 1 � y 2 Y y 2 0 x 2 x 2 x 1 X 28 / 59
The adversary is too powerful! Every algorithm will have Ω( T ) regret 1 � y 2 Y loss ≥ 1 / 4 y 2 0 x 2 x 2 x 1 X 28 / 59
The adversary is too powerful! Every algorithm will have Ω( T ) regret 1 Y 0 x 3 x 2 x 3 x 1 X 28 / 59
The adversary is too powerful! Every algorithm will have Ω( T ) regret 1 Y � y 1 0 x 3 x 2 x 3 x 1 X 28 / 59
The adversary is too powerful! Every algorithm will have Ω( T ) regret y 3 1 Y � y 1 0 x 3 x 2 x 3 x 1 X 28 / 59
The adversary is too powerful! Every algorithm will have Ω( T ) regret y 3 1 loss ≥ 1 / 4 Y � y 1 0 x 3 x 2 x 3 x 1 X 28 / 59
The adversary is too powerful! Every algorithm will have Ω( T ) regret 1 Y 0 x 2 x 3 x 1 X Algorithms’ loss ≥ 1 4 per trial, loss of best isotonic function = 0. 28 / 59
Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions 29 / 59
Fixed design Data x 1 , . . . , x T is known in advance to the learner We will show that in such model, efficient online algorithms exist. K., Koolen, Malek: Online Isotonic Regression . Proc. of Conference on Learning Theory (COLT), pp. 1165–1189, 2016. 30 / 59
Off-the-shelf online algorithms Algorithm General bound Bound for online IR √ Stochastic Gradient Descent G 2 D 2 T T √ T log d √ T log T Exponentiated Gradient G ∞ D 1 T 2 log T Follow the Leader G 2 D 2 d log T Exponential Weights d log T T log T These bounds are tight (up to logarithmic factor). 31 / 59
Exponential Weights (Bayes) with uniform prior Let f = ( f 1 , . . . , f T ) denote values of f at ( x 1 , . . . , x T ). π ( f ) = const , for all f : f 1 ≤ . . . ≤ f T , P ( f | y i 1 , . . . , y i t ) ∝ π ( f ) e − 1 2 loss 1 ... t ( f ) , � � y i t +1 = f i t +1 P ( f | y i 1 , . . . , y i t )d f . � �� � = posterior mean 32 / 59
Exponential Weights with uniform prior does not learn prior mean 1 0.8 0.6 Y 0.4 0.2 prior mean posterior mean 0 X 33 / 59
Exponential Weights with uniform prior does not learn posterior mean ( t = 10) 1 ● ● ● ● ● ● ● ● ● ● 0.8 0.6 Y 0.4 0.2 prior mean posterior mean 0 X 34 / 59
Exponential Weights with uniform prior does not learn posterior mean ( t = 20) 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 0.6 Y 0.4 0.2 prior mean posterior mean 0 X 35 / 59
Exponential Weights with uniform prior does not learn posterior mean ( t = 50) 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 0.6 Y 0.4 0.2 prior mean posterior mean 0 X 36 / 59
Exponential Weights with uniform prior does not learn posterior mean ( t = 100) 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 0.6 Y 0.4 0.2 prior mean posterior mean 0 X 37 / 59
The algorithm Exponential Weights on a covering net � � f : f t = k t F K = K , k ∈ { 0 , 1 , . . . , K } , f 1 ≤ . . . ≤ f T , π ( f ) uniform on F K . Efficient implementation by dynamic programming: O ( Kt ) at trial t . Speed-up to O ( K ) if the data revealed in isotonic order. 38 / 59
Covering net A finite set of isotonic functions on a discrete grid of y values. 1 0 . 9 0 . 8 0 . 7 0 . 6 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 39 / 59
Covering net A finite set of isotonic functions on a discrete grid of y values. 1 0 . 9 0 . 8 0 . 7 0 . 6 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 39 / 59
Covering net A finite set of isotonic functions on a discrete grid of y values. 1 0 . 9 0 . 8 0 . 7 0 . 6 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 39 / 59
Covering net A finite set of isotonic functions on a discrete grid of y values. 1 0 . 9 0 . 8 0 . 7 0 . 6 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 39 / 59
Covering net A finite set of isotonic functions on a discrete grid of y values. 1 0 . 9 0 . 8 0 . 7 There are O ( T K ) functions in F K 0 . 6 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 39 / 59
Performance of the algorithm Regret bound � � T 1 / 3 log − 1 / 3 ( T ) When K = Θ , � � T 1 / 3 log 2 / 3 ( T ) Regret = O 40 / 59
Performance of the algorithm Regret bound � � T 1 / 3 log − 1 / 3 ( T ) When K = Θ , � � T 1 / 3 log 2 / 3 ( T ) Regret = O Matching lower bound Ω( T 1 / 3 ) (up to log factor). 40 / 59
Performance of the algorithm Regret bound � � T 1 / 3 log − 1 / 3 ( T ) When K = Θ , � � T 1 / 3 log 2 / 3 ( T ) Regret = O Matching lower bound Ω( T 1 / 3 ) (up to log factor). Proof idea Regret = Loss(alg) − min f ∈F K Loss( f ) + min f ∈F K Loss( f ) − isotonic f Loss( f ) min 40 / 59
Performance of the algorithm Regret bound � � T 1 / 3 log − 1 / 3 ( T ) When K = Θ , � � T 1 / 3 log 2 / 3 ( T ) Regret = O Matching lower bound Ω( T 1 / 3 ) (up to log factor). Proof idea Regret = Loss(alg) − min f ∈F K Loss( f ) � �� � =2 log |F K | = O ( K log T ) + min f ∈F K Loss( f ) − isotonic f Loss( f ) min � �� � = T 4 K 2 40 / 59
Performance of the algorithm prior mean 1 0.8 0.6 Y 0.4 0.2 prior mean posterior mean 0 X 41 / 59
Performance of the algorithm posterior mean ( t = 10) 1 ● ● ● ● ● ● ● ● ● ● 0.8 0.6 Y 0.4 0.2 prior mean posterior mean 0 X 42 / 59
Performance of the algorithm posterior mean ( t = 20) 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 0.6 Y 0.4 0.2 prior mean posterior mean 0 X 43 / 59
Performance of the algorithm posterior mean ( t = 50) 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 0.6 Y 0.4 0.2 prior mean posterior mean 0 X 44 / 59
Performance of the algorithm posterior mean ( t = 100) 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 0.6 Y 0.4 0.2 prior mean posterior mean 0 X 45 / 59
Other loss functions Cross-entropy loss ℓ ( y , � y ) = − y log � y − (1 − y ) log(1 − � y ) � � T 1 / 3 log 2 / 3 ( T ) The same bound O . Covering net F K obtained by non-uniform discretization. 46 / 59
Other loss functions Cross-entropy loss ℓ ( y , � y ) = − y log � y − (1 − y ) log(1 − � y ) � � T 1 / 3 log 2 / 3 ( T ) The same bound O . Covering net F K obtained by non-uniform discretization. Absolute loss ℓ ( y , � y ) = | y − � y | � obtained by Exponentiated Gradient. � √ T log T O √ Matching lower bound Ω( T ) (up to log factor). 46 / 59
Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions 47 / 59
Random permutation model A more realistic scenario for generating x 1 , . . . , x T which allows data to be unknown in advance. 48 / 59
Random permutation model A more realistic scenario for generating x 1 , . . . , x T which allows data to be unknown in advance. The data are chosen adversarially before the game begins, but then are presented to the learner in a random order Motivation: data gathering process is independent on the underlying data generation mechanism. Still very weak assumption. Evaluation: regret averaged over all permutations of data: E σ [regret T ] K., Koolen, Malek: Random Permutation Online Isotonic Regression . NIPS, pp. 4180–4189, 2017. 48 / 59
Leave-one-out loss Definition Given t labeled points { ( x i , y i ) } t i =1 , for i = 1 , . . . , t : Take out i -th point and give remaining t − 1 points to the learner as a training data. Learner predict � y i on x i and receives loss ℓ ( y i , � y i ). � t Evaluate the learner by ℓ oo t = 1 i =1 ℓ ( y i , � y i ) t No sequential structure in the definition. 49 / 59
Leave-one-out loss Definition Given t labeled points { ( x i , y i ) } t i =1 , for i = 1 , . . . , t : Take out i -th point and give remaining t − 1 points to the learner as a training data. Learner predict � y i on x i and receives loss ℓ ( y i , � y i ). � t Evaluate the learner by ℓ oo t = 1 i =1 ℓ ( y i , � y i ) t No sequential structure in the definition. Theorem If ℓ oo t ≤ g ( t ) for all t , then E σ [regret T ] ≤ � T t =1 g ( t ) . 49 / 59
Fixed design to random permutation conversion Any algorithm for fixed-design can be used in the random permutation setup by being re-run from the scratch in each trial. We have shown that: ℓ oo t ≤ 1 t E σ [fixed-design-regret t ] We thus get an optimal algorithm (Exponential Weights on a grid) with � O ( T − 2 / 3 ) leave-one-out loss “for free”, but it is complicated. Can we get simpler algorithms to work in this setup? 50 / 59
Follow the Leader (FTL) algorithm Definition Given past t − 1 data, compute the optimal (loss-minimizing) function f ∗ and predict on new instance x according to f ∗ ( x ). 51 / 59
Follow the Leader (FTL) algorithm Definition Given past t − 1 data, compute the optimal (loss-minimizing) function f ∗ and predict on new instance x according to f ∗ ( x ). FTL is undefined for isotonic regression. − 3 − 1 2 3 x y 0 0 . 2 0 . 7 1 f ∗ ( x ) 0 0 . 2 0 . 7 1 51 / 59
Follow the Leader (FTL) algorithm Definition Given past t − 1 data, compute the optimal (loss-minimizing) function f ∗ and predict on new instance x according to f ∗ ( x ). FTL is undefined for isotonic regression. − 3 − 1 0 2 3 x y 0 0 . 2 0 . 7 1 f ∗ ( x ) 0 0 . 2 ?? 0 . 7 1 51 / 59
Foward Algorithm (FA) Definition Given past t − 1 data and a new instance x , take any guess y ′ ∈ [0 , 1] of the new label and predict according to the optimal function f ∗ on the past data including the new point ( x , y ′ ). x − 3 − 1 0 2 3 0 0 . 2 0 . 7 1 y f ∗ ( x ) 52 / 59
Foward Algorithm (FA) Definition Given past t − 1 data and a new instance x , take any guess y ′ ∈ [0 , 1] of the new label and predict according to the optimal function f ∗ on the past data including the new point ( x , y ′ ). x − 3 − 1 0 2 3 y ′ = 1 0 0 . 2 0 . 7 1 y f ∗ ( x ) 52 / 59
Foward Algorithm (FA) Definition Given past t − 1 data and a new instance x , take any guess y ′ ∈ [0 , 1] of the new label and predict according to the optimal function f ∗ on the past data including the new point ( x , y ′ ). x − 3 − 1 0 2 3 y ′ = 1 0 0 . 2 0 . 7 1 y f ∗ ( x ) 0 0 . 2 0 . 85 0 . 85 1 52 / 59
Foward Algorithm (FA) Definition Given past t − 1 data and a new instance x , take any guess y ′ ∈ [0 , 1] of the new label and predict according to the optimal function f ∗ on the past data including the new point ( x , y ′ ). x − 3 − 1 0 2 3 y ′ = 1 0 0 . 2 0 . 7 1 y f ∗ ( x ) 0 0 . 2 0 . 85 0 . 85 1 Various popular prediction algorithms for IR fall into this framework (including linear interpolation [Zadrozny & Elkan, 2002] and many others [Vovk et al., 2015]). 52 / 59
Foward Algorithm (FA) Two extreme FA: guess-1 and guess-0, denoted f ∗ 1 and f ∗ 0 . Prediction of any FA is always between: f ∗ 0 ( x ) ≤ f ∗ ( x ) ≤ f ∗ 1 ( x ). 1 y 8 y 7 y 6 y 5 Y y 2 y 3 y 1 0 x 1 x 2 x 3 x 5 x 6 x 7 x 8 x 4 X 53 / 59
Foward Algorithm (FA) Two extreme FA: guess-1 and guess-0, denoted f ∗ 1 and f ∗ 0 . Prediction of any FA is always between: f ∗ 0 ( x ) ≤ f ∗ ( x ) ≤ f ∗ 1 ( x ). 1 y 8 y 7 f ∗ 1 y 6 y 5 Y y 2 y 3 y 1 0 x 1 x 2 x 3 x 5 x 6 x 7 x 8 x 4 X 53 / 59
Foward Algorithm (FA) Two extreme FA: guess-1 and guess-0, denoted f ∗ 1 and f ∗ 0 . Prediction of any FA is always between: f ∗ 0 ( x ) ≤ f ∗ ( x ) ≤ f ∗ 1 ( x ). 1 y 8 y 7 f ∗ 1 y 6 y 5 Y y 2 y 3 f ∗ 0 y 1 0 x 1 x 2 x 3 x 5 x 6 x 7 x 8 x 4 X 53 / 59
Foward Algorithm (FA) Two extreme FA: guess-1 and guess-0, denoted f ∗ 1 and f ∗ 0 . Prediction of any FA is always between: f ∗ 0 ( x ) ≤ f ∗ ( x ) ≤ f ∗ 1 ( x ). 1 y 8 y 7 f ∗ 1 y 6 y 5 every FA predicts in this range Y y 2 y 3 f ∗ 0 y 1 0 x 1 x 2 x 3 x 5 x 6 x 7 x 8 x 4 X 53 / 59
Performance of FA Theorem For squared loss, every forward algorithm has: � log t ℓ oo t = O t The bound is suboptimal, but only a factor of O ( t 1 / 6 ) off. For cross-entropy loss, the some bound holds but a more careful choice of the guess must be made. 54 / 59
Outline 1 Motivation 2 Isotonic regression 3 Online learning 4 Online isotonic regression 5 Fixed design online isotonic regression 6 Random permutation online isotonic regression 7 Conclusions 55 / 59
Recommend
More recommend