Sequential complexities and uniform martingale laws of large numbers Ambuj Tewari (based on joint work with Alexander Rakhlin and Karthik Sridharan) Department of Statistics, and Department of EECS, University of Michigan, Ann Arbor November 15, 2014 Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Some Prediction Problems Will a friendship relation form between two Facebook users? Which ads should Google show me when I search for flights to Mexico ? 507,000 webpages match game-theoretic probability : in which order should Google show them to me? Should Gmail put the email with subject FREE ONLINE COURSES!!! in the spam folder? Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Mathematical Formulation of Prediction Problems Input space X (vectors, matrices, text, graphs) Label space Y (classification) Y = {± 1 } (regression) Y = [ − 1 , +1] (ranking) Y = S k , group of k -permutations Want to learn a prediction function f : X → Y Loss function: how bad is prediction f ( x ) if “truth” is y Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Predictions and Losses Learner/Statistician/Decision Maker chooses prediction function f : X → Y Adversary/Nature/Environment produces examples ( x , y ) ∈ X × Y Learner’s loss ℓ ( f ( x ) , y ) Assume ℓ is bounded Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Probabilistic Approach ( x t , y t ) are drawn from a stochastic process For instance, ( x t , y t ) i.i.d. from some distribution P Parametric case: P = P θ with θ ∈ Θ ⊆ R p Distribution free or “agnostic” case: P arbitrary Goal: Choose � f based on the sample (( x t , y t )) n t =1 to have small expected loss � � ℓ ( � f ( x ) , y ) E x 1: n , y 1: n , x , y ∼ P Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Empirical Risk Minimization Risk and empirical risk n � L ( f ) = 1 � L ( f ) = E ( x , y ) ∼ P [ ℓ ( f ( x ) , y )] ℓ ( f ( x t ) , y t ) n t =1 Risk minimizer f ⋆ = argmin L ( f ) f ∈F Empirical risk minimizer (ERM) � � f = argmin L ( f ) f ∈F Excess risk L ( � f ) − L ( f ⋆ ) Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Game Theoretic Approach FOR t = 1 to n Adversary plays x t ∈ X Learner plays f t ∈ F Adversary plays y t ∈ Y Learner suffers ℓ ( f t ( x t ) , y t ) ENDFOR No assumption on data generating mechanism Want to “do well” on every sequence ( x 1 , y 1 ) , . . . , ( x n , y n ) Tricky to define Goal: Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Regret Measure learner’s loss relative to some benchmark computed in hindsight (External) Regret n n � � ℓ ( f t ( x t ) , y t ) − min ℓ ( f ( x t ) , y t ) f ∈F t =1 t =1 Benchmark here is the best fixed decision in hindsight Many variants exist (switching regret, Φ-regret) Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Why Study Regret? Lets us proceed with no assumptions on the data generating process Regret-minimizing algorithms perform well if data is i.i.d. Yields simple one-pass algorithms If players in a game follow regret-minimizing algorithms, the empirical distribution of play converges to an equilibrium Long history in Computer Science, Finance, Game Theory, Information Theory, and Statistics Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Two pioneers James Hannan (1922-2010) David Blackwell (1919-2010) Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Simplest Case: Finite Class of Functions |F| = K Hannan’s theorem. There is a (randomized) learner strategy for which (expected) regret = o ( n ) “no-regret learning” or “Hannan consistency”: when regret = o ( n ) Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Multiple Discovery Originally proved by Hannan (1956) Blackwell (1956) showed how it follows from his approachability theorem Result has been proven many times since then: Banos (1968) Cover (1991) Foster & Vohra (1993) Vovk (1993) Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Rest of the Talk Rademacher complexity and its sequential analog Fat-shattering dimension and its sequential analog Uniform martingale law of large numbers Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Rademacher Complexity Recall ERM � f , RM f ⋆ f ⋆ = argmin � � f = argmin L ( f ) L ( f ) f ∈F f ∈F Easy to show � � � � L ( � L ( f ) − � f ) − L ( f ⋆ ) E ≤ E sup L ( f ) f ∈F Symmetrization ( ǫ t ’s are Rademacher, i.e. symmetric Bernoulli) � � � � � n 1 L ( f ) − � sup L ( f ) ≤ 2 E ǫ 1: n , x 1: n , y 1: n sup ǫ t ℓ ( f ( x t ) , y t ) E n f ∈F f ∈F t =1 Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Which Algorithm Should We Analyze? Obvious analogue of ERM is “follow-the-leader” or “fictitious play”: � t f t +1 = argmin ℓ ( f ( x s ) , y s ) f ∈F s =1 Does not enjoy good regret bound Lack of a generic regret-minimizing strategy is a problem Directly attack minimax regret Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Minimax Regret Minimax regret: � n � � � n V n := min max E ℓ ( f t ( x t ) , y t ) − min ℓ ( f ( x t ) , y t ) f ∈F Learner Adversary t =1 t =1 strategies strategies Theorem (Rakhlin, Sridharan, Tewari (2010)) V n ≤ 2 R seq n Important precursor: Abernethy et al. (2009) Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Sequential Rademacher Complexity � � n � R seq := sup sup ǫ t ℓ ( f ( x ( ǫ 1: t − 1 )) , y ( ǫ 1: t − 1 ) x , y E ǫ 1: n n f ∈F t =1 x 1 , y 1 x 3 , y 3 x 2 , y 2 Tree x , y x 4 , y 4 x 6 , y 6 x 7 , y 7 x 5 , y 5 Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Sequential Rademacher Complexity � � n � R seq := sup sup ǫ t ℓ ( f ( x ( ǫ 1: t − 1 )) , y ( ǫ 1: t − 1 ) x , y E ǫ 1: n n f ∈F t =1 x ( ∅ ) , y ( ∅ ) x 3 , y 3 x 2 , y 2 Tree x , y x 4 , y 4 x 6 , y 6 x 7 , y 7 x 5 , y 5 Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Sequential Rademacher Complexity � � n � R seq := sup sup ǫ t ℓ ( f ( x ( ǫ 1: t − 1 )) , y ( ǫ 1: t − 1 ) x , y E ǫ 1: n n f ∈F t =1 x 1 , y 1 x 3 , y 3 x ( − 1) , y ( − 1) Tree x , y x 4 , y 4 x 6 , y 6 x 7 , y 7 x 5 , y 5 Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Sequential Rademacher Complexity � � n � R seq := sup sup ǫ t ℓ ( f ( x ( ǫ 1: t − 1 )) , y ( ǫ 1: t − 1 ) x , y E ǫ 1: n n f ∈F t =1 x 1 , y 1 x 3 , y 3 x 2 , y 2 Tree x , y x 4 , y 4 x 6 , y 6 x 7 , y 7 x ( − 1 , 1) , y ( − 1 , 1) Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Sequential Rademacher Complexity � � n � R seq := sup sup ǫ t ℓ ( f ( x ( ǫ 1: t − 1 )) , y ( ǫ 1: t − 1 ) x , y E ǫ 1: n n f ∈F t =1 x 1 , y 1 x 3 , y 3 x 2 , y 2 Tree x , y x 4 , y 4 x 6 , y 6 x 7 , y 7 x 5 , y 5 Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Sequential Rademacher Complexity � � n � R seq := sup sup ǫ t ℓ ( f ( x ( ǫ 1: t − 1 )) , y ( ǫ 1: t − 1 ) x , y E ǫ 1: n n f ∈F t =1 x 1 , y 1 − 1 x 3 , y 3 x 2 , y 2 Tree x , y +1 x 4 , y 4 x 6 , y 6 x 7 , y 7 x 5 , y 5 +1 Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Rademacher Complexity: Classical vs. Sequential � � n � R n ( ℓ ◦ F ) := E ǫ 1: n , x 1: n , y 1: n sup ǫ t ℓ ( f ( x t ) , y t )) f ∈F t =1 � � n � R seq n ( ℓ ◦ F ) := sup sup ǫ t ℓ ( f ( x ( ǫ 1: t − 1 )) , y ( ǫ 1: t − 1 )) x , y E ǫ 1: n f ∈F t =1 Sequences x 1: n , y 1: n replaced by tree x , y Expectation over sequences x 1: n , y 1: n replaced by supremum over trees x , y Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Seq. Rademacher Complexity: Properties (inclusion) If F ⊆ F ′ then R seq n ( ℓ ◦ F ) ≤ R seq n ( ℓ ◦ F ′ ) (scaling) If c ∈ R then R seq n ( c ℓ ◦ F ) = | c | · R seq n ( ℓ ◦ F ) (translation) If ℓ ′ = ℓ + h then n ( ℓ ◦ F ) = R seq ( ℓ ′ ◦ F ) R seq Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Seq. Rademacher Complexity: Properties (inclusion) If F ⊆ F ′ then R seq n ( ℓ ◦ F ) ≤ R seq n ( ℓ ◦ F ′ ) (scaling) If c ∈ R then R seq n ( c ℓ ◦ F ) = | c | · R seq n ( ℓ ◦ F ) (translation) If ℓ ′ = ℓ + h then n ( ℓ ◦ F ) = R seq ( ℓ ′ ◦ F ) R seq Using these and other properties, possible to bound seq. Rademacher complexity of decision trees, neural networks, etc. Rakhlin, Sridharan, Tewari Sequential complexities and uniform martingale LLNs
Recommend
More recommend