Feature Selection Risk Alex Chinco University of Illinois at Urbana-Champaign September 15, 2014
” Our model allows us to identify and interpret events faster than more traditional methods used by other investors. —Quant. Fund Pitch Book
” Our model allows us to identify and interpret events faster than more traditional methods used by other investors. —Quant. Fund Pitch Book
Imagine you’re a trader. Each stock can have Y / N exposure to 7 features. Whether or not. . . 1. It’s involved in a crowded trade 2. It’s mentioned in M&A rumors 3. Its major supplier closed down 4. Its labor force unionized 5. It belongs alcohol/tobacco/gaming industry 6. It’s referenced in a scientific article 7. It’s been added to the S&P 500 1 of the 7 features might have realized a shock. Having mystery feature raises demand by α > 0 shares. Question: How many observations do you need to see in order to decide which (if any) of the 7 features has realized a shock?
Answer: Only 3 !
Answer: Only 3 ! ◮ Stock 1 : crowded trade, supplier close, ATG ind., S&P 500 add. ◮ Stock 2 : M&A rumor, supplier close, sci. article, S&P 500 add. ◮ Stock 3 : labor unionization, ATG ind., sci. article, S&P 500 add. Data matrix ( X ) 3 × 7 tells you if stock n has attribute q : � 1 if yes iid ∼ N(0 , σ 2 x n,q = with ǫ n ǫ ) , α ≫ σ ǫ 0 if no e.g., if only d 1 ≈ α then crowded trade shock: α α ǫ 1 1 0 1 0 1 0 1 0 0 0 1 1 0 0 1 1 + ǫ 2 . ≈ . . 0 0 0 0 1 1 1 1 ǫ 3 0 ���� � �� � � �� � ( d ) 3 × 1 ( X ) 3 × 7 ( ǫ ) 3 × 1 ���� ( α ) 7 × 1 e.g., if d 1 ≈ d 2 ≈ d 3 ≈ α , then S&P 500 addition shock.
Key Insight: Inference problem changes character at N ⋆ = 3 .
Key Insight: Inference problem changes character at N ⋆ = 3 . First, imagine you’ve seen N = 4 observations: α α 1 0 1 0 1 0 1 ǫ 1 0 0 0 1 1 0 0 1 1 ǫ 2 + ≈ . . 0 0 0 0 1 1 1 1 ǫ 3 . 1 1 0 0 1 1 0 α ǫ 4 0 � �� � � �� � � �� � ���� ( d ) 4 × 1 ( X ) 4 × 7 ( ǫ ) 4 × 1 ( α ) 7 × 1 √ Estimate of α is now ( d 1 + d 4 ) / 2 ≈ α ± σ ǫ / 2 .
Key Insight: Inference problem changes character at N ⋆ = 3 . First, imagine you’ve seen N = 4 observations: α α 1 0 1 0 1 0 1 ǫ 1 0 0 0 1 1 0 0 1 1 ǫ 2 + ≈ . . 0 0 0 0 1 1 1 1 ǫ 3 . 1 1 0 0 1 1 0 α ǫ 4 0 � �� � � �� � � �� � ���� ( d ) 4 × 1 ( X ) 4 × 7 ( ǫ ) 4 × 1 ( α ) 7 × 1 √ Estimate of α is now ( d 1 + d 4 ) / 2 ≈ α ± σ ǫ / 2 . Now, imagine you’ve instead seen only N = 2 observations: α � � � � � � 0 α ǫ 1 1 0 1 0 1 0 1 + ≈ . . ǫ 2 0 0 1 1 0 0 1 1 . ���� � �� � ���� 0 ( d ) 2 × 1 ( X ) 2 × 7 ( ǫ ) 2 × 1 ���� ( α ) 7 × 1 Could be either crd. trade or ATG ind. How to value 3 rd asset? � 0 1 � x 3 = 0 0 1 1 1
This is a stylized example, but. . . the problem scales! iid Suppose Q = 400 , K = 5 , and x n,q ∼ N(0 , 1) : 400 � d n = ˜ d n − E[ ˜ d n | f ] = α q · x n,q + ǫ n q =1 Bonferroni Threshold FDR Threshold LASSO 1.00 α q } ) 2 q =1 1 { α q � =ˆ 0.75 0.50 25 · ( � 400 N ⋆ ≈ 22 N ⋆ ≈ 22 N ⋆ ≈ 22 0.25 1 / 0.00 3 4 5 6 3 4 5 6 3 4 5 6 log( N )
1) Derive feature selection bound 2) Embed in eqm. asset-pricing model 3) Outline empirical predictions: ◮ Noise trader and feature selection risks are substitutes. ◮ Derivatives more informative than Arrow securities. Slogan: There are fundamental limits on how quickly even the most sophisticated trader can interpret market signals. Sparse B.R.: Gabaix (2012); Compressed Sensing: Candes, Romberg, and Tao (2004); Candes and Tao (2005); Donoho (2006); Cogn. Control: Chinco (2014); High-D. Inference: Chinco and Clark-Joseph (2014); Info-Based Asset Pricing: Grossman and Stiglitz (1980); Kyle (1985); Veldkamp (2006); Behavioral Finance: Barberis, Shleifer, and Wurgler (2005); Garleanu and Pedersen (2012).
Consider sequences of Kyle (1985)-type markets where: N →∞ Q N , K N = ∞ lim N ≥ K N lim K N / Q N = 0 N →∞ Agents must use feature selection rule, φ ( d , X ) , to identify shocks: φ : R N × R N × Q �→ R Q where FSE[ φ ] is prob. that φ identifies wrong features. Proposition (Feature Selection Bound) If there exists some constant C > 0 such that: N < C × K N · log( Q N / K N ) as N → ∞ , then there exists some constant c > 0 such that: min φ ∈ Φ FSE[ φ ] > c N ⋆ ( Q, K ) ≍ K · log( Q / K ) is the feature selection bound.
Static Kyle (1985)-type model with N assets. N informed traders each get priv. signal about value of single asset. Single market maker (MM) views agg. demand for N assets: � � θ ) · d � 2 α = arg min � � X � α − ( 1 / 2 + γ · � � α � 1 α ∈ R Q � ◮ Informed trader demand rule: y n = θ · v n ◮ Market maker pricing rule: p n = λ · d n Proposition (Equilibrium Using the LASSO) If MM uses the LASSO and N > N ⋆ , then there exists an equilibrium: � � σ z � � 1 K λ = and θ = C · log( Q ) × N · 2 · θ σ v � for C > 0 and γ = 2 · ( σ z / θ ) · 2 · log( Q ) .
Informed trader expected profit: � C / 2 · K / N · log( Q ) × σ v · σ z Question: What is the feature count for noise trader demand volatility exchange rate that leaves informed traders indifferent?
Informed trader expected profit: � C / 2 · K / N · log( Q ) × σ v · σ z Question: What is the feature count for noise trader demand volatility exchange rate that leaves informed traders indifferent? Consider transformations: Q �→ Q ′ = Q · (1 + ∆ Q ) σ z �→ σ ′ and z = σ z · (1 + ∆ σ z ) Proposition (Substituting Risks) If σ z decreases by ∆ σ z < 0 , then informed trader expected profits are unchanged if Q increases by ∆ Q > 0 : � Q � ∆ Q = 2 · log( Q ) · × − ∆ σ z σ z
Question: What kind of asset reveals shocks using fewest obs.?
Question: What kind of asset reveals shocks using fewest obs.? Could look at Arrow securities: d ( A ) 1 0 0 0 α 1 · · · 1 d ( A ) 0 1 0 0 α 2 · · · 2 d ( A ) 0 0 1 0 α 3 = · · · + “Noise” 3 . . . . . . ... . . . . . . . . . . . . d ( A ) α Q 0 0 0 1 · · · Q � �� � X ( A ) . . . but this is over-kill!
Question: What kind of asset reveals shocks using fewest obs.? Could look at Arrow securities: d ( A ) 1 0 0 0 α 1 · · · 1 d ( A ) 0 1 0 0 α 2 · · · 2 d ( A ) 0 0 1 0 α 3 = · · · + “Noise” 3 . . . . . . ... . . . . . . . . . . . . d ( A ) α Q 0 0 0 1 · · · Q � �� � X ( A ) . . . but this is over-kill! Could also look at N deriv. constr. by fin. eng. from Q Arrow sec.: N × Q D X ( A ) N × Q = X Q × Q Can’t have ind. exposures to all Q features since N ≪ Q . e.g., all deriv. must have sim. exp. to, say, crwd. trade and S&P 500 incl.
Key insight: Don’t need complete independence! If any (2 · K ) columns of X are lin. indep., then any K -sparse signal α ∈ R Q can be reconstructed uniquely from X α . Why? Suppose not. i.e., there exists α , α ′ ∈ R Q with X α = X α ′ ; but, this implies X ( α − α ′ ) = 0 which is a contrdtn. α − α ′ is at most (2 · K ) -sparse. There can’t be lin. dep. betw. (2 · K ) cols. of X by asm. Proposition (Seemingly Redundant Assets) If N ≥ N ⋆ ( Q, K ) , then MM studying deriv. using the LASSO can identify K -sparse shocks with prob. greater than 1 − C 1 · e − C 2 · K using: Θ[ K / Q · log( Q / K )] times fewer assets than MM studying Arrow sec with C 1 , C 2 > 0 .
Thanks!
Recommend
More recommend