principal component analysis of high frequency data
play

Principal Component Analysis of High Frequency Data t-Sahalia - PowerPoint PPT Presentation

Motivation Model Setup Inference Simulations Empirical Work Conclusion Principal Component Analysis of High Frequency Data t-Sahalia Dacheng Xiu Yacine A Department of Economics, Princeton University Booth School of


  1. Motivation Model Setup Inference Simulations Empirical Work Conclusion Principal Component Analysis of High Frequency Data ıt-Sahalia † Dacheng Xiu ‡ Yacine A¨ † Department of Economics, Princeton University ‡ Booth School of Business, University of Chicago FERM 2014, Central University of Finance and Economics June 28, 2014

  2. Motivation Model Setup Inference Simulations Empirical Work Conclusion Motivation ◮ Principal component analysis (PCA) is one of the most popular and oldest techniques for multivariate analysis. 1. Pearson (1901, Philosophical Magazine) 2. Hotelling (1933, J. Educ. Psych.) ◮ PCA is a dimension reduction technique that seeks to describe the multivariate structure of the data. ◮ The central idea is to identify a small number of factors that effectively summarize the variation of the data.

  3. Motivation Model Setup Inference Simulations Empirical Work Conclusion Statistical Inference on PCA ◮ Estimating eigenvalues of the sample covariance matrix is the key step towards PCA. ◮ Anderson (1963, AOS) studies the statistical inference problem of the eigenvalues, and find that � � �� √ n ( � λ − λ ) d λ 2 1 , λ 2 2 , . . . , λ 2 → N 0 , 2Diag . d where � λ and λ are the vectors of eigenvalues of the sample ad population covariance matrix. λ is simple. ◮ When eigenvalues are repeated, the asymptotic theory is a bit complicated to use.

  4. Motivation Model Setup Inference Simulations Empirical Work Conclusion Drawbacks There are at least two obvious drawbacks of the classical asymptotic theory. ◮ The derivation requires i.i.d. and multivariate normality. ◮ Extensions to non-normality or time-series data is possible, e.g. Waternaux (1976, AOS), Tyler (1983, AOS), Stock and Watson (1998, WP), etc. ◮ The second drawback is the curse of dimensionality. ◮ It is well known that when n / d → C ≥ 1, 1 a . s . � → (1 + C − 1 / 2 ) 2 λ 1 − n where the true eigenvalue is 1, see e.g. German (1980, AOP), Bai (1999, Statist. Sinica), and Johnstone (2001, AOS).

  5. Motivation Model Setup Inference Simulations Empirical Work Conclusion Limited Applications ◮ An ETF that tracks the Dow Jones index needs d = 30 stocks. Its covariance matrix has 465 parameters, if no additional structure is imposed. ◮ It is very demanding to conduct nonparametric inference with limited amount of data. Years of daily data are required. ◮ Moreover, stock returns exhibit time-varying volatility and heavy tails, which deviate from i.i.d. normality to a great extent.

  6. Motivation Model Setup Inference Simulations Empirical Work Conclusion Why using High Frequency Data is Better? ◮ The dimension does not need to be “large.” ◮ The large amount of data eases the curse of dimensionality, to the extent that asymptotic results with dimensionality being fixed may serve as rather good approximations. ◮ E.g. for a typical trading day, we have at least 78 observations of 5-minute returns. ◮ The time span does not need to be “long.” ◮ Fixed time span [0 , T ], with T = 1 day or 1 month. ◮ The in-fill asymptotic framework enables nonparametric analysis of general continuous-time stochastic processes. ◮ Instead of estimating “expectations,” we measure “ realizations .”

  7. Motivation Model Setup Inference Simulations Empirical Work Conclusion Main Contribution ◮ We define the concept of (realized) PCA for data sampled from a continuous-time stochastic process within a fixed time window. ◮ We propose asymptotic theory for spectral functions, eigenvalues , eigenvectors , and principal components , under general nonparametric models, using intraday data. ◮ Empirically, we use this new technique to analyze constituents of Dow Jones Index and document a factor structure within a short window.

  8. Motivation Model Setup Inference Simulations Empirical Work Conclusion PCA and Factor Models ◮ Applications in Finance and Macro: Ross (1976, JET), Stock and Watson (2002, JBES). ◮ Classic PCA and Factor Analysis: Hotelling (1933, J. Educ. Psych.), Thomson (1934, J. Educ. Psych.), Anderson and Amemiya (1988, AOS). ◮ Large d Setting: Chamberlain and Rothschild (1983, ECMA), Connor and Korajczyk (1998, JFE), Stock and Watson (2002, JASA), Bai and NG (2002, ECMA), Bai (2003, ECMA), Mario Forni, Lucrezia Reichlin, Marc Hallin, Marco Lippi (1999, RofE&S), Lam and Yao (2012, AOS).

  9. Motivation Model Setup Inference Simulations Empirical Work Conclusion A Large Literature on High Frequency Data dX t = µ t dt + σ t dW t + J t � t u du , � ◮ QV and Components of QV, e.g. 0 σ 2 u ≤ t (∆ J u ) 2 . � t ◮ Covariance, e.g. 0 σ u σ ⊺ u du . ◮ Downside Risk, e.g. � u ≤ t (∆ J u ) 2 1 { ∆ J u < 0 } . ◮ Skewness, � u ≤ t (∆ J u ) 3 . � t ◮ Other Nonlinear Functionals of Volatility: 0 σ 4 u du , � t � t 0 e − x σ 2 0 1 { σ 2 u ≤ x } du , u du . ◮ Testing for Jumps, Estimation of Jump Activity and Tails. ◮ Robustness to Microstructure Noise, Asynchronous Trading, Endogenous Trading Time. ◮ Leverage Effect. ◮ From Realized to Spot Variance : σ u σ ⊤ u .

  10. Motivation Model Setup Inference Simulations Empirical Work Conclusion Spot Variance Related Papers ◮ Jacod and Rosenbaum (2013, AOS) : � t � t σ 4 g ( σ s σ ⊺ s ) ds , e . g . s ds . 0 0 ◮ Fixed T, Fixed d: Eigenvalue Related Problems ◮ Test of Rank: Jacod, Lejay, and Talay (2008, Bernoulli), Jacod and Podolskij (2013, AOS). ◮ Fixed T, Large d: ◮ High-Dimensional Covariance Matrix Estimation with High Frequency Data: Wang and Zou (2010, JASA), Tao, Wang, and Zhou (2013, AOS), Tao, Wang, and Chen (2013, ET), and Tao, Wang, Yao, and Zou (2011, JASA). ◮ Spectral Distribution of Realized Covariance Matrix: Zheng and Li (2011, AOS).

  11. Motivation Model Setup Inference Simulations Empirical Work Conclusion Classical PCA Suppose R is a d -dimensional vector-valued random variable. The first component is a linear combination of R , γ ⊺ 1 R , which maximize its variation. The weight γ 1 satisfies the following optimization problem: γ 1 γ ⊺ γ ⊺ max 1 c γ 1 , subject to 1 γ 1 = 1 where c = cov( R ). Using the Lagrange multiplier, the problem is to maximize γ ⊺ 1 c γ 1 − λ 1 ( γ ⊺ 1 γ 1 − 1) which yields c γ 1 = λ 1 γ 1 , and γ ⊺ 1 c γ 1 = λ 1 .

  12. Motivation Model Setup Inference Simulations Empirical Work Conclusion Classical PCA - continue ◮ Therefore, λ 1 is the largest eigenvalue of the population covariance matrix c , and γ 1 is the corresponding eigenvector. ◮ The second principal component solves the following optimization problem: γ 2 γ ⊺ γ ⊺ 2 γ 2 = 1 , and cov( γ ⊺ 1 R , γ ⊺ max 2 c γ 2 , subject to 2 R ) = 0 . It turns out that the solution γ 2 corresponds to the second eigenvalue λ 2 .

  13. Motivation Model Setup Inference Simulations Empirical Work Conclusion Continuous-Time Model We consider a d -dimensional Itˆ o semimartingale, defined on a filtered space (Ω , F , ( F t ) t ≥ 0 , P ) with the following representation: � t � t X t = X 0 + b s ds + σ s dW s + J t , 0 0 and writing c t = ( σσ ⊺ ) t , � t � t ˜ σ s d ˜ W s + ˜ c t = c 0 + b s ds + ˜ J t , 0 0 where W is a d -dimensional Brownian motion, ˜ W is another Brownian motion, possibly correlated with W , and J t and ˜ J t are price and volatility jumps.

  14. Motivation Model Setup Inference Simulations Empirical Work Conclusion Principal Component Analysis How do we introduce PCA in this setting? ◮ Instead of maximizing the variance , we maximize the continuous component of the quadratic variation. ◮ Theorem: There exists a sequence of { λ g , s , γ g , s } , 1 ≤ g ≤ d , 0 ≤ s ≤ t , such that γ ⊺ c s γ g , s = λ g , s γ g , s , γ ⊺ g , s γ g , s = 1 , and h , s c s γ g , s = 0 , where λ 1 , s ≥ λ 2 , s ≥ . . . ≥ λ d , s ≥ 0. Moreover, for any c` adl` ag and vector-valued adapted process γ s , such that γ ⊺ s γ s = 1, and for 1 ≤ h ≤ g − 1, �� u � u � c � u �� u � u � c γ ⊺ γ ⊺ γ ⊺ γ ⊺ s − dX s , h , s − dX s = 0 , and λ g , s ds ≥ s − dX s , s − dX s . 0 0 0 0 0

  15. Motivation Model Setup Inference Simulations Empirical Work Conclusion Eigenvalue as a Function Let’s start with integrated eigenvalues, from which we know the relative importance of different components. ◮ Lemma: The function λ : M + d → ¯ R + d is Lipchitz . ◮ ¯ R + d is the subset of ordered nonnegative numbers of R d . ◮ M + d is the space of non-negative matrices. � t ◮ Therefore, 0 λ ( c s ) ds is well-defined. ◮ Moreover, λ g , if simple , is a C ∞ -function. So is its unique corresponding eigenvector function γ g up to a sign.

  16. Motivation Model Setup Inference Simulations Empirical Work Conclusion Estimation Strategy ◮ The idea is simple. 1. Decompose the interval [0 , t ] into many subintervals 2. Estimate c s within each subinterval using sample covariance matrix. 3. Aggregate the eigenvalues of � c s , λ ( � c s ). ◮ Apparently, we need some idea about the derivatives of λ ( · ) with respect to a matrix, as the estimation error depends on the smoothness of λ ( · ). ◮ Then, how do we handle repeated eigenvalues, given that they are only Lipchitz?

Recommend


More recommend