estimation with infinite dimensional
play

Estimation with Infinite Dimensional Kernel Exponential Families - PowerPoint PPT Presentation

Estimation with Infinite Dimensional Kernel Exponential Families Kenji Fukumizu The Institute of Statistical Mathematics Joint work with Bharath Sriperumbudur (Penn State U), Arthur Gretton (UCL), Aapo Hyvarinen (U Helsinki), Revant Kumar


  1. Estimation with Infinite Dimensional Kernel Exponential Families Kenji Fukumizu The Institute of Statistical Mathematics Joint work with Bharath Sriperumbudur (Penn State U), Arthur Gretton (UCL), Aapo Hyvarinen (U Helsinki), Revant Kumar (Georgia Tech) IGAIA IV. June 12-17, 2016. Liblice, Czech Republic 1

  2. Introduction 2

  3. Infinite dimensional exponential family ๏ฎ (Finite dim.) exponential family ๐‘› ๐‘ž ๐œ„ ๐‘ฆ = exp เท ๐œ„ ๐‘˜ ๐‘ˆ ๐‘˜ ๐‘ฆ โˆ’ ๐ต ๐œ„ ๐‘Ÿ 0 (๐‘ฆ) ๐‘˜=1 ๏ฎ Infinite dimensional extension? where ๐ต ๐‘” โ‰” log โˆซ ๐‘“ ๐‘”(๐‘ฆ) ๐‘Ÿ 0 ๐‘ฆ ๐‘’๐‘ฆ ๐‘ž ๐‘” ๐‘ฆ = exp ๐‘” ๐‘ฆ โˆ’ ๐ต ๐‘” ๐‘Ÿ 0 (๐‘ฆ) ๐‘” is a natural parameter in an infinite dimensional function class. โ€“ Maximal exponential model (Pistone & Sempi AoS 1995) : โ€ข Orlicz space (Banach sp.) is used. โ€ข Estimation is not at all obvious. โ€œEmpiricalโ€ mean parameter cannot be defined.

  4. ๏ฎ Kernel exponential manifold (Fukumizu 2009; Canu & Smola 2005) Reproducing kernel Hilbert space is used. โ€ข ๐‘ž ๐‘” ๐‘ฆ = exp โŒฉ๐‘”, ๐‘™ โ‹…, ๐‘ฆ โŒช โˆ’ ๐ต ๐‘” ๐‘Ÿ 0 (๐‘ฆ) Infinite dimensional Parameter sufficient statistics โ€ข Empirical estimation is possible โ€“ Mean parameter: ๐‘› ๐‘” = ๐น ๐‘ž ๐‘” [๐‘™ โ‹…, ๐‘Œ ] 1 ๐‘œ โ€“ Maximum likelihood estimator: เท ๐‘œ ฯƒ ๐‘—=1 ๐‘› ๐‘” = ๐‘™(โ‹…, ๐‘Œ ๐‘— ) โ€ข Manifold structure can be defined (Fukumizu 2009) 4

  5. Problems in estimation ๏ฎ Normalization constant / partition function โ€“ Even in finite dim. cases ๐‘› ๐œ„ ๐‘˜ ๐‘ˆ ๐‘˜ ๐‘ฆ ๐‘Ÿ 0 ๐‘ฆ ๐‘’๐‘ฆ ๐ต ๐œ„ โ‰” log โˆซ ๐‘“ ฯƒ ๐‘˜=1 is not easy to compute. โ€“ MLE: โ€œMean parameter ๏ƒ  natural parameterโ€ needs to solve ๐‘œ ๐œ–๐ต ๐œ„ = 1 ๐‘œ เท ๐‘ˆ ๐‘Œ ๐‘— . ๐œ–๐œ„ ๐‘—=1 โ€“ Even more difficult for an infinite dimensional exponential family ๏ฎ This talk ๏ƒ  score matching (Hyvarinen, JMLR 2005) โ€“ Estimation method without normalization constants. โ€“ Introducing a new method for (unnormalized) density estimation. 5

  6. Score Matching 6

  7. Score matching for exponential family (Hyvรคrinen, JMLR2005) ๏ฎ Fisher divergence ๐‘’ ๐‘ž, ๐‘Ÿ : two p.d.f.โ€™s on ฮฉ = ฯ‚ ๐‘=1 ๐‘’ . (๐‘ก ๐‘ , ๐‘ข ๐‘ ) โŠ‚ ๐’ โˆช ยฑโˆž ๐‘’ 2 ๐พ ๐‘ž||๐‘Ÿ โ‰” 1 ๐œ– log ๐‘ž ๐‘ฆ โˆ’ ๐œ– log ๐‘Ÿ ๐‘ฆ 2 เถฑ เท ๐‘ž(๐‘ฆ)๐‘’๐‘ฆ ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐‘=1 โ€“ ๐พ(๐‘ž| ๐‘Ÿ โ‰ฅ 0. Equality holds iff ๐‘ž = ๐‘Ÿ (under mild conditions). โ€“ Derivative w.r.t. ๐‘ฆ , not parameter. โ€ข For location parameter ๐‘ž(๐‘ฆ) = ๐‘” ๐‘ฆ โˆ’ ๐œ„ , ๐œ– log ๐‘ž ๐‘ฆ = โˆ’ ๐œ– log ๐‘” ๐œ„ ๐‘ฆ ๐œ–๐‘ฆ ๐‘ ๐œ–๐œ„ ๐‘ ๐พ(๐‘ž||๐‘Ÿ) = squared ๐‘€ 2 -distance of Fisher scores. 7

  8. Set ๐‘ž = ๐‘ž 0 (true), and ๐‘Ÿ = ๐‘ž ๐œ„ to be estimated. ๐พ ๐œ„ โ‰” ๐พ ๐‘ž 0 ||๐‘ž ๐œ„ 2 ๐œ– log ๐‘ž ๐œ„ ๐‘ฆ โˆ’ ๐œ– log ๐‘ž 0 ๐‘ฆ ๐‘’ 1 2 โˆซ ฯƒ ๐‘=1 = ๐‘ž 0 (๐‘ฆ)๐‘’๐‘ฆ ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ โ‰ก แˆš ๐พ ๐œ„ ๐‘’ ๐œ– 2 log ๐‘ž ๐œ„ ๐‘ฆ ๐‘’ 2 = 1 ๐œ– log ๐‘ž ๐œ„ ๐‘ฆ 2 เถฑ เท ๐‘ž 0 (๐‘ฆ)๐‘’๐‘ฆ + เถฑ เท ๐‘ž 0 ๐‘ฆ ๐‘’๐‘ฆ 2 ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐‘=1 ๐‘=1 + const. ๐œ– log ๐‘ž ๐œ„ ๐‘ฆ โ€ข ๐‘ฆ ๐‘ โ†’๐‘ก ๐‘ or ๐‘ข ๐‘ ๐‘ž 0 (๐‘ฆ) lim = 0 , and use partial integral Assume ๐œ–๐‘ฆ ๐‘ ๐‘ข ๐‘ ๐œ– 2 log ๐‘ž ๐œ„ ๐‘ฆ ๐œ– log ๐‘ž ๐œ„ ๐‘ฆ ๐œ– log ๐‘ž 0 ๐‘ฆ ๐œ– log ๐‘ž ๐œ„ ๐‘ฆ โˆซ ๐‘ž 0 ๐‘ฆ ๐‘’๐‘ฆ = ๐‘ž 0 ๐‘ฆ โˆ’ โˆซ ๐‘ž 0 (๐‘ฆ)๐‘’๐‘ฆ 2 ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐‘ก ๐‘ ๐œ–๐‘ž 0 ๐‘ฆ 0 ๐œ–๐‘ฆ ๐‘ 8

  9. ๏ฎ Empirical estimation ๐‘’ ๐œ– 2 log ๐‘ž ๐œ„ ๐‘ฆ ๐‘’ 2 ๐พ ๐œ„ = 1 ๐œ– log ๐‘ž ๐œ„ ๐‘ฆ แˆš 2 เถฑ เท ๐‘ž 0 (๐‘ฆ)๐‘’๐‘ฆ + เถฑ เท ๐‘ž 0 ๐‘ฆ ๐‘’๐‘ฆ 2 ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐‘=1 ๐‘=1 ๐‘Œ 1 , โ€ฆ , ๐‘Œ ๐‘œ : i.i.d. sample ~ ๐‘ž 0 . ๐‘’ ๐‘œ 2 + ๐œ– 2 log ๐‘ž ๐œ„ ๐‘Œ ๐‘— ๐พ ๐‘œ ๐œ„ = 1 1 ๐œ– log ๐‘ž ๐œ„ ๐‘Œ ๐‘— แˆš ๐‘œ เท เท 2 2 ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐‘=1 ๐‘—=1 แˆ˜ ๐œ„ = arg min แˆš ๐พ ๐‘œ (๐œ„) : Score matching estimator 9

  10. Score matching for exponential family โ€“ For exponential family ๐‘ž ๐œ„ ๐‘ฆ = exp ฯƒ ๐‘˜ ๐œ„ ๐‘˜ ๐‘ˆ ๐‘˜ ๐‘ฆ โˆ’ ๐ต ๐œ„ ๐‘Ÿ 0 ๐‘ฆ , แˆš ๐พ ๐‘œ ๐œ„ 2 ๐‘’ 1 ๐‘œ ๐‘› ๐‘› + ๐œ– 2 log ๐‘Ÿ 0 ๐‘Œ ๐‘— ๐œ– 2 ๐‘ˆ ๐œ–๐‘ˆ ๐‘˜ ๐‘Œ ๐‘— + ๐œ– log ๐‘Ÿ 0 ๐‘Œ ๐‘— ๐‘˜ ๐‘Œ ๐‘— = เท เท เท ๐œ„ + เท ๐œ„ ๐‘˜ ๐‘˜ 2 2 2 ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐‘—=1 ๐‘=1 ๐‘˜=1 ๐‘˜=1 โ€ข No need of ๐ต ๐œ„ ! (derivative w.r.t. ๐‘ฆ ) โ€ข Quadratic form w.r.t. ๐œ„ ๏ƒ  Solvable! โ€ข In the Gaussian case, เท  ๐œ„ is the same as MLE. 10

  11. Kernel Exponential Family 11

  12. Reproducing kernel Hilbert space โ€“ Def. ฮฉ : set. ๐ผ: Hilbert space consisting of functions on ฮฉ . ๐ผ : reproducing kernel Hilbert space (RKHS), if for any ๐‘ฆ โˆˆ ฮฉ there is ๐‘™ ๐‘ฆ โˆˆ ๐ผ s.t. ๐‘”, ๐‘™ ๐‘ฆ = ๐‘” ๐‘ฆ for โˆ€๐‘” โˆˆ ๐ผ [reproducing property] โ€“ ๐‘™ ๐‘ฆ, ๐‘ง โ‰” ๐‘™ ๐‘ฆ (๐‘ง) . ๐‘™ is a positive definite kernel, i.e., ๐‘™ ๐‘ฆ, ๐‘ง = ๐‘™(๐‘ง, ๐‘ฆ) and the Gram matrix ๐‘™ ๐‘ฆ ๐‘— , ๐‘ฆ ๐‘˜ ๐‘—๐‘˜ is positive semidefinite for any ๐‘ฆ 1 , โ€ฆ , ๐‘ฆ ๐‘œ . โ€“ Moore-Aronszajn theorem: for any positive definite kernel on ฮฉ , there uniquely exists an RKHS s.t. its reproducing kernel is ๐‘™(โ‹…, ๐‘ฆ) . (One-to-one correspondence between p.d. kernel and RKHS) โ€–๐‘ฆโˆ’๐‘งโ€– 2 โ€“ Example of pos. def. kernel on ๐’ ๐‘’ : ๐‘™ ๐‘ฆ, ๐‘ง = exp โˆ’ . 12 2๐œ 2

  13. Kernel exponential family ๐‘’ ๐‘’ . Def. ๐‘™ : pos. def. kernel on ฮฉ = ฯ‚ ๐‘=1 (๐‘ก ๐‘ , ๐‘ข ๐‘ ) โŠ‚ ๐’ โˆช ยฑโˆž ๐ผ ๐‘™ : RKHS. ๐‘Ÿ 0 : p.d.f. on ฮฉ with supp ๐‘Ÿ 0 = ฮฉ . ๐บ ๐‘™ โ‰” {๐‘” โˆˆ ๐ผ ๐‘™ โˆฃ โˆซ ๐‘“ ๐‘” ๐‘ฆ ๐‘Ÿ 0 ๐‘ฆ ๐‘’๐‘ฆ < โˆž} (functional) parameter space ๐‘„ ๐‘™ โ‰” {๐‘ž ๐‘” : ฮฉ โ†’ 0, โˆž โˆฃ ๐‘ž ๐‘” ๐‘ฆ = ๐‘“ ๐‘” ๐‘ฆ โˆ’๐ต ๐‘” ๐‘Ÿ 0 ๐‘ฆ , ๐‘” โˆˆ ๐บ ๐‘™ } where ๐ต ๐‘” โ‰” โˆซ ๐‘“ ๐‘”(๐‘ฆ) ๐‘Ÿ 0 ๐‘ฆ ๐‘’๐‘ฆ ๐‘„ ๐‘™ : kernel exponential family (KEF) โ€“ With finite dimensional ๐ผ ๐‘™ , KEF is reduced to a finite dim. exponential family. e.g. ๐‘™ ๐‘ฆ, ๐‘ง = 1 + ๐‘ฆ ๐‘ˆ ๐‘ง 2 ๏ƒ  Gaussian distributions. 13

  14. Score matching for KEF Assume ๐‘™ is of class ๐ท 2 ( ๐œ– ๐‘+๐‘ ๐‘™(๐‘ฆ, ๐‘ง)/๐œ– ๐‘ ๐‘ฆ๐œ– ๐‘ ๐‘ง exists and is continuous for ๐‘ + ๐‘ โ‰ค 2) and ๐œ– 2 ๐‘™ ๐‘ฆ,๐‘ง lim เธฌ ๐‘ž 0 ๐‘ฆ = 0 (for partial integral). ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ง ๐‘ ๐‘ง=๐‘ฆ ๐‘ฆ ๐‘ โ†’๐‘ก ๐‘ or ๐‘ข ๐‘ โ€“ Score matching objective function ๐‘’ 1 ๐‘œ 2 + ๐œ– 2 log ๐‘Ÿ 0 ๐‘Œ ๐‘— + ๐œ– 2 ๐‘” ๐‘Œ ๐‘— ๐œ–๐‘” ๐‘Œ ๐‘— + ๐œ– log ๐‘Ÿ 0 ๐‘Œ ๐‘— แˆš ๐พ ๐‘œ ๐‘” โ‰” เท เท 2 2 2 ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐‘—=1 ๐‘=1 ๐œ– 2 ๐‘” ๐‘Œ ๐‘— ๐œ– 2 ๐‘™ โ‹…,๐‘Œ ๐‘— ๐œ–๐‘” ๐‘Œ ๐‘— ๐œ–๐‘™ โ‹…,๐‘Œ ๐‘— Note ๐‘” ๐‘Œ ๐‘— = ๐‘”, ๐‘™ โ‹…, ๐‘Œ ๐‘— , ๐œ–๐‘ฆ ๐‘ = ๐‘”, , = ๐‘”, . 2 2 ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ แˆš ๐พ ๐‘œ ๐‘” is a quadratic form w.r.t. ๐‘” โˆˆ ๐ผ . 14

  15. โ€“ Estimation แˆ˜ ๐ท ๐‘œ ๐‘” = ๐œŠ ๐‘œ where ๐‘’ ๐œ–๐‘™ โ‹…, ๐‘Œ ๐‘— ๐‘œ ๐ท ๐‘œ โ‰” 1 ๐œ–๐‘™ โ‹…, ๐‘Œ ๐‘— แˆ˜ ๐‘œ เท เท ,โˆ— โˆถ ๐ผ ๐‘™ โ†’ ๐ผ ๐‘™ ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐‘—=1 ๐‘=1 ๐‘œ ๐‘’ + ๐œ– 2 ๐‘™ โ‹…, ๐‘Œ ๐‘— ๐œŠ ๐‘œ โ‰” 1 ๐œ–๐‘™ โ‹…, ๐‘Œ ๐‘— ๐œ– log ๐‘Ÿ 0 ๐‘Œ ๐‘— แˆ˜ ๐‘œ เท เท โˆˆ ๐ผ ๐‘™ 2 ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐‘—=1 ๐‘=1 โ€“ Regularized estimator โˆ’1 แˆ˜ เทก แˆ˜ ๐‘” ๐‘œ = ๐ท ๐‘œ + ๐œ‡ ๐‘œ ๐ฝ ๐œŠ ๐‘œ i.e., เทก ๐‘œ = argmin ๐‘” แˆš 2 ๐‘” ๐พ ๐‘œ ๐‘” + ๐œ‡ ๐‘œ ๐‘” ๐ผ ๐‘™ 15

  16. Explicit Solution โ€“ Estimator: (from representer theorem) ๐‘œ ๐‘’ ๐œ–๐‘™ โ‹…, ๐‘Œ ๐‘˜ แˆ˜ ๐‘œ = ๐›ฝ แˆ˜ ๐‘” ๐œŠ ๐‘œ + เท เท ๐›พ ๐‘˜๐‘ ๐œ–๐‘ฆ ๐‘ ๐‘˜=1 ๐‘=1 where 2 1 1 ๐‘ ๐ป ๐‘—๐‘˜ ๐‘๐‘ + ๐œ‡ โ„Ž ๐‘˜ ๐‘ 2 + ๐œ‡ แˆ˜ ๐‘ 2 ๐‘œ ฯƒ ๐‘,๐‘— โ„Ž ๐‘— ๐‘œ ฯƒ ๐‘,๐‘— โ„Ž ๐‘— ๐œŠ ๐‘œ แˆ˜ ๐›ฝ ๐œŠ ๐‘œ ๐›พ ๐‘—๐‘ = โˆ’ ๐‘ ๐ป ๐‘—๐‘˜ ๐‘๐‘ + ๐œ‡ โ„Ž ๐‘˜ ๐‘๐‘‘ + ๐œ‡ ๐ป ๐‘—๐‘˜ 1 1 ๐‘ ๐‘ ๐‘๐‘‘ ๐ป ๐‘›๐‘˜ ๐‘๐‘ โ„Ž ๐‘˜ ๐‘œ ฯƒ ๐‘,๐‘— โ„Ž ๐‘— ๐‘œ ฯƒ ๐‘‘,๐‘› ๐ป ๐‘—๐‘› ๐œ– 3 ๐‘™ ๐‘Œ ๐‘— ,๐‘Œ ๐‘˜ ๐œ– 2 ๐‘™ ๐‘Œ ๐‘— ,๐‘Œ ๐‘˜ ๐‘ = 1 ๐œ–โ„“ ๐‘Œ ๐‘— ๐œ–โ„“ ๐‘Œ ๐‘— ๐œ– log ๐‘Ÿ 0 ๐‘Œ ๐‘— ๐‘œ ฯƒ ๐‘—,๐‘ โ„Ž ๐‘˜ 2 ๐œ–๐‘ง ๐‘ + ๐œ–๐‘ฆ ๐‘ , ๐œ–๐‘ฆ ๐‘ = ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ง ๐‘ ๐œ–๐‘ฆ ๐‘ ๐œ– 2 ๐‘™ ๐‘Œ ๐‘— ,๐‘Œ ๐‘˜ 2 = ๐œ– 4 ๐‘™ ๐‘Œ ๐‘— ,๐‘Œ ๐‘˜ ๐œ– 3 ๐‘™ ๐‘Œ ๐‘— ,๐‘Œ ๐‘˜ ๐œ– 2 ๐‘™ ๐‘Œ ๐‘— ,๐‘Œ ๐‘˜ ๐œ–โ„“(๐‘Œ ๐‘˜ ) ๐œ–โ„“(๐‘Œ ๐‘˜ ) ๐‘๐‘ = 1 ๐œ–โ„“ ๐‘Œ ๐‘— ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ง ๐‘ , แˆ˜ ๐ป ๐‘—๐‘˜ ๐œŠ ๐‘œ ๐‘œ 2 ฯƒ ๐‘—๐‘˜,๐‘๐‘ 2 + 2 ๐œ–๐‘ฆ ๐‘ + 2 ๐œ–๐‘ง ๐‘ 2 ๐œ–๐‘ง ๐‘ ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ง ๐‘ ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘ฆ ๐‘ ๐œ–๐‘™ โ‹…,๐‘Œ ๐‘˜ แˆ˜ , แˆ˜ โ€ข ๐‘” ๐œŠ ๐‘œ . ๐‘œ can be taken in Span ๐œ–๐‘ฆ ๐‘ โ€ข Estimator is simply given by solving 1 + ๐‘œ๐‘’ -dimensional linear 16 equation.

  17. Unnormalized p.d.f. โ€“ Score matching for KEF gives only ๐‘”(๐‘ฆ) or ๐‘“ ๐‘” ๐‘ฆ , unnormalized p.d.f. โ€ข Estimation of ๐ต ๐‘” โ‰” โˆซ ๐‘“ ๐‘”(๐‘ฆ) ๐‘Ÿ 0 ๐‘ฆ ๐‘’๐‘ฆ is yet nontrivial. โ€“ There are interesting applications. 1) Nonparametric structure learning for graphical model given data (Sun, Kolar, Xu NIPS2015) ๐‘ž ๐‘Œ โˆ เท‘ ๐‘ž ๐‘—๐‘˜ ๐‘Œ ๐‘— , ๐‘Œ ๐‘˜ , ๐ป = (๐‘Š, ๐น) ๐‘—๐‘˜โˆˆ๐น ๐‘ž ๐‘—๐‘˜ is estimated nonparametrically with KEF (with sparse edges). a b c d 17 e

Recommend


More recommend