Kernel Methods for Topological Data Analysis Kenji Fukumizu The Institute of Statistical Mathematics (Tokyo, Japan) Joint work with Genki Kusano and Yasuaki Hiraoka (Tohoku Univ.), supported by JST CREST STM2016 at ISM. July 22, 2016 1
Topological Data Analysis • TDA: a new method for extracting topological or geometrical information of data. Key technology = Persistence homology ( Edelsbrunner et al 2002; Carlsson 2005) Background • Complex data: Data with complex structure must be analyzed. • Progress of computational topology: Computing topological invariants becomes easy. 2
TDA: Various applications Computer Vision Data of highly complex geometric structure Often difficult to define good feature vectors / descriptors Shape signature, Biochemistry natural image statistics Material Science (Freedman & Chen 2009) Brain Science Glass Liquid Structure change of proteins eg. open / closed Brain artery trees (Kovacev-Nikolic et al 2015) e.g. age effect Non-crystal materials (Bendich et al 2014) ( Nakamura, Hiraoka, Hirata, Escolar, Nishiura. etc … Nanotechnology 26 (2015) ) Persistence homology provides a compact representation for such data. 3
Outline • A brief introduction to persistence homology • Statistical approach with kernels to topological data analysis • Applications • Material science • Protein classification • Summary 4
Topology ≅ 5
Topology: two sets are equivalent if one is deformed to the other without tearing or attaching. Topological invariants: any equivalent sets take the same value. Connected Ring Cavity components 1 0 ≅ ≅ 0 0 0 2 ≅ 1 0 1 ≅ ≅ 1 0 1 ≅ 6
Algebraic Topology • Algebraic treatment of topological spaces Compute various topological ≅ invariances Algebraic e.g. Euler number operations Simplicial complex (union of simplexes) Classify topological spaces with topological invariances. 7
• Homology group: independent “holes” 𝐼 𝑙 (𝑌) : 𝑙 -th homology group of topological space 𝑌 ( 𝑙 = 0,1,2, … ) 𝐼 0 (𝑌) 𝐼 2 (𝑌) 𝐼 1 (𝑌) 𝑙 -dimensional holes 𝐼 0 (𝑌) : connected components 0 0 ℤ ≅ 𝐼 1 (𝑌) : rings 𝐼 2 (𝑌) : cavities 0 0 … ≅ ℤ ⊕ ℤ 0 ℤ ℤ ≅ 0 ℤ ≅ ℤ The generators of 1st ℤ ℤ ℤ ⊕ ℤ homology group 8
Topology of statistical data? True structure 𝜁 − balls (e.g. manifold learning) Small 𝜁 disconnected object Noisy finite Stable extraction of topology sample is NOT easy! Large 𝜁 small ring is not visible 9
Persistence Homology • All 𝜁 considered ⊂ 𝐒 𝑒 , 𝑌 𝜁 ≔∪ 𝑗=1 𝑛 𝑛 𝑌 = 𝑦 𝑗 𝑗=1 𝐶 𝜁 (𝑦 𝑗 ) 𝜁 small 𝜁 large Two rings ( generators of 1 dim homology ) persist in a long interval. 10
• Persistence homology (formal definition) X ∶ 𝑌 1 ⊂ 𝑌 2 ⊂ ⋯ ⊂ 𝑌 𝑀 Filtration of topological spaces 𝑛 𝑙 𝐽[𝑐 𝑗 , 𝑒 𝑗 ] 𝑄𝐼 𝑙 ( X ): 𝐼 𝑙 𝑌 1 → 𝐼 𝑙 𝑌 2 → ⋯ → 𝐼 𝑙 (𝑌 𝑀 ) ≅ ⊕ 𝑗=1 Irreducible decomposition at 𝑌 𝑐 at 𝑌 𝑒 𝐽 𝑐, 𝑒 ≅ 0 → ⋯ → 0 → 𝐿 → ⋯ → 𝐿 → 0 → ⋯ → 0 𝐿: field The lifetime (birth, death) of each generator is rigorously defined, and can be computed numerically. Birth and death of a generator of 𝑄𝐼 1 (𝑌) 11
• Two popular (equivalent) expressions of PH Barcode Persistence diagram (PD) Bar from the birth to death of each generator 𝛽 Plots of the birth (b) and death (d) of each generator of PH 𝜁 in a 2D graph ( 𝑒 ≥ 𝑐 ). Handy descriptors or features of complex geometric objects Barcodes and PD are considered for each dimension. 12
Beyond topology • PH contains geometrical information more than topology Barcodes of 1-dim PH 𝜁 13
Statistical approach with kernels to topological data analysis 14
Statistical approach to TDA • Conventional TDA Data Computation of PH Visualization ( PD ) Analysis by experts e.g. Molecular Software dynamics simulation CGAL / PHAT CGAL: The Computational Geometry Algorithms Library http://www.cgal.org/ PHAT: Persistent Homology Algorithm Toolbox https://bitbucket.org/phat-code/phat 15
• Statistical approach to TDA ( Kusano, Fukumizu, Hiraoka ICML 2016; Reininghaus et al CVPR 2015; Kwitt et al NIPS2015; Fasy et al 2014 ) Many data sets Many PD’s But how? PD 1 PD 2 Statistical analysis of PD’s PD 3 Computation of PH PD n Features / Descriptors 16
Kernel representation of PD • Vectorization of PD by positive definite kernel • PD = Discrete measure 𝜈 𝐸 ≔ σ 𝑨∈𝑄𝐸 𝜀 𝑨 • Kernel embedding of PD’s into RKHS 𝜈 𝐸 ↦ ∫ 𝑙 ⋅, 𝑦 𝑒𝜈 𝐸 𝑦 = σ 𝑗 𝑙(⋅, 𝑦 𝑗 ) ℇ 𝑙 : ∈ 𝐼 𝑙 , Vectorization 𝑙 : positive definite kernel • For some kernels (e.g., Gaussian, Laplace), ℇ 𝑙 is injective. 𝐼 𝑙 : corresponding RKHS • By vectorization, • a number of methods for data analysis can be applied, SVM, regression, PCA, CCA, etc. • tractable computation is possible with kernel trick. 17
Persistence Weighted Gaussian (PWG) Kernel Generators close to the diagonal may be noise, and should be discounted. 𝑧−𝑦 2 𝑙 𝑄𝑋𝐻 𝑦, 𝑧 = 𝑥 𝑦 𝑥 𝑧 exp − 2𝜏 2 𝑥 𝑦 = 𝑥 𝐷,𝑞 𝑦 ≔ arctan 𝐷Pers 𝑦 𝑞 (𝐷, 𝑞 > 0) Pers 𝑦 ≔ 𝑒 − 𝑐 for 𝑦 ∈ { 𝑐, 𝑒 ∈ 𝐒 2 |𝑒 ≥ 𝑐} Pers(x1) 18
• Stability with PWG kernel embedding • PWGK defines a distance on the persistence diagrams, 𝑒 𝑙 𝐸 1 , 𝐸 2 ≔ ℇ 𝑙 𝐸 1 − ℇ 𝑙 𝐸 2 𝐼 𝑙 , 𝐸 1 , 𝐸 2 : persistence diagrams Stability Theorem (Kusano, Hiraoka, Fukumizu 2015) 𝑁: compact subset in 𝐒 𝑒 . 𝑇 ⊂ 𝑁, 𝑈 ⊂ 𝐒 𝑒 : finite sets. A small change of a set causes only a small If 𝑞 > 𝑒 + 1 , then with PWG kernel ( 𝑞, 𝐷, 𝜏) , change in PD 𝑒 𝑙 𝐸 𝑟 (𝑇), 𝐸 𝑟 (𝑈) ≤ 𝑀 𝑒 𝐼 𝑇, 𝑈 . Lipschitz continuity 𝑀 : constant depending only on 𝑁, 𝑞, 𝑒, 𝐷, 𝜏 𝐸 𝑟 (𝑇) : 𝑟 th persistence diagram of 𝑇 𝑒 𝐼 : Haussdorff distance This stability is NOT known for Gaussian kernel. 19
2nd-level kernel Data analysis method PD1 Embedding ℇ 𝑙 𝑄𝐸 1 PD2 ℇ 𝑙 𝑄𝐸 2 PD3 Application of pos. def. … Kernel on RKHS ℇ 𝑙 𝑄𝐸 𝑛 PDm PD’s Data sets Vectors in RKHS 2nd-level kernel (SVM for measures, Muandet, Fukumizu, Dinuzzo, Schölkopf 2012) 2 𝜒 1 −𝜒 2 𝐼𝑙 • RKHS-Gaussian kernel 𝐿 𝜒 1 , 𝜒 2 = exp − 2𝜐 2 derives 2 ℇ 𝑙 (𝐸 𝑗 )−ℇ 𝑙 (𝐸 𝑘 ) 𝐼𝑙 𝐸 𝑗 , 𝐸 𝑘 : Persistence diagrams 𝐿 𝐸 𝑗 , 𝐸 𝑘 = exp − 2𝜐 2 20
Computational issue The number of generators in a PD may be large ( ≥ 10 3 , 10 4 ) 2 ℇ 𝑙 (𝑄𝐸 𝑗 )−ℇ 𝑙 (𝑄𝐸 𝑘 ) 𝐼𝑙 𝑂 𝑗 (𝑗) ∪ Δ , 𝐿 𝑄𝐸 𝑗 , 𝑄𝐸 For 𝑄𝐸 𝑗 = σ 𝑏=1 𝜀 𝑦 𝑏 𝑘 = exp − requires 2𝜐 2 computation 2 ℇ 𝑙 (𝑄𝐸 𝑗 ) − ℇ 𝑙 (𝑄𝐸 𝑘 ) 𝐼 𝑙 𝑘 , 𝑦 𝑐 𝑗 , 𝑦 𝑐 𝑗 , 𝑦 𝑐 𝑂 𝑘 𝑂 𝑘 𝑂 𝑘 𝑂 𝑗 𝑂 𝑗 𝑂 𝑗 𝑗 𝑘 𝑘 = σ 𝑏 =1 σ 𝑐 =1 + σ 𝑏 =1 σ 𝑐 =1 − 2 σ 𝑏 =1 σ 𝑐 =1 𝑙 𝑦 𝑏 𝑙 𝑦 𝑏 𝑙 𝑦 𝑏 . 𝑦 𝑏 −𝑦 𝑐 2 = 𝑃(𝑛 2 𝑂 2 ) computationally expensive for The number of exp − 2𝜏 2 𝑂 ≈ 10 4 𝑂 = max{𝑂 𝑗 |𝑗 = 1, … , 𝑜} 21
• Approximation by random features (Rahimi & Recht 2008) Gaussian distribution =: 𝑅 𝜏 By Bochner’s theorem 2𝜌 𝑓 − 𝜏2 𝜕 2 exp − 𝑦 𝑏 −𝑦 𝑐 2 𝜏 2 = 𝐷 ∫ 𝑓 −1𝜕 𝑈 𝑦 𝑏 −𝑦 𝑐 𝑒𝜕 ( Fourier transform ) 2 2𝜏 2 Approximation by sampling: 𝜕 1 , … , 𝜕 𝑀 : 𝑗. 𝑗. 𝑒. ~ 𝑅 𝜏 exp − 𝑦 𝑏 −𝑦 𝑐 2 𝑈 𝑦 𝑏 𝑓 −1𝜕 ℓ ≈ 𝐷 1 𝑈 𝑦 𝑐 𝑀 𝑓 −1𝜕 ℓ 𝑀 σ ℓ=1 2𝜏 2 (𝑗) 𝑓 −1𝜕 ℓ 𝑗 , 𝑦 𝑐 (𝑘) 𝑂 𝑘 𝑂 𝑘 𝐷 𝑈 𝑦 𝑏 𝑈 𝑦 𝑐 𝑂 𝑗 𝑂 𝑗 𝑘 𝑗 𝑘 𝑀 𝑓 −1𝜕 ℓ σ 𝑏 =1 σ 𝑐 =1 𝑙 𝑦 𝑏 ≈ 𝑀 σ 𝑏 =1 σ 𝑐 =1 σ ℓ=1 𝑥 𝑦 𝑏 𝑥 𝑦 𝑐 (𝑗) σ 𝑐=1 𝑂 𝑘 𝑥 𝑦 𝑐 (𝑘) 𝑈 𝑦 𝑏 𝑈 𝑦 𝑐 𝐷 𝑂 𝑗 𝑗 𝑘 𝑀 𝑓 −1𝜕 ℓ 𝑓 −1𝜕 ℓ 𝑀 σ ℓ =1 σ 𝑏 =1 = 𝑥 𝑦 𝑏 𝑀 dim. 2nd level Gram matrix 𝑃(𝑛𝑀𝑂 + 𝑛 2 𝑀) . c.f. 𝑃(𝑛 2 𝑂 2 ) Computational cost 𝑃(𝑀𝑂) Big reduction if 𝑀, 𝑜 ≪ 𝑂 22
Comparison: Persistence Scale Space Kernel (Reininghaus et al 2015) • PSS Kernel 𝑦 − 𝑧 2 𝑧 2 𝑙 𝑆 𝑦, 𝑧 = 1 𝑦 − ത 8𝜌𝑢 exp − exp 8𝑢 8𝑢 𝑧 = (𝑒, 𝑐) for 𝑧 = (𝑐, 𝑒) . ത Pos. def. on 𝑐, 𝑒 𝑒 ≥ 𝑐 0 on Δ . ℇ 𝑙 (𝐸) is considered. • Comparison between PWGK and PSSK • PWGK can control the discount around the diagonal independently of the bandwidth parameter. • PSSK is not shift-invariant Random feature approximation is not applicable. • In Reininghaus et al 2015, 2nd level kernel is not considered. 23
Recommend
More recommend