Resampling PCA & GP Inference Manfred Opper (ISIS, University of Southampton)
Motivation • Construct “simple” intractable GP model • Study approximate (EC/EP) inference • “MC” conceptually simple • Get a quantitative idea why EC inference works.
Resampling (Bootstrap) Estimate average case properties (test errors) of statistical estimators based on a single dataset D 0 = { y 1 , y 2 , y 3 } Bootstrap: Resample with replacement → Generate pseudo data. D 1 = { y 1 , y 2 , y 2 } , D 2 = { y 1 , y 1 , y 1 } , D 3 = { y 2 , y 3 , y 3 } , . . . etc Problem: Each sample requires retraining of some learning algorithm. Mapping to probabilistic model & Approximate inference: Only single trai- ning (inference) for single (effective) model required (Malzahn & Opper 2003).
PCA • Goal: Project ( d dimensional) data vectors y → P q [ y ] on q < d dimensio- nal subspace with minimal reconstruction error E || y − P q [ y ] || 2 . • Method: Approximate expectation by N training data D 0 given by the ( d × N ) matrix Y = ( y 1 , y 2 , . . . , y N ). y i ∈ R d . d = ∞ allowed (feature vectors). Optimal subspace spanned by eigenvectors u l of data covariance matrix C = 1 N YY T corresponding to the q largest eigenvalues λ l ≥ λ .
Reconstruction Error Expected reconstruction error (on novel data) E ( y · u l ) 2 � ε ( λ ) = l : λ l <λ Resample averaged reconstruction error E r = 1 � y i y T i u l u T � � Tr E D l N 0 y i / ∈ D ; λ l <λ
Bootstrap of density of Eigenvalues 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 3.5 4 eigenvalue λ Bootstrap ( N = 50 random data, Dim = 25) 1 × and 3 × oversampled 2 1.4 1.8 1.2 1.6 1 1.4 1.2 0.8 1 0.6 0.8 0.6 0.4 0.4 0.2 0.2 0 0 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
The model • Let s i = # times y i ∈ D • Diagonal random matrix D ii = D i = 1 C ( ǫ ) = Γ N YDY T . µ Γ( s i + ǫδ s i , 0 ) C (0) ∝ covariance matrix of the resampled data. • kernel matrix K = 1 N Y T Y • Partition function − 1 � � � K − 1 + D d N x exp 2 x T � � = Z x − 1 1 � � � 2 z T ( C ( ǫ ) + Γ I ) z 2 Γ d/ 2 (2 π ) ( N − d ) / 2 d d z exp = | K | .
Z as generating function N − 2 ∂ ln Z 1 δ s j , 0 Tr y j y T � = j G (Γ) ∂ǫ µN ǫ =0 j =1 − 2 ∂ ln Z d = Γ + Tr G (Γ) ∂ Γ with u k u T G (Γ) = ( C (0) + Γ I ) − 1 = k � λ k + Γ k Compare with (resample averaged) reconstruction error E r = 1 � y i y T i u l u T � � E D Tr l N 0 ∈ D ; λ l <λ y i /
Analytical Continuation Reconstruction error E r = 1 y i y T i u l u T � � � Tr E D l N 0 ∈ D ; λ l <λ y i / 1 Use representation of the Dirac δ δ ( x ) = lim η → 0 + ℑ π ( x − iη ) and get � λ 0 + dλ ′ ε r ( λ ′ ) E r = E 0 r + where ε r ( λ ) = 1 η → 0 + ℑ 1 y j y T � � � π lim δ s j , 0 Tr j G ( − λ − iη ) E D N 0 j defines error density from all eigenvalues > 0 and and E 0 r is the contribution from eigenspace with λ k = 0.
Replica Trick Data averaged free energy 1 n ln E D [ Z n ] , − E D [ln Z ] = − lim n → 0 for integer n : Z ( n ) . � = E D [ Z n ] = dx ψ 1 ( x ) ψ 2 ( x ) where we set x . = ( x 1 , . . . , x n ) and n n − 1 − 1 x T x T a K − 1 x a � � ψ 1 ( x ) = E D exp ψ 2 ( x ) = exp a Dx a 2 2 a =1 a =1 intractable!
Approximate Inference (EC: Opper & Winther) p 1 ( x ) = 1 p 0 ( x ) = 1 2 Λ 0 x T x , ψ 1 ( x ) e − Λ 1 x T x e − 1 Z 1 Z 0 with Λ 1 and Λ 0 “variational” parameters dx p 1 ( x ) ψ 2 ( x ) e Λ 1 x T x � Z ( n ) = Z 1 dx p 0 ( x ) ψ 2 ( x ) e Λ 1 x T x ≡ Z ( n ) � ≈ Z 1 EC (Λ 1 , Λ 0 ) Match moments � x T x � 1 = � x T x � 0 & Stationarity w.r.t. Λ 1 Final result 2 x T ( D +(Λ 0 − Λ) I ) x d x e − 1 � � � − ln Z EC = − E D ln − 2 x T ( K − 1 +Λ I ) x + ln d x e − 1 d x e − 1 � � 2 Λ 0 x T x − ln where we have set Λ = Λ 0 − Λ 1 . Tractable!
Result: Artificial Data N = 50 data, Dim = 25, 3 × oversampled. EC vs resampling 1.2 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 eigenvalue λ
The PCA Reconstruction Error ( N = 32 artificial random data, Dim = 25) Approximate bootstrap 3 × oversampled 25 20 15 10 5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 eigenvalue λ test error versus sum of eigenvalues (training error)
Approximate Bootstrap: handwritten Digits ( N = 100 data, Dim = 784) Density of eigenvalues and reconstruction error 0.6 20 0.4 15 10 0.2 5 eigenvalue λ 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
The result without replicas 2 x T ( D +(Λ 0 − Λ) I ) x − ln 2 x T ( K − 1 +Λ I ) x + d x e − 1 d x e − 1 � � − ln Z = − ln 2 Λ 0 x T x + 1 d x e − 1 � + ln 2 ln det( I + r ) with � � � Λ 0 � − 1 − I � K − 1 + Λ I � r ij = 1 − Λ 0 . Λ 0 − Λ + D i ij Expand ∞ ( − 1) k +1 � r k � � ln det ( I + r ) = Tr ln ( I + r ) = Tr k k =1 We have E D [ r ij ] = 0 → 1.order term vanishes after average, 2.order yields on average � 2 � 2 � ∆ F = − 1 Λ 0 � − 1 � K − 1 + Λ I � � � Λ 0 − 1 × − 1 E D ii 4 Λ 0 − Λ + D i i i
Correction Correction to resampling error 0.6 22 Resampled reconstruction error ( λ = 0) 20 18 16 0.4 14 12 10 0.2 8 6 4 0 2 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4 Resampling rate µ Resampling rate µ
Correction to EC Z ( n ) �� � dk dxp 1 ( x ) ψ 2 ( x ) e Λ 1 x T x = 1 2 Λ x T x (2 π ) Nn e − ik T x χ ( k ) � � = dxψ 2 ( x ) e Z 1 � dx p 1 ( x ) e − ik T x is the characteristic function of the density where χ ( k ) . = p 1 . Cumulant expansion starts with a quadratic term (EC) ln χ ( k ) = − M 2 2 k T k + R ( k ) , (1) where M 2 = � x T a x a � 1 . Expand 4-th order term in R ( k ) as e R ( k ) = 1 + R ( k ) + . . . leads to ∆ F . Possibility of perturbative improvement?
Conclusion • Non–Bayesian inference problems can be related to “hidden” probabili- stic models via analytic continuation. • EC approximate inference appears to be robust and survives analytic continuation and limits.
Recommend
More recommend