pseudo orthogonal bases give the optimal generalization
play

Pseudo Orthogonal Bases Give the Optimal Generalization Capability - PDF document

Pseudo Orthogonal Bases Give the Optimal Generalization Capability in Neural Network Learning Masashi Sugiyama Hidemitsu Ogawa Department of Computer Science, Tokyo Institute of Technology, Japan SPIEs 44th Annual Meeting and Exhibition


  1. Pseudo Orthogonal Bases Give the Optimal Generalization Capability in Neural Network Learning Masashi Sugiyama Hidemitsu Ogawa Department of Computer Science, Tokyo Institute of Technology, Japan

  2. SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 2 Pseudo Orthogonal Bases (POBs) ✓ ✏ Definition H : a finite dimensional Hilbert space M ≥ dim( H ) A set { φ m } M m =1 of elements in H is called a POB if any f in H is expressed as M f = m =1 � f, φ m � φ m , � where �· , ·� denotes the inner product in H . ✒ ✑ √ φ 3 1 / 2 • If M = dim( H ), φ 2 1 / 2 a POB is reduced to an ONB. √ 1 / 2 • A POB is a tight frame with frame bound 1. − 1 / 2 M � f � 2 = φ 1 m =1 |� f, φ m �| 2 . � H = R 2 , M = 3 If � φ 1 � = � φ 2 � = · · · = � φ M � , then { φ m } M m =1 is called a pseudo orthonormal basis (PONB).

  3. SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 3 Frame, POB, PBOB, · · · • Frame – Duffin and Shaeffer (1952) – Young (1980) • Pseudo orthogonal basis (POB) – Ogawa and Iijima (1973) M m =1 � f, φ m � φ m f = � • Pseudo biorthogonal basis (PBOB) – Ogawa (1978) M m =1 � f, φ ∗ f = � m � φ m ⎧ Signal restoration, ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Computerized Tomography, ⎪ ⎪ ⎪ ⎨ Neural Network Learning, ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ . ⎪ . ⎪ . ⎪ ⎪ ⎪ ⎩

  4. SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 4 Learning in Neural Networks u 1 modifiable weights v 11 { ξ 1 v 12 u 2 w 1 w 2 ξ 2 x y = f 0 ( x ) . . . . . . . . . w N synapses ξ L v LN u N neurons ✓ Purpose of NN Learning ✏ Modify weights by using training examples: { ( x m , y m ) | y m = f ( x m ) + n m } M m =1 , and obtain underlying input-output rule. ✒ ✑ target function f learning result f 0 y 2 y 3 y 1 x 1 x 2 x 3

  5. SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 5 NN Learning as an Inverse Problem sample value function space sampling space H operator C M target ⎛ ⎞ A f ( x 1 ) ⎜ ⎟ function f ( x 2 ) ⎜ ⎟ ⎜ ⎟ . ⎜ ⎟ . . f ⎜ ⎟ ⎝ ⎠ f ( x M ) learning + n operator X f 0 y learning sample value result vector ⎛ ⎞ y 1 ⎜ ⎟ ⎜ ⎟ . ⎜ ⎟ . sampling : y = . = Af + n ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ y M ⎝ ⎠ learning : f 0 = Xy ✓ representation of sampling operator A ✏ M � A = m =1 ( e m ⊗ ψ m ) ψ m ( x ) = K ( x, x m ) K ( x, x ′ ) : reproducing kernel � f, ψ m � = f ( x m ) ✒ ✑

  6. SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 6 Trigonometric Polynomial Space A Hilbert space H is called a trigonometric polynomial space of order N if H is spanned by { exp( inx ) } N n = − N which are defined on [ − π, π ] and the inner product in H is defined as � f, g � = 1 � π − π f ( x ) g ( x ) dx. 2 π ⎧ sin (2 N + 1)( x − x ′ ) sin x − x ′ ⎪ ⎪ � ( x � = x ′ ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 2 2 ⎪ K ( x, x ′ ) = ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ( x = x ′ ) ⎪ 2 N + 1 ⎪ ⎪ ⎪ ⎩ 12 10 8 6 4 2 0 −2 −3 −2 −1 0 1 2 3 Profile of the reproducing kernel of a trigonometric polynomial space of order 5 ( x ′ = 0).

  7. SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 7 Process of NN Learning H C M ⎛ ⎞ A f ( x 1 ) ⎜ f ( x 2 ) ⎟ ⎜ ⎟ ⎜ . ⎟ . f ⎜ ⎟ . ⎝ ⎠ f ( x M ) + n X f 0 y 1. (Active Learning) Sample points { x m } M m =1 are determined. 2. Sample values { y m } M m =1 are gathered. 3. X and f 0 are calculated : Projection Learning When noise covariance matrix is σ 2 I , X = A † . A † is the Moore-Penrose generalized inverse of A . ✓ Our goal ✏ We give the optimal solution to active learning. ✒ ✑

  8. SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 8 Active Learning Find a set { x m } M m =1 of sample points which minimizes J G = E n � f 0 − f � 2 , Generalization error where E n denotes the ensemble average over the noise. If noise covariance matrix is σ 2 I , then J G yields + σ 2 tr(( AA ∗ ) † ) J G = � P N ( A ) f � 2 , � �� � � �� � variance bias where N ( A ) denotes the null space of A . Bias of f 0 is 0 ⇐ ⇒ N ( A ) = { 0 } ⇓ ✓ Strategy ✏ Find a set { x m } M m =1 of sample points which minimizes J G = σ 2 tr(( AA ∗ ) † ) under the constraint of N ( A ) = { 0 } . ✒ ✑

  9. SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 9 Main Theorem Suppose noise covariance matrix is σ 2 I with σ 2 > 0. J G is minimized under the constraint of N ( A ) = { 0 } if and only if { 1 M ψ m } M m =1 forms a PONB in H . √ In this case, the minimum value of J G is σ 2 (2 N + 1) . M 1 M ψ m � 1 M √ √ f = m =1 � f, � M ψ m for all f ∈ H. � ψ 1 � = � ψ 2 � = · · · = � ψ M � ψ m ( x ) = K ( x, x m ) K ( x, x ′ ) : reproducing kernel ⎧ sin (2 N + 1)( x − x ′ ) sin x − x ′ ⎪ ⎪ � ( x � = x ′ ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 2 2 ⎪ K ( x, x ′ ) = ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ( x = x ′ ) ⎪ 2 N + 1 ⎪ ⎪ ⎪ ⎩

  10. SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 10 Interpretation When { 1 M ψ m } M m =1 forms a PONB in H , √ √ � Af � = M � f � . f 0 = Xy = A † Af + A † n 1 + A † n 2 . A † Af = f ⇐ N ( A ) = { 0 } = A † n 2 = 0 ⇐ = X : Projection Learning � A † n 1 � = 1 { 1 M ψ m } M M � n 1 � ⇐ = m =1 : PONB √ √ C M H Amplification √ y = Af + n × n 2 M A n f Af n 1 f 0 X = A † × 1 R ( A ) √ M Amplification

  11. SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 11 Examples of PONB –1– ✓ Example 1 ✏ M ≥ 2 N + 1 (= dim( H )) , c : − π ≤ c ≤ − π + 2 π M . If we put { x m } M m =1 as x m = c + 2 π M ( m − 1) , then { 1 M ψ m } M √ m =1 forms a PONB in H . ✒ ✑ x 1 x 2 x M · · · − π π M sample points are fixed to 2 π/M intervals and sample values are gathered once at each point. ψ m ( x ) = K ( x, x m ) K ( x, x ′ ) : reproducing kernel ⎧ sin (2 N + 1)( x − x ′ ) sin x − x ′ ⎪ ⎪ � ( x � = x ′ ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 2 2 ⎪ K ( x, x ′ ) = ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ( x = x ′ ) ⎪ 2 N + 1 ⎪ ⎪ ⎪ ⎩

  12. SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 12 Examples of PONB –2– M = k (2 N + 1) : k is a positive integer . For a general finite dimensional Hilbert space H , { φ m } M m =1 becomes a PONB √ kφ m } M if { m =1 consists of k sets of ONBs in H . ✓ Example 2 ✏ 2 π c : − π ≤ c ≤ − π + 2 N + 1 . If we put { x m } M m =1 as 2 πp x m = c + 2 N + 1 : p = m − 1 (mod (2 N + 1)) , then { 1 M ψ m } M m =1 forms a PONB in H . √ ✒ ✑ } x M − 2 N x M − 2 N +1 x M · · · . . . . . . . . . k times x 2(2 N +1) x 2 N +2 x 2 N +3 · · · x 2 N +1 x 1 x 2 · · · − π π (2 N + 1) sample points are fixed to 2 π/ (2 N + 1) intervals and sample values are gathered k times at each point.

  13. SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 13 Computer Simulation 1 N = 3 (dim( H ) = 7), M = 21 target function learning result 10 8 6 4 2 0 −2 −4 −6 −8 −10 −3 −2 −1 0 1 2 3 (A) Optimal sampling : J G = 0 . 333 10 8 6 4 2 0 −2 −4 −6 −8 −10 −3 −2 −1 0 1 2 3 (B) Random sampling : J G = 1 . 202

  14. SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 14 Computer simulation 2 1.5 1 J G 0.5 0 7 14 21 28 35 42 49 56 63 70 The number of training examples Optimal sampling Random sampling (average of 100 trials)

  15. SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 15 Conclusions 1. We showed that pseudo orthogonal bases (POBs) give the optimal solution to active learning in neural networks. 2. By utilizing properties of POBs, we clarified the mechanism of achieving the optimal generalization. 3. We gave two construction methods of PONBs.

  16. Active Learning in Neural Networks

  17. Projection learning f 0 = XAf + Xn � �� � � �� � signal noise component component E n � Xn � 2 minimize under the constraint of XAf = P R ( A ∗ ) f H f f 0 R ( A ∗ ) approximation space ✓ projection learning operator ✏ X = V † A ∗ U † + Y ( I − UU † ) A ∗ : adjoint operator of A Q : noise covariance matrix U = AA ∗ + Q U † : Moore-Penrose V = A ∗ U † A generalized inverse of U Y : arbitrary operator ✒ ✑

Recommend


More recommend