Probabilistic symmetry and invariant neural networks Benjamin Bloem-Reddy , University of Oxford Work with Yee Whye Teh 14 January 2019, UBC Computer Science
models Outline B. Bloem-Reddy 2 / 27 • Symmetry in neural networks • Permutation-invariant neural networks • Symmetry in probability and statistics • Exchangeable sequences • Permutation-invariant neural networks as exchangeable probability • Symmetry in neural networks as probabilistic symmetry
Deep learning and statistics settings. semi-/unsupervised domains. B. Bloem-Reddy 3 / 27 • Deep neural networks have been applied successfully in a range of • Effort under way to improve performance in data poor and • Focus on symmetry . • The study of symmetry in probability and statistics has a long history.
Symmetric neural networks n network. If X and Y are assumed to satisfy a symmetry property, B. Bloem-Reddy 4 / 27 ( ) w ( ℓ ) ∑ f ℓ, i = σ i , j f ℓ − 1 , j j = 1 For input X and output Y , model Y = h ( X ) , where h ∈ H is a neural how is H restricted?
Symmetric neural networks Convolutional neural networks encode translation invariance: Illustration from medium.freecodecamp.org B. Bloem-Reddy 5 / 27
Why symmetry? Stabler training and better generalization through • reduction in dimension of parameter space through weight-tying; and • capturing structure at multiple scales via pooling. Historical note: Interest in invariant neural networks goes back at least to Minsky and Papert [MP88]; extended by Shawe-Taylor and Wood [Sha89; WS96]. More recent work by a host of others. B. Bloem-Reddy 6 / 27 Encoding symmetry in network architecture is a Good Thing ∗ .
Neural networks for permutation-invariant data [Zah+17] Permutation invariance: X 1 X 2 X 3 X 4 Y B. Bloem-Reddy 7 / 27 Consider a sequence X n := ( X 1 , . . . , X n ) , X i ∈ X . Y = h ( X n ) = h ( π · X n ) for all π ∈ S n .
Neural networks for permutation-invariant data [Zah+17] X 1 B. Bloem-Reddy n h Y X 3 X 2 X 4 7 / 27 X 2 Permutation invariance: Y X 1 X 3 X 4 Consider a sequence X n := ( X 1 , . . . , X n ) , X i ∈ X . Y = h ( X n ) = h ( π · X n ) for all π ∈ S n . �→ ( ) Y = ˜ ∑ Y = h ( X n ) �→ φ ( X i ) i = 1
Neural networks for permutation-invariant data [Zah+17] Equivariance: X 1 X 2 X 3 X 4 Y 1 Y 2 Y 3 Y 4 B. Bloem-Reddy 8 / 27 Y n = h ( X n ) such that h ( π · X n ) = π · h ( X n ) for all π ∈ S n .
Neural networks for permutation-invariant data [Zah+17] X 3 B. Bloem-Reddy X j n n Equivariance: Y 3 Y 2 Y 1 X 4 Y 4 X 2 X 3 X 1 X 2 X 1 8 / 27 X 4 Y 4 Y 1 Y 2 Y 3 Y n = h ( X n ) such that h ( π · X n ) = π · h ( X n ) for all π ∈ S n . ( ) ( ) ∑ ∑ [ h ( X n )] i = σ �→ [ h ( X n )] i = σ w 0 X i + w 1 w i , j X j j = 1 j = 1
Neural networks for permutation-invariant data . . . B. Bloem-Reddy 9 / 27
You could probably make some money making decent hats. Note to students: These were the first Google Image results for ”deep learning hat” and ”statistics hat”. B. Bloem-Reddy 10 / 27 ⟨⟨ Deep learning hat, off; statistics hat, on ⟩⟩
Statistical models and symmetry If X is assumed to satisfy a symmetry property, B. Bloem-Reddy 11 / 27 Consider a sequence X n := ( X 1 , . . . , X n ) , X i ∈ X . A statistical model of X n is a family of probability distributions on X n : P = { P θ : θ ∈ Ω } . how is P restricted?
Exchangeable sequences de Finetti’s theorem: iid Implication for Bayesian inference: Analogous theorems for other symmetries. The book by Kallenberg [Kal05] collects many of them. Some other accessible references: [Dia88; OR15]. B. Bloem-Reddy 12 / 27 A distribution P on X n is exchangeable if P ( X 1 , . . . , X n ) = P ( X π ( 1 ) , . . . , X π ( n ) ) for all π ∈ S n . X N is infinitely exchangeable if this is true for all prefixes X n ⊂ X N , n ∈ N . X N exchangeable ⇐ ⇒ X i | Q ∼ Q for some random Q . Our models for X N need only consist of i.i.d. distributions on X .
Finite exchangeable sequences de Finetti’s theorem may fail for finite exchangeable sequences. What else can we say? n B. Bloem-Reddy 13 / 27 The empirical measure of X n is ∑ M X n ( • ) := δ X i ( • ) . i = 1
Finite exchangeable sequences The empirical measure is a sufficient statistic : P is exchangeable iff with empirical measure m . B. Bloem-Reddy 14 / 27 • | M X n = m ) = U m ( • ) , P ( X n ∈ where U m is the uniform distribution on all sequences ( x 1 , . . . , x n )
Finite exchangeable sequences The empirical measure is a sufficient statistic : P is exchangeable iff with empirical measure m . d The empirical measure is an adequate statistic for any such Y : B. Bloem-Reddy 14 / 27 • | M X n = m ) = U m ( • ) , P ( X n ∈ where U m is the uniform distribution on all sequences ( x 1 , . . . , x n ) Consider Y such that ( π · X n , Y ) = ( X n , Y ) . • | X n = x n ) = P ( Y ∈ • | M X n = M x n ) . P ( Y ∈ M X n contains all information in X n that is relevant for predicting Y .
A useful theorem Theorem (Invariant representation; B-R, Teh) d a.s. B. Bloem-Reddy 15 / 27 Suppose X n is an exchangeable sequence. Then ( π · X n , Y ) = ( X n , Y ) for all π ∈ S n if and only if there is a mea- surable function ˜ h : [ 0 , 1 ] × M ( X ) → Y such that = ( X n , ˜ ( X n , Y ) h ( η, M X n )) and η ∼ Unif [ 0 , 1 ] , η ⊥ ⊥ X n .
A useful theorem X 3 B. Bloem-Reddy n h n h Y Theorem (Invariant representation; B-R, Teh) X 3 X 2 X 1 Y X 4 X 4 X 2 a.s. d X 1 15 / 27 Suppose X n is an exchangeable sequence. Then ( π · X n , Y ) = ( X n , Y ) for all π ∈ S n if and only if there is a mea- surable function ˜ h : [ 0 , 1 ] × M ( X ) → Y such that = ( X n , ˜ ( X n , Y ) h ( η, M X n )) and η ∼ Unif [ 0 , 1 ] , η ⊥ ⊥ X n . Deterministic invariance [Zah+17] �→ stochastic invariance [B-R, Teh] η ( ) ( ) Y = ˜ ∑ Y = ˜ ∑ φ ( X i ) �→ η, δ X i i = 1 i = 1
Another useful theorem a.s. B. Bloem-Reddy iid Theorem (Equivariant representation; B-R, Teh) 16 / 27 d Suppose X n is an exchangeable sequence and Y i ⊥ ⊥ X n ( Y n \ Y i ) . Then ( π · X n , π · Y n ) = ( X n , Y n ) for all π ∈ S n if and only if there is a measurable function ˜ h : [ 0 , 1 ] × X × M ( X ) → Y such that X n , (˜ ( ) ( X n , Y n ) = h ( η i , X i , M X n )) i ∈ [ n ] and η i ∼ Unif [ 0 , 1 ] , ( η i ) i ∈ [ n ] ⊥ ⊥ X n .
Another useful theorem Theorem (Equivariant representation; B-R, Teh) Y 2 Y 3 Y 4 X 1 X 2 X 3 X 4 Y 1 Y 2 Y 4 X 4 n X j h n w 0 X i w 1 n j 1 B. Bloem-Reddy Y 1 Y 3 X 3 iid d X 2 a.s. X 1 16 / 27 Suppose X n is an exchangeable sequence and Y i ⊥ ⊥ X n ( Y n \ Y i ) . Then ( π · X n , π · Y n ) = ( X n , Y n ) for all π ∈ S n if and only if there is a measurable function ˜ h : [ 0 , 1 ] × X × M ( X ) → Y such that X n , (˜ ( ) ( X n , Y n ) = h ( η i , X i , M X n )) i ∈ [ n ] and η i ∼ Unif [ 0 , 1 ] , ( η i ) i ∈ [ n ] ⊥ ⊥ X n . Deterministic equivariance [Zah+17] �→ stochastic equivariance [B-R, Teh] η 1 η 2 η 3 η 4 ( ) ( ) ∑ Y i = ˜ ∑ Y i = σ w 0 X i + w 1 �→ η i , X i , δ X j j = 1 j = 1 ( ) ∫ ∑ X j dx
models Outline B. Bloem-Reddy 17 / 27 • Symmetry in neural networks • Permutation-invariant neural networks • Symmetry in probability and statistics • Exchangeable sequences • Permutation-invariant neural networks as exchangeable probability • Symmetry in neural networks as probabilistic symmetry
and A bit of group theory B. Bloem-Reddy 18 / 27 For a group G acting on a set X : • The orbit of any x ∈ X is the subset of X generated by applying G to x : G · x = { g · x ; g ∈ G} . • A maximal invariant statistic M : X → S (i) is constant on an orbit, i.e., M ( g · x ) = M ( x ) for all g ∈ G and x ∈ X ; (ii) takes a different value on each orbit, i.e., M ( x 1 ) = M ( x 2 ) implies x 1 = g · x 2 for some g ∈ G . • A maximal equivariant τ : X → G satisfies τ ( g · X ) = g · τ ( x ) , g ∈ G , x ∈ X .
A general invariance theorem d B. Bloem-Reddy a.s. Theorem (B-R, Teh) 19 / 27 d Let G be a compact group and assume that g · X = X for all g ∈ G . Let M : X → S be a maximal invariant. Then ( g · X , Y ) = ( X , Y ) for all g ∈ G if and only if there exists a mea- surable function ˜ h : [ 0 , 1 ] × S → Y such that X , ˜ ( ) ( X , Y ) = h ( η, M ( X )) with η ∼ Unif [ 0 , 1 ] and η ⊥ ⊥ X .
Proof by picture B. Bloem-Reddy 20 / 27 P ( g · X , Y ) = P ( X , Y ) for all g ∈ G X Y
Proof by picture B. Bloem-Reddy 20 / 27 P ( g · X , M ( g · X ) , Y ) = P ( X , M ( X ) , Y ) for all g ∈ G ⇒ Y ⊥ ⊥ M ( X ) X X Y M ( X )
A general equivariance theorem a.s. B. Bloem-Reddy a.s. h is equivariant: Theorem (Kallenberg; B-R, Teh) 21 / 27 d d Let G be a compact group and assume that g · X = X for all g ∈ G . Assume that a maximal equivariant τ : X → G exists. Then ( g · X , g · Y ) = ( X , Y ) for all g ∈ G if and only if there exists a measurable function ˜ h : [ 0 , 1 ] × X → Y such that X , ˜ ( ) ( X , Y ) = h ( η, X ) with η ∼ Unif [ 0 , 1 ] and η ⊥ ⊥ X , where ˜ ˜ = g · ˜ h ( η, g · X ) h ( η, X ) , g ∈ G .
Proof by picture B. Bloem-Reddy 22 / 27 P ( g · X , g · Y ) = P ( X , Y ) for all g ∈ G X Y
Recommend
More recommend