On the Practical Computational Power of Finite Precision RNNs for Language Recognition Gail Weiss , Yoav Goldberg, Eran Yahav GRU < LSTM (!?) � 1 Supported by European Union’s Seventh Framework Programme (FP7) under grant agreement no. 615688 (PRIME)
Current State • RNNs are everywhere • We don’t know too much about the di ff erences between them: • Gated RNNs are shown to train better, beyond that: • “RNNs are Turing Complete”? � 2
Turing Complete? � 3
Turing Complete? 1993 Proof: 1. Requires Infinite Precision: Uses stack(s), maintained in certain dimension(s) Zeros are pushed using division (using g = g/4 + 1/4) In 32 bits, this reaches the limit after 15 pushes 2. Requires Infinite Time: Allows processing steps beyond reading input (Not the standard use case!) unreasonable assumptions! � 4
Turing Complete? 1993 Proof: G N I R 1. Requires Infinite Precision: U Uses stack(s), maintained in certain dimension(s) T ! T Zeros are pushed using division (using g = g/4 + 1/4) I P In 32 bits, this reaches the limit after 15 pushes R A T 2. Requires Infinite Time: Allows processing steps beyond reading input (Not the standard use case!) unreasonable assumptions! � 5
What happens on real hardware and real use-cases? � 6
Real Use • Gated architectures have the best performance • LSTM and GRU are most popular • Of these, the choice between them is unclear � 7
Main Result We accept all RNN types can simulate DFAs We show that LSTMs and IRNNs can also count And that the GRU and SRNN cannot � 8
Power of Counting Practical In NMT: LSTM better at capturing target length � 9
Power of Counting Practical In NMT: LSTM better at capturing target length Theoretical Finite State Machines vs Counter Machines � 10
K-Counter Machines (SKCMs) Fischer, Meyer, Rosenberg - 1968 • Similar to finite automata, but also maintain k counters • A counter has 4 operations: inc/dec by one, do nothing, reset • Counters are observed by comparison to zero + � 11
Counting Machines and Chomsky Hierarchy Regular Languages (RL) Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 12
Chomsky Hierarchy and SKCMs a n b n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 13
Chomsky Hierarchy and SKCMs a n b n a n b n c n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 14
Chomsky Hierarchy and SKCMs a n b n a n b n c n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 15
Chomsky Hierarchy and SKCMs a n b n a n b n c n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 16
Chomsky Hierarchy and SKCMs SKCMs cross the Chomsky Hierarchy! ? a n b n a n b n c n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 17
Summary so Far • Counters give additional formal power • We claimed that LSTM can count and GRU cannot • Let’s see why � 18
Summary so Far • Counters give additional formal power • We claimed that LSTM can count and GRU cannot • Let’s see why � 19
Popular Architectures GRU LSTM f t = σ ( W f x t + U f h t − 1 + b f ) z t = σ ( W z x t + U z h t − 1 + b z ) i t = σ ( W i x t + U i h t − 1 + b i ) r t = σ ( W r x t + U r h t − 1 + b r ) o t = σ ( W o x t + U o h t − 1 + b o ) ˜ h t = tanh( W h x t + U h ( r t ∘ h t − 1 ) + b h ) c t = tanh( W c x t + U c h t − 1 + b c ) ˜ h t = z t ∘ h t − 1 + (1 − z t ) ∘ ˜ h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t h t = o t ∘ g ( c t ) � 20
Popular Architectures GRU LSTM f t = σ ( W f x t + U f h t − 1 + b f ) gates z t = σ ( W z x t + U z h t − 1 + b z ) i t = σ ( W i x t + U i h t − 1 + b i ) r t = σ ( W r x t + U r h t − 1 + b r ) o t = σ ( W o x t + U o h t − 1 + b o ) ˜ h t = tanh( W h x t + U h ( r t ∘ h t − 1 ) + b h ) c t = tanh( W c x t + U c h t − 1 + b c ) ˜ h t = z t ∘ h t − 1 + (1 − z t ) ∘ ˜ h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t candidate h t = o t ∘ g ( c t ) vectors update functions � 21
Popular Architectures GRU LSTM f t ∈ (0,1) W f x t dfsfsfddgdg gates z t ∈ (0,1) i t ∈ (0,1) W i x t ddgdgsfsdfs r t ∈ (0,1) o t ∈ (0,1) W o x t ddgdgsdfsfd ˜ h t = tanh( W h x t + U h ( r t ∘ h t − 1 ) + b h ) c t = tanh( W c x t + U c h t − 1 + b c ) ˜ h t = z t ∘ h t − 1 + (1 − z t ) ∘ ˜ h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t candidate h t = o t ∘ g ( c t ) vectors update functions � 22
Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa gates z t ∈ (0,1) i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z t ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t candidate vectors h t = o t ∘ g ( c t ) update functions � 23
Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t h t = o t ∘ g ( c t ) � 24
Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t h t = o t ∘ g ( c t ) Interpolation � 25
Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t h t = o t ∘ g ( c t ) Interpolation � 26
Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t = f t ∘ c t − 1 + i t ∘ ˜ c t c t h t = o t ∘ g ( c t ) Interpolation � 27
Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t = f t ∘ c t − 1 + i t ∘ ˜ c t c t h t = o t ∘ g ( c t ) Interpolation Addition � 28
Popular Architectures GRU LSTM f t ≈ 1 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 1 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ c t − 1 + ˜ c t h t = o t ∘ g ( c t ) Interpolation Addition c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 29
Popular Architectures GRU LSTM f t ≈ 1 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 1 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ≈ 1 a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ c t − 1 + 1 h t = o t ∘ g ( c t ) Interpolation Increase by 1 c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 30
Popular Architectures GRU LSTM f t ≈ 1 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 1 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ≈ − 1 a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ c t − 1 − 1 h t = o t ∘ g ( c t ) Interpolation Decrease by 1 c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 31
Popular Architectures GRU LSTM f t ≈ 1 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 0 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ c t − 1 + ˜ c t h t = o t ∘ g ( c t ) Interpolation Do Nothing c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 32
Popular Architectures GRU LSTM f t ≈ 0 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 0 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ 0 c t − 1 + ˜ c t h t = o t ∘ g ( c t ) Interpolation Reset c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 33
Recommend
More recommend