on the practical computational power of finite precision
play

On the Practical Computational Power of Finite Precision RNNs for - PowerPoint PPT Presentation

On the Practical Computational Power of Finite Precision RNNs for Language Recognition Gail Weiss , Yoav Goldberg, Eran Yahav GRU < LSTM (!?) 1 Supported by European Unions Seventh Framework Programme (FP7) under grant agreement no.


  1. On the Practical Computational Power of Finite Precision RNNs for Language Recognition Gail Weiss , Yoav Goldberg, Eran Yahav GRU < LSTM (!?) � 1 Supported by European Union’s Seventh Framework Programme (FP7) under grant agreement no. 615688 (PRIME)

  2. Current State • RNNs are everywhere • We don’t know too much about the di ff erences between them: • Gated RNNs are shown to train better, beyond that: • “RNNs are Turing Complete”? � 2

  3. Turing Complete? � 3

  4. Turing Complete? 1993 Proof: 1. Requires Infinite Precision: Uses stack(s), maintained in certain dimension(s) Zeros are pushed using division (using g = g/4 + 1/4) In 32 bits, this reaches the limit after 15 pushes 2. Requires Infinite Time: Allows processing steps beyond reading input (Not the standard use case!) unreasonable assumptions! � 4

  5. Turing Complete? 1993 Proof: G N I R 1. Requires Infinite Precision: U Uses stack(s), maintained in certain dimension(s) T ! T Zeros are pushed using division (using g = g/4 + 1/4) I P In 32 bits, this reaches the limit after 15 pushes R A T 2. Requires Infinite Time: Allows processing steps beyond reading input (Not the standard use case!) unreasonable assumptions! � 5

  6. What happens on real hardware and real use-cases? � 6

  7. Real Use • Gated architectures have the best performance • LSTM and GRU are most popular • Of these, the choice between them is unclear � 7

  8. Main Result We accept all RNN types can simulate DFAs We show that LSTMs and IRNNs can also count And that the GRU and SRNN cannot � 8

  9. Power of Counting Practical In NMT: LSTM better at capturing target length � 9

  10. Power of Counting Practical In NMT: LSTM better at capturing target length Theoretical Finite State Machines vs Counter Machines � 10

  11. K-Counter Machines (SKCMs) Fischer, Meyer, Rosenberg - 1968 • Similar to finite automata, but also maintain k counters • A counter has 4 operations: inc/dec by one, do nothing, reset • Counters are observed by comparison to zero + � 11

  12. Counting Machines and Chomsky Hierarchy Regular Languages (RL) Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 12

  13. Chomsky Hierarchy and SKCMs a n b n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 13

  14. Chomsky Hierarchy and SKCMs a n b n a n b n c n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 14

  15. Chomsky Hierarchy and SKCMs a n b n a n b n c n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 15

  16. Chomsky Hierarchy and SKCMs a n b n a n b n c n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 16

  17. Chomsky Hierarchy and SKCMs SKCMs cross the Chomsky Hierarchy! ? a n b n a n b n c n Regular Languages (RL) Palindromes Context Free Languages (CFL) Context Sensitive Languages (CSL) Recursively Enumerable Languages (RE) � 17

  18. Summary so Far • Counters give additional formal power • We claimed that LSTM can count and GRU cannot • Let’s see why � 18

  19. Summary so Far • Counters give additional formal power • We claimed that LSTM can count and GRU cannot • Let’s see why � 19

  20. Popular Architectures GRU LSTM f t = σ ( W f x t + U f h t − 1 + b f ) z t = σ ( W z x t + U z h t − 1 + b z ) i t = σ ( W i x t + U i h t − 1 + b i ) r t = σ ( W r x t + U r h t − 1 + b r ) o t = σ ( W o x t + U o h t − 1 + b o ) ˜ h t = tanh( W h x t + U h ( r t ∘ h t − 1 ) + b h ) c t = tanh( W c x t + U c h t − 1 + b c ) ˜ h t = z t ∘ h t − 1 + (1 − z t ) ∘ ˜ h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t h t = o t ∘ g ( c t ) � 20

  21. Popular Architectures GRU LSTM f t = σ ( W f x t + U f h t − 1 + b f ) gates z t = σ ( W z x t + U z h t − 1 + b z ) i t = σ ( W i x t + U i h t − 1 + b i ) r t = σ ( W r x t + U r h t − 1 + b r ) o t = σ ( W o x t + U o h t − 1 + b o ) ˜ h t = tanh( W h x t + U h ( r t ∘ h t − 1 ) + b h ) c t = tanh( W c x t + U c h t − 1 + b c ) ˜ h t = z t ∘ h t − 1 + (1 − z t ) ∘ ˜ h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t candidate h t = o t ∘ g ( c t ) vectors update functions � 21

  22. Popular Architectures GRU LSTM f t ∈ (0,1) W f x t dfsfsfddgdg gates z t ∈ (0,1) i t ∈ (0,1) W i x t ddgdgsfsdfs r t ∈ (0,1) o t ∈ (0,1) W o x t ddgdgsdfsfd ˜ h t = tanh( W h x t + U h ( r t ∘ h t − 1 ) + b h ) c t = tanh( W c x t + U c h t − 1 + b c ) ˜ h t = z t ∘ h t − 1 + (1 − z t ) ∘ ˜ h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t candidate h t = o t ∘ g ( c t ) vectors update functions � 22

  23. Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa gates z t ∈ (0,1) i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z t ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t candidate vectors h t = o t ∘ g ( c t ) update functions � 23

  24. Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t h t = o t ∘ g ( c t ) � 24

  25. Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t h t = o t ∘ g ( c t ) Interpolation � 25

  26. Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t h t = o t ∘ g ( c t ) Interpolation � 26

  27. Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t = f t ∘ c t − 1 + i t ∘ ˜ c t c t h t = o t ∘ g ( c t ) Interpolation � 27

  28. Popular Architectures GRU LSTM f t ∈ (0,1) W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ∈ (0,1) W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t = f t ∘ c t − 1 + i t ∘ ˜ c t = f t ∘ c t − 1 + i t ∘ ˜ c t c t h t = o t ∘ g ( c t ) Interpolation Addition � 28

  29. Popular Architectures GRU LSTM f t ≈ 1 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 1 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ c t − 1 + ˜ c t h t = o t ∘ g ( c t ) Interpolation Addition c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 29

  30. Popular Architectures GRU LSTM f t ≈ 1 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 1 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ≈ 1 a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ c t − 1 + 1 h t = o t ∘ g ( c t ) Interpolation Increase by 1 c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 30

  31. Popular Architectures GRU LSTM f t ≈ 1 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 1 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ≈ − 1 a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ c t − 1 − 1 h t = o t ∘ g ( c t ) Interpolation Decrease by 1 c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 31

  32. Popular Architectures GRU LSTM f t ≈ 1 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 0 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ c t − 1 + ˜ c t h t = o t ∘ g ( c t ) Interpolation Do Nothing c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 32

  33. Popular Architectures GRU LSTM f t ≈ 0 W f x t aaaaaaaaaa z t ∈ (0,1) Bounded! i t ≈ 0 W i x t aaaaaaaaaa r t ∈ (0,1) o t ∈ (0,1) W o x t ( tanh ) aaaa ˜ h t ∈ ( − 1,1) c t ∈ ( − 1,1) a c ˜ h t = z t ∘ h t − 1 + (1 − z ) ∘ ˜ b h t c t ≈ 0 c t − 1 + ˜ c t h t = o t ∘ g ( c t ) Interpolation Reset c t = f t ∘ c t − 1 + i t ∘ ˜ c t � 33

Recommend


More recommend