parallelizable stacklstm
play

Parallelizable StackLSTM Shuoyang Ding Philipp Koehn NAACL 2019 - PowerPoint PPT Presentation

Parallelizable StackLSTM Shuoyang Ding Philipp Koehn NAACL 2019 Structured Prediction Workshop Minneapolis, MN, United States June 7th, 2019 Outline What is StackLSTM? Parallelization Problem Homogenizing Computation


  1. Parallelizable StackLSTM Shuoyang Ding Philipp Koehn NAACL 2019 Structured Prediction Workshop Minneapolis, MN, United States June 7th, 2019

  2. Outline • What is StackLSTM? • Parallelization Problem • Homogenizing Computation • Experiments Parallelizable StackLSTM 2

  3. What is StackLSTM?

  4. A Partial Tree Parallelizable StackLSTM 4

  5. Good Edge? Parallelizable StackLSTM 5

  6. Good Edge? Parallelizable StackLSTM 6

  7. LSTM? Parallelizable StackLSTM 7

  8. :( Parallelizable StackLSTM 8

  9. StackLSTM • An LSTM whose states are stored in a stack • Computation is conditioned on the stack operation Dyer et al. (2015) Ballesteros et al. (2017) Parallelizable StackLSTM 9

  10. StackLSTM Parallelizable StackLSTM 10

  11. Push , Parallelizable StackLSTM 11

  12. Pop Parallelizable StackLSTM 12

  13. Push 61 Parallelizable StackLSTM 13

  14. Push years Parallelizable StackLSTM 14

  15. Push old Parallelizable StackLSTM 15

  16. Pop Parallelizable StackLSTM 16

  17. Pop Parallelizable StackLSTM 17

  18. Pop Parallelizable StackLSTM 18

  19. Push , Parallelizable StackLSTM 19

  20. Pop Parallelizable StackLSTM 20

  21. Push will Parallelizable StackLSTM 21

  22. Push join Parallelizable StackLSTM 22

  23. :) Parallelizable StackLSTM 23

  24. Parallelization Problem

  25. LSTM Parallelizable StackLSTM 25

  26. LSTM Parallelizable StackLSTM 26

  27. Batched LSTM Parallelizable StackLSTM 27

  28. Batched… StackLSTM? Parallelizable StackLSTM 28

  29. :( Parallelizable StackLSTM 29

  30. Wouldn’t it be nice if… Parallelizable StackLSTM 30

  31. Homogenizing Computation

  32. Push • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) and h_{p(t)}; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + 1; Parallelizable StackLSTM 32

  33. Push • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) and h_{p(t)}; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + 1; Parallelizable StackLSTM 33

  34. Push • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) and h_{p(t)}; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + 1; Parallelizable StackLSTM 34

  35. Push • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) and h_{p(t)}; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + 1; Parallelizable StackLSTM 35

  36. Push • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) and h_{p(t)}; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + 1; Parallelizable StackLSTM 36

  37. Pop • update stack top pointer p(t+1) = p(t) - 1; Parallelizable StackLSTM 37

  38. Pop • update stack top pointer p(t+1) = p(t) - 1; Parallelizable StackLSTM 38

  39. Observation 1 • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) • update stack top pointer and h_{p(t)}; p(t+1) = p(t) - 1; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + 1; Parallelizable StackLSTM 39

  40. Observation 1 • read the stack top Use op = +1 for push and hidden state h_{p(t)}; op = -1 for pop • perform LSTM forward computation with x(t) • update stack top pointer and h_{p(t)}; p(t+1) = p(t) + op ; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + op ; Parallelizable StackLSTM 40

  41. Observation 1 The computation performed for Pop operation is a subset of Push operation. Parallelizable StackLSTM 41

  42. Observation 2 Is it safe to do the other computations for push for pop as well? Parallelizable StackLSTM 42

  43. Observation 2 • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) • update stack top pointer and h_{p(t)}; p(t+1) = p(t) + op; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + op; Parallelizable StackLSTM 43

  44. Observation 2 • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) • update stack top pointer and h_{p(t)}; p(t+1) = p(t) + op; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + op; Parallelizable StackLSTM 44

  45. Observation 2 A write will always happen before the stack top pointer advances. Parallelizable StackLSTM 45

  46. Observation 2 If one wants to write anything in the higher position than the current stack top pointer… Parallelizable StackLSTM 46

  47. Observation 2 If one wants to write anything in the higher position than the current stack top pointer… Just do it! Parallelizable StackLSTM 47

  48. Observation 2 • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) • update stack top pointer and h_{p(t)}; p(t+1) = p(t) + op; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + op; Parallelizable StackLSTM 48

  49. Observation 2 • read the stack top • read the stack top hidden state h_{p(t)}; hidden state h_{p(t)}; • perform LSTM forward • perform LSTM forward computation with x(t) computation with x(t) and h_{p(t)}; and h_{p(t)}; • write new hidden state • write new hidden state to h_{p(t) + 1}; to h_{p(t) + 1}; • update stack top pointer • update stack top pointer p(t+1) = p(t) + op; p(t+1) = p(t) + op; Parallelizable StackLSTM 49

  50. Done! • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) and h_{p(t)}; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + op; Parallelizable StackLSTM 50

  51. Experiments

  52. Benchmark Transition-based dependency parsing on Stanford Dependency Treebank PyTorch, Single K80 GPU Parallelizable StackLSTM 52

  53. Hyperparameters • Largely following Dyer et al. (2015); Ballesteros et al. (2017), except: • Adam w/ ReduceLROnPlateau and warmup • Arc-Hybrid w/o composition function • Slightly larger models (200 hidden, 200 state, 48 action embedding) perform better Parallelizable StackLSTM 53

  54. Speed Parallelizable StackLSTM 54

  55. Speed Parallelizable StackLSTM 55

  56. Performance Ours Ballesteros 2017 93 92.5 92 91.5 91 8 16 32 64 128 256 batch size Parallelizable StackLSTM 56

  57. Conclusion

  58. Conclusion • We propose a parallelization scheme for StackLSTM architecture. • Together with a different optimizer, we are able to train parsers of comparable performance within 1 hour. paper code slides https://github.com/shuoyangd/hoolock Parallelizable StackLSTM 58

Recommend


More recommend