Parallelizable StackLSTM Shuoyang Ding Philipp Koehn NAACL 2019 Structured Prediction Workshop Minneapolis, MN, United States June 7th, 2019
Outline • What is StackLSTM? • Parallelization Problem • Homogenizing Computation • Experiments Parallelizable StackLSTM 2
What is StackLSTM?
A Partial Tree Parallelizable StackLSTM 4
Good Edge? Parallelizable StackLSTM 5
Good Edge? Parallelizable StackLSTM 6
LSTM? Parallelizable StackLSTM 7
:( Parallelizable StackLSTM 8
StackLSTM • An LSTM whose states are stored in a stack • Computation is conditioned on the stack operation Dyer et al. (2015) Ballesteros et al. (2017) Parallelizable StackLSTM 9
StackLSTM Parallelizable StackLSTM 10
Push , Parallelizable StackLSTM 11
Pop Parallelizable StackLSTM 12
Push 61 Parallelizable StackLSTM 13
Push years Parallelizable StackLSTM 14
Push old Parallelizable StackLSTM 15
Pop Parallelizable StackLSTM 16
Pop Parallelizable StackLSTM 17
Pop Parallelizable StackLSTM 18
Push , Parallelizable StackLSTM 19
Pop Parallelizable StackLSTM 20
Push will Parallelizable StackLSTM 21
Push join Parallelizable StackLSTM 22
:) Parallelizable StackLSTM 23
Parallelization Problem
LSTM Parallelizable StackLSTM 25
LSTM Parallelizable StackLSTM 26
Batched LSTM Parallelizable StackLSTM 27
Batched… StackLSTM? Parallelizable StackLSTM 28
:( Parallelizable StackLSTM 29
Wouldn’t it be nice if… Parallelizable StackLSTM 30
Homogenizing Computation
Push • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) and h_{p(t)}; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + 1; Parallelizable StackLSTM 32
Push • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) and h_{p(t)}; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + 1; Parallelizable StackLSTM 33
Push • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) and h_{p(t)}; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + 1; Parallelizable StackLSTM 34
Push • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) and h_{p(t)}; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + 1; Parallelizable StackLSTM 35
Push • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) and h_{p(t)}; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + 1; Parallelizable StackLSTM 36
Pop • update stack top pointer p(t+1) = p(t) - 1; Parallelizable StackLSTM 37
Pop • update stack top pointer p(t+1) = p(t) - 1; Parallelizable StackLSTM 38
Observation 1 • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) • update stack top pointer and h_{p(t)}; p(t+1) = p(t) - 1; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + 1; Parallelizable StackLSTM 39
Observation 1 • read the stack top Use op = +1 for push and hidden state h_{p(t)}; op = -1 for pop • perform LSTM forward computation with x(t) • update stack top pointer and h_{p(t)}; p(t+1) = p(t) + op ; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + op ; Parallelizable StackLSTM 40
Observation 1 The computation performed for Pop operation is a subset of Push operation. Parallelizable StackLSTM 41
Observation 2 Is it safe to do the other computations for push for pop as well? Parallelizable StackLSTM 42
Observation 2 • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) • update stack top pointer and h_{p(t)}; p(t+1) = p(t) + op; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + op; Parallelizable StackLSTM 43
Observation 2 • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) • update stack top pointer and h_{p(t)}; p(t+1) = p(t) + op; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + op; Parallelizable StackLSTM 44
Observation 2 A write will always happen before the stack top pointer advances. Parallelizable StackLSTM 45
Observation 2 If one wants to write anything in the higher position than the current stack top pointer… Parallelizable StackLSTM 46
Observation 2 If one wants to write anything in the higher position than the current stack top pointer… Just do it! Parallelizable StackLSTM 47
Observation 2 • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) • update stack top pointer and h_{p(t)}; p(t+1) = p(t) + op; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + op; Parallelizable StackLSTM 48
Observation 2 • read the stack top • read the stack top hidden state h_{p(t)}; hidden state h_{p(t)}; • perform LSTM forward • perform LSTM forward computation with x(t) computation with x(t) and h_{p(t)}; and h_{p(t)}; • write new hidden state • write new hidden state to h_{p(t) + 1}; to h_{p(t) + 1}; • update stack top pointer • update stack top pointer p(t+1) = p(t) + op; p(t+1) = p(t) + op; Parallelizable StackLSTM 49
Done! • read the stack top hidden state h_{p(t)}; • perform LSTM forward computation with x(t) and h_{p(t)}; • write new hidden state to h_{p(t) + 1}; • update stack top pointer p(t+1) = p(t) + op; Parallelizable StackLSTM 50
Experiments
Benchmark Transition-based dependency parsing on Stanford Dependency Treebank PyTorch, Single K80 GPU Parallelizable StackLSTM 52
Hyperparameters • Largely following Dyer et al. (2015); Ballesteros et al. (2017), except: • Adam w/ ReduceLROnPlateau and warmup • Arc-Hybrid w/o composition function • Slightly larger models (200 hidden, 200 state, 48 action embedding) perform better Parallelizable StackLSTM 53
Speed Parallelizable StackLSTM 54
Speed Parallelizable StackLSTM 55
Performance Ours Ballesteros 2017 93 92.5 92 91.5 91 8 16 32 64 128 256 batch size Parallelizable StackLSTM 56
Conclusion
Conclusion • We propose a parallelization scheme for StackLSTM architecture. • Together with a different optimizer, we are able to train parsers of comparable performance within 1 hour. paper code slides https://github.com/shuoyangd/hoolock Parallelizable StackLSTM 58
Recommend
More recommend