Multi-Threaded Composition of Finite-State Automata Bryan Jurish Kay-Michael Würzner Berlin-Brandenburg Academy of Sciences University of Potsdam jurish@bbaw.de wuerzner@uni-potsdam.de FSMNLP 2013 St. Andrews, 17 th July, 2013 FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 1/25
Overview The Big Idea The Situation The Approach Parallel Composition Algorithms Master-Slave Peer-to-Peer Experiments Materials Method Results Concluding Remarks FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 2/25
— The Big Idea — FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 3/25
The Situation No Free Lunch (anymore) CPU frequency growth stagnating Multiprocessor systems increasingly popular ❀ “horizontal” scaling / multi-threading T 3 = ( T 1 ◦ T 2 ) (W)FST Composition Online: lexical lookup, Viterbi decoding, parsing, . . . Offline: lexicon compilation, statistical modelling, . . . no generic parallel implementation (that we know of) 1 S ( N ) = Amdahl’s Law (1 − P )+ P N Not all algorithms scale well horizontally ( P ≪ 1 ) For FSTs, P may depend on FST topology ❀ not all FST compositions scale horizontally! FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 4/25
The Basics Definition Given two ε -free FSTs T 1 = � Σ , Γ , Q 1 , q 0 1 , F 1 , E 1 � and T 2 = � Γ , ∆ , Q 2 , q 0 2 , F 2 , E 2 � , T 3 = ( T 1 ◦ T 2 ) is itself an FST with: � � T 3 = Σ , ∆ , Q 1 × Q 2 , ( q 0 1 , q 0 2 ) , F 1 × F 2 , E 3 �� �� � E 3 = ( q 1 , q 2 ) , ( r 1 , r 2 ) , a, c ( q 1 ,r 1 ,a,b ) ∈ E 1 , ( q 2 ,r 2 ,b,c ) ∈ E 2 � � � T 3 � = ( x, z ) | ∃ y : ( x, y ) ∈ � T 1 � & ( y, z ) ∈ � T 2 � � T 1 � ◦ � T 2 � = Properties simple construction requires ε -free FSTs worst-case O time = O ( | E 1 × E 2 | ) FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 5/25
Serial Algorithm compose ( T 1 = � Σ , Γ , Q 1 , q 0 1 , F 1 , E 1 � , T 2 = � Γ , ∆ , Q 2 , q 0 2 , F 2 , E 2 � ) 1 Q ← { ( q 0 1 , q 0 2 ) } /* initialize */ 2 V ← { ( q 0 1 , q 0 2 ) } /* visitation queue */ 3 while V � = ∅ do ( q 1 , q 2 ) ← pop( V ) 4 /* visit state */ if ( q 1 , q 2 ) ∈ F 1 × F 2 then 5 /* final state */ F ← F ∪ { ( q 1 , q 2 ) } 6 foreach ( e 1 , e 2 ) ∈ E [ q 1 ] × E [ q 2 ] with o[ e 1 ] = i[ e 2 ] do 7 /* align edges */ ∈ Q then if (n[ e 1 ] , n[ e 2 ]) / 8 Q ← Q ∪ { (n[ e 1 ] , n[ e 2 ]) } 9 V ← V ∪ { (n[ e 1 ] , n[ e 2 ]) } 10 /* enqueue for visitation */ E ← E ∪ { ( q 1 , q 2 ) , (n[ e 1 ] , n[ e 2 ]) , i[ e 1 ] , o[ e 2 ] } 11 12 return T 3 = � Σ , ∆ , Q, ( q 0 1 , q 0 2 ) , F, E � FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 6/25
The Approach Parallel State Visitation (lines 4–11) breadth-first search of output states ( V : FIFO ) distributed output data ( Q, F, E ) shared visitation queue ( V ) Amdahl’s Law Revisited | Q | | Q | S max : ≈ = 1+depth( T 3 ) 1+max q ∈ Q min π ∈ Π( q 0 ,q ) | π | 1 1 − P = S max assumes constant (average) state complexity worst-case breadth-first visitation 2 3 1 0 1 2 3 0 4 5 S max = 1 ; P = 0 S max = 3 2 ; P = 1 3 FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 7/25
— Algorithms — FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 8/25
Algorithm (Sketch): Master-Slave slave0 slave1 master slave2 slave3 Superordinate Distribution of Work state-pairs ( q 1 , q 2 ) passed to slaves for visitation Slave Tasks align & expand transitions, globally enqueue visitation requests Shared Global Data V ⊆ Q 1 × Q 2 V : visitation queue Q ⊆ Q 1 × Q 2 Q : visited states n_q : output state counter (for serialization) n_up : number of tasks currently assigned (for termination) FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 9/25
Algorithm (Sketch): Peer-to-Peer peer0 peer1 peer3 peer2 � q 1 + q 2 State Partitioning Function � r : ( q 1 , q 2 ) �→ mod N 2 peer i visits states with r ( q 1 , q 2 ) = i Peer-to-Peer Message Passing V ∈ ℘ ( E 1 × E 2 ) N × N messages are aligned transitions ( e 1 , e 2 ) sender: r (p[ e 1 ] , p[ e 2 ]) ❀ receiver: r (n[ e 1 ] , n[ e 2 ]) Shared Global Data n_q : output state counter (for serialization) n_up : number of messages currently enqueued (for termination) FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 10/25
— Experiments — FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 11/25
Experiments Materials 2,266 randomly generated WFSTs T trie spine + random arcs depth( T ) ≤ 32 (piecewise-) uniform sampling | Q T | , | E T | , | Σ | “embarrassingly parallel” topology P ( T − 1 ◦T ) > 99% g++ v4.4.5 algorithms implemented in C++ hexadecacore test machine 16 hardware cores Method for each generated T , compute ( T − 1 ◦ T ) 1 sample selection filter 64 sec ≤ t serial ≤ 8 sec varied number of threads N ∈ { 1 , 2 , 4 , 8 , 16 } Evaluation average running time 8 iterations per configuration structural properties of T , ( T − 1 ◦ T ) | Q | , | E | , . . . FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 12/25
Results: Master-Slave 8 serial ms: 2 ms: 4 ms: 8 ms:16 4 S = t.serial / t.ms 2 1 0.5 P ≈ − 23 . 5% σ = 82 . 3% 0.25 1 2 4 8 16 32 64 128 E / Q FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 13/25
Results: Peer-to-Peer 8 serial pp: 2 pp: 4 pp: 8 pp:16 4 S = t.serial / t.pp 2 P ≈ 83 . 1 % σ = 7 . 18% 1 1 2 4 8 16 32 64 128 E / Q FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 14/25
So What About NLP? Lexical Lookup � � many “small” compositions Id( w ) ◦ T Lex w ∈ W topology-dependent S max ❀ prefer high-level fork() over W Corpus Analysis single “large” composition A Corpus ◦ T Anal distributed representation ❀ serialization overhead Model Compilation offline “large” composition T Error ◦ A Lex partitioning function ❀ task-dependent tuning FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 15/25
Concluding Remarks Summary No (more) Free Lunch parallelization of “traditional” serial algorithms Amdahl’s Law Applied maximum speedup depends on FST topology Sharing (data) Hurts distributed synchronization improves performance Future Directions improve sampling procedure extend to other FST operations determinization minimization cascaded best-path lookup FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 16/25
The End Thank you for listening! FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 17/25
— Addenda — 2d Plots t serial : S E : S E/Q : S 3d Plots E/Q : N : S t serial : N : S Q : E : histogram t serial : E/Q : histogram FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 18/25
Plots: 2d: t serial : S 2 8 serial serial ms: 2 pp: 2 ms: 4 pp: 4 ms: 8 pp: 8 ms:16 pp:16 4 S = t.serial / t.ms S = t.serial / t.pp 1 2 0.5 1 0.1 1 0.1 1 t.serial t.serial 8 8 serial serial ms: 8 pp: 8 ms: 8 pp: 8 4 2 4 S = t.serial / t.ms S = t.serial / t.pp 1 0.5 2 0.25 0.125 1 0.1 1 0.1 1 t.serial t.serial FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 19/25
Plots: 2d: E : S 4 8 serial serial ms: 2 pp: 2 ms: 4 pp: 4 ms: 8 pp: 8 ms:16 pp:16 2 4 S = t.serial / t.ms S = t.serial / t.pp 1 2 0.5 0.25 1 100000 1e+06 1e+07 100000 1e+06 1e+07 nec nec 8 8 serial serial ms: 8 pp: 8 ms: 8 pp: 8 4 2 4 S = t.serial / t.ms S = t.serial / t.pp 1 0.5 2 0.25 0.125 1 100000 1e+06 1e+07 100000 1e+06 1e+07 nec nec FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 20/25
Plots: 2d: E/Q : S 8 8 serial serial ms: 2 pp: 2 ms: 4 pp: 4 ms: 8 pp: 8 ms:16 pp:16 4 4 S = t.serial / t.ms 2 S = t.serial / t.pp 1 2 0.5 1 0.25 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128 E / Q E / Q 8 8 serial serial ms: 8 pp: 8 ms: 8 pp: 8 4 2 4 S = t.serial / t.ms S = t.serial / t.pp 1 0.5 2 0.25 0.125 1 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128 E / Q E / Q FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 21/25
Plots: 3d: E/Q : N : S master-slave peer-to-peer S = t.serial / t.ms S = t.serial / t.pp 16 6 16 6 5 5 8 8 4 4 N N 3 3 4 4 2 2 1 1 2 2 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128 E / Q E / Q FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 22/25
Plots: 3d: t serial : N : S master-slave peer-to-peer S = t.serial / t.ms S = t.serial / t.pp 16 6 16 6 5 5 8 8 4 4 N N 3 3 4 4 2 2 1 1 2 2 0.1 1 0.1 1 t.serial t.serial FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 23/25
Plots: 3d: Q : E : histogram raw smoothed 25 12 10 1e+07 1e+07 20 8 15 nec 6 1e+06 1e+06 10 4 5 2 100000 100000 0 0 10000 100000 1e+06 1e+07 10000 100000 1e+06 1e+07 nqc nqc FSMNLP 2013 / Jurish | Würzner / Multi-threaded composition – p. 24/25
Recommend
More recommend