Parallel HMMs Parallel Implementation of Hidden Markov Models for Wireless Applications Authors Shawn Hymel (Wireless@VT, Virginia Tech) Ihsan Akbar (Harris Corporation) Jeffrey Reed (Wireless@VT, Virginia Tech)
Agenda • Overview of GPGPU • Overview of HMMs • Parallelization • Results • Applications • Why Is This Useful? 2
General ‐ Purpose Processing on GPUs • CUDA ‐ specific • Important Terms: – Threads – Blocks – Grid 3
CUDA Code Flow 4
Hidden Markov Model This image cannot currently be displayed. ( , , ) A B Start Initialization 0.6 0.4 0 . 7 0 . 3 0.7 0.6 0.3 A 0 . 4 0 . 6 States Rainy Sunny 0.4 0 . 1 0 . 6 0.1 0.1 0.5 0.6 0.3 0.4 0 . 4 0 . 3 B Observations Walk Shop Clean 0 . 5 0 . 1 0 . 6 0 . 4 5
HMM Canonical Problems • Evaluation: P(O| λ ) – Forward Algorithm – Backward Algorithm • Find the most likely state sequence – Viterbi Algorithm • Training ( maximize P(O| λ ) ) – Baum ‐ Welch Algorithm 6
Forward Algorithm Given a model and an observation sequence, calculate P(O| λ ) – T = number of observations – N = number of states – M = number of possible symbols Initiation: ( ) , 1 , 2 ,... i b O i N 1 1 i i Induction: N j i a b O 1 1 t t ij j t 1 i Termination N | P O T i 1 i 7
Example of Parallelization N j i a b O 1 1 t t ij j t 1 i For all j, element ‐ by ‐ element multiplication For all j, matrix multiplication A α ' b(O t+1 ) α t+1 α t α ' N × = × = N N × × N N We can perform this step in parallel! O(TN 2 ) → O(T log N) 8
Computational Complexity Serial Parallel O(TN 2 ) Forward Algorithm O(T log N) O(TN 2 ) O(T log N) Viterbi Algorithm O(TN 2 ) Baum ‐ Welch Algorithm O(T log N) or O(TMN) 9
Test Procedures Test Hardware • Time execution of each Component Specification algorithm (C vs. CUDA) CPU Intel Core 2 Duo – Vary states U7300 @ 1.30GHz – Vary symbols GPU NVIDIA GeForce GT 335M – Vary sequence length GPU Core Speed 450 MHz • Calculate total energy GPU Shader Speed 1080 MHz consumption (C vs. CUDA) GPU Memory Speed 1066 MHz CUDA Cores 72 – PowerTOP software 10
Speed Results Number of States CPU Runtime (s) GPU Runtime (s) Speed Increase Forward Algorithm 4 0.001 0.1531 0.007x 40 0.04 0.1393 0.287x 400 4.2816 0.2379 17.99x 4000 534.2028 2.9495 181.12 x Viterbi Algorithm 4 0.0033 0.1605 0.021x 40 0.0436 0.1801 0.242x 400 4.2684 1.6595 2.57x 4000 534.5543 116.2531 4.60 x Baum ‐ Welch Algorithm 4 0.0021 0.4142 0.005x 40 0.1946 0.4299 0.453x 400 17.6719 0.7502 23.56x 4000 1834.672 28.1271 65.23 x 11
Energy Consumption Power (W) States to Algorithm Break Even C CUDA Forward 18.5 26.5 ~100 Viterbi 18.5 29.1 ~120 BWA 18.3 26.1 ~70 Energy Consumption for Forward Algorithm Energy Consumed (kWh) 0.000006 0.000004 0.000002 CPU GPU 0 0 50 100 150 200 250 Number of States 12
Applications • Pattern Recognition – Spectrum Sensing – Signal Classification – Specific Emitter Identification – Geolocation • Modeling – Channel Fading – Call Drop Prediction 13
Why Is This Useful? • Evolution of GPUs and multi ‐ core processors – Smart phones, tablets, SDR – Co ‐ processor • Utilize existing hardware for HMM applications – Large number of states – 2D/3D HMMs • Uses in other fields (speech recognition, computer vision) • Extrapolation to other algorithms (pattern recognition) 14
Questions? Contact Information Email: hymelsr@vt.edu Blog: http://sgmustadio.wordpress.com/ Code: http://code.google.com/p/hmm ‐ cuda/ Other Good Resources cuHMM: http://code.google.com/p/chmm/ MATLAB: http://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html HTK: http://htk.eng.cam.ac.uk/ 15
Supporting Slide: Reductions MATLAB example: C Implementation: >> sum(A) sum = 0; for (i = 0; i < length; i++) { sum = sum + A[i]; } Parallelization: Reducing arrays to a single value (e.g. sum) go from O(N) to O(log N) 16
Supporting Slide: Timing Results (Forward) Execution Time for Forward Algorithm 600 Execution Time (s) 400 200 CPU GPU 0 0 1000 2000 3000 4000 5000 Vary States Number of States Execution Time for Forward Alg. on GPU 20 Execution Time (s) 15 10 5 0 1 10 100 1000 10000 Number of States Execution Time for Forward Algorithm 1.5 Execution Time (s) Vary Symbols 1 0.5 CPU 0 GPU 0 2000 4000 6000 8000 10000 Number of Observations 17
Supporting Slide: Timing Results (Viterbi) Execution Time for Viterbi Algorithm 600 Execution Time (s) 400 200 CPU GPU 0 0 1000 2000 3000 4000 5000 Vary States Number of States Execution Time for Viterbi on GPU 600 Execution Time (s) 400 200 0 1 10 100 1000 10000 Number of States Execution Time for Viterbi Algorithm 0.2 Execution Time (s) Vary Symbols 0.1 CPU 0 GPU 0 1000 2000 3000 4000 5000 6000 7000 8000 Number of Symbols 18
Supporting Slide: Timing Results (BWA) Execution Time for Baum ‐ Welch Algorithm 2000 Execution Time (s) 1500 1000 CPU 500 GPU 0 0 1000 2000 3000 4000 5000 Vary States Number of States Execution Time for BWA on GPU 100 Execution Time (s) 50 0 1 10 100 1000 10000 Number of States Execution Time for Baum ‐ Welch Algorithm 0.55 Execution Time (s) Vary Symbols 0.45 CPU 0.35 GPU 0 1000 2000 3000 4000 5000 6000 7000 8000 Number of Symbols 19
Recommend
More recommend