Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan Virginia Polytechnic Institute and State University
Reverse-engineer the brain National Academy of Engineering Top 5 Grand Challenges Axon Terminal (transmitter) Neuron Cited from Sciseek.com Axon (wires) Dendrites (receiver) Question: How are the neurons connected? Action Potentials (Spikes) 2
Reverse-engineer the brain National Academy of Engineering Top 5 Grand Challenges Multi-Electrode Array (MEA) Neurons grown on MEA Chip A B C A B C tim time Spike Train Stream 3
Reverse-engineer the brain National Academy of Engineering Top 5 Grand Challenges Find Repeating Patterns Infer Network Connectivity 4
Fast data mining of spike train stream on Graphics Processing Units (GPUs) MEA MEA Chip hip GPU GPU Chip hip Multi-Electrode Array NVIDIA GTX280 (MEA) Graphics Card 5
Fast data mining of spike train stream on Graphics Processing Units (GPUs) Two key algorithmic strategies to address scalability problem on GPU A hybrid mining approach A two-pass elimination approach 6
Event stream data: sequence of neurons firing ( ) , E 2 , t 2 ( ) ,..., E n , t n ( ) E 1 , t 1 Event of Type A occurred at t = 6 t = 6 A 1 1 1 Neuron B 1 1 C 1 1 1 D 1 1 1 1 Time Event of Type D occurred at t = 5 t = 5 7
Pattern or Episode Inter-event constraint Occurrences (Non-overlapped) 1 1 A 1 1 1 1 Neurons 1 B 1 1 1 1 1 C 1 1 1 1 D 1 1 1 1 1 Time Episode appears twice in the event stream. 8
Data mining problem: Find all possible episodes / patterns which occur more than X-times in the event sequence. Challenge: Combinatorial Explosion: large number of episodes to count Episode …… 1 2 3 4 Size/Length: A A → B A → B → C A → B → C → D B B → A A → C → B A → C → B → D …… A → C B → A → C A → C → D → B B → C → A A → D → B → C …… …… A → D → C → B …… 9
Mining Algorithm (A level wise procedure to control combinatorial explosion) Generate an initial list of candidate size-1 episodes Repeat until - no more candidate episodes Count ount : Occurrences of size-M candidate episodes Prune: Retain only frequent episodes Candidate Generation: size-(M+1) candidate episodes from N-size frequent episodes Output all the frequent episodes Computational bottleneck 10
Counting Algorithm (for one episode) Episode : Episode Accept_A pt_A() () Accept_B pt_B() () Accept_C pt_C() () Accept_D pt_D() () A 1 B 4 C 10 D 17 A 2 B 12 C 13 A 5 5 10 A 1 B 4 A 5 C 10 B 12 C 13 D 17 A 2 Event Str Ev nt Stream 11
Find an efficient counting algorithm on GPU to count the occurrences of N size-M episodes in an event stream. Address scalability problem on GPU’s massive parallel execution architecture. 12
One episode per GPU thread (PTPE) Each thread counts one episode Simple extension of serial counting GPU MP MP MP N Episodes SP SP SP … N GPU Threads SM SM SM Event Stream Global Memory Efficient when the number of episode is larger than the number of GPU cores. 13
Not enough episodes/thread, some GPU cores will be idle. Solution: Increase the level of parallelism. Multiple Thread per Episode (MTPE) N Episodes N Episodes NM NM GPU N GPU Threads Threads Event Stream M Event Segments 14
Problem with simple count merge. 15
Choose the right algorithm with respect to the number of episodes N . Define a switching threshold - Crossover point (CP) No Yes If N < CP Use Use PTPE MTPE GPU Performance CP = MP × B MP × T B × f ( size ) computing Penalty Factor capacity MP : Number of multi- processors B MP : Block per multi- processor T B : Thread per block 16
Problem: Original counting algorithm is too complex for a GPU kernel function. Episode : Episode Accept_A pt_A() () Accept_B pt_B() () Accept_C pt_C() () Accept_D pt_D() () A 1 B 4 C 10 D 17 A 2 B 12 C 13 A 5 5 10 A 1 B 4 A 5 C 10 B 12 C 13 D 17 A 2 Event Str Ev nt Stream 17
Problem: Original counting algorithm is too complex for a GPU kernel function. MP MP MP Accept_A pt_A() () Accept__B pt__B() () Accept_C pt_C() () Accept_D pt_D() () SP SP SP A 1 B 4 C 10 D 17 … A 2 B 12 C 13 SM SM SM A 5 Global Memory Large shared memory usage Large register file usage Large number of branching instructions 18
Solution: PreElim algorithm Less constrained counting Simple kernel function Upper bound only ( − ,5] ( − ,10] ( − ,5] Episode : A Episode ⎯ B ⎯ C ⎯ D ⎯ ⎯ → ⎯ ⎯ → ⎯ ⎯ → Accept_A pt_A() () Accept_B pt_B() () Accept_C pt_C() () Accept_D pt_D() () A 5 A 2 A 1 B 4 C 10 D 17 C 13 B 12 5 10 A 1 B 4 A 5 C 10 B 12 C 13 A 2 D 17 Ev Event Str nt Stream 19
A simpler kernel function Shared Memory Register Local Memory PreElim 4 x Episode Size 13 0 Normal Counting 44 x Episode Size 17 80 20
Solution: Two-pass elimination approach PASS 1: Less Constrained Counting PASS 2: Normal Counting Fewer Episodes Threads Episodes Threads Event Stream Event Stream 21
A simpler kernel function Compile Time Difference Shared Memory Register Local Memory PreElim 4 x Episode Size 13 0 Normal Counting 44 x Episode Size 17 80 Run Time Difference Local Memory Load Divergent Branching and Store Two Pass 24,770,310 12,258,590 Hybrid 210,773,785 14,161,399 22
Hardware Computer (custom-built) Intel Core2 Quad @ 2.33GHz 4GB memory Graphics Card (Nvidia GTX 280 GPU) 240 cores (30 MPs * 8 cores) @ 1.3GHz 1GB global memory 16K shared memory for each MP 23
Datasets Synthetic ( Sym26 ) 60 seconds with 50,000 events Real (Culture growing for 5 weeks) Day 33: 2-1-33 (333478 events) Day 34: 2-1-34 (406795 events) Day 35: 2-1-35 (526380 events) 24
PTPE vs MTPE Crossover points 25
Performance of the Hybrid Approach PTPE PTPE MTPE MTPE Hybrid 1200 1200 1000 1000 Time (ms) ime (ms) Time (ms) ime (ms) 800 800 Crossover points 600 600 400 400 200 200 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 Episode Size Episode Siz Episode Size Episode Siz e e Episode Number: Sym26 dataset, Support = 100 26
Crossover Point Estimation a is a better fit. f ( size ) = size + b A least square fit is performed. 27
Two-pass approach vs Hybrid approach 99.9% fewer episodes 28
Performance of the Two-pass approach One Pass Two Pass Total # First Pass Cull 200K 160K 160K 120K ime (ms) Time (ms) Episode # Episode # 120K 80K 80K 40K 40K 0K 0K 1 2 3 4 5 1 2 3 4 5 One Pass 93.2 1839.8 16139.7 132752.6 7036.6 Total # 64 6210 33623 173408 6288 Two Pass 160.4 1716.6 12602.6 41581.7 1844.6 First Pass Cull 18 2677 21442 169360 6288 Episode Size Episode Siz e Episode Siz Episode Size e 2-1-35 dataset, Support = 3150 29
Percentage of episodes eliminated by each pass First Pass Second Pass 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% 3000 3050 3100 3150 3200 3250 3300 3350 3400 3450 3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 Suppor Support t 2-1-35 dataset, episode size = 4 30
GPU vs CPU • GPU is always faster than CPU – 5x - 15x speedup – Fair comparison • Two-pass algorithm used • Maximum threading for both 31
Massive parallelism is required for conquering near exponential search space GPU’s far more accessible than high performance clusters Frequent episode mining – Not data parallel Redesigned algorithm Framework for real-time and interactive analysis of spike train experimental data 32
A fast temporal data mining framework on GPUs Commoditized system Massive parallel execution architecture Two programming strategies A hybrid approach Increase level of parallelism (data segmentation + map-reduce) Two-pass elimination approach Decrease algorithm complexity (Task decomposition) 33
Questions. 34
ACE A . Parallel Execution via B . pthreads . C Optimized for CPU ACDE execution D A Minimize disk access B E Cache performance D AEF F E Implements Two-Pass G F Approach H Z PreElim – Simpler/ EFG G Quicker state machine … . Full State Machine – … . Slower but is required to . eliminate all unsupported episodes
Level-wise N-size frequent episodes => (N+1)-size candidates 1 A A 1 1 B B + 1 1 C C 1 D D 1 A 1 B 1 C 1 D
Recommend
More recommend