scalable string matching on the scalable string matching
play

Scalable String Matching on the Scalable String Matching on the - PowerPoint PPT Presentation

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the Cell BE Processor BE Processor Cell Cell BE Processor Daniele Scarpazza, Oreste Villa, Fabrizio Petrini Applied Computer Science Group Pacific


  1. Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the Cell BE Processor BE Processor Cell Cell BE Processor Daniele Scarpazza, Oreste Villa, Fabrizio Petrini Applied Computer Science Group Pacific Northwest National Laboratory fabrizio.petrini@pnl.gov Georgia Tech, Sony/Toshiba/IBM Workshop on Software and Applications for the Cell BE Processor Atlanta, GA, June 19 2007

  2. Outline Outline Outline The problem � Network Intrusion Detection Systems (NIDS) are becoming an essential part of data centers � At the heart of a NIDS there is a string matching algorithm The Aho-Corasick algorithm � A Deterministic Finite Automaton (DFA) Multicore Processors � An interesting opportunity to accelerate keyword scanning � Most of existing work done on FPGAs/specialized processors Goals and challenges � Scalability of the dictionary and the network speed DFAs with very high speed � Two SPEs can handle a 10 Gbit/sec rate with a transition table of less than 200KB 2

  3. The advent of teraflop-scale, many-core processors. Medieval Times Renaissance Period Industrial Age Threads 100 Arrays of Throughput Cores 10 Small Number Of Traditional 1 Cores SMT Year 2003 2005 2007 2009 2011 2013 Courtesy of Doug Carmean, Intel 3

  4. Set Pattern Matching Problem Set Pattern Matching Problem Set Pattern Matching Problem Find patterns in text P ={P 1 , P 2 , ... P q }, in T Aho and Corasick proposed an interesting algorithm for multi-pattern string matching Uses a state machine Important problem in a number of fields � Text processing, biology, network security, etc. 4

  5. Aho Corasick - - Example Example Aho Corasick Aho Corasick - Example P = {her, iris, he, is} T = “the iris for her” ϖ h i he ir is her iri iris 5

  6. Aho Corasick - - Example Example Aho Corasick Aho Corasick - Example P = {her, iris, he, is} T = “the iris for her” ϖ ����������������� h i ���� ���������� ��������������� ��� ������������ he ir is her iri iris 6

  7. Aho Corasick - - Example Example Aho Corasick Aho Corasick - Example P = {her, iris, he, is} T = “the iris for her” ϖ ����������������� h i ���� ���������� ��������������� ��� ����������� ������������ ����������� he ir is ��������� her iri iris 7

  8. First Step: Keyword Tree First Step: Keyword Tree First Step: Keyword Tree 8

  9. Second Step: Failed Transitions (Non- - Second Step: Failed Transitions (Non Second Step: Failed Transitions (Non- deterministic Finite Automaton NFA) deterministic Finite Automaton NFA) deterministic Finite Automaton NFA) 9

  10. Extend Failed Transitions for Each Extend Failed Transitions for Each Extend Failed Transitions for Each Character Character Character 10

  11. Build an Optimized Deterministic Finite Build an Optimized Deterministic Finite Build an Optimized Deterministic Finite Automaton (DFA) Automaton (DFA) Automaton (DFA) 11

  12. Design Challenges: Speed vs vs Size of the Size of the Design Challenges: Speed Design Challenges: Speed vs Size of the Dictionary Dictionary Dictionary 12

  13. Mapping the Aho Aho- -Corasick Corasick Algorithm on Algorithm on Mapping the Mapping the Aho-Corasick Algorithm on the Cell Processor: Data Streaming and the Cell Processor: Data Streaming and the Cell Processor: Data Streaming and SIMD parallelism SIMD parallelism SIMD parallelism PPE SPE1 SPE3 SPE5 SPE7 IOIF1 Data Arbiter BIF MIC SPE0 SPE2 SPE4 SPE6 IOIF0 13

  14. Aho- -Corasick: A Multi Corasick: A Multi- -level Parallelization level Parallelization Aho Aho-Corasick: A Multi-level Parallelization General approach � Multithreaded parallelism within a Synergistic Processing Unit (SPU), using multiple segments/connections of the input stream � SIMD parallelism, pipeline parallelism (even/odd pipelines of the SPU) � An arsenal of techniques: loop unrolling, removing speculation, restricted pointers, etc. Using multiple SPUs to increase processing bandwidth/dictionary size Dynamic loading of dictionaries 14

  15. Aggregate Main Memory Bandwidth: Memory Aggregate Main Memory Bandwidth: Memory Access Traffic Explicitly Orchestrated at User- - Access Traffic Explicitly Orchestrated at User Level Level 15

  16. SIMD and Pipeline Parallelism SIMD and Pipeline Parallelism SIMD and Pipeline Parallelism State Transition Table 16 Interleaved input streams 16 input characters SIMD shl << 3 16 input symbols SIMD shr >> 1 16 offsets to the load load load load load load load load load load 16 loads load load load load load load transition table cells split 0xFFFFFFFE 0x00000001 + + + + Current state &&&&&&&&&&&&&&&& &&&&&&&&&&&&&&&& + + pointers for + 16 16 the 16 DFAs + 16 SISD + SISD SISD + ands ands + add + + + + Addresses + to the cells address address address address address address address address address address address address address address address address Next state pointers Final state flags containing the for the 16 DFA for the 16 DFA next state pointers 16

  17. Local Storage Usage Local Storage Usage Local Storage Usage DFA DFA DFA state 256 k (total size of the local store) state state transition transition transition table table table 206 190 k 214 k (1520 states, k (1648 states, (1712 states, 32 input 32 input 32 input symbols) symbols) symbols) Input buffer 0 16 k 8 k Input buffer 0 4 k Input buffer 1 16 k Input buffer 1 8 k 4 k Code Code Code 34 k 34 k 34 k and Stack and Stack and Stack Case 1 Case 2 Case 3 17

  18. Overlapping Computation with Overlapping Computation with Overlapping Computation with Communication Communication Communication Computation Data transfer Time Load buffer 0 (5.94 us) Load buffer 1 (5.94 us) Process Process buffer 0 buffer 0 (25.64 us) (25.64 us) Load buffer 0 (5.94 us) Process buffer 1 (25.64 us) Load buffer 1 (5.94 us) Process buffer 0 (25.64 us) 18

  19. Schedule of a Dynamic State Transition Schedule of a Dynamic State Transition Schedule of a Dynamic State Transition Table (STT) Replacement Table (STT) Replacement Table (STT) Replacement Computation Data transfer Load input to buffer 0 (5.94 us) Time Load input to buffer 1 (5.94 us) Process buffer 0 Load next STT into STT 1 (match against STT 0) chunk 1/2 (48 kbyte) (25.64 us) (17.83 us) Load input to buffer 0 (5.94 us) Process buffer 1 Load next STT into STT 1 (match against STT 0) chunk 2/2 (47 kbyte) (25.64 us) (17.46 us) Load input to buffer 1 (5.94 us) Process buffer 0 Load next STT into STT 0 (match against STT 1) chunk 1/2 (48 kbyte) (25.64 us) (17.83 us) Load input to buffer 0 (5.94 us) Process buffer 1 Load next STT into STT 0 (match against STT 1) chunk 2/2 (47 kbyte) (25.64 us) (17.46 us) Load input to buffer 1 (5.94 us) Process buffer 0 Load next STT into STT 1 (match against STT 0) chunk 1/2 (48 kbyte) (25.64 us) (17.83 us) 19

  20. Thoughput Provide by the STT replacement with Provide by the STT replacement with Thoughput Thoughput Provide by the STT replacement with a a variable number of tiles (1 to 8) a a variable number of tiles (1 to 8) a a variable number of tiles (1 to 8) 20

  21. Conclusion Conclusion Conclusion Multi-core processors competitive with FPGAs and specialized network processors Multiple data streaming options to perform string matching Performance from 40 Gbits/sec to 5 Gbits/sec � With small dictionaries Future work includes � Addressing larger dictionaries � Compression of the STT Paper available at http://hpc.pnl.gov/people/fabrizio/papers/smtps07.pdf 21

Recommend


More recommend