detection architecture
play

Detection Architecture Giorgos Vasiliadis, FORTH-ICS, Greece Michalis - PowerPoint PPT Presentation

MIDeA: A Multi-Parallel Intrusion Detection Architecture Giorgos Vasiliadis, FORTH-ICS, Greece Michalis Polychronakis, Columbia U., USA Sotiris Ioannidis, FORTH-ICS, Greece CCS 2011, 19 October 2011 Network Intrusion Detection Systems


  1. MIDeA: A Multi-Parallel Intrusion Detection Architecture Giorgos Vasiliadis, FORTH-ICS, Greece Michalis Polychronakis, Columbia U., USA Sotiris Ioannidis, FORTH-ICS, Greece CCS 2011, 19 October 2011

  2. Network Intrusion Detection Systems • Typically deployed at ingress/egress points – Inspect all network traffic – Look for suspicious activities – Alert on malicious actions 10 GbE Internal Internet Network NIDS gvasil@ics.forth.gr 2

  3. Challenges • Traffic rates are increasing – 10 Gbit/s Ethernet speeds are common in metro/enterprise networks – Up to 40 Gbit/s at the core • Keep needing to perform more complex analysis at higher speeds – Deep packet inspection – Stateful analysis – 1000s of attack signatures gvasil@ics.forth.gr 3

  4. Designing NIDS • Fast – Need to handle many Gbit/s – Scalable • Moore’s law does not hold anymore • Commodity hardware – Cheap – Easily programmable gvasil@ics.forth.gr 4

  5. Today: fast or commodity • Fast “hardware” NIDS – FPGA/TCAM/ASIC based – Throughput: High • Commodity “software” NIDS – Processing by general-purpose processors – Throughput: Low gvasil@ics.forth.gr 5

  6. MIDeA • A NIDS out of commodity components – Single-box implementation – Easy programmability – Low price Can we build a 10 Gbit/s NIDS with commodity hardware? gvasil@ics.forth.gr 6

  7. Outline • Architecture • Implementation • Performance Evaluation • Conclusions gvasil@ics.forth.gr 7

  8. Single-threaded performance Pattern NIC Preprocess Output matching • Vanilla Snort: 0.2 Gbit/s gvasil@ics.forth.gr 8

  9. Problem #1: Scalability • Single-threaded NIDS have limited performance – Do not scale with the number of CPU cores gvasil@ics.forth.gr 9

  10. Multi-threaded performance Pattern Preprocess Output matching Pattern NIC Preprocess Output matching Pattern Preprocess Output matching • Vanilla Snort: 0.2 Gbit/s • With multiple CPU-cores: 0.9 Gbit/s gvasil@ics.forth.gr 10

  11. Problem #2: How to split traffic cores  Synchronization overheads NIC  Cache misses  Receive-Side Scaling (RSS) 11

  12. Multi-queue performance Pattern Preprocess Output matching RSS Pattern Preprocess Output NIC matching Pattern Preprocess Output matching • Vanilla Snort: 0.2 Gbit/s • With multiple CPU-cores: 0.9 Gbit/s • With multiple Rx-queues: 1.1 Gbit/s 12

  13. Problem #3: Pattern matching is the bottleneck > 75% Pattern NIC Preprocess Output matching  Offload pattern matching on the GPU Pattern NIC Preprocess Output matching gvasil@ics.forth.gr 13

  14. Why GPU? • General-purpose computing – Flexible and programmable • Powerful and ubiquitous – Constant innovation • Data-parallel model – More transistors for data processing rather than data caching and flow control gvasil@ics.forth.gr 14

  15. Offloading pattern matching to the GPU Pattern Preprocess Output matching RSS Pattern Preprocess Output NIC matching Pattern Preprocess Output matching • Vanilla Snort: 0.2 Gbit/s • With multiple CPU-cores: 0.9 Gbit/s • With multiple Rx-queues: 1.1 Gbit/s • With GPU: 5.2 Gbit/s 15

  16. Outline • Architecture • Implementation • Performance Evaluation • Conclusions gvasil@ics.forth.gr 16

  17. Multiple data transfers PCIe PCIe GPU CPU NIC • Several data transfers between different devices Are the data transfers worth the computational gains offered? gvasil@ics.forth.gr 17

  18. Capturing packets from NIC Ring buffers User space Kernel space Rx Rx Rx Rx Network Interface Rx Queue Assigned • Packets are hashed in the NIC and distributed to different Rx-queues • Memory-mapped ring buffers for each Rx-queue gvasil@ics.forth.gr 18

  19. CPU Processing • Packet capturing is performed by different CPU-cores in parallel – Process affinity • Each core normalizes and reassembles captured packets to streams – Remove ambiguities – Detect attacks that span multiple packets • Packets of the same connection always end up to the same core – No synchronization – Cache locality • Reassembled packet streams are then transferred to the GPU for pattern matching – How to access the GPU? gvasil@ics.forth.gr 19

  20. Accessing the GPU • Solution #1: Master/Slave model Thread 2 Thread 1 PCIe Thread 3 GPU 64 Gbit/s Thread 4 • Execution flow example 14.6 Gbit/s P1 P1 Transfer to GPU: GPU execution: P1 P1 Transfer from GPU: P1 P1 gvasil@ics.forth.gr 20

  21. Accessing the GPU • Solution #2: Shared execution by multiple threads Thread 1 Thread 2 PCIe GPU 64 Gbit/s Thread 3 Thread 4 • Execution flow example 48.1 Gbit/s Transfer to GPU: P1 P2 P3 P1 GPU execution: P1 P2 P3 P1 Transfer from GPU: P1 P2 P3 P1 gvasil@ics.forth.gr 21

  22. Transferring to GPU Push CPU-core Push Scan Push GPU • Small transfer results to PCIe throughput degradation  Each core batches many reassembled packets into a single buffer gvasil@ics.forth.gr

  23. Pattern Matching on GPU Packet Buffer GPU GPU GPU core core core GPU GPU GPU core core core Matches • Uniformly, one GPU core for each reassembled packet stream gvasil@ics.forth.gr 23

  24. Pipelining CPU and GPU CPU Packet buffers • Double-buffering – Each CPU core collects new reassembled packets, while the GPUs process the previous batch – Effectively hides GPU communication costs gvasil@ics.forth.gr 24

  25. Recap Data-parallel GPUs: content matching Reassembled packet streams Per-flow CPUs: protocol analysis Packet streams NIC: Demux 1-10Gbps Packets gvasil@ics.forth.gr 25

  26. Outline • Architecture • Implementation • Performance Evaluation • Conclusions gvasil@ics.forth.gr 26

  27. Setup: Hardware CPU-0 CPU-1 Memory Memory IOH IOH GPU GPU NIC • NUMA architecture, QuickPath Interconnect Model Specs 2 x CPU Intel E5520 2.27 GHz x 4 cores 2 x GPU NVIDIA GTX480 1.4 GHz x 480 cores 1 x NIC 82599EB 10 GbE gvasil@ics.forth.gr 27

  28. Pattern Matching Performance Bounded by PCIe capacity GPU Throughput 48.1 42.5 26.7 14.6 1 2 4 8 #CPU-cores • The performance of a single GPU increases, as the number of CPU-cores increases gvasil@ics.forth.gr 28

  29. Pattern Matching Performance 70.7 GPU Throughput 48.1 42.5 Adding a second GPU 26.7 14.6 1 2 4 8 #CPU-cores • The performance of a single GPU increases, as the number of CPU-cores increases gvasil@ics.forth.gr 29

  30. Setup: Network 10 GbE Traffic MIDeA Generator/Replayer gvasil@ics.forth.gr 30

  31. Synthetic traffic 7.2 MIDeA Gbit/s Snort (8x cores) 4.8 2.4 2.1 1.5 1.1 200b 800b 1500b Packet size • Randomly generated traffic gvasil@ics.forth.gr 31

  32. Real traffic MIDeA 5.2 Gbit/s Snort (8x cores) 1.1 • 5.2 Gbit/s with zero packet-loss – Replayed trace captured at the gateway of a university campus gvasil@ics.forth.gr 32

  33. Summary • MIDeA: A multi-parallel network intrusion detection architecture – Single-box implementation – Based on commodity hardware – Less than $1500 • Operate on 5.2 Gbit/s with zero packet loss – 70 Gbit/s pattern matching throughput gvasil@ics.forth.gr 33

  34. Thank you! gvasil@ics.forth.gr gvasil@ics.forth.gr 34

Recommend


More recommend