MIDeA: A Multi-Parallel Intrusion Detection Architecture Giorgos Vasiliadis, FORTH-ICS, Greece Michalis Polychronakis, Columbia U., USA Sotiris Ioannidis, FORTH-ICS, Greece CCS 2011, 19 October 2011
Network Intrusion Detection Systems • Typically deployed at ingress/egress points – Inspect all network traffic – Look for suspicious activities – Alert on malicious actions 10 GbE Internal Internet Network NIDS gvasil@ics.forth.gr 2
Challenges • Traffic rates are increasing – 10 Gbit/s Ethernet speeds are common in metro/enterprise networks – Up to 40 Gbit/s at the core • Keep needing to perform more complex analysis at higher speeds – Deep packet inspection – Stateful analysis – 1000s of attack signatures gvasil@ics.forth.gr 3
Designing NIDS • Fast – Need to handle many Gbit/s – Scalable • Moore’s law does not hold anymore • Commodity hardware – Cheap – Easily programmable gvasil@ics.forth.gr 4
Today: fast or commodity • Fast “hardware” NIDS – FPGA/TCAM/ASIC based – Throughput: High • Commodity “software” NIDS – Processing by general-purpose processors – Throughput: Low gvasil@ics.forth.gr 5
MIDeA • A NIDS out of commodity components – Single-box implementation – Easy programmability – Low price Can we build a 10 Gbit/s NIDS with commodity hardware? gvasil@ics.forth.gr 6
Outline • Architecture • Implementation • Performance Evaluation • Conclusions gvasil@ics.forth.gr 7
Single-threaded performance Pattern NIC Preprocess Output matching • Vanilla Snort: 0.2 Gbit/s gvasil@ics.forth.gr 8
Problem #1: Scalability • Single-threaded NIDS have limited performance – Do not scale with the number of CPU cores gvasil@ics.forth.gr 9
Multi-threaded performance Pattern Preprocess Output matching Pattern NIC Preprocess Output matching Pattern Preprocess Output matching • Vanilla Snort: 0.2 Gbit/s • With multiple CPU-cores: 0.9 Gbit/s gvasil@ics.forth.gr 10
Problem #2: How to split traffic cores Synchronization overheads NIC Cache misses Receive-Side Scaling (RSS) 11
Multi-queue performance Pattern Preprocess Output matching RSS Pattern Preprocess Output NIC matching Pattern Preprocess Output matching • Vanilla Snort: 0.2 Gbit/s • With multiple CPU-cores: 0.9 Gbit/s • With multiple Rx-queues: 1.1 Gbit/s 12
Problem #3: Pattern matching is the bottleneck > 75% Pattern NIC Preprocess Output matching Offload pattern matching on the GPU Pattern NIC Preprocess Output matching gvasil@ics.forth.gr 13
Why GPU? • General-purpose computing – Flexible and programmable • Powerful and ubiquitous – Constant innovation • Data-parallel model – More transistors for data processing rather than data caching and flow control gvasil@ics.forth.gr 14
Offloading pattern matching to the GPU Pattern Preprocess Output matching RSS Pattern Preprocess Output NIC matching Pattern Preprocess Output matching • Vanilla Snort: 0.2 Gbit/s • With multiple CPU-cores: 0.9 Gbit/s • With multiple Rx-queues: 1.1 Gbit/s • With GPU: 5.2 Gbit/s 15
Outline • Architecture • Implementation • Performance Evaluation • Conclusions gvasil@ics.forth.gr 16
Multiple data transfers PCIe PCIe GPU CPU NIC • Several data transfers between different devices Are the data transfers worth the computational gains offered? gvasil@ics.forth.gr 17
Capturing packets from NIC Ring buffers User space Kernel space Rx Rx Rx Rx Network Interface Rx Queue Assigned • Packets are hashed in the NIC and distributed to different Rx-queues • Memory-mapped ring buffers for each Rx-queue gvasil@ics.forth.gr 18
CPU Processing • Packet capturing is performed by different CPU-cores in parallel – Process affinity • Each core normalizes and reassembles captured packets to streams – Remove ambiguities – Detect attacks that span multiple packets • Packets of the same connection always end up to the same core – No synchronization – Cache locality • Reassembled packet streams are then transferred to the GPU for pattern matching – How to access the GPU? gvasil@ics.forth.gr 19
Accessing the GPU • Solution #1: Master/Slave model Thread 2 Thread 1 PCIe Thread 3 GPU 64 Gbit/s Thread 4 • Execution flow example 14.6 Gbit/s P1 P1 Transfer to GPU: GPU execution: P1 P1 Transfer from GPU: P1 P1 gvasil@ics.forth.gr 20
Accessing the GPU • Solution #2: Shared execution by multiple threads Thread 1 Thread 2 PCIe GPU 64 Gbit/s Thread 3 Thread 4 • Execution flow example 48.1 Gbit/s Transfer to GPU: P1 P2 P3 P1 GPU execution: P1 P2 P3 P1 Transfer from GPU: P1 P2 P3 P1 gvasil@ics.forth.gr 21
Transferring to GPU Push CPU-core Push Scan Push GPU • Small transfer results to PCIe throughput degradation Each core batches many reassembled packets into a single buffer gvasil@ics.forth.gr
Pattern Matching on GPU Packet Buffer GPU GPU GPU core core core GPU GPU GPU core core core Matches • Uniformly, one GPU core for each reassembled packet stream gvasil@ics.forth.gr 23
Pipelining CPU and GPU CPU Packet buffers • Double-buffering – Each CPU core collects new reassembled packets, while the GPUs process the previous batch – Effectively hides GPU communication costs gvasil@ics.forth.gr 24
Recap Data-parallel GPUs: content matching Reassembled packet streams Per-flow CPUs: protocol analysis Packet streams NIC: Demux 1-10Gbps Packets gvasil@ics.forth.gr 25
Outline • Architecture • Implementation • Performance Evaluation • Conclusions gvasil@ics.forth.gr 26
Setup: Hardware CPU-0 CPU-1 Memory Memory IOH IOH GPU GPU NIC • NUMA architecture, QuickPath Interconnect Model Specs 2 x CPU Intel E5520 2.27 GHz x 4 cores 2 x GPU NVIDIA GTX480 1.4 GHz x 480 cores 1 x NIC 82599EB 10 GbE gvasil@ics.forth.gr 27
Pattern Matching Performance Bounded by PCIe capacity GPU Throughput 48.1 42.5 26.7 14.6 1 2 4 8 #CPU-cores • The performance of a single GPU increases, as the number of CPU-cores increases gvasil@ics.forth.gr 28
Pattern Matching Performance 70.7 GPU Throughput 48.1 42.5 Adding a second GPU 26.7 14.6 1 2 4 8 #CPU-cores • The performance of a single GPU increases, as the number of CPU-cores increases gvasil@ics.forth.gr 29
Setup: Network 10 GbE Traffic MIDeA Generator/Replayer gvasil@ics.forth.gr 30
Synthetic traffic 7.2 MIDeA Gbit/s Snort (8x cores) 4.8 2.4 2.1 1.5 1.1 200b 800b 1500b Packet size • Randomly generated traffic gvasil@ics.forth.gr 31
Real traffic MIDeA 5.2 Gbit/s Snort (8x cores) 1.1 • 5.2 Gbit/s with zero packet-loss – Replayed trace captured at the gateway of a university campus gvasil@ics.forth.gr 32
Summary • MIDeA: A multi-parallel network intrusion detection architecture – Single-box implementation – Based on commodity hardware – Less than $1500 • Operate on 5.2 Gbit/s with zero packet loss – 70 Gbit/s pattern matching throughput gvasil@ics.forth.gr 33
Thank you! gvasil@ics.forth.gr gvasil@ics.forth.gr 34
Recommend
More recommend