Towards Adaptive Big Data Cyber-attack Detection via Semantic Link Networks George Karabatis 1 , Jianwu Wang 1 , Ahmed AlEroud 2 {georgek, jianwu, ahmed21}@umbc.edu 1 Department of Information Systems University of Maryland, Baltimore County 2 Department of Computer Information Systems Yarmouk University, Irbid, Jordan Mission Critical Big Data Analytics MCBDA – Prairie View, TX, May 2016 1 UMBC
Why are Cyber Attacks an issue? Sabotage of Operations Data Security (Database & Communication) Communication Interference Financial fraud Grid Security 2 UMBC
Intrusion Detection Systems • Packet-based IDSs : Analyze the content of network packets to predict attacks – Fairly hard task with today’s high speed Gigabit networks which carry vast volumes of network packets • Flow-based IDSs : Detect Cyber attacks by analyzing net-flows – The content of packets is not-available – Only traffic-based features 3 UMBC
Packet-based Intrusion Detection Advantages • Have full access to payload • More information is available • More accurate intrusion detection Disadvantages • Increasing network bandwidth generates huge amounts of data • Analysis of data is computationally expensive Result : Perfect big data problem 4 UMBC
Network Flows (flows) Think of it like phone call metadata: who called whom, when, but without the conversation • Source/Destination IP • Input/Output Router Interface • Protocol • Type of Service Packet Count • • Octet Count • Start/End Time • TCP Flags • Source/Dest Network Mask Input/Output Interface encapsulation size • • IP Address of next hop within the peer • Router IP of cache shortcut in supervisor UMBC
NetFlow flow 4 flow 1 flow 2 flow 3 • Set of packets that “belong together” – Source/destination IP addresses and port numbers – Same protocol, … – Same input/output interfaces at a router (if known) • Packets that are “close” together in time – Maximum spacing between packets (e.g., 30 sec) UMBC
Flow-based intrusion detection Advantages • Less information is available • Detection process is faster due to less data Disadvantages • Have no access to payload • Subset of attacks detected • Accuracy not as good as packet-based 7 UMBC
Semantic Link Networks (SLN) • A SNL is a graph with nodes and edges • Nodes: Represent alerts or benign activity • Edges: Weighted links representing similarity of the nodes – Measured in terms of context: time, location, numerical, and descriptive features 8 UMBC
Contextual features • Time – Start, end time of flows • Location – Source, destination IP addresses, port numbers • Numerical – Traffic statistics, e.g. # of packets, octets • Descriptive – Other characteristics, e.g. flags, protocol 9 UMBC
Constructing SLNs 𝑞 𝑜 1 , 𝑜 2 𝑞 𝑜 1 , 𝑜 2 = 𝑇𝑇𝑇 𝑜 1 , 𝑜 2 n 1 𝑞 n 2 ∑ 𝑇𝑇𝑇 𝑛 𝑛=1 Nodes represent either alerts or benign activities – Each node is initially represented as feature vector Binary Feature Vectors (e.g. TCP flags) Feature Vectors using numerical weights 𝐵𝐵 _ 𝑇𝑇𝑇 𝑜 1 , 𝑜 2 𝑄𝑄 _ 𝑇𝑇𝑇 𝑜 1 , 𝑜 2 n 1 n 2 n 1 n 2 f1 0.7 0.8 f1 1 1 f2 0.02 0.5 f2 0 1 f3 0.9 0.03 f3 1 0 f4 0.01 0.01 f4 0 0 Edges: weighted links (calculated using Anderberg and Pearson) 10 UMBC
Intrusion Detection with SLNs After SLN is complete, and during run-time • Investigate features of an incoming flow • Find start node in the SLN with similar features to the incoming flow – Classifies individual flows using rule-based classifier that works on flow features (J48) • Expand the set of nodes with additional ones based on: – Connectivity on the graph – Threshold value (controls scope of expansion) • Recall is increased, but may have false positives 11 UMBC
Intrusion Detection with SLNs • Apply context filters – Limit the expanded result set – Reduce the false positives/negatives • Precision increases • SLN must be updated when new attacks (nodes) are discovered – Graph re-generation is expensive – Dynamic approach is more promising 12 UMBC
Attack Prediction Process Classification rules for Initial prediction Incoming flow R1 R2 Rn Filtering FPs Final predictions 13 UMBC
Hybrid intrusion detection • Combines flow-based and packet-based • Takes advantages of both approaches • Requires big data platform • Increased accuracy of predictions (obviously) 14 UMBC
Hybrid intrusion detection Layer one • Flow-based approach is applied • If prediction is benign, allow flow to pass • If prediction is suspicious analyze further – Flow marked suspicious with high probability, then enforce appropriate policy: • Deny entry • Divert to another system (e.g. honeypot) – Flow marked suspicious with medium probability, then proceed to layer two 15 UMBC
Hybrid intrusion detection Layer two • More information is needed to decide • Corresponding packets are passed to Spark based platform • Spark Dstream is applied • Map function in parallel for both individual and multi-stage packet analysis 16 UMBC
Hybrid Big-Data IDS Flow-based layer 17 UMBC
Hybrid intrusion detection • Multistage attacks – Requires current and past (historical packets with same IP address) – A NoSQL DB (Cassandra) stores suspicious packets and is queried for matched patterns – Newly discovered attacks are used to dynamically update the SLN 18 UMBC
Advantages of Hybrid Approach • Flows that are predicted as benign or suspicious with high probability do not reach the second layer (packet examination) saving computational resources • Only questionable flows are further examined at the packet level • Accuracy of the prediction is expected to rise, since more information (payload) is available • More attacks may be recognized (since there is access to payload, in addition to flow data) • Compared to packet-based approaches, our approach requires less computational resources 19 UMBC
Packet Analysis on Spark • Create a Spark streaming context with batch interval at n second • Create DStream by collecting incoming network socket, a DStream contains all packets within the batch interval time window • Apply full packet analysis function for each packet in parallel through the DStream’s map function, output each suspicious packet and its attackType using key-value structure • Report new types of attacks to update SLN • Apply multistage packet analysis function for each DStream element in parallel through DStream’s map function, output each suspicious multistage packets and its attackType using key-value structure 20 UMBC
Conclusions • A promising technique for huge amounts of network data • Takes advantage of flow and packet approaches • Builds on previous success on packet-based and flow-based intrusion detection • Work in progress on Hybrid approach for BD – Implementation for Spark platform – Evaluation with datasets 21 UMBC
Questions? 22 UMBC
Recommend
More recommend