a labeled data set for flow based intrusion detection
play

A Labeled Data Set For Flow-based Intrusion Detection Anna - PowerPoint PPT Presentation

A Labeled Data Set For Flow-based Intrusion Detection Anna Sperotto, Ramin Sadre, Frank van Vliet, Aiko Pras Design and Analysis of Communication Systems University of Twente, The Netherlands NMRG Workshop on Netflow/IPFIX Usage in Network


  1. A Labeled Data Set For Flow-based Intrusion Detection Anna Sperotto, Ramin Sadre, Frank van Vliet, Aiko Pras Design and Analysis of Communication Systems University of Twente, The Netherlands NMRG Workshop on Netflow/IPFIX Usage in Network Management Maastricht - July 30, 2010

  2. Contents • Operational experience in trace collections • Experimental Setup • Data processing and labeling • The labeled data set

  3. Introduction 30% trace 1 system 1 attacks! 85% trace 2 system 2 attacks! • Systems are evaluated on proprietary traces • No shared ground truth • Results cannot be directly compared!

  4. Data set requirements We want the data set to be: • realistic data • complete and correct in labeling • achievable in an acceptable labeling time • sufficient trace size The requirements will determine the collection setup

  5. Measurement scale NETWORK • realistic • not complete • it does not scale

  6. Measurement scale NETWORK SUBNETWORK • • realistic realistic • • not complete not complete • it does not scale

  7. Measurement scale NETWORK SUBNETWORK SINGLE HOST • • • realistic realistic realistic • • • not complete enhanced logging not complete • (honeypot) it does not scale

  8. Setup HONEYPOT ssh, http, ftp ssh session transcript XEN SERVER tcpdump • daily used services with enhanced logging • direct connection to the Internet • attack exposure • complete tcpdump of the traffic (offline flow creation)

  9. Data set creation TRAFFIC FLOWS TYPESCRIPTS DUMP ALERT CLUSTERING LABELLED LOGS EVENTS GENERATION/ & CAUSALITY DATASET CORRELATION Preprocessing • packets  flows F = ( I src , I dst , P src , P dst , Pckts, Octs, T start , T end , Flags, Prot ) • logs  log events L = ( T, I src , P src , I dst , P dst , Descr, Auto, Succ, Corr )

  10. Data set creation TRAFFIC FLOWS TYPESCRIPTS DUMP ALERT CLUSTERING LABELLED LOGS EVENTS GENERATION/ & CAUSALITY DATASET CORRELATION • The correlation process will results in alerts A = ( T, Descr, Auto, Succ, Serv, Type )

  11. Correlation procedure HP F1 LOGS F2 CORRELATE ALERT A (F1, A) (F2, A)

  12. Cluster and Causality Alert Cluster alert cluster relation causality relation Alert Alert Alert Alert Alert Flow Flow Flow Flow Flow Flow • Hierarchic view of the alerts to enrich the data set with extra information on the traffic • Group simple alerts into cluster alerts • high level view of malicious activities

  13. Implementation Packets to flows • softflowd AUTOMATIC • shell scripts SEMI-AUTOMATIC Logs to log events • discriminate between manual/ MANUAL automated attacks • correlation procedure Alert correlation SEMI-AUTOMATIC • extensible for other attacks Cluster and • analysis of typescripts MANUAL causality

  14. The Dataset dump file 24 GB flows 14M alerts 7.6M • Flow breakdown 13942629 SSH 18038 ICMP 13 FTP 9798 HTTP 14151511 TCP 191339 AUTH/IDENT 7383 IRC 583 UDP 18970 OTHERS 1.0E+00 1.0E+02 1.0E+04 1.0E+06 1.0E+08 1.0E+00 1.0E+02 1.0E+04 1.0E+06 1.0E+08 number of flows number of flows

  15. The Dataset dump file 24 GB flows 14M alerts 7.6M • Alert breakdown 8756 SSH IN 10 SSH IN 7591869 SSH OUT 6 FTP 5317 HTTP 35 SSH OUT 95664 AUTH/IDENT IN 6 AUTH/IDENT OUT 4 HTTP 3692 IRC OUT 16382 ICMP IN 0 10 20 30 40 1.0E+00 1.0E+02 1.0E+04 1.0E+06 1.0E+08 number of alerts number of alerts

  16. The Dataset • We labeled: 98,5% flows and 99,99% alerts • Mainly malicious traffic: • ssh brute force attacks • automated http connections • Small percentage of side-effect traffic • auth/ident on port 113 • IRC traffic

  17. Conclusions • We presented the first labeled data set for flow-based intrusion detection • http://traces.simpleweb.org/ • Semi-automated correlation process • manual intervention is still needed • Data set mainly constituted of malicious traffic • need to extend to benign traffic

  18. Conclusions • Reactions: • Since publication (October 2009) ~ 7 requests • We do not monitor the downloads at the webpage • In contact with Philipp Winter (Hagenberg University, AU): MSc Project “ Inductive Intrusion Detection in Flow-Based Network Data using One-Class Support Vector Machines ”

  19. Implementation ALERTS_CLUSTERING parent ALERT_TYPES ALERTS child i d i d description automated succeeded description timestamp ALERTS_CAUSALITY SERVICES type parent i d service child description NETFLOWS i d src_ip dst_ip packets NETFLOW_ALERTS octets flowid start_time alertid start_msec end_time end_msec src_port dst_port tcp_flag prot

  20. Correlation procedure Algorithm 1 Correlation procedure 1: procedure ProcessFlowsForService ( s : service) 2: for all Incoming flows F 1 for the service s do Retrieve matching response Flow F 2 such as 3: F 2 .I src = F 1 .I dst ∧ F 2 .I dst = F 1 .I src ∧ F 2 .P src = F 1 .P dst ∧ F 2 .P dst = F 1 .P src 4: ∧ F 1 .T start ≤ F 2 .T start ≤ F 1 .T start + δ 5: with smallest F 2 .T start − F 1 .T start ; 6: Retrieve a matching log event L such as 7: L.I src = F 1 .I src ∧ L.I dst = F 1 .I dst ∧ L.P src = F 1 .P dst ∧ L.P dst = F 1 .P src ∧ 8: F 1 .T start ≤ L.T ≤ F 1 .T end ∧ not L.Corr 9: with smallest L.T − F 1 .T start ; 10: if L exists then 11: Create alert A = ( L.T, L.Descr, L.Auto, L.Succ, s, CONN ). 12: Correlate F 1 to A ; 13: if F 2 exists then 14: Correlate F 2 to A ; L.Corr ← true ; 15: end if 16: end if 17: 18: end for

Recommend


More recommend