Large-scale NetFlow Information Management Adrien Raulot, Shahrukh Zaidi University of Amsterdam Supervisor: Wim Biemolt (SURFnet) February 5, 2018 Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 1 / 24
What is NetFlow? Traffic monitoring technology originaly developed by Cisco. Flow : “a set of IP packets passing an observation point in the network during a certain time interval. All packets belonging to a particular flow have a set of common properties.”[4] Important differences with regular packet capture methods: NetFlow considered to be less privacy sensitive NetFlow requires less computational resources for analysis Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 2 / 24
What is NetFlow? Figure 1: Schematic overview of the NetFlow export process.[2] Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 3 / 24
NetFlow Analysis Three main application areas[3]: Flow analysis and reporting Threat detection Performance monitoring Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 4 / 24
NetFlow analysis techniques NfDump: Figure 2: Schematic overview of the NfDump tool set.[1] Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 5 / 24
Netflow Analysis techniques Limitations of this setup[5]: Inefficient file-based store: NfDump typically stores NetFlow data in separate files for every 5 minutes time frame Very slow processing speed: each file is read line by line from the beginning. Therefore, analysis of large amounts of NetFlow data takes a lot of time. Limited analysis methods: as network situations are becoming more and more complex, new analysis approaches are required that allow for NetFlow data analysis. Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 6 / 24
Research question Which data analysis technique could be used in order to analyse the current SURFnet NetFlow data in a more time-efficient manner? Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 7 / 24
What is Apache Hadoop? Framework for large datasets processing Distributed, local computation & storage Hadoop Distributed File System (HDFS) YARN (Yet Another Resource Negotiator) Batch, interactive & real-time jobs Designed to be scalable Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 8 / 24
What is Apache Hadoop? Figure 3: Schematic overview of Hadoop 2.0. Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 9 / 24
What is Apache Spark? Hadoop-related project, but not only Powerful computing engine for Big Data processing In-memory Built-in modules for streaming, SQL, machine learning, etc. Binding for Java, Scala, Python and R Ease of use Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 10 / 24
What is Apache Parquet? Data-store for Hadoop Column-oriented Fast access to data Figure 4: Schematic overview of a row vs column-oriented database. Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 11 / 24
Choice for analysis technique (summary) Figure 5: Apache Parquet Figure 6: Apache Hadoop Figure 7: Apache Spark logo. logo. logo. To-Do list: 1 Store NetFlow data into Parquet files on HDFS 2 Load Parquet files using PySpark (Python API) 3 Query the data using Spark SQL Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 12 / 24
Experiments: test environment NfDump server specifications: Hadoop cluster specifications: 1x Dell PowerEdge R230 ∼ 100 nodes Intel Xeon CPU E3-1240L v5 @ ∼ 600 cores 2.10GHz ∼ 4TB of memory 4 cores ∼ 2PB of storage 16GB of RAM Apache Hadoop 2.7.2 ∼ 200GB of SSD storage Apache Spark 2.1.1 NfDump v1.6.12 Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 13 / 24
Experiments: implementation 1 Convert NetFlow binary data to CSV nfdump -r nfcapd.201801011245 -o csv 2 Write two Spark jobs in Python: Converter: Converts CSV data to Parquet format Querier: Loads Parquet data & executes queries 3 Write SQL query query = ’SELECT ts, sa, da FROM nf_data’ 4 Using the Querier, execute and cache the results 5 Proceed with next operations on the cached results print results.count() print results.show() Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 14 / 24
Experiments: test queries Retrieve all flows containing a specific IP address Retrieve all flows with a byte count larger than 100MBs List the top 10 of Telnet connections with only the SYN flag set in the IP header ordered by the number of bits per second List the top 10 of IP addresses receiving the largest amount of traffic Retrieve all flows with only the SYN flag set in the IP header Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 15 / 24
Results: retrieve all flows containing a specific IP address 8 Execution time in minutes NfDump 6:42 Hadoop+Spark 6 4 3:33 2 1:05 0:33 0:08 0 5min 30min 1hr 3.5hrs 7hrs 5 Execution time in minutes 4 3:46 3:22 3:00 2:58 3 2:37 2 1 0 5min 30min 1hr 3.5hrs 7hrs Time frame Figure 8: Execution time of retrieving all flows containing a specific IP address. Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 16 / 24
Results: retrieve all flows with byte count > 100MB 8 Execution time in minutes 6:52 NfDump Hadoop+Spark 6 4 3:39 2 1:06 0:28 0:08 0 5min 30min 1hr 3.5hrs 7hrs 5 Execution time in minutes 4 3:15 3:09 3:07 2:53 2:50 3 2 1 0 5min 30min 1hr 3.5hrs 7hrs Time frame Figure 9: Execution time of retrieving all flows with byte count larger than 100MB. Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 17 / 24
Results: list top 10 Telnet connections with only SYN flag set ordered by bps 4 Execution time in minutes NfDump Hadoop+Spark 3 2:09 2 1 0:49 0:09 0 5min 30min 1hr 3.5hrs 7hrs 4 Execution time in minutes 3:22 3:15 3:15 3:09 3 2:29 2 1 0 5min 30min 1hr 3.5hrs 7hrs Time frame Figure 10: Execution time of retrieving the top 10 of Telnet connections with only the SYN flag set ordered by the number of bits per second. Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 18 / 24
Results: List top 10 IPs receiving most traffic 25 23:22 Execution time in minutes NfDump 20 Hadoop+Spark 15 11:12 10 5 4:04 1:25 0:19 0 5min 30min 1hr 3.5hrs 7hrs 5 Execution time in minutes 4:03 4 3:37 3 2:39 2:42 2:38 2 1 0 5min 30min 1hr 3.5hrs 7hrs Time frame Figure 11: Execution time of Retrieving the top 10 IP addresses receiving the largest amount of traffic. Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 19 / 24
Results: Retrieve all flows with only SYN flag set 100 Execution time in minutes 89:23 NfDump 80 Hadoop+Spark 60 41:44 40 20 11:53 5:22 1:02 0 5min 30min 1hr 3.5hrs 7hrs 5:52 6 Execution time in minutes 5:06 4 3:28 3:21 3:14 2 0 5min 30min 1hr 3.5hrs 7hrs Time frame Figure 12: Execution time of retrieving all flows with only the SYN flag set in the IP header. Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 20 / 24
Discussion Execution time of NfDump increases linearly with longer time frames. Hadoop scales very well: Execution time of Spark with Hadoop does not increase significantly when dealing with larger amounts of data. NfDump struggles with executing more complex queries, whereas this is no problem for Spark and Hadoop. Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24
Conclusion and future work Combination of Hadoop and Apache Spark is a viable option for analyzing large-scale NetFlow data. Tuning and optimization to the Spark implementation and Hadoop cluster may lead to even better performance. Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 22 / 24
Questions? Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 23 / 24
References NfDump. http://nfdump.sourceforge.net/ . I. Cisco. NetFlow. Introduction to Cisco IOS NetFlow C a technical overview, 2007. R. Hofstede, P. ˇ Celeda, B. Trammell, I. Drago, R. Sadre, A. Sperotto, and A. Pras. Flow monitoring explained: From packet capture to data analysis with netflow and ipfix. IEEE Communications Surveys & Tutorials , 16(4):2037–2064, 2014. G. Sadasivan. Architecture for ip flow information export. Architecture , 2009. Z. Tian. Management of large scale NetFlow data by distributed systems. Adrien Raulot, Shahrukh Zaidi (UvA) Master’s thesis, NTNU, 2016. NetFlow Information Management February 5, 2018 24 / 24
Recommend
More recommend