NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok - PowerPoint PPT Presentation

NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok Lee Chungnam National University Chungnam National University {teshi85, yhlee06, lee}@cnu.ac.kr 2010.04.24(Sat) based on "An Internet Traffic Analysis Method with MapReduce", Cloudman workshop, April 2010 1

Introduction Introduction • Flow-based traffic monitoring – Volume of processed data is reduced – Popular flow statistics tools : Cisco NetFlow [1] • Traditional flow-based traffic monitoring – Run on a high performance central server Run on a high performance central server Routers Storag Flow Data e High Performance Server 2

Motivation Motivation • A huge amount of flow data g – Long-term collection of flow data Flow data in our campus network ( /16 prefix ) # of Routers # of Routers 1 Day 1 Day 1 Month 1 Month 1 Year 1 Year 1 1.2 GB 13 GB 156 GB 5 6 GB 65 GB 780 GB 10 12 GB 130 GB 1.5 TB 200 240 GB 2.6 TB 30 TB – Short-term period of flow data • Massive flow data from anomaly traffic data of Internet worm and DDoS • Cluster file system and cloud computing platform – Google’s programming model, MapReduce, big table [8] – Open-source system Hadoop [9] Open source system, Hadoop [9] 3

MapReduce MapReduce • MapReduce is a programming model for large data set p p g g g • First suggested by Google – J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Cluster,” OSDI, 2004 [8] • User only specify a map and a reduce function – Automatically parallelized and executed on a large cluster 4

MapReduce MapReduce Split 1 Map Reduce Shuffle Split 2 & Result Split 3 Sort Map Reduce Split 4 ( K1, V ) List ( K2, V2 ) ( k2, list ( v2 ) ) List ( v3 ) • Map : return a list containing zero or more ( k, v ) pair – Output can be a different key from the input – Output can have same key • Reduce : return a new list of reduced output from input 5

Hadoop Hadoop • Open-source framework for running applications on large clusters built of commodity hardware • Implementation of MapReduce and HDFS Implementation of MapReduce and HDFS – MapReduce : computational paradigm – HDFS : distributed file system • Node failures are automatically handled by framework • Hadoop – Amazon : EC2, S3 service Amazon : EC2, S3 service – Facebook : analyze the web log data 6

Related Work Related Work • Widely used tools for flow statistics y – Flow-tools, flowscan or CoralReef[5] • P2P-based distributed analysis of flow data P2P based distributed analysis of flow data – DIPStorage : each storage tank associated with a rule [11] • MapReduce software – Snort log analysis : NCHC cloud computing research group [16] 7

Contribution Contribution • A flow analysis method with MapReduce y p – Process flow data in a cloud computing platform, hadoop • Implementation of flow analysis programs with Hadoop – Decrease flow computation time – Enhance fault-tolerant of flow analysis jobs Enhance fault tolerant of flow analysis jobs 8

Architecture of Flow Measurement and Analysis System d A l i S • Each router exports flow data to cluster node • Cluster master manages cluster nodes 9

Components of Cluster Node Components of Cluster Node Flow File Input Processor ocesso • Flow file Flow Analysis Flow Analysis Map Reduce Cluster File input processor Map Reduce Cluster File System • Flow analysis Flow analysis System System ( HDFS ) ( HDFS ) ( HDFS ) flow- map/reduce MapReduce Library tools • Flow-tools Hadoop • Hadoop – HDFS Java Virtual Machine – MapReduce MapReduce • Java VM Operating System ( Linux ) • OS : Linux Hardware ( CPU, HDD, Memory, NIC ) 10

Flow File Input Processor Flow File Input Processor Local Disk Flow File ( Binary Format ) NetFlow v5 Cluster Master Convert Convert Flow File • Save NetFlow data ( Text Format ) in binary flow file in binary flow file Copy • Convert binary flow file into text file HDFS • Copy text file to HDFS C t t fil t HDFS Cluster Nodes 11

Flow Analysis Map/Reduce Flow Analysis Map/Reduce • • Read text flow files Read text flow files Flow Flow • Run map tasks Flow Flow Dst Port Octet Flow Flow – Read each line Flow Flow (Validation Check) (Validation Check) – Parsing flow data – Save result into temporary files 53 [64, 128] (key value) (key, value) • Run reduce tasks 53 192 53 53 128 64 – Read temporary files (Key, List[Value]) (Key, List[Value]) – Run sum process • Write results to a file 12

Performance Evaluation Environment • Data: flow data from /24 subnet Flow count Flow file Total binary Total text Duration (million) count file size (GB) file size (GB) 1 day 1 day 3 2 3.2 228 228 0 2 0.2 1 2 1.2 1 week 19.0 1596 0.3 2.3 1 month 109.1 7068 2.0 13.1 • Compared methods : computing byte count per C destination port – flow-tools : flow-cat [flow data folder] | flow-stat –f 5 [ ] | – Our implementation with Hadoop • Performance metric – flow statistics computation time fl t ti ti t ti ti • Fault recovery against map/reduce tasks 13

Our Testbed Our Testbed Internet Internet Chungnam National University Cluster nodes Router • Hadoop 0.18.3 NetFlow v5 Data Export Cluster m aster x 1 • • Core 2 Duo 2.33 GHz • Memory 2GB Gigabit Ethernet • 1 GE Cluster master • Cluster node x 4 • Core 2 Quad 2.83 GHz • Memory 4GB • HDD 1.5 TB • 1 GE 14

Flow Statistics Computation Time Flow Statistics Computation Time flow-tools : 4h 30m 23s Port-breakdown Computation Time 18000 ime (sec) 16000 14000 flow-tools kdown Running ti 12000 MR (1) 10000 MR (2) 8000 MR (3) ( ) Port Break 6000 MR (4) 4000 2000 0 0 3.2 million (One Day) 19 million (One Week) 109.1 million (One Month) number of flows (duration) MR(4) : 1h 15m 49s • Port breakdown computation time – 72% decrease with MR(4) on Hadoop 15

Single Node Failure : Map Task Single Node Failure : Map Task • Under 4 cluster nodes • M Map task fail time t k f il ti – 4 sec (M : 9% R : 0%) • Map task recover time – 266 sec (M : 99% R : 32%) Fail time 4 sec Recover time 266 sec 16

Single Node Failure : Reduce Task Single Node Failure : Reduce Task • Under 4 cluster nodes • Reduce task fail time – 29 sec (M : 41% R : 10% ) • Reduce task recover time – 320 sec (M : 99% R : ( 32% ) Fail time 29 sec Recover time 320 sec 17

Text vs Binary NetFlow Files Text vs. Binary NetFlow Files Flow Analyzer on Hadoop Flow Flow Text Netflow Binary Text HDFS Packet Exporter Packet Collecter flow file Converter flow file TextInputFormat TextOutputFormat Flow analysis with text files Map Reduce K : Text V : LongWritable Flow Analyzer on Hadoop Flow Flow Netflow Binary HDFS Packet Exporter Exporter Packet Packet Collector Collector flow file flow file BinaryInputFormat BinaryOutputFormat Flow analysis with binary files Flow analysis with binary files Map Reduce K : BytesWritable V : BytesWritable 18

Binary Input in Hadoop Binary Input in Hadoop • Currently developing BinaryInputFormat module Currently developing BinaryInputFormat module for Hadoop • Small storage by binary NetFlow files – Reduces # of Map tasks � increasing performance p g p • Decreasing computation time Decreasing computation time – By 18% ~ 55% for a single flow analysis job – By 58% ~ 75% for two flow analysis jobs 19

20 Prototype Prototype

Summary Summary • NetFlow data analysis with MapReduce – Easy management of big flow data – Decreasing computation time – Fault-tolerant service against a single machine failure • Ongoing work – Supporting binary NetFlow files – Enhancing fast processing of NetFlow files Enhancing fast processing of NetFlow files 21

NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok - PowerPoint PPT Presentation

NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok Lee Chungnam National University Chungnam National University {teshi85, yhlee06, lee}@cnu.ac.kr 2010.04.24(Sat) based on "An Internet Traffic Analysis Method with

NetFlow Analysis Jonzy Data Security Analysis, Sr. Information Security Office NetFlow Analysis

NetFlow Analysis: Detecting covert channels on the network Detecting malicious traffic by using

Overview of NetFlow NetFlow and ITSG -33 Existing Monitoring Tools Network

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

nProbe: an Open Source NetFlow Probe for Gigabit Networks Luca Deri <deri@ntop.org>

NetFlow Ne t wor k M a na g e me nt W or k s h op APRI COT 2010 Kua l a Lumpur Contents

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Milan ermk Daniel Tovark, Martin Latovika, Pavel eleda NetFlow/IPFIX Monitoring

(with Online Real-time Data) Lszl Z. Varga 1 Road Traffic: Routing Problem 2 Road Traffic:

Solely Based on Query Flows for Structured Overlays Yasuhiro Ando, Hiroya Nagao, Takehiro Miyao,

Graph Routing Problems: Approximation, Hardness, and Graph-Theoretic Insights Julia Chuzhoy

Completing High-quality Global Routes Jin Hu , Jarrod A. Roy and Igor L. Markov

Design of capacitated networks with unsplittable shortest path routing Andreas Bley Zuse

CS640: Introduction to Computer Networks Aditya Akella Lecture 21 QoS The Road Ahead

Flyspeck Inequalities and Semidefinite Programming Victor Magron , RA Imperial College Memory

Certification of Inequalities involving Transcendental Functions: combining SDP and Max-plus

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok - PowerPoint PPT Presentation

NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok Lee Chungnam National University Chungnam National University {teshi85, yhlee06, lee}@cnu.ac.kr 2010.04.24(Sat) based on "An Internet Traffic Analysis Method with

NetFlow Analysis Jonzy Data Security Analysis, Sr. Information Security Office NetFlow Analysis

NetFlow Analysis: Detecting covert channels on the network Detecting malicious traffic by using

Overview of NetFlow NetFlow and ITSG -33 Existing Monitoring Tools Network

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

nProbe: an Open Source NetFlow Probe for Gigabit Networks Luca Deri &lt;deri@ntop.org&gt;

NetFlow Ne t wor k M a na g e me nt W or k s h op APRI COT 2010 Kua l a Lumpur Contents

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Milan ermk Daniel Tovark, Martin Latovika, Pavel eleda NetFlow/IPFIX Monitoring

(with Online Real-time Data) Lszl Z. Varga 1 Road Traffic: Routing Problem 2 Road Traffic:

Solely Based on Query Flows for Structured Overlays Yasuhiro Ando, Hiroya Nagao, Takehiro Miyao,

Graph Routing Problems: Approximation, Hardness, and Graph-Theoretic Insights Julia Chuzhoy

Completing High-quality Global Routes Jin Hu , Jarrod A. Roy and Igor L. Markov

Design of capacitated networks with unsplittable shortest path routing Andreas Bley Zuse

CS640: Introduction to Computer Networks Aditya Akella Lecture 21 QoS The Road Ahead

Flyspeck Inequalities and Semidefinite Programming Victor Magron , RA Imperial College Memory

Certification of Inequalities involving Transcendental Functions: combining SDP and Max-plus

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

nProbe: an Open Source NetFlow Probe for Gigabit Networks Luca Deri <deri@ntop.org>

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the