An Internet Protocol Address Clustering Algorithm Robert Beverly - PowerPoint PPT Presentation

An Internet Protocol Address Clustering Algorithm Robert Beverly Karen Sollins MIT Computer Science and Artificial Intelligence Laboratory {rbeverly,sollins}@csail.mit.edu December 11, 2008 USENIX SysML 2008 R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 1

Scope of Talk Motivation: Learning to operate in an increasingly complex and 1 malicious Internet Challenges: Many at Internet-scale, in dynamic environment 2 Needed: Building-blocks for network and systems designers 3 Approach (And why we didn’t do X ): An IP Clustering Algorithm 4 as one building-block with many practical applications Results: Predictive performance, including ability to detect 5 changed network portions Future: What’s next, work building upon this research 6 R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 2

Internet-Scale Learning Outline Internet-Scale Learning 1 Defining the Problem 2 Exploiting Network Structure 3 Results 4 R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 3

Internet-Scale Learning Evolution of Internet Architecture The Internet is a phenomenal success, but original assumptions underlying its design have changed, e.g.: Security ... historically a second concern Trust ... in a world of botnets, phishers, etc Scale ... traffic, routes, multi-homing, etc Complexity ... policy constraints, network demands, economics And it’s continuing to evolve, grow more complex. E.g.: Scale along new dimension: bad hosts/users Support increasingly critical services Trend to content-based networking Adding devices with intermittent connectivity (sensor nets, DTNs) R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 4

Internet-Scale Learning The Research Challenge Apply statistical learning to embrace Internet’s natural complexity Find predictive models: generalize to unseen data, new situations Networking problems are a challenging learning environment: Non-stationary On-line Distributed Tradeoff between effort vs. improvement obtained vs. errors Needed: Building blocks to realize ML promise while mitigating challenges R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 5

Defining the Problem Outline Internet-Scale Learning 1 Defining the Problem 2 Exploiting Network Structure 3 Results 4 R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 6

Defining the Problem Overview IP Clustering as a Building Block Internet Protocol (IP) v4 addresses are unsigned 32-bit integers e.g. 18.26.0.230 Hosts given addresses based on the network on which they reside An IP Address Clustering Algorithm: Supervised learning (describe change detection later) Given (informally): Training samples from a portion of the IP address space Labeled with a real or discrete property (e.g. latency, security reputation, etc) Find a “good” partitioning of the space Why? R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 7

Defining the Problem Motivation IPs as Identifiers: For better or worse, IP addresses are overloaded. IPs serve as identifiers for: End hosts Location in the network topology Location in the physical topology Implications of this conflation: Security policy (firewalls, etc) Reputation (spam sources, etc) Service selection, load balancing, performance optimization (P2P , CDNs, etc) User-directed routing, grid computing, more... For example... R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 8

Defining the Problem Building Intuition Practical Example: Internet Mail Server 32 0 2 Spam ??? Ham Spam Spam Ham Mail Server Assuming spam originates from “grouped” hosts/networks Can a mail server build a predictive model of likely spam sources/networks? R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 9

Defining the Problem Building Intuition Emulating Ideal World Ideally, a “knowledge plane” would provide oracle information on every node in the network Unfortunately, the size ( ∼ 3 B addresses, ∼ 300 K networks) and dynamics of the Internet generally precludes complete knowledge Instead, leverage Internet’s inherent structure due to physical, logical and administrative boundaries How much structure exists? R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 10

Defining the Problem Building Intuition IANA /8 Allocations by Continent IP addressing is hierarchical Discontinuous, fragmented Correct granularity? Hosts within same sub network likely have consistent policy, latencies, routes, etc. R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 11

Defining the Problem Building Intuition Learning Structure Idea 1: Statically divide input space Email server example: 32 0 2 Spam ??? Ham Spam Spam Ham Mail Server R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 12

Defining the Problem Building Intuition Learning Structure Idea 1: Statically divide input space Email server example: 32 0 2 32 0 2 P(Spam|Struct) = 0.5 P(S) = 0 P(S) = 1 R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 13

Defining the Problem Building Intuition Learning Structure Idea 1: Statically divide input space Issues: Email server example: Pre-supposes a structure; we may want to infer this 32 0 2 Requires large amount of memory to perform 32 0 2 decently Static alignment with data leads to inferior performance compared to P(Spam|Struct) = 0.5 P(S) = 0 P(S) = 1 other approaches R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 13

Defining the Problem Building Intuition Idea 2: Leverage network routing IP Hierarchy and Aggregation: Blocks (varying size) of contiguous addresses assigned to networks (e.g. AT&T, UCSD, Level3, etc) Aggregated unit: prefix/mask (defined precisely in paper) E.g. 18.0.0.0/8 is a large prefix with 2 24 addresses Smaller blocks are further sub-delegated (“smaller” prefixen) Routers exchange aggregated prefixes, perform per-packet longest-match forwarding to get packet closer to destination Implication: There’s an existing source of rich data e.g. [Balachandar & Wang] For example... R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 14

Defining the Problem Building Intuition Learning Structure Idea 2: Leverage network routing Email server example: 32 0 2 R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 15

Defining the Problem Building Intuition Learning Structure Idea 2: Leverage network routing Email server example: 32 0 2 Sprint AT&T Qwest R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 16

Defining the Problem Building Intuition Learning Structure Idea 2: Leverage network routing Email server example: 32 0 2 Sprint AT&T Qwest Seaworld Qualcomm Hotel R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 17

Defining the Problem Building Intuition Learning Structure Idea 2: Leverage network routing Email server example: Issues: Inferior to more 32 0 2 sophisticated approaches Even if readily available, typically at Sprint AT&T Qwest wrong granularity Similar problems in using registry Seaworld Qualcomm databases Hotel R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 17

Defining the Problem Takeaways How to Best Learn/Exploit Structure? Temptation to formulate network task into a learning problem (i.e. use out-of-the-box “black-box” algorithms) Often suboptimal e.g. how to set thresholds, regularization parameter, kernel, etc? How about Internet-specific learning algorithms? Leverage domain-specific knowledge Learn in a way amenable to non-stationary environment, on-line directed learning As Important: Must be fast (ideally suitable for Internet core / high-speed routers) Memory efficient (think FIBs not RIBs) R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 18

Exploiting Network Structure Outline Internet-Scale Learning 1 Defining the Problem 2 Exploiting Network Structure 3 Results 4 R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 19

Exploiting Network Structure Refining the Problem Data Set Latency Data Set Live RTT Measurements: Reference data set drawn from live Internet measurements IP IP 2 IP 1 N Use round-trip latency as per-IP property (label) Note algorithm isn’t specific to latency prediction ping = RTT 1 ping = RTT 2 Latency is evocative of many ping = RTT structural properties (e.g. N latencies of sub-networks are Agent often a function of the network to which they belong) R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 20

Exploiting Network Structure Refining the Problem Data Set Find: 30,000 random Internet hosts responding to ping Gather: Average latency to each over 5 pings 0.07 0.06 0.05 Probability 0.04 Several modes, non-trivial 0.03 distribution 0.02 0.01 0 0 100 200 300 400 500 Latency (ms) R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 21

Exploiting Network Structure Refining the Problem Black-block Performance Let’s try out-of-the-box SVM regression: Predict latency to unknown destinations With lots of tuning, performs reasonably well; several insights from feature selection 500 75% within 30% 400 Predicted Latency (ms) 300 Points within yellow lines represent good predictions 200 100 0 0 100 200 300 400 500 Measured Latency (ms) R. Beverly, K. Sollins (MIT) IP Clustering SysML 2008 22

An Internet Protocol Address Clustering Algorithm Robert Beverly - PowerPoint PPT Presentation

An Internet Protocol Address Clustering Algorithm Robert Beverly Karen Sollins MIT Computer Science and Artificial Intelligence Laboratory {rbeverly,sollins}@csail.mit.edu December 11, 2008 USENIX SysML 2008 R. Beverly, K. Sollins (MIT) IP

Internetworking Internetworking Address Resolution Protocol Address Resolution Protocol z

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

LECTURE 7 Clustering The k-means algorithm Hierarchical Clustering The DBSCAN algorithm

1 IP datagram IP datagram format format 20 bytes 20 bytes header header (minimum)

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

TCP/IP CIS 218/238 Internet Protocol (IP) The Internet Protocol (IP) is responsible for

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

6 KEYNOTE ADDRESS SLIDES 7 KEYNOTE ADDRESS SLIDES 8 KEYNOTE ADDRESS SLIDES 9 KEYNOTE ADDRESS

K-MEANS++ OPTIMAL INITIALIZATION ALGORITHM An Improved K-means Clustering Method OVERVIEW

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Detecting Changes in Data Streams Shai Ben-David, Johannes Gehrke and Daniel Kifer Cornell

D2 - 00 SPECIAL REPORT FOR SC D2 Information Systems and Telecommunication Giovanna DONDOSSOLA

Reactive and Proactive Standardisation of TLS Kenny Paterson and Thyla van der Merwe Royal

Detecting Threats, Not Sandboxes (C (Characterizin ing Ne Network Environments to o Im

Transforming Data into Insight LA EDC Tommy Ashman Cofounder, CPO UNLOCKING THE WORLDS

Conservation Innovation: Using New Technologies to Identify Landscape Scale Conservation and

Unstructured Sequential Change Detection in Sensor Networks Grigory Sokolov Department of

Land Cover Changes in the Western Siberian Corn-Belt Implementation of a remote sensing-based