data mining based detection methods
play

Data Mining Based Detection Methods Data Mining in Intrusion - PDF document

Outline Related Data Mining Background Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline Typical Dataset in Data Mining Related Data Mining Background Dataset consists of records.


  1. Outline � Related Data Mining Background Data Mining Based Detection Methods � Data Mining in Intrusion detection Feng Pan Outline Typical Dataset in Data Mining � Related Data Mining Background � Dataset consists of records. � Pattern Mining � Records consist of a set of items. � Association Pattern r1 host_IP=“10.11.0.1”, � Sequence Pattern host_port>1100, flag=“SF” r2 flag=“SF” � Association Rules r3 host_IP=“10.11.0.1”, � Classification host_port>1100, duration>100ms, bytes>1KB r4 flag=“SF”,service=“telnet”, � Decision Tree bytes>1KB � CBA: Classification based on association rules r5 host_IP=“10.11.0.1”, flag=“SF”, duration>100ms, bytes>1KB r6 host_IP=“10.11.0.1”, host_port>1100, flag=“SF” Pattern Mining Association Rules � From the association patterns, we can get � Association Patterns association rules (LHS->RHS) A set of items frequently occur, order � cf is an association pattern, then we get the rule as doesn’t matter. flag=“SF” -> bytes>1KB, Pattern1: host_IP=“10.11.0.1”, host_port>1100, � Two measurements for the goodness of rules Pattern 2: flag=“SF”, bytes>1KB � Support: number of records containing ( ) Pattern 3: duration>100ms, bytes>1KB LHS � RHS support ( flag=“SF” -> bytes>1KB )=2, � Confidence: number of records containing ( ) / LHS � RHS number of records containing ( ) LHS confidence ( flag=“SF” -> bytes>1KB )=40%

  2. Classification CBA •Add a new Column: Class Label r1 host_IP=“10.11.0.1”, normal � Many different classifications host_port>1100, flag=“SF” •Records are labeled by user using domain knowledge. � Decision Tree r2 flag=“SF” normal •Class Label can be considered as a � Neural Network special feature. r3 host_IP=“10.11.0.1”, abnormal host_port>1100, duration>100ms, bytes>1KB •We find association rule � CBA (Classification based on Association duration>100ms, bytes>1KB ->abnormal r4 flag=“SF”,service=“telnet”, normal Rules) support =2, confidence =100% bytes>1KB � CBA is constructed based on the r5 host_IP=“10.11.0.1”, abnormal •Then a simple rule of the classifier is flag=“SF”, duration>100ms, bytes>1KB association rules. duration>100ms, bytes>1KB ->abnormal r6 host_IP=“10.11.0.1”, abnormal host_port>1100, flag=“SF” CBA CBA r1 host_IP=“10.11.0.1”, normal •We can find many rules with different � Given an unknown record, apply the rules in order. host_port>1100, flag=“SF” support and confidence � “host_IP=“10.11.0.1”, host_port>1100, flag=“SF””: duration>100ms, bytes>1KB ->abnormal r2 flag=“SF” normal apply rule 3 -> classify as class abnormal : support =2, confidence =100% host_IP=“10.11.0.1”, host_port>1100 ->normal r3 host_IP=“10.11.0.1”, abnormal � “host_IP=“10.11.0.1”, host_port>1100, flag=“SF”, service=“telnet””: host_port>1100, : support =2, confidence =66% duration>100ms, bytes>1KB apply rule2 -> classify as class normal flag=“SF”,service=“telnet”->normal : support =1, confidence =100% � rule 3 can also apply to it, but rule2 has higher support and confidence r4 flag=“SF”,service=“telnet”, normal bytes>1KB � “host_port>1100, service=“telnet”, duration>100ms” r5 host_IP=“10.11.0.1”, abnormal •Sorting all the rules according to their : no rule can apply, then classify to default class flag=“SF”, duration>100ms, bytes>1KB confidence and support. � Random r6 host_IP=“10.11.0.1”, abnormal 1) duration>100ms, bytes>1KB ->abnormal � Majority class host_port>1100, flag=“SF” 2) flag=“SF”,service=“telnet”->normal 3) host_IP=“10.11.0.1”, host_port>1100 ->normal Outline Why Data Mining? � The dataset is large. � Why Data Ming � Challenge for Data Mining in intrusion � Constructing IDS manually is expensive detection. and slow. � Two layers to use Data Mining � Update is frequent since new intrusion � Mining in the connection data occurs frequently. � Mining in the alarm records.

  3. Can Data Mining work? Challenge in feature selection � Challenges for Data Mining in building IDS � Many features in the connection records, relevant or irrelevant. � Develop techniques to automate the � Automatic detection (classifiers) are sensitive to processing of knowledge-intensive feature features. Missing of key features for some attack may selection. result worst performance � Customize the general algorithm to � The missing of “host_count” feature will make the IDS unable to incorporate domain knowledge so only detect DOS attack in the experiments on DARPAR data. relevant patterns are reported � Different attacks require different features � Compute detection models that are accurate � Some useful features are not in the original data and efficient in run-time Challenge in Pattern Mining Challenge in Building Models � Large amount of patterns can be found in the � Single model is not able to capture all dataset. System may be overwhelmed. type of attacks. � For different attacks, pattern mining shall focus � An ideal model consists of several light on different feature subsets. weighted models each of which focuses on its own aspects. � For sequence patterns, different attacks has different optimal window size. Mining in the data Framework of Building IDS � Step1: Preprocessing. Summarize the raw data. � Tow kinds of datset. � Step2: Association Rule Mining. � Network based dataset � Step3: Find sequence patterns (Frequent � Host based dataset Episodes) based on the association rules. � Step4: Construct new features based on the � Build IDS by mining in the records. sequence patterns. � When find attacks, give alarms to � Step5: Construct Classifiers on different set of features administration system.

  4. Preprocessing Association Rule Mining � To summarize raw data to high level event, e.g. � Customizing: Only report important and a network connection (network based data) or relevant patterns. host session (host based data). � Define relevant features, reference features. � Bro and NFR can be used to do the summarizing. � Bro policy script � const ftp_guest_ids = { "anonymous", "ftp", "guest", } &redef; � Pattern must contain relevant features or redef ftp_guest_ids += { "visitor", "student" }; reference features. Otherwise, the redef ftp_guest_ids -= "ftp"; � NFR N-Codes patterns are not interesting. Association Rule Mining Sequence Pattern Mining � Example on the Shell Command Data. � Frequent Episodes. Shell Command Data is a host based dataset � X,Y->Z, [ c , s , w ] � With the existence of itemset X and Y, Z will occur in time w . � example � Different window size may generate different results. � Window size is related to attack type. � DOS: w=2 sec � slow Probing: w=1 min Feature Construction Build Model (classifier) � Construct new feature according to the � Build different classifiers for different frequent episode. attacks. � Some features will show close relationship to each other. Then combine the features. is no is no is � Some frequent episode may indicate Classfier1 Classfier2 Classfier3 DOS? Probing? R2L? yes yes interesting new features. yes DOS Probing R2L no Normal ……..

  5. The DARPA data Preprocessing � 4G compressed tcpdump data of 7 weeks of � Use Bro script to summarize the raw data to network traffics. records for each connection. � Contains 4 main categories of attacks � Each connection contains some “intrinsic” features. � DOS : denial of service, e.g., ping-of-death, syn flood � t ime , duration , service , src_host , dst_host , src_port , � R2L : unauthorized access from a remote machine, e.g., guessing password wrong_fragment , flag � wrong_fragment : e.g., fragment size is not multiple of � U2R : unauthorized access to local super user 8 bytes, fragment offset are overlapped privileges by a local unprivileged user, e.g., buffer overflow � flag : how the connection is established and terminated � PROBING : e.g., port-scan, ping-sweep Build training data Feature Construction � Normal data set: � Time-based “traffic” features: can detect DOS and PROBING attacks. � randomly extract sequences of normal � example connections records � Data set for each attach type: � “same host” � Exam only the connections in the past 2 seconds that have � extract all the records that fall within a the same dst_host as the current one � Features: surrounding time window of plus and minus 5 count , percentage of same service , percentage of different minutes of the whole duration of each attack service , percentage of S0 flag , percentage of rejected connection flag . Feature Construction Feature Construction � Some slow probing attack need a larger � “same service” window size � Exam only the connections in the past 2 � Host-based “traffic” features seconds that have the same service as the current one � Instead of the time window of 2 seconds, use a connection window of 100 connections. � Features : � Same set of features on the connection count , percentage of different dst_host , window. percentage of S0 flag , percentage of rejected � Can detect slow Probing . connection flag .

Recommend


More recommend