Alert classification to reduce false positives in intrusion detection P h D D e f e n s e P r e s e n t a t i o n / 't ʌ ·deu ʃ p ɪ e·'tr ʌ · ʃ ek / Tadeusz Pietraszek tadek@pietraszek.org Albert-Ludwigs-Universität Freiburg Fakultät für Angewandte Wissenschaften Dec 5, 2006
Thesis Statement Thesis at the intersection of machine learning and computer security . 1. Using machine learning it is possible to train classifiers in the form of human readable classification rules by observing the human analyst. 2. Abstaining Classifiers can significantly reduce the number of misclassified alerts with acceptable abstention rate and are useful in intrusion detection. 3. Combining supervised and unsupervised learning in a two-stage alert-processing system forms a robust framework for alert processing. 13.08.07 PhD Defense 2
Outline � Background and problem statement. 1. Adaptive learning for alert classification. 2. Abstaining classifiers. 3. Combining supervised and unsupervised learning. � Summary and conclusions. 13.08.07 PhD Defense 3
Intrusion Detection Background � Intrusion Detection Systems (IDSs) [And80,Den87] detect intrusions i.e. sets of actions that attempt to compromise the integrity , confidentiality, or availability of computer resource [HLMS90]. � IDS have to be effective (detect as many intrusions as possible) and keep false positives to the acceptable level, however, in real environments 95-99% alerts are false positives [Axe99, Jul01, Jul03]. � Eliminating false positives is a difficult problem: – intrusion may only slightly differ from normal actions (IDSs have limited context processing capabilities), – writing a good signature is a difficult task (specific vs. general), – actions considered intrusive in one systems, may be normal in others, – viewed as a statistical problem – base rate fallacy. 13.08.07 PhD Defense 4
Global picture – IDS monitoring � Manual knowledge acquisition is not used for classifying alerts – Fact 1: Large database of historical alerts. – Fact 2: Analyst typically analyzes alerts in real time. 13.08.07 PhD Defense 5
Problem statement � Given – A sequence of alerts (A 1 , A 2 , …, A i , …) in an alert log L – A set of classes C = {C 1 , C 2 , …, C n } – An intrusion detection analyst O sequentially and in real-time assigning classes to alerts – A utility function U describing the value of a classifier to the analyst O � Find – A system classifying alerts, maximizing the utility function U • Misclassified alerts • Analyst’s workload • Abstentions 13.08.07 PhD Defense 6
Outline � Background and problem statement. 1. Adaptive learning for alert classification. 2. Abstaining classifiers. 3. Combining supervised and unsupervised learning. � Summary and conclusions. 13.08.07 PhD Defense 7
ALAC (Adaptive Learner for Alert Classification) � Automatically learn an alert classifier based on analyst’s feedback using machine learning techniques. Alerts Classified Alerts Feedback Alert Classifier IDS ID Analyst Update Rules Rules Params Background Training Knowledge Examples Machine Learning Model Update Recommender mode • Misclassifications 13.08.07 PhD Defense 8
ALAC (Adaptive Learner for Alert Classification) Alerts Feedback Alert No Confident? Classifier IDS Yes ID Analyst Process Update Rules Rules Params Background Training Knowledge Examples Machine Learning Agent mode Model Update • Misclassifications • Analyst’s workload 13.08.07 PhD Defense 9
Why does learning work and why can it be difficult? � The approach hinges on the two assumptions – Analysts are able to classify most of alerts correctly – It is possible to learn a classifier based on historical alerts � Difficult learning problem 1. Use analyst’s feedback (learning from training examples). 2. Generate the rules in a human readable form (correctness can be verified). 3. Be efficient for large data files. 4. Use background knowledge. 5. Asses the confidence of classification. 6. Work with skewed class distributions / misclassification costs. 7. Adapt to environment changes. 13.08.07 PhD Defense 10
Requirements - revisited 1. Core algorithm - RIPPER. 2. Rules in readable form. 3. Efficient to work on large datasets. 4. Background knowledge represented in attribute-value form. 5. Confidence – rule performance on testing data with Laplace correction. 6. Cost Sensitivity – weighted examples. 7. Incremental Learning – “batch incremental approach” – batch size depends on the current classification accuracy. 13.08.07 PhD Defense 11
Results - Thesis Statement (1) � Adaptive Learner for Alert Classification (ALAC) • Human feedback, background knowledge, ML techniques. – Recommender Mode (focusing on the misclassifications in the utility function U ). • Good performance: fn=0.025 , fp=0.038 (DARPA), fn = 0.003 , fp = 0.12 (Data Set B). – Agent Mode (focusing on the misclassifications and the workload in the utility function U ). • Similar number of misclassifications and more than 66% of false positives are automatically discarded. – Many rules are interpretable. 13.08.07 PhD Defense 12
Outline � Background and problem statement. 1. Adaptive learning for alert classification. 2. Abstaining classifiers. 3. Combining supervised and unsupervised learning. � Summary and conclusions. 13.08.07 PhD Defense 13
Metaclassifier A α , β � Abstaining binary classifier A is a classifier that in certain case can refrain from classification. We c onstruct it as follows: ⎧ + = + C ( x ) α C α C β Result ⎪ ) ( ) ( = = − ∧ = + A C C ⎨ ( x ) ? ( x ) ( x ) α β α β , + + + ⎪ − = − C ( x ) ⎩ β - + ? + - Impossible where C α , C β is such that: ∀ = + ⇒ = + C C - - - x : ( ( x ) ( x ) ) α β ∧ = − ⇒ = − C C ( ( x ) ( x ) ) β α (Conditions used by Flach&Wu [FW05] in their work on repairing concavities of the ROC curves, met in particular if C α , C β are constructed from a single scoring classifier R). � Can we optimally select C α , C β ? 13.08.07 PhD Defense 14
“Optimal” Metaclassifier A α , β � How do we compare binary classifiers and abstaining classifiers? How to select an optimal classifier? � No clear answer – Use cost based model (Cost-Based Model) (extension of [Tor04] – Use boundary conditions: Maximum number of instances classified as “ ? ” (Bounded- • Abstention Model) • Maximum misclassification cost (Bounded-Improvement Model) 13.08.07 PhD Defense 15
Cost-based model – a simulated example Misclassification cost for different Misclassification cost for different ROC curve with two optimal classifiers combinations of A and B combinations of A and B 1.0 Classifier B 0.5 0.5 0.8 0.4 0.4 Cost Cost 0.6 0.3 0.3 TP Classifier A 0.4 c N ′ = 23 f ( fp ) 0.2 0.2 β − ROC c c P 0.0 0.0 12 13 1.0 1.0 0.2 0.2 0.2 − 0.8 0.8 c c N 0.4 0.4 ′ = 21 23 0.6 0.6 f ( fp ) FP(b) FP(b) 0.6 0.6 α ROC F F 0.4 0.4 P P c P ( ( a a ) ) 13 0.8 0.8 0.0 0.2 0.2 1.0 1.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 FP 13.08.07 PhD Defense 16
Bounded models � Problem: 2x3 cost matrix is not always given and would have to be estimated. However, classifier is very sensitive to c 13 , c 23 . � Finding other optimization criteria for an abstaining classifier using a standard cost matrix. – Calculate misclassification costs per classified instance. � Follow the same reasoning to find the optimal classifier. 13.08.07 PhD Defense 17
Bounded models equation � Obtained the following equation, determining the relationship between k and rc for as a function of classifiers C α , C β . ) ( ) 1 = + rc FP c FN c ( )( α β − + 21 12 1 k N P ( ( ) ( ) ) 1 = − + − k FP FP FN FN β α α β + N P – Constrain k , minimize rc → bounded-abstention – Constrain rc , minimize k → bounded-improvement � No algebraic solution, however, for a convex ROCCH we can show an efficient algorithm. 13.08.07 PhD Defense 18
Bounded-abstention model � Among classifiers abstaining for no more than a fraction of k MAX instances find the one that minimizes rc . � Useful application in real-time processing instances where the non-classified instances will be processed by another classifier with a limited processing speed. � Algorithm: Three-step derivation – Step 1: Show an (impractical) solution for a smooth ROCCH and equality k = k MAX . – Step 2: Extend for a inequality k ≤ k MAX – Step 3: Derive an algorithm for ROCCH. 13.08.07 PhD Defense 19
Recommend
More recommend