Detecting Hidden Anomalies in DNS Communication CZ.NIC Ondrej Mikle-Barat / ondrej.mikle@nic.cz Karel Slaný / karel.slany@nic.cz 18. 10. 2011 1
Outline ● Motivation ● Method description – original work – algorithm – DNS specifics ● Experiments – set-up – results ● Conclusion 2
Motivation ● Most of the internet communication starts with a DNS query. – There is a possibility to track communication at a certain level of DNS hierarchy. e.g. for intrusion detection, botnet discovery – ● We want a tool that is able to: – detect suspicious behaviour – scan high volume traffic – detect low volume anomalies – works in real-time = low computation cost – does not need any initial knowledge about the analysed traffic ● Will the tool be able to detect something at a ccTLD? 3
Original Work Extracting Hidden Anomalies using Sketch and Non Gaussian Multiresolution Statistical Detection Procedures by G. Dewaele, K. Fukuda, P. Borgnat, P. Abry, K. Cho ● Blindly analyses large-scale packet trace databases. ● Able to detect short-lived anomalies as well as longer ones. ● Detection method is sensitive to statistical characteristics. ● Promises a very low computation cost. 4
Method Description ● The algorithm analyses the traffic using a sliding time-window within which the analysis is performed. ● The analysis iterates over following steps: 1) random projection - sketches 2) data aggregation 3) Gamma distribution estimation 4) reference values computation 5) distance from reference evaluation 6) sketch combination and anomaly identification 5
Random Projections ● A fixed size time-window of captured traffic is split into sketches using a hash function. ● Selected packet attribute (policy) serves as hash key. ● Hash table size is fixed. 6
Aggregation, Gamma Distribution Parameters ● The sketches are aggregated jointly over a collection of aggregation levels to form a series of packet counts which arrived during an aggregation period. – Aggregation levels transform the time-scale granularity. ● Data from the aggregated time series are modelled using Gamma distribution. – Shape ( α ) and scale ( β ) Gamma distribution parameters are computed for each aggregation level. level 1 0 2 1 0 2 α 1, β 1 2 1 1 1 1 2 2 1 1 2 2 2 2 α 2 , β 2 level 2 3 1 3 2 3 3 4 level 3 4 5 5 7 α 3 , β 3 level 4 9 12 α 4 , β 4 7
Reference Values, Identification of Anomalous Sketches ● For each aggregation level across all sketches standard sample mean and variance of the computed Gamma parameters are computed. ● For each sketch the average Mahalanobis distance to the 'centre of gravity' is computed. ● Sketches with their average distance exceeding a given threshold are marked as anomalous. β 3 β 1 α 3 α 1 β 2 β 4 8 α 2 α 4
Anomaly Identification ● All packet attributes (hash keys) contained in an anomalous sketch are considered suspicious. ● Using a different hash function provides a different mapping into sketches resulting in various anomalous sketches. ● A list of attributes corresponding to detected anomalies is obtained by combining the results for several hash functions and computing the intersection of anomalous sketches. 9
Modification for DNS ● The method was designed to analyse the whole TCP/IP traffic. – Works with TCP/IP connection identifiers (src/dst port/address). ● We extended it to meet DNS traffic specifics. ● Policies: – IP address policy Based on original paper, uses the TCP/IP connection identifiers. – Supports IPv4 and IPv6. – Helps finding suspicious traffic sources. – – Query name policy First domain name of the query is extracted and used as hash key. – Helps finding suspicious traffic from legitimate sources. – 10
The Tool ● The algorithm is implemented using C++. ● It is freely available at git://git.nic.cz/dns-anomaly/ – licensed under GPLv3 ● Command line parameters: – window size + detection interval – count of aggregation levels Aggregation steps are power of 2 in seconds (i.e. 1,2,4,8,...). – – analyse shape, scale or both – detection threshold – policy – hash function count – sketch count (hash table size) 11
Experiments Tested on DITL 2011 data collected in April 2011 on .cz authoritative DNS servers. parameter value time-window size 10 minutes detection interval 10 minutes hash function count 25 hash table size 32 aggregation levels 8 distance threshold 0.8 12
Results Types of traffic labelled as anomalies: ● Traffic form legitimate sources (exhibiting specific patterns) – large recursive resolvers, web crawlers ● Domain enumeration – Blind or dictionary based (gTLD domain, prefix and postfix alteration for given words – e.g. bank or various trademarks) – With the knowledge of the content (little or no NXDOMAIN replies) ● Suspicious – Traffic generated by broken resolvers or testing scripts. e.g. bursts of queries for the same name from single host – – Repeated queries due to short TTL 13
Generic Traffic Recursive resolver Web crawler farm srcIP policy srcIP policy Originates at webhosting/ISP. The pattern is very Possibly web crawlers. They generate lots of regular with a period of approximately 12 seconds. queries whenever they encounter sites with many references. 14
Domain Enumeration Blind domain enumeration Known domain enumeration srcIP policy srcIP policy When analysing the DNS queries a pattern The source must have a very good knowledge emerged – prefixes and postfixes variation using about the content of the domain. Very few well-known trademarks. NXDOMAIN replies are generated. 15
Other Suspicious Broken resolver Possible spam attack srcIP policy qname policy Hundreds of queries for a single record are Multiple hosts are querying same MX record. generated in less than two seconds. ??? qname policy Multiple hosts evenly distributed around the world are generating bursts of queries for the same record. The pattern is visible throughout the entire tested period - always as characteristic spikes. 16
Conclusion ● The tool is able to pinpoint low- and high-volume anomalies. ● Two policies implemented with different effect: – IP policy serves best for domain enumeration detection. – Query name policy divulges domain-related events. e.g. presence of short TTL domains (fast flux) – ● The classification of the anomalies is currently left to be done manually. – Future work: automate this process. 17
The End Thank you for your attention. Questions? 18
Recommend
More recommend