methodology and tools to analyze ditl dns data
play

Methodology and tools to analyze DITL DNS data Sebastian Castro - PowerPoint PPT Presentation

Methodology and tools to analyze DITL DNS data Sebastian Castro secastro@caida.org CAIDA 9 th CAIDA/WIDE workshop January 2008 Process overview Trace Data collection OARC Collection Data Trace Curation Analysis Trace


  1. Methodology and tools to analyze DITL DNS data Sebastian Castro secastro@caida.org CAIDA 9 th CAIDA/WIDE workshop – January 2008

  2. Process overview Trace • Data collection OARC Collection Data Trace Curation Analysis • Trace curation Trace Plotting Merging • Trace merging Trace upload • Trace analysis Root Server Graphs/ operator Aggregated Data • Plotting I ntermediate Analysis Server Server CAI DA Box Root Server operator Database Fileserver Root Server operator 2

  3. Data collection • Done by each operator based on CAIDA recommendations – http://www.caida.org/projects/ditl/ • Using tcpdump, dnscap, etc and helper scripts • All traces uploaded to OARC file server – All further processing done on OARC boxes due to data access restriction. 3

  4. Data verification • Verify trace completeness and integrity – Has missing pieces? – Truncated packets? – Truncated gzip files? – Check clock skew – Count DNS queries, responses, IPv4 packets, TCP, UDP, etc. • Select the best dataset available – In terms of coverage • Defined as the number of packets seen versus the number of packets expected to seen. 4

  5. 5 Example of coverage

  6. Trace merging • Transform the original traces by – Homogeneous time intervals • 1-hour chunk – Correct clock skew (where known) – Translate destination addresses • All instances of the same root share the same IP, impossible to distinguish. • Some use private addresses internally. • Transform from 192.33.4.12(C-root) to 3.0.0.4 (3 represents C, 4 represents instance number) – Filter other traffic • DNS queries sent to other addresses on the same machine • Leave only queries • DNS traffic generated by the machine: zone sync traffic. • To get one file per hour with all instances included 6

  7. DNS analyses • Analyses currently available – Client and query rates per instance – AS/prefix coverage per instance – Distribution of queries by query type • Global and deaggregated by root and instance – Node/cloud switching per client – Source port distribution – EDNS support (client and query), EDNS buffer size – Invalid queries – Recursive queries – RFC1918-sourced queries counter 7

  8. Trace Analysis Tool – Reads pcap and pcap.gz files – Output as text file • SQL files to create tables and the data • Plain files with some stats – C/C++ code – Memory footprint • 300M – 6G – Uses patricia trees to implement route table lookups 8

  9. Database • PostgreSQL – Usually one table per analysis – Not much work on performance – Gave us some problems about table access control • One database per dataset – Root traces 2007 – Root traces 2006 – ORSN 2007 9

  10. Data presentation • Some preprocessing/data aggregation done using Perl/AWK • Graph generated using ploticus • Group things could be easily done 10

  11. Process example DITL 2007 analysis flow example Merged DNS Trace Merging Trace Analysis traces traces (18-30 hours) (2-3 days) ~ 160G ~ 740G SQL dump Data Curation (table and (weeks) data) PNG/ EPS Database Plot & report Plots Loading (1-5 min) Text Files (15-20 min) 11

  12. Recent improvements • Have better performance – Replaced map with hash_map (unordered associative arrays) for a 40% performance gain • Simpler selection of analysis to run – Using command line • A object-oriented design – More organized code – Allowing others to add analyses 12

  13. What’s next • Add new analyses – Daily patterns by query type – Locality of queries by TLD – Improve some criteria on the invalid query classification – IPv6 related traffic (queries and packets) – … put your desired analysis here … 13

  14. Conclusions • Having tools and procedures to collect and analyze the data makes things easier. – Allowed us to make comparisons between 2006 and 2007 pretty straightforward • Current tools covers the basics – Clearly subject to be improved and extended – Performance could become an issue with larger datasets. 14

Recommend


More recommend