Inferring the Source of Encrypted HTTP Connections Michael Lin CSE 544
Hiding your identity • You can wear a mask, but some distinguishing characteristics are visible: • Height • Weight • Hair • Clothing • Even if everyone looked the same, we can determine some things about people based on their habits • People who go to school everyday are probably students or teachers • If you follow a strict schedule everyday (school, coffee shop, gym), you can be identified to some degree of accuracy • “There are 10 people who follow this exact schedule everyday.”
Profiling • How would you identify someone in a world of clones? • Determine their schedule • Determine their habits • Profiling allows us to identify something without knowing what it is
Hiding your online identity • Encryption will save us from prying eyes. Or will it? • We can hide the header and contents of a packet behind encryption • But can we still say something about the packet itself? • Packet size • Packet direction • What about traffic patterns? • Packet arrival rate/distribution
HTTP traffic profiling • Using only packet size and direction, create profiles of traces of HTTP traffic for certain websites • Instance - <Packet size, direction> • Class - URL • Create sets of instances for each class and use these sets to identify other traces to unknown sites • These sets are surprisingly unique
Comparing HTTP traces • Two relatively simple methods to get a rating of the similarity of two sets • Jaccard’s coefficient • Intersection of two sets divided by union of two sets • Think about this and it makes sense • Naive Bayes classifier (Idiot’s Bayes) • “Naive” because it assumes every event is independent • A surprisingly good indicator of similarity • Important: you need something to compare against!
Collecting HTTP traces • Gathered 100,000 URLs from DNS server logs • Used Firefox to access top 2000 pages over an SSH tunnel 4 times a day over 2 months • Used tcpdump to collect header information from these connections • Analyzed the logs to get packet length and direction for connections to each site • Create a library of profiles for sites
This is where the magic happens • Now we have two methods for comparing sets and a big library of site profiles • Say we intercepted some encrypted HTTP traffic and want to guess where it’s going... • Compare with all sites in the library to find the best match or two or ten
How well does it work? • Surprisingly well • Lots of variables to play with: • Size of “training set” - the data used to create the library profile • Size of test set • Time between collection of training and test set • Desired accuracy (top 1 most likely site or top k, k = 2, 3, 5, 10...) • Number of sites in library • Jaccard’s coefficient is generally better than naive Bayes • Bottom line: for a training set of 4 samples and a test set of 4 samples, they got ~75% accuracy
Effect of variables • Increasing size of training set up to 4 greatly improves accuracy, after 4 they get diminishing returns • Increasing k increases accuracy (duh) • Time between training set and test set matters, but the difference is less than 10%, even after 4 weeks • It doesn’t matter if the training set comes from before or after the test set • The fewer total sites there are in the library, the better the accuracy, but the drop in accuracy is relatively slow from 200-2000 (will this hold true to 40 million?)
Is this good enough? • This is a philosophical question • Given the relatively small amount of data collected for each site , I think this is good enough to be interesting • This kind of accuracy requires a training and test set of size 4+ • How likely are you to get a test set of that size? • Even with perfect data, a maximum of ~75% accuracy is limiting
How can we make it worse? • This analysis is based entirely on packet size • Change the packet size, change the results • 4 simple packet size padding methods: • Linear • Exponential • Mice & elephants • MTU • All increase packet sizes in a deterministic manner
The effectiveness of padding • Linear padding cuts accuracy in half Padding Accuracy Size • Exponential makes it useless none 0.721 1 • Total data transmitted remains linear 0.477 1.034 small for linear and exponential exp 0.056 1.089 • Results for 10-accuracy are m & e 0.003 1.478 much better MTU 0.001 2.453
The not so great... • For this to be useful, you need a library of every website • Collecting this much data isn’t easy • How accurate will this be? With 38 million websites there’s going to be a lot of sites that look the same • They show that trivial packet padding makes this useless • No results for test sets of size < 4
Future work • Current analysis is weak to packet padding, they are looking to use packet arrival times to overcome this • Even for non-padded packets, packet timing can be important (but also hard to use) • Padding packets non-deterministically may be even stronger against profiling • How reasonable is building a huge library of profiles for the entire Internet? • In the end, is 75% accuracy good enough?
Take away You can say a lot about a book by its cover.
Recommend
More recommend