k -fingerprinting: a Robust Scalable Website Fingerprinting Technique George Danezis Jamie Hayes University College London August 12, 2016 1/24
How does website fingerprinting work? - Training Tor network − Relay 2 − Adversary Relay 1 Relay 3 − Create fingerprints for , and 2/24
How does website fingerprinting work? - Attack Tor network Adversary Relay 2 − Relay 1 Client Relay 3 www Adversary checks if fingerprint of is equal to fingerprint of or or 3/24
Experimental Attack set-up Access only: Access any: Closed World Open World 4/24
Contributions k -FP - New attack based on Random Forests and k-NN 1 An analysis of the features used in this and prior work to determine which yield the most information about an encrypted or anonymized webpage. Large open world setting. In total we tested k -FP on 101,130 unique webpages. Experimented with both standard websites and Tor hidden services. 1 Wang et al. “Effective Attacks and Provable Defenses for Website Fingerprinting” 2014 5/24
Feature Analysis Features need to be drawn from a diverse set to bypass targeted WF defenses. 0.040 0.035 0.030 Feature importance score 0.025 0.020 0.015 0.010 0.005 0.000 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 Feature rank The “best” features were number of packets (incoming/outgoing) and information leaked from the first few seconds of loading a webpage. 6/24
k-FP Attack Train on a classification task with network traffic information as features. Use Random Forest output as the fingerprint of a website load. Then use k-NN for classification. 7/24
Base Rate Previous attacks had very high True Positive Rate (TPR) and very low False Positive Rate (FPR), but as the number of samples rises so too will the false alarms. As the number of samples grows, the vast majority of alarms will be false positives. 8/24
Base Rate FPR needs to be very low for an accurate attack as more fingerprints are tested. Suppose we have a FPR of 1%. If a client loads 100 unmonitored webpages. Then the attacker will mark 1 webpages incorrectly as monitored. If a client load 1,000,000 unmonitored webpages. Then the attacker will mark 10,000 webpages incorrectly as monitored. 9/24
Accuracy metrics TPR - The probability that a monitored page is classified as the correct monitored page. FPR - The probability that an unmonitored page is incorrectly classified as a monitored page. BDR - The probability that a page corresponds to the correct monitored page given that the classifier recognized it as that monitored page. Assuming a uniform distribution of pages BDR can be found from TPR and FPR using the formula TPR · Pr( M ) ( TPR · Pr( M ) + FPR · Pr( U )) where Pr( M ) = | Monitored | Pr( U ) = 1 − P ( M ) . | Total Pages | , 10/24
Tor hidden services Protects receiver anonymity in addition to sender anonymity. Sensitive servers such as SecureDrop use Tor hidden services. 11/24
Tor hidden services Tor network Adversary − Client www 12/24
Tor hidden services Tor network Adversary − Client www IP 13/24
Tor hidden services Tor network Adversary − Client www RP IP 14/24
Tor hidden services Tor network Adversary − Client www RP IP 15/24
Tor hidden services Tor network Adversary − Client www RP IP 16/24
Prelims All traffic was collected via Tor. Monitored websites by the Adversary - Alexa Sites (Google, Facebook, Wikipedia etc.) & popular Tor Hidden Services Only collected landing page of each website. Alexa monitored set consisted of 100 samples for each of 55 websites. Hidden Services monitored set consisted of 80 samples for each of 30 Hidden Services. Extra sites for testing purposes - 100,000 websites (chosen from top Alexa list). 17/24
Parameter tuning - number of neighbours and number of trees Number of neighbours Number of Trees 1.0 0.92 Max accuracy Min accuracy 0.8 0.90 0.6 True positive 0.88 Accuracy 0.4 0.86 0.2 0.84 True positive rate False positive rate 0.82 0.0 0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020 0 50 100 150 200 False positive Number of trees Using different k , the number of neighbours allows us to tune the TPR and FPR. After adding 15 decision trees only incremental benefit in adding more. 18/24
Alexa monitored set results 1.00 0.95 True positive rate 0.90 0.85 0.80 0.75 0.70 k=1 k=5 k=10 0.040 0.035 False positive rate 0.030 0.025 0.020 0.015 0.010 0.005 0.000 20000 40000 60000 80000 100000 Number of unmonitored sites 19/24
Tor hidden service monitored set results 1.00 0.95 True positive rate 0.90 0.85 0.80 0.75 0.70 k=1 k=5 k=10 0.040 0.035 False positive rate 0.030 0.025 0.020 0.015 0.010 0.005 0.000 20000 40000 60000 80000 100000 Number of unmonitored sites 20/24
BDR Tor Hidden Services Monitored set. 1.0 Bayesian detection rate 0.8 0.6 0.4 0.2 0.0 k=1 k=5 k=10 1.0 Bayesian detection rate 0.8 0.6 0.4 0.2 0.0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 Number of unmonitored sites Alexa Monitored set. 21/24
Limitations “The BDR implicitly assumes a base rate, with no particular backing in reality.” - We assume uniform expectation of visiting a webpage. “I would like to better understand how these techniques would work if the attacker did not know the start/stop time that the user visits each website.” - Website fingerprinting evaluation may not reflect practical risks. 22/24
Conclusion The open world is not as much of a problem as we had thought, and using state-of-the-art machine learning we expect to be able to tackle other obstacles such as start-stop time identification and multiple tabs. Attack is highly accurate over a large number of webpages. Distiguishability between Tor Hidden Services and Non Tor Hidden Services. 23/24
Thanks Questions? j.hayes@cs.ucl.ac.uk @_jamiedh http://www.homepages.ucl.ac.uk/~ucabaye/ 24/24
Recommend
More recommend