1/24
k -fingerprinting: a Robust Scalable Website Fingerprinting - - PowerPoint PPT Presentation
k -fingerprinting: a Robust Scalable Website Fingerprinting - - PowerPoint PPT Presentation
k -fingerprinting: a Robust Scalable Website Fingerprinting Technique George Danezis Jamie Hayes University College London August 12, 2016 1/24 How does website fingerprinting work? - Training Tor network Relay 2 Adversary Relay 1
2/24
How does website fingerprinting work? - Training
− − −
Adversary Tor network Relay1 Relay2 Relay3
Create fingerprints for , and
3/24
How does website fingerprinting work? - Attack
−
Client Tor network Relay1 Relay2 Relay3 Adversary www
Adversary checks if fingerprint of is equal to fingerprint of
- r
- r
4/24
Experimental Attack set-up
Access only: Access any: Closed World Open World
5/24
Contributions
k-FP - New attack based on Random Forests and k-NN1 An analysis of the features used in this and prior work to determine which yield the most information about an encrypted or anonymized webpage. Large open world setting. In total we tested k-FP on 101,130 unique webpages. Experimented with both standard websites and Tor hidden services.
1Wang et al. “Effective Attacks and Provable Defenses for Website
Fingerprinting” 2014
6/24
Feature Analysis
Features need to be drawn from a diverse set to bypass targeted WF defenses.
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 Feature rank 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 Feature importance score
The “best” features were number of packets (incoming/outgoing) and information leaked from the first few seconds of loading a webpage.
7/24
k-FP Attack
Train on a classification task with network traffic information as features. Use Random Forest output as the fingerprint of a website load. Then use k-NN for classification.
8/24
Base Rate
Previous attacks had very high True Positive Rate (TPR) and very low False Positive Rate (FPR), but as the number of samples rises so too will the false alarms. As the number of samples grows, the vast majority of alarms will be false positives.
9/24
Base Rate
FPR needs to be very low for an accurate attack as more fingerprints are tested. Suppose we have a FPR of 1%. If a client loads 100 unmonitored webpages. Then the attacker will mark 1 webpages incorrectly as monitored. If a client load 1,000,000 unmonitored webpages. Then the attacker will mark 10,000 webpages incorrectly as monitored.
10/24
Accuracy metrics
TPR - The probability that a monitored page is classified as the correct monitored page. FPR - The probability that an unmonitored page is incorrectly classified as a monitored page. BDR - The probability that a page corresponds to the correct monitored page given that the classifier recognized it as that monitored page. Assuming a uniform distribution of pages BDR can be found from TPR and FPR using the formula TPR · Pr(M) (TPR · Pr(M) + FPR · Pr(U)) where Pr(M) = |Monitored| |Total Pages|, Pr(U) = 1 − P(M).
11/24
Tor hidden services
Protects receiver anonymity in addition to sender anonymity. Sensitive servers such as SecureDrop use Tor hidden services.
12/24
Tor hidden services
−
Client Tor network Adversary www
13/24
Tor hidden services
−
Client Tor network IP Adversary www
14/24
Tor hidden services
−
Client Tor network RP IP Adversary www
15/24
Tor hidden services
−
Client Tor network RP IP Adversary www
16/24
Tor hidden services
−
Client Tor network RP IP Adversary www
17/24
Prelims
All traffic was collected via Tor. Monitored websites by the Adversary - Alexa Sites (Google, Facebook, Wikipedia etc.) & popular Tor Hidden Services Only collected landing page of each website. Alexa monitored set consisted of 100 samples for each of 55 websites. Hidden Services monitored set consisted of 80 samples for each of 30 Hidden Services. Extra sites for testing purposes - 100,000 websites (chosen from top Alexa list).
18/24
Parameter tuning - number of neighbours and number of trees
Number of neighbours Number of Trees
0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020 False positive 0.82 0.84 0.86 0.88 0.90 0.92 True positive
Max accuracy Min accuracy
50 100 150 200 Number of trees 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy
True positive rate False positive rate
Using different k, the number of neighbours allows us to tune the TPR and FPR. After adding 15 decision trees only incremental benefit in adding more.
19/24
Alexa monitored set results
0.70 0.75 0.80 0.85 0.90 0.95 1.00 True positive rate
k=1 k=5 k=10
20000 40000 60000 80000 100000 Number of unmonitored sites 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 False positive rate
20/24
Tor hidden service monitored set results
0.70 0.75 0.80 0.85 0.90 0.95 1.00 True positive rate
k=1 k=5 k=10
20000 40000 60000 80000 100000 Number of unmonitored sites 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 False positive rate
21/24
BDR
Tor Hidden Services Monitored set.
0.0 0.2 0.4 0.6 0.8 1.0 Bayesian detection rate
k=1 k=5 k=10
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 Number of unmonitored sites 0.0 0.2 0.4 0.6 0.8 1.0 Bayesian detection rate
Alexa Monitored set.
22/24
Limitations
“The BDR implicitly assumes a base rate, with no particular backing in reality.” - We assume uniform expectation of visiting a webpage. “I would like to better understand how these techniques would work if the attacker did not know the start/stop time that the user visits each website.” - Website fingerprinting evaluation may not reflect practical risks.
23/24
Conclusion
The open world is not as much of a problem as we had thought, and using state-of-the-art machine learning we expect to be able to tackle other obstacles such as start-stop time identification and multiple tabs. Attack is highly accurate over a large number of webpages. Distiguishability between Tor Hidden Services and Non Tor Hidden Services.
24/24