k -fingerprinting: a Robust Scalable Website Fingerprinting - - PowerPoint PPT Presentation

k fingerprinting a robust scalable website fingerprinting
SMART_READER_LITE
LIVE PREVIEW

k -fingerprinting: a Robust Scalable Website Fingerprinting - - PowerPoint PPT Presentation

k -fingerprinting: a Robust Scalable Website Fingerprinting Technique George Danezis Jamie Hayes University College London August 12, 2016 1/24 How does website fingerprinting work? - Training Tor network Relay 2 Adversary Relay 1


slide-1
SLIDE 1

1/24

k-fingerprinting: a Robust Scalable Website Fingerprinting Technique

Jamie Hayes George Danezis

University College London

August 12, 2016

slide-2
SLIDE 2

2/24

How does website fingerprinting work? - Training

− − −

Adversary Tor network Relay1 Relay2 Relay3

Create fingerprints for , and

slide-3
SLIDE 3

3/24

How does website fingerprinting work? - Attack

Client Tor network Relay1 Relay2 Relay3 Adversary www

Adversary checks if fingerprint of is equal to fingerprint of

  • r
  • r
slide-4
SLIDE 4

4/24

Experimental Attack set-up

Access only: Access any: Closed World Open World

slide-5
SLIDE 5

5/24

Contributions

k-FP - New attack based on Random Forests and k-NN1 An analysis of the features used in this and prior work to determine which yield the most information about an encrypted or anonymized webpage. Large open world setting. In total we tested k-FP on 101,130 unique webpages. Experimented with both standard websites and Tor hidden services.

1Wang et al. “Effective Attacks and Provable Defenses for Website

Fingerprinting” 2014

slide-6
SLIDE 6

6/24

Feature Analysis

Features need to be drawn from a diverse set to bypass targeted WF defenses.

1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 Feature rank 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 Feature importance score

The “best” features were number of packets (incoming/outgoing) and information leaked from the first few seconds of loading a webpage.

slide-7
SLIDE 7

7/24

k-FP Attack

Train on a classification task with network traffic information as features. Use Random Forest output as the fingerprint of a website load. Then use k-NN for classification.

slide-8
SLIDE 8

8/24

Base Rate

Previous attacks had very high True Positive Rate (TPR) and very low False Positive Rate (FPR), but as the number of samples rises so too will the false alarms. As the number of samples grows, the vast majority of alarms will be false positives.

slide-9
SLIDE 9

9/24

Base Rate

FPR needs to be very low for an accurate attack as more fingerprints are tested. Suppose we have a FPR of 1%. If a client loads 100 unmonitored webpages. Then the attacker will mark 1 webpages incorrectly as monitored. If a client load 1,000,000 unmonitored webpages. Then the attacker will mark 10,000 webpages incorrectly as monitored.

slide-10
SLIDE 10

10/24

Accuracy metrics

TPR - The probability that a monitored page is classified as the correct monitored page. FPR - The probability that an unmonitored page is incorrectly classified as a monitored page. BDR - The probability that a page corresponds to the correct monitored page given that the classifier recognized it as that monitored page. Assuming a uniform distribution of pages BDR can be found from TPR and FPR using the formula TPR · Pr(M) (TPR · Pr(M) + FPR · Pr(U)) where Pr(M) = |Monitored| |Total Pages|, Pr(U) = 1 − P(M).

slide-11
SLIDE 11

11/24

Tor hidden services

Protects receiver anonymity in addition to sender anonymity. Sensitive servers such as SecureDrop use Tor hidden services.

slide-12
SLIDE 12

12/24

Tor hidden services

Client Tor network Adversary www

slide-13
SLIDE 13

13/24

Tor hidden services

Client Tor network IP Adversary www

slide-14
SLIDE 14

14/24

Tor hidden services

Client Tor network RP IP Adversary www

slide-15
SLIDE 15

15/24

Tor hidden services

Client Tor network RP IP Adversary www

slide-16
SLIDE 16

16/24

Tor hidden services

Client Tor network RP IP Adversary www

slide-17
SLIDE 17

17/24

Prelims

All traffic was collected via Tor. Monitored websites by the Adversary - Alexa Sites (Google, Facebook, Wikipedia etc.) & popular Tor Hidden Services Only collected landing page of each website. Alexa monitored set consisted of 100 samples for each of 55 websites. Hidden Services monitored set consisted of 80 samples for each of 30 Hidden Services. Extra sites for testing purposes - 100,000 websites (chosen from top Alexa list).

slide-18
SLIDE 18

18/24

Parameter tuning - number of neighbours and number of trees

Number of neighbours Number of Trees

0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020 False positive 0.82 0.84 0.86 0.88 0.90 0.92 True positive

Max accuracy Min accuracy

50 100 150 200 Number of trees 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy

True positive rate False positive rate

Using different k, the number of neighbours allows us to tune the TPR and FPR. After adding 15 decision trees only incremental benefit in adding more.

slide-19
SLIDE 19

19/24

Alexa monitored set results

0.70 0.75 0.80 0.85 0.90 0.95 1.00 True positive rate

k=1 k=5 k=10

20000 40000 60000 80000 100000 Number of unmonitored sites 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 False positive rate

slide-20
SLIDE 20

20/24

Tor hidden service monitored set results

0.70 0.75 0.80 0.85 0.90 0.95 1.00 True positive rate

k=1 k=5 k=10

20000 40000 60000 80000 100000 Number of unmonitored sites 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 False positive rate

slide-21
SLIDE 21

21/24

BDR

Tor Hidden Services Monitored set.

0.0 0.2 0.4 0.6 0.8 1.0 Bayesian detection rate

k=1 k=5 k=10

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 Number of unmonitored sites 0.0 0.2 0.4 0.6 0.8 1.0 Bayesian detection rate

Alexa Monitored set.

slide-22
SLIDE 22

22/24

Limitations

“The BDR implicitly assumes a base rate, with no particular backing in reality.” - We assume uniform expectation of visiting a webpage. “I would like to better understand how these techniques would work if the attacker did not know the start/stop time that the user visits each website.” - Website fingerprinting evaluation may not reflect practical risks.

slide-23
SLIDE 23

23/24

Conclusion

The open world is not as much of a problem as we had thought, and using state-of-the-art machine learning we expect to be able to tackle other obstacles such as start-stop time identification and multiple tabs. Attack is highly accurate over a large number of webpages. Distiguishability between Tor Hidden Services and Non Tor Hidden Services.

slide-24
SLIDE 24

24/24

Thanks

Questions? j.hayes@cs.ucl.ac.uk @_jamiedh http://www.homepages.ucl.ac.uk/~ucabaye/