Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations C. Wright, L. Ballard, S. Coull, F. Monrose, G. Masson Talk held by Goran Doychev Selected Topics in Information Security and Cryptography Seminar 1 / 30
Overview 1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation 2 / 30
1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation 3 / 30
How does VoIP work? • Control channel: SIP, XMPP, Skype • negotiate IP ports, supported codecs etc. • Voice data: RTP over UDP • Speech codec: GSM, G.728, iSAC, Speex 4 / 30
Operation of a Codec → → audio stream sampling at 8000 or n most recent sam- 16000 samples per ples compressed second (Hz) to packet (usually 20ms) Example • 16kHz audio source: n = 320 samples per packet • 8kHz audio source: n = 160 samples per packet 5 / 30
Operation of a Codec (2) • brute-force search over entries in codebook of audio vectors • find one that most closely reproduces audio packet → 01001110 audio packet digital representation ↓ In Out 01001010 0110 → 0111 01001110 0111 output 01011001 1000 01011010 1001 01011110 1010 codebook 6 / 30
Operation of a Codec (3) • Quality of sound depends on # entries in codebook • Classification of coders according to bit-rate: Category Bit-rate range High bit-rate > 15 kbps Medium bit-rate 5 to 15 kbps Low bit-rate 2 to 5 kbps Very low bit-rate < 2 kbps 7 / 30
Variable Bit Rate • Variable bit rate (VBR): adaptively choose bit rate for each packet • Balance between audio quality and bandwidth • In a two-way conversation: speaker silent 63% of the time 8 / 30
Variable Bit Rate (2) LEAKAGE: • Bit rate depends on encoded data • e.g., Speex encodes vowel sounds ( aa , aw ) at higher bit rate than fricative sounds ( f , s ) 9 / 30
1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation 10 / 30
Problem Description Given: • utterances of n phrases phrase 1 phrase 2 phrase 3 • packet sizes of one of the phrases (5k,7k,3k,8k,12k,2k,1k) Goal: • recognize the phrase (5k,7k,3k,8k,12k,2k,1k) → “ the phrase ” 11 / 30
Profile Hidden Markov Model (HMM) • Match states - expected distribution of packet sizes at each position in the sequence • Insert states - emit packets according to some distribution (uniform). Allows “insertion” of additional packets. • Delete states - silent states. Allows “omitting” packets. 12 / 30
Building a Profile HMM Initially: • set Match state probabilities to uniform distribution • transition probabilities : make Match the most likely transition 13 / 30
Building a Profile HMM Initially: • set Match state probabilities to uniform distribution • transition probabilities : make Match the most likely transition Train the HMM using example utterances 13 / 30
Building a Profile HMM Initially: • set Match state probabilities to uniform distribution • transition probabilities : make Match the most likely transition Train the HMM using example utterances: • Apply Baum & Welch algorithm: iteratively improves the probability of the training sequences • Baum & Welch finds locally optimal set of parameters ⇒ apply Simulated annealing • Apply Viterbi training to further refine parameters. 13 / 30
Problem Description Given: • utterances of n phrases phrase 1 phrase 2 phrase 3 • packet sizes of one of the phrases (5k,7k,3k,8k,12k,2k,1k) Goal: • recognize the phrase (5k,7k,3k,8k,12k,2k,1k) → “ the phrase ” 14 / 30
Searching for a Phrase Changes: • Random - emit packets according to uniform distribution. Matches packets not part of phrase of interest • Profile Start/End - matches start/end of phrase • from PS: transition to the first M state is most likely 15 / 30
Searching for a Phrase (2) • Apply the Viterbi algorithm - find most likely sequence of states to explain observed packet sizes • A “hit” : subsequence of states that belong to the profile part of the model 16 / 30
Searching for a Phrase (2) • Apply the Viterbi algorithm - find most likely sequence of states to explain observed packet sizes • A “hit” : subsequence of states that belong to the profile part of the model • Evaluate the hit ’s goodness: l i , . . . , l j – packet lengths of the phrase of interest score i , j = log Pr [ l i , . . . , l j | Profile ] Pr [ l i , . . . , l j | Random ] • Discard hits below a threshold 16 / 30
1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation 17 / 30
Phrase Models from Phonemes • Phonemes – sounds like b , ch , t , s , aa , aw (English - 40 to 60 phonemes) • Idea: words built up by concatenated phonemes ⇒ model phonemes instead 18 / 30
Phrase Models from Phonemes • Phonemes – sounds like b , ch , t , s , aa , aw (English - 40 to 60 phonemes) • Idea: words built up by concatenated phonemes ⇒ model phonemes instead Advantages: • Flexibility • Cheaper 18 / 30
Problem Description Given: • recordings of all phonemes aa, ae, ah, ao, aw, ay, b, ch, d, dh, eh, er, ey, f, g, hh, etc. • packet sizes of a phrase (5k,7k,3k,8k,12k,2k,1k) Goal: • recognize the phrase (5k,7k,3k,8k,12k,2k,1k) → “ the phrase ” 19 / 30
Phrase Models from Phonemes (2) Straightforward method: 1 build HMMs for phonemes 2 concatenate them, build word HMM 3 concatenate word HMMs to phrase HMM 20 / 30
Phrase Models from Phonemes (2) Straightforward method: 1 build HMMs for phonemes 2 concatenate them, build word HMM 3 concatenate word HMMs to phrase HMM American English: “the phrase” (5k,7k,1k,8k,12k,2k,1k) ↓ (dh,ah),(f,r,ey,z) ↓ (“ the ”),(“ phrase ”) ↓ “ the phrase ” 20 / 30
Phrase Models from Phonemes (2) Straightforward method: 1 build HMMs for phonemes 2 concatenate them, build word HMM 3 concatenate word HMMs to phrase HMM Scottish English: “the phrase” (5k,7k,1k,8k,10k,2k,1k) ↓ (dh,ah),(f,r,eh,z) ↓ (“ the ”),(“ frese ”?) ↓ ? 20 / 30
Problem Description Given: • recordings of all phonemes aa, ae, ah, ao, aw, ay, b, ch, d, dh, eh, er, ey, f, g, hh, etc. • packet sizes of a phrase (5k,7k,3k,8k,12k,2k,1k) Goal: • recognize the phrase (5k,7k,3k,8k,12k,2k,1k) → “ the phrase ” 21 / 30
Problem Description Given: • recordings of all phonemes aa, ae, ah, ao, aw, ay, b, ch, d, dh, eh, er, ey, f, g, hh, etc. • packet sizes of a phrase (5k,7k,3k,8k,12k,2k,1k) • phonetic pronunciation dictionary Goal: • recognize the phrase (5k,7k,3k,8k,12k,2k,1k) → “ the phrase ” 21 / 30
Phrase Models from Phonemes (3) Advanced method: • build initial profile HMM for phrase (as usual) • train it using synthetic training set • search for phrase (as usual) 22 / 30
Phrase Models from Phonemes (3) Advanced method: • build initial profile HMM for phrase (as usual) • train it using synthetic training set • search for phrase (as usual) Synthetic training set: • phrase: “the phrase” • split into words: “the” “phrase” • create list of phonemes: “dh ah” “f r ey z” • replace with packet sizes: “9k 20k” “5k 8k 14k 3k” 22 / 30
Phrase Models from Phonemes (3) Advanced method: • build initial profile HMM for phrase (as usual) • train it using synthetic training set • search for phrase (as usual) Synthetic training set: • phrase: “the phrase” • split into words: “the” “phrase” • create list of phonemes: “dh ah” “f r ey z” • replace with packet sizes: “9k 20k” “5k 8k 14k 3k” Improved Model: use diphones and triphones instead of words 22 / 30
1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation 23 / 30
Experimental Setup • Use TIMIT continuous speech corporus • Concatenate sentences to “conversation” • Training of HMM: • TIMIT pronunciation dictionary (“proper” American English) • PRONLEX pronunciation dictionary (more colloquial English) 24 / 30
Evaluation Metrics • recall : Probability that algorithm finds phrase • precision : Probability that reported match is correct 25 / 30
Results of the Experiment recall precision 51% 50% 26 / 30
Results of the Experiment recall precision 51% 50% • Some phrases were found with high accuracy: “Young children should avoid exposure to contagious diseases.” (recall = 0.99, precision = 1) 26 / 30
Results of the Experiment recall precision 51% 50% • Some phrases were found with high accuracy: “Young children should avoid exposure to contagious diseases.” (recall = 0.99, precision = 1) • A high deviation of results for individual speakers 26 / 30
Robustness to Noise Using pink noise : • energy logarithmically distributed across range of human hearing • harder for noise removal algorithms to filter it sound noise recall precision 100% - .51 .50 90% 10% .39 .40 75% 25% .23 .22 27 / 30
Recommend
More recommend