le lear arnin ing based p based prac actic ical s al smar
play

Le Lear arnin ing-based P based Prac actic ical S al Smar - PowerPoint PPT Presentation

Le Lear arnin ing-based P based Prac actic ical S al Smar martpho phone ne Eavesdrop oppin ing wi with Bu Built-in A in Acceler elerome meter er Authors: Zhongjie Ba, Tianhang Zheng, Xinyu Zhang, Zhan Qin, Baochun Li, Xue Liu


  1. Le Lear arnin ing-based P based Prac actic ical S al Smar martpho phone ne Eavesdrop oppin ing wi with Bu Built-in A in Acceler elerome meter er Authors: Zhongjie Ba, Tianhang Zheng, Xinyu Zhang, Zhan Qin, Baochun Li, Xue Liu and Kui Ren Presenter: Shiqing Luo

  2. Smartphone Sensors Permission required No Permission needed Voice Sensor Motion Sensor Gyroscope Microphone Accelerometer Accelerometer Magnetic Sensor Image Sensor Magnetometer Camera

  3. Motion Sensor Threat to Speech Privacy • A smartphone gyroscope can pick up surface vibrations incurred by an independent loudspeaker placed on the same table (Michalevsky et al., Usenix 2014). • Gyroscopes are (lousy but still) microphones. • Very low signal to noise ratio • Low sampling frequency Speaker Speaker Identification Digits Recognition Mixed Female/Male 50% 17% Female speakers 45% 26% Male speakers 65% 23%

  4. Motion Sensor Threat to Speech Privacy • Only loudspeaker-rendered speech signals traveling through a solid surface can create noticeable impacts on motion sensors (Anand et al., S&P 2018). Through a Gyroscope shared surface Accelerometer Through air The threat does not go beyond the Loudspeaker-Same-Surface setup studied by Michalevsky et al.

  5. Commonly Believed Limitations • Can only pick up a narrow band of speech signals • Android has a sampling ceiling of 200 Hz • iOS has a sampling ceiling of 100 Hz Fundamental frequency range of human speech 85-180 Hz 165-255 Hz • Does not go beyond the Loudspeaker-Same-Surface setup • Very low SNR (Signal-to-Noise Ratio) • Sensitive to sound angle of arrival

  6. <latexit sha1_base64="FsdzMpQszuC+0z1Y2gmbgrn9ioY=">ADinicfZLdbtMwFMfdBNhaBuvgkhuLiobqiRt103cTIDUCjGpMLoNVXlOG5rzbEjxwFK1nfhmbjbThpw0TbsRNF+euc3/mwc4JY8MQ4zu+SZd+7/2Bnt1x5uPfo8X714Ml5olJN2YAqofRlQBImuGQDw41gl7FmJAoEuwiu3ubxi69MJ1zJz2Yes1FEpJPOCUGXOD0s+yH7Apl5khQSqIXmTX4prCs6iUy/4srwviVIVM4Dr+woiGzxmJYDg5xZ+IYdj31GjcLdVx57jHuK6zi492ODgfwkhfT3zTuxD13cbeMCwXvNqaXkm+M49N8kO/bN1r39H17GjFdaBmy8vBda6r1FQw3Of4czNFQspLXcbLbr3oUpfqxu07fwXQ4K+Dq6PqnPZHjzP8bVmtNwloa3hVuIGiqsP67+8kNF04hJQwVJkqHrxGaUEW04FWxR8dOExYRekSkbgpQkYskoW67SAr8AT4gnSsMrDV56/83I4A6TeRQAGREzSzZjufO2DA1k6NRxmWcGibpqtEkFRiWJd9LHLNqBFzEIRqDrNiOiOaUAPbW4FLcDePvC3OvYbLQ/tmonb4r2EXP0HP0Ermog05QD/XRAFrx3plHVode8/27GP79Qq1SkXOU7Rm9rs/jM74yw=</latexit> <latexit sha1_base64="yYOIwdWODfxZEK63X1Gki8GTDu8=">AC+HicdZLbtNAFIbHLpcSoKSwZDMiomIV2YW2LOvSlCK1NDRJWxRH0fHkJBl1PLZmxkipmydhwKE2PIo7HgbxonFpQ3HGun3Od8/lzMTpYJr43k/HXfpxs1bt5fvVO7eu7/yoLr68EQnmWLYlI1FkEGgWX2DHcCDxLFUIcCTyNzl8V9dMPqDRPZNtMUuzFMJ8yBkYm+qvOithCMucwNRJkBN80txyew3rYTjYtIK3UBE3qUFgZN18r/NdqC2O5QjugxGKRhWKGlIw+NobuNg+B92H97dHwYHEwtvu5NC78G3T/wtIL4M6bOWg5i3n/5V4Hh42C3Cxn9DfrWxami+m9oNVutNqFoeSDvaD5hw1RDn4fv1+teXVvFvS68EtRI2U0+9Uf4SBhWYzSMAFad30vNb0clOFMoO1ipjEFdg4j7FopIUbdy2cXN6VPbWZAh4myQxo6y/7tyCHWehJHlozBjPXVWpFcVOtmZviyl3OZgYlmy80zAQ1CS1eAR1whcyIiRXAFLd7pWwMCpixb6Vim+BfPfJ1cbJe95/XN969qG3vlO1YJo/JE/KM+GSLbJN90iQdwpzM+eh8dr64F+4n96v7bY6Tul5RP4J9/svAufiTA=</latexit> Our Observations: Sampling Frequency • The actual sampling rates of motion sensors are determined by the performance of the smartphone. • Accelerometers on recent smartphones can cover almost the entire fundamental frequency band (85-255Hz) of adult speech. Sampling frequencies supported by Android [1] Delay Options Delay Sampling Rate Model Year Sampling Rate 200 ms 5 Hz Moto G4 2016 100 Hz DELAY NORMAL 20 ms 50 Hz DELAY UI Samsung J3 2016 100 Hz 60 ms 16.7 Hz DELAY GAME LG G5 2016 200 Hz 0 ms AFAP DELAY FASTEST Huawei Mate 9 2016 250 Hz Samsung S8 2017 420 Hz The 200 Hz sampling ceiling Google Pixel 3 2018 410 Hz no longer exists Huawei P20 Pro 2018 500 Hz Huawei Mate 20 2018 500 Hz [1] “Sensor Overview,” https://developer.android.com/guide/topics/sensors/sensors_overview.

  7. Our Observations: New Setup • Employs a smartphone’s accelerometer to eavesdrop on the speaker in the same smartphone. • Much Higher SNR • Sound always arrives from the same direction 0.03 0.4 x-axis z-axis

  8. Our Observations: New Setup • Employs a smartphone’s accelerometer to eavesdrop on the speaker in the same smartphone. • Much Higher SNR • Sound always arrives from the same direction • A smartphone speaker is more likely to reveal sensitive information than an independent loudspeaker.

  9. Threat Model Handhold setting Table setting

  10. Accelerometer-based Smartphone Eavesdropping • Preprocessing: convert acceleration signals into spectrograms. • Speech Recognition: convert spectrograms to text. • Speech Reconstruction: reconstructs voice signals from spectrograms

  11. Preprocessing • Problems in Raw Acceleration Signals • Raw accelerometer measurements are not sampled at fixed interval. • Raw accelerometer measurements can be distorted by human movement. • Raw accelerometer measurements have captured multiple digits and needs to be segmented. Time x-axis y-axis z-axis ("/$ % ) ("/$ % ) ("/$ % ) (ms) 1 -0.2130 -0.1410 10.0020 2 -0.1870 -0.1440 9.9970 3 -0.2110 -0.1510 9.9970 5 -0.2110 -0.1410 10.0070 8 -0.2080 -0.1340 10.0120 10 -0.2150 -0.1320 10.0070

  12. Step 1: Generate Sanitized Single-word Signals • Interpolation Time x-axis y-axis z-axis (ms) ("/$ % ) ("/$ % ) ("/$ % ) • Upsample accelerometer signals to 1000 Hz using linear interpolation. 1 -0.2130 -0.1410 10.0020 2 -0.1870 -0.1440 9.9970 3 -0.2110 -0.1510 9.9970 5 -0.2110 -0.1410 10.0070 8 -0.2080 -0.1340 10.0120 10 -0.2150 -0.1320 10.0070

  13. Step 1: Generate Sanitized Single-word Signals • Interpolation Time x-axis y-axis z-axis (ms) ("/$ % ) ("/$ % ) ("/$ % ) • Upsample accelerometer signals to 1000 Hz using linear interpolation. 1 -0.2130 -0.1410 10.0020 2 -0.1870 -0.1440 9.9970 3 -0.2110 -0.1510 9.9970 4 -0.2110 -0.1460 10.0020 5 -0.2110 -0.1410 10.0070 6 -0.2100 -0.1387 10.0087 7 -0.2090 -0.1363 10.0103 8 -0.2080 -0.1340 10.0120 9 -0.2115 -0.1330 10.0095 10 -0.2150 -0.1320 10.0070

  14. Step 1: Generate Sanitized Single-word Signals • Interpolation Fundamental frequency range of human speech • Upsample accelerometer signals to 1000 Hz using linear interpolation. • High-pass filter • Convert the acceleration signal along each 85-180 Hz 165-255 Hz axis to the frequency domain and eliminate frequency components below 80 Hz.

  15. Step 1: Generate Sanitized Single-word Signals • Interpolation Table setting • Upsample accelerometer signals to 1000 Hz using linear interpolation. • High-pass filter • Convert the acceleration signal along each axis to the frequency domain and eliminate frequency components below 80 Hz. Handhold setting

  16. Step 1: Generate Sanitized Single-word Signals Table setting • Interpolation • Upsample accelerometer signals to 1000 Hz using linear interpolation. • High-pass filter • Convert the acceleration signal along each axis to the frequency domain and eliminate frequency components below 80 Hz. • Segmentation Handhold setting • Calculate the magnitude of the acceleration signal and smooth the obtained magnitude sequence with moving average. • Locate all regions with magnitudes higher than a threshold.

  17. Step 2: Generate Spectrogram Images • Signal-to-spectrogram conversion • Divide the signal into multiple short segments with a fixed overlap. • Window each segment with a Hamming window and calculate its spectrum through STFT (Short- Time Fourier Transform). • Three spectrograms can be obtained for each single-word signal. Table setting Handhold setting

  18. Step 2: Generate Spectrogram Images • Signal-to-spectrogram conversion • Divide the signal into multiple short segments with a fixed overlap. • Window each segment with a Hamming window and calculate its spectrum through STFT (Short- Time Fourier Transform). • Three spectrograms can be obtained for each single-word signal. Table setting Handhold setting • Generate Spectrogram-Images • Fit the three m x n spectrograms into one m x n x 3 tensor. • Take the square root of all the elements in the tensor and map the obtained values to integers between 0 and 255. • Export the m x n x 3 tensor as an image in PNG Table setting format Handhold setting

  19. <latexit sha1_base64="gr8gdQU1LkMOStjg+tgd8EtEyJ8=">ACRHicbVDLSgMxFM3UVx1fVZdugkWoUIcZH+hGKLrpsoJ9QDuUTJpQzMPkoxYhvk4N36AO7/AjQtF3IqZaYU+vBy7jnkpvjhIwKaZqvWm5peWV1Lb+ub2xube8UdvcaIog4JnUcsIC3HCQIoz6pSyoZaYWcIM9hpOkMb1O9+UC4oIF/L0chsT3U96lLMZK6hbaHQ/JgePGj0mXwWv412Y3RiyuJkoTdvMpzqrLJhGDNMzE6sxD7W9W6haBpmVnARWBNQBJOqdQsvnV6AI4/4EjMkRNsyQ2nHiEuKGUn0TiRIiPAQ9UlbQR95RNhxFkICjxTg27A1fElzNjpiRh5Qow8RznTXcW8lpL/ae1Iuld2TP0wksTH4fciEZwDR2KOcYMlGCiDMqdoV4gHiCEuVexqCNf/lRdA4Nawz4+LuvFi5mcSRBwfgEJSABS5BVRBDdQBk/gDXyAT+1Ze9e+tO+xNadNZvbBTGk/v83dsTI=</latexit> Speech Recognition x l = H l ([ x 0 , x 1 , ..., x l − 1 ]) • DenseNet: • Direct connections between each layer • Fewer nodes and parameters • Comparable performance with VGG & ResNet Huang, Gao, et al. "Densely connected convolutional networks." Proceedings of the IEEE conference on computer vision and pattern recognition . 2017.

Recommend


More recommend