Localization and Detection Paper: 3583 Session: AUD-L3 Acoustic - PowerPoint PPT Presentation

A Sequence Matching Network for Poly lyphonic Sound Event Localization and Detection Paper: 3583 Session: AUD-L3 Acoustic Event Detection T. N. T. Nguyen*, D. L. Jones ꭝ , W. S. Gan* *School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore ꭝ Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, USA 6 May 2020 - ICASSP

Sound event localization and detection 2

Sound event localization and detection (SELD) Sound event detection (SED) Direction-of-arrival (DOA) estimation Signal Support Spatial Filtering 3

SELDnet: joint SED and DOA estimation The losses of SED and DOA estimation task are jointly optimized. S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,” IEEE Journal of Selected Topics in Signal Processing , vol. 13, no. 1, pp. 34 – 48,March 2019 4

Two-stage SELD Y. Cao, Q. Q. Kong, T. Iqbal, F. An, W. Wang, and M. D. Plumbley , “Polyphonic sound event detection and localization using a two- stage strategy,” in Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) , 2019. 5

Observation Ground-truth Sequences Output Sequences Sound event False positive Event 2 Event 2 1. Timestamp (onset, offset) 2. Sound class Event 3 Event 5 Event 5 Event 3 3. DOA Event 1 Event 1 Event 4 Event 4 Timestamps SED Sound classes Event 3 Event 3 Event 5 Event 5 Event 1 Event 4 Timestamps Event 1 Event 4 DOA Event 2 Event 2 estimation DOAs 6

A sequence matching network (SMN) for SELD Sequence Matching Network softmax Number of SED Network FCs events n_classes Input Features n_max_event + 1 = 3 n_frames x 11 Log-mel CRNN softmax SED GCC-PHAT FCs n_frames concatenate n_max_events x (n_classes +1) = 2 x 12 Bidirectional GRU softmax Azimuth upsampling FCs n_max_events x Input Features n_angles Single-source n_azimuths = 2 x 36 Complex CNN histogram n_frames x 128 Spectrogram softmax Elevation n_frames FCs n_max_events x n_classes=11 DOAE Module n_elevations = 2 x 9 n_angles=324 7

Improved SED network Input Features Log-mel Fully n_classes Bidirectional Connected CNN GRU Sigmoid GCC-PHAT n_classes n_frames Improvement: data augmentation Use random cut out with the same mask for all logmel and GCC-PHAT channels Y. Cao, Q. Q. Kong, T. Iqbal, F. An, W. Wang, and M. D. Plumbley , “Polyphonic sound event detection and localization using a two- stage strategy,” in Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) , 2019. 8

DOA estimation Time Frequency Input features 2D histogram Binary Mask elevation Complex n_angles vectorize Spectrogram azimuth n_frames T. N. T. Nguyen, S. K. Zhao, and D. L. Jones, “Robust doa estimation of multiple speech sources,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2014, pp. 2287 – 2291. 9

Output format Conventional output format Sound classes: multi-label multi-class classification Azimuths: regression Elevations: regression n_classes Proposed output format Number of active events n_max_events + 1 n_max_events Event 1 Event 2 n_azimuths n_elevations n_classes + 1 Sound classes: Azimuth: Elevation: multi-class classification multi-class classification multi-class classification 10

Dataset • TAU Spatial Sound Events 2019 – Ambisonic (DCASE 2019 – task 3) Evaluation: 100 one- Development: 400 one-minute recordings minute recording • Data are synthesized using recorded room impulse responses (RIRs) and clean signals. Maximum 2 overlapping sources in one frame • SED: 11 indoor sound classes • DOA: 324 angles • Azimuth between [0°, 360°), resolution 10°: 36 angles • Elevation between [-40°, 40°], resolution 10°: 9 angles 11

Evaluation metrics: SED DOA estimation • Segment-based error rate • Frame-based DOA error • Segment-based F1 score • Frame-based frame recall • Segment length: 1 second • Frame length: 0.02 second 12

New evaluation metrics: to account for correct matching of sound classes and DOAs 1. Matching F1 score (frame-based) 𝑏 𝑛𝑏𝑢𝑑ℎ𝑗𝑜𝑕 𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜(𝑛𝑞) = 𝑏 + 𝑐 + 𝑑 𝑏 𝑛𝑏𝑢𝑑ℎ𝑗𝑜𝑕 𝑠𝑓𝑑𝑏𝑚𝑚(𝑛𝑠) = 𝑏 + 𝑐 + 𝑒 𝑛𝑏𝑢𝑑ℎ𝑗𝑜𝑕 𝐺1 = 2 ∗ 𝑛𝑞 ∗ 𝑛𝑠 𝑛𝑞 + 𝑛𝑠 2. Same-class matching accuracy (frame-based) 𝑛𝑏𝑢𝑑ℎ𝑗𝑜𝑕 𝑏𝑑𝑑𝑣𝑠𝑏𝑑𝑧 𝑁𝐵 = # 𝑝𝑔 𝑑𝑝𝑠𝑠𝑓𝑑𝑢𝑚𝑧 𝑞𝑠𝑓𝑒𝑗𝑑𝑢𝑓𝑒 𝑔𝑠𝑏𝑛𝑓 − 𝑐𝑏𝑡𝑓𝑒 𝑓𝑤𝑓𝑜𝑢𝑡 𝑢ℎ𝑏𝑢 ℎ𝑏𝑤𝑓 𝑡𝑏𝑛𝑓 𝑡𝑝𝑣𝑜𝑒 𝑑𝑚𝑏𝑡𝑡 # 𝑝𝑔 𝑕𝑠𝑝𝑣𝑜𝑒 − 𝑢𝑠𝑣𝑢ℎ 𝑔𝑠𝑏𝑛𝑓 − 𝑐𝑏𝑡𝑓𝑒 𝑓𝑤𝑓𝑜𝑢𝑡 𝑢ℎ𝑏𝑢 ℎ𝑏𝑤𝑓 𝑡𝑏𝑛𝑓 𝑡𝑝𝑣𝑜𝑒 𝑑𝑚𝑏𝑡𝑡 13

Methods for comparison Group Methods Descriptions SELDnet joint SED and DOAE [1], with log-mel and GCC-PHAT input features [2] Baselines Two-stage two-stage SELD [2] Improved Two-stage SELD with additional random cut-out augmentation for input Two-stage-aug baseline features SED-net the SED network of the Two-stage-aug -> SED sequences for SMNs Inputs to SMNs DOA-hist single-source histogram for DOA estimation -> DOA sequences for SMNs [5] SMN SMN with the conventional SELD output format Proposed SMN-event SMN with new output format Top DCASE Kapka-en the consecutive ensemble of CRNN models with heuristics rules; ranked 1 [6] SELD team Two-stage-en the ensemble based on two-stage training; ranked 2 [7] ranking 14

↑: the higher, the better SELD evaluation results ↓: the lower, the better DOA Matching Same-class SED error SED F1 DOA error Group Methods frame rate F1 score matching rate ↓ score ↑ ↓ ↑ ↑ accuracy ↑ SELDnet 0.212 0.880 9.75° 0.851 0.750 0.229 Baselines Two-stage 0.143 0.921 8.28° 0.876 0.786 0.270 Improved Two-stage-aug 0.108 0.944 8.42° 0.892 0.797 0.270 baseline SED-net 0.108 0.944 NA NA NA NA Inputs to SMNs DOA-hist NA NA 4.28° 0.825 NA NA SMN 0.079 0.958 4.97° 0.913 0.869 0.359 Proposed SMN-event 0.079 0.957 5.50° 0.924 0.840 0.649 Kapka-en 0.08 0.947 3.7 ° 0.968 NA NA Top DCASE team ranking Two-stage-en 0.08 0.955 5.5° 0.922 NA NA 15

Conclusions • Our proposed sequence matching networks outperformed the state-of-the- art SELDnet and the two-stage method for sound event localization and detection. • The sequence matching network is modular and hierarchical -> improve the performance while increase the flexibility in designing and optimizing its components. • The sequence matching networks increase the correct association between the sound classes and the corresponding DOAs in multiple-source cases. The new output format can also handle the cases where multiple sound events of the same class have different DOAs. • The new evaluation metrics address the problem of matching sound classes and DOAs which was not achievable using the conventional SELD evaluation metrics. 16

References 1. S. Adavanne, A. Politis, J. Nikunen , and T. Virtanen, “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,” IEEE Journal of Selected Topics in Signal Processing , vol. 13, no. 1, pp. 34 – 48,March 2019 2. Y. Cao, Q. Q. Kong, T. Iqbal, F. An, W. Wang, and M. D. Plumbley , “Polyphonic sound event detection and localization using a two- stage strategy,” in Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) , 2019. 3. S. Adavanne, A. Politis , and T. Virtanen, “A multi -room reverberant dataset for sound event localization and detection,” in Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) , 2019. 4. A. Mesaros, T. Heittola , and T. Virtanen, “Metrics for polyphonic sound event detection,” Applied Sciences , vol. 6, no. 6, pp. 162, 2016. 5. T. N. T. Nguyen, S. K. Zhao, and D. L. Jones, “Robust doa estimation of multiple speech sources,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2014, pp. 2287 – 2291. 6. S. Kapka and M. Lewandowski, “Sound source detection, localization and classification using consecutive ensemble of CRNN models,” Tech. Rep., DCASE2019 Challenge, June 2019. 7. Y. Cao, T. Iqbal, Q. Q. Kong, M. Galindo, W. Wang, and M. D Plumbley , “Two -stage sound event localization and detection using intensity vector and generalized cross- correlation,” Tech. Rep., DCASE2019 Challenge, June 2019. 17

Acknowledgement This research was conducted at Singtel Cognitive and Artificial Intelligence Lab for Enterprises (SCALE@NTU), which is a collaboration between Singapore Telecommunications Limited (Singtel) and Nanyang Technological University (NTU) that is funded by the Singapore Government through the Industry Alignment Fund ‐ Industry Collaboration Projects Grant. 18

This Photo by Unknown Author is licensed under CC BY-NC-ND 19

Localization and Detection Paper: 3583 Session: AUD-L3 Acoustic - PowerPoint PPT Presentation

A Sequence Matching Network for Poly lyphonic Sound Event Localization and Detection Paper: 3583 Session: AUD-L3 Acoustic Event Detection T. N. T. Nguyen, D. L. Jones , W. S. Gan *School of Electrical and Electronic Engineering, Nanyang

Category-level localization Cordelia Schmid Category-level localization Localization of

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Localization Nischal K N System Overview Mapping Hector Mapping Localization Path Planning

Category-level localization Cordelia Schmid Category-level localization Localization up to a

E. Elnahrawy, X. Li, and R. Martin Rutgers U. WLAN-Based Localization Localization in

Robot Localization Localization Robot and and Kalman Filters Filters Kalman Rudy Negenborn

Anderson Localization Alaska Subedi April 24, 2008 Alaska Subedi Anderson Localization

Lecture 18: Localization Lecture 18: Localization algorithms algorithms Mythili Vutukuru CS

Localization in Sensor Networks Rahul Jain ETH Z urich May 5, 2010 Rahul Jain Localization

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

MPLS TP Ring Fault Detection and Localization draft-jiang-mpls-tp-ring-fd Authors Albert

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

Vehicle Localization based on Lane Marking Detection Yuncong Chen UCSD HRI intern 2014

SYNTHESIZING 3D SOUND SYNTHESIZING 3D SOUND AND AND SOUND LOCALIZATION SOUND LOCALIZATION

Expressing Internationalization and Localization information in XML Felix Sasaki Richard Ishida

Boundary Conditions and Localization on AdS Rajesh Gupta Kings College London Workshop on

ChildcareSupportat* Harvard* From%the%perspec,ve%of%science%and%engineering%faculty% %

Surveillance Event Detec/on 11:30 11:50 University of Ottawa (VIVA_uOttawa)

Volume 20 No. ?? March 30, 2020 All Providers For Action TO: Managed Care Organizations

University of Manchester CS3282 : Digital Communications Section 4: Introduction to digital

Solving Device Tree Issues - part 3 Using devicetree is painful. The framework does not help to

Conges'on control CSCI 466: Networks Keith Vertanen

Latin America inequality: Recent decline and prospects for its further reduction Giovanni Andrea

(Jeremiason et al., 1994) David Reckhow CEE 577 #36 1 Homologs (11) What are the PCBs

Localization and Detection Paper: 3583 Session: AUD-L3 Acoustic - PowerPoint PPT Presentation

A Sequence Matching Network for Poly lyphonic Sound Event Localization and Detection Paper: 3583 Session: AUD-L3 Acoustic Event Detection T. N. T. Nguyen*, D. L. Jones , W. S. Gan* *School of Electrical and Electronic Engineering, Nanyang

Category-level localization Cordelia Schmid Category-level localization Localization of

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Localization Nischal K N System Overview Mapping Hector Mapping Localization Path Planning

Category-level localization Cordelia Schmid Category-level localization Localization up to a

E. Elnahrawy, X. Li, and R. Martin Rutgers U. WLAN-Based Localization Localization in

Robot Localization Localization Robot and and Kalman Filters Filters Kalman Rudy Negenborn

Anderson Localization Alaska Subedi April 24, 2008 Alaska Subedi Anderson Localization

Lecture 18: Localization Lecture 18: Localization algorithms algorithms Mythili Vutukuru CS

Localization in Sensor Networks Rahul Jain ETH Z urich May 5, 2010 Rahul Jain Localization

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

MPLS TP Ring Fault Detection and Localization draft-jiang-mpls-tp-ring-fd Authors Albert

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

Vehicle Localization based on Lane Marking Detection Yuncong Chen UCSD HRI intern 2014

SYNTHESIZING 3D SOUND SYNTHESIZING 3D SOUND AND AND SOUND LOCALIZATION SOUND LOCALIZATION

Expressing Internationalization and Localization information in XML Felix Sasaki Richard Ishida

Boundary Conditions and Localization on AdS Rajesh Gupta Kings College London Workshop on

Childcare*Support*at* Harvard* From%the%perspec,ve%of%science%and%engineering%faculty% %

Surveillance Event Detec/on 11:30 11:50 University of Ottawa (VIVA_uOttawa)

Volume 20 No. ?? March 30, 2020 All Providers For Action TO: Managed Care Organizations

University of Manchester CS3282 : Digital Communications Section 4: Introduction to digital

Solving Device Tree Issues - part 3 Using devicetree is painful. The framework does not help to

Conges'on control CSCI 466: Networks Keith Vertanen

Latin America inequality: Recent decline and prospects for its further reduction Giovanni Andrea

(Jeremiason et al., 1994) David Reckhow CEE 577 #36 1 Homologs (11) What are the PCBs

A Sequence Matching Network for Poly lyphonic Sound Event Localization and Detection Paper: 3583 Session: AUD-L3 Acoustic Event Detection T. N. T. Nguyen, D. L. Jones , W. S. Gan *School of Electrical and Electronic Engineering, Nanyang

ChildcareSupportat* Harvard* From%the%perspec,ve%of%science%and%engineering%faculty% %