Deep Learning with Audio Signals Prepare, Process, Design, Expect - PowerPoint PPT Presentation

Deep Learning with Audio Signals Prepare, Process, Design, Expect Keunwoo Ch i

Keunwoo Choi Research Scientist QMUL, UK ETRI, S. Korea SNU, S. Korea @keunwoochoi (twtr, github)

WARNING THIS MATERIAL IS WRITTEN FOR ATTENDEES IN QCON.AI, NAMELY, SOFTWARE ENGINEERS AND DEEP LEARNING PRACTITIONERS TO PROVIDE AN OFF-THE- SHELF GUIDE. MY ADVICE MIGHT NOT BE THE FINAL SOLUTION FOR YOUR PROBLEM, BUT WOULD BE A GOOD STARTING POINT. ..ALSO, THERE'S NO SPOTIFY SECRET HERE :P

Content • Prepare the dataset • Pre-process the signal • Design your network • Expect the result

Prepare the datasets or, know your data Q. How to start an audio task?

LMGTFY • Google them, of course • But....

Audio dataset • Lucky → the exactly same class(es), many of them, yay! • Meh → same or similar classes, sounds alright.. • Ugh.. → there are 2 in freesound.org and 3 on youtube

Audio (or, sound ) dataset • Our algorithm is living in the digital space • So is the .wav files • But,   the sound is in the real world Our lovely cyberspace

Audio dataset Source Noise Reverberation Microphone • Room reverberation image from https://johnlsayers.com/Recmanual/Pages/Reverb.htm

Audio dataset Dear everyone, YOU ARE ALWAYS IN THE "UGH..." SITUATION → HOW TO BUILD A CORRECT AUDIO DATASET?

What we can do DL models are robust only within the variance they've seen. → Good at interpolation.. only. E.g., a model trained with clean signals probably can't deal with noisy signals • Know your real situation noisy environment cheap mic • You can mimic noise/reverberation/mic if you have • clean/dry/high-quality source signals

Simulate the real world clean signal noisy signal + noise signal dry signal wet signal room impulse response original recorded band-pass filter signal signal

What to Google babble noise recording home noise recording Noise cafe noise recording x_noise = x + alpha * noise street noise recording white noise, brown noise room impulse responses, RIR Reverberation x_wet = np.conv(x, rir) reverberation simulators (maybe skip it) band pass filter scipy.signal.convolve scipy.signal filtering scipy.signal. ff tconvolve Microphone microphone specification speaker specification Or trimming-o ff your microphone frequency response spectrograms

Pre-process the signals or, log(melgram) Q. What to do after loading the signals?

Digital Audio 101 • 1 second of digital audio:   size=(44100, ), dtype=int16 • MNIST: (28, 28, 1), int8   CIFAR10: (32, 32, 3), int8   ImageNet: (256, 256, 3), int8 • Audio: Lots of data points in one item!

Audio representations Data shape and size Type Description for e.g., 1 second,   sampling rate=44100 44100 x [int16] Waveform x STFT(x) 513 x 87 x [float32] Spectrograms Melspectrogram(x) 128 x 87 x [float32] CQT(x) 72 x 87 x [float32] Spoiler: log10(Melspectrograms) for the win, but let's see some details MFCC(x) Features 20 x 87 x [float32] = some process on STFT(x)

Spectrograms • 2-dim representation of audio signal TODO: IMAGE

Practitioner's choice • Rule of thumb: DISCARD ALL THE REDUNDANCY • Sample rate, or bandwidth • Goal: To o ptimize the input audio data for your model • by resampling - can be computation heavier • by discarding some freq bands - can be storage heavy https://www.summerrankin.com/dogandponyshow/2017/10/16/catdog

Practitioner's choice • Melspectrogram   - in decibel scale   - which only covers the frequency range you're interested in. • Why?   - smaller, therefore easier and faster training   - perceptual - weighing more on the freq region where Q. Ok, how can I compute them? humans are more interested   - faster than CQT to compute   - decibel scale - another perceptually motivated choice

import librosa import madmom • Python libraries - librosa/madmom/scipy/.. • Computations on CPU • Best when all the processing will be done before the training

import kapre • K eras A udio Pre processing layers • CPU and GPU • Best when you want to do things on the fly/GPU   = Best to optimize audio-related parameters • pip install kapre • There's also pytorch-audio! Disclaimer: I'm the maintainer

Design your network or, know the assumptions Q. What kind of network structure I need?

A dumb-but-strong-therefore-good-while- annoying-since-it's-from-computer-vision baseline approach • Trim the signals properly (e.g. 1-sec) • Do the classification with 2D convnet, 3x3 kernel (=aka vggnet) • Raise $1B • Retire • Post "why i retired.." on Medium • Happy life!

Go even dumber • Just download some pre-trained networks for..   - music   - audio   - image (?) • Re-use it for your task (aka transfer learning) • 1B - retire - Medium - happy - repeat

Better and stronger, by understanding assumptions • assert " Receptive field " size == size of the target pattern • How sparse the target pattern is?   - Bird singing sparse?   - Voice-in-music sparse?   - Distortion-guitar-in-Metallica sparse?

Have no idea? • Go see how computer vision people are doing • Clone it • It's ok, it's a good baseline at least

My spectrogram is 28x28 bc the model I downloaded is trained on MNIST Don't use spectrograms as if they are images It all boils down to the pattern recognition, they're actually similar tasks. the time and frequency axes have totally different meanings I don't know how to incorporate them into my model.. BUT IT WORKS!

Expecting the result or, know the problem Q. How would it work?

YOU • You are responsible for the feasibility • Is it a task you can? • Is the information in the input (mel-spectrogram)? • Are similar tasks being solved?

Think about it! • Is it possible? To what extent? E.g., • Baby crying detection • Baby crying recognition and classification • Dog barking translation • Hit song detection

Conclusion Conclusion.. Conclusion!

Conclusion • Sound is analog, you might need to think about some analog process, too. • Pre-process: Follow others when you're lost • Audio is big in data size, but sparse in information. Reduce the size. Don't start with end-to-end. • Design: Follow others when you're lost • Expect: Make sure if it's doable

Deep Learning with Audio Signal Q&A Prepare, Process, Design, Expect Keunwoo Ch i PS. See you soon at the panel talk!

Deep Learning with Audio Signals Prepare, Process, Design, Expect - PowerPoint PPT Presentation

Deep Learning with Audio Signals Prepare, Process, Design, Expect Keunwoo Ch i Keunwoo Choi Research Scientist QMUL, UK ETRI, S. Korea SNU, S. Korea @keunwoochoi (twtr, github) WARNING THIS MATERIAL IS WRITTEN FOR ATTENDEES IN QCON.AI,

Asynchronous Events: Signals Signals Concepts Generating Signals Catching Signals

Asynchronous Events: Signals Signals Concepts Generating Signals

Audio Device Client Better and Faster Audio I/O on Web Hongchan Choi Google Chrome Web Audio

091031 091031 VIDEO SIGNALS VIDEO SIGNALS Lecturer: Marco Marcon 091032 - AUDIO AND VIDEO

Topic 1: LTI Systems Overview: Introduction to Signals Types of Signals: CT/DT,

Cirrus Audio Solutions Cirrus Audio Solutions Home Audio Portable Audio Personal CD Player

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Create PowerPoint Audio and Video V0B August 2020 V0B V0B Schield: 2020 PPTX Create Audio-Video

Audio and Speech August 13, 2001 Audio 2 Digital sound anti-aliasing amplifier codec filter

Game Audio Coding vs. Aesthetics Leonard Paul of Lotus Audio Vancouver, Canada Game Audio :

Signal Encoding Techniques Digital Data, Analog Signals Analog Data, Digital Signals ITS323:

6.003: Signals and Systems Signals and Systems September 8, 2011 1 6.003: Signals and Systems

Signals Maninder Kaur professormaninder@gmail.com Maninder Kaur www.eazynotes.com 1 Various

EE361: SIGNALS AND SYSTEMS II REVIEW SIGNALS AND SYSTEMS I http://www.ee.unlv.edu/~b1morris/ee361

influences in shaping young peoples employment aspirations: case study evidence from three

soundFishing Claudio Lucio Midolo This project is about a portable tool that captures sounds

Final Assessment May 2017 Vanessa Wilgeroth Diary of Experiences Within Care Work Dementia

Solitary ry confinement and super- maximum prisons A South Afr frican and in international

USING DATA FOR EQUITY AND SOCIAL JUSTICE: A CASE STUDY ON K-12 FILIPINO ACHIEVEMENT GAPS AND

Certified Speaking Professional - CSP - Auckland Chapter Its ALL about Pressing the Right

Developing Future Pharmacy Leaders: An Interdisciplinary Approach Ferris State District IV

For personal use only Chile Explore Congress 2016 7 September 2016 Important Notice and