Audio: Generation & Extraction Charu Jaiswal Music Composition - PowerPoint PPT Presentation

Audio: Generation & Extraction Charu Jaiswal

Music Composition – which approach? • Feed forward NN can’t store information about past (or keep track of position in song) • RNN as a single step predictor struggle with composition, too • Vanishing gradients means error flow vanishes or grows exponentially • Network can’t deal with long-term dependencies • But music is all about long-term dependencies! 2

Music • Long-term dependencies define style: • Spanning bars and notes contribute to metrical and phrasal structure • How do we introduce structure at multiple levels? • Eck and Schmidhuber à LSTM 3

Why LSTM ? • Designed to obtain constant error flow through time • Protect error from perturbations • Uses linear units to overcome decay problems with RNN • Input gate: protects flow from perturbation by irrelevant inputs • Output gate: protects other units from perturbation from irrelevant memory • Forget gate: reset memory cell when content is obsolete Hochreiter & Schmidhuber, 1997 4

Data Representation Chords : Only quarter notes No rests Notes: Training melodies written by Eck Dataset of 4096 segments Eck and Schmidhuber, 2002 5

Experiment 1- Learning Chords • Objective: show that LSTM can learn/represent chord structure in the absence of melody • Network: • 4 cell blocks w/ 2 cells each are fully connected to each other + input • Output layer is fully connected to all cells and to input layer • Training & testing: predict probability of a note being on or off • Use network predictions for ensuing time steps with decision threshold • CAVEAT: treat outputs as statistically independent. This is untrue! (Issue #1) • Result: generated chord sequences 6

Experiment 2 – Learning Melody and Chords • Can LSTM learn chord & melody structure, and use these structures for composition? • Network: • Difference for ex1. : chord cell blocks have recurrent connections to themselves + melody; melody cell blocks are only recurrently connected to melody • Training: predict probability for a note to be on or off 7

Sample composition • Training set: http://people.idsia.ch/~juergen/blues/train.32.mp3 • Chord + melody sample: http://people.idsia.ch/~juergen/blues/lstm_0224_1510.32.mp3 8

Issues • No objective way to judge quality of compositions • Repetition and similarity to training set • Considered notes to be independent • Limited to quarter notes + no rests • Uses symbolic representations (modified sheet notation) à how could it handle real—time performance music (MIDI or audio) • Would allow interaction (live improvisation) 9

Audio Extraction (source separation) • How do we separate sources? • Engineering approach: decompose mixed audio signal into spectrogram, assign time-frequency element to source • Ideal binary mask: each element is attributed to source with largest magnitude in the source spectrogram • This is then used to est. reference separation 10

DNN Approach • Dataset: 63 pop songs (50 for training) • binary mask computed: determined by comparing magnitudes of vocal/non- vocal spectrograms and assigning mask a ‘1’ when vocal had greater mag 11

DNN • Trained a feed-forward DNN to predict binary masks for separating vocal and non-vocal signals for a song • Spectrogram window was unpacked into a vector • Probabilistic binary mask: testing used sliding window, and output of model described predictions of binary mask in sliding window format • Confidence threshhold(alpha): Mv binary mask 12

Separation of sources using DNN 13

Separation quality as a function of alpha SIR (red) = signal-to- interference ratio SDR(green) = signal-to- distortion SAR(blue) = signal-to- artefact SAR and SIR can be interpreted as energetic equivalents of positive hit rate (SIR) and false positive rate (SAR) 14

Like-to-like Comparison Plots mean SAR as a function of mean SIR for both models DNN provides ~3dB better SAR performance for a given SIR index mean, ~5dB for vocal and and only a small advantage for non-vocal signals DNN seems to have biased its learnings toward making good predictions via correct positive identification of vocal sounds 15

Critique of Paper + Next Steps • DNN seems to have biased its learnings toward making good predictions via correct positive identification of vocal sounds • Only a small advantage to using DNN vs. traditional approach • Expand data set 16

Audio: Generation & Extraction Charu Jaiswal Music Composition - PowerPoint PPT Presentation

Audio: Generation & Extraction Charu Jaiswal Music Composition which approach? Feed forward NN cant store information about past (or keep track of position in song) RNN as a single step predictor struggle with composition, too

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Audio Device Client Better and Faster Audio I/O on Web Hongchan Choi Google Chrome Web Audio

Cirrus Audio Solutions Cirrus Audio Solutions Home Audio Portable Audio Personal CD Player

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

3. Feature Extraction 3.1 Feature Extraction from Speech or other types of audio like music

Create PowerPoint Audio and Video V0B August 2020 V0B V0B Schield: 2020 PPTX Create Audio-Video

Audio and Speech August 13, 2001 Audio 2 Digital sound anti-aliasing amplifier codec filter

Game Audio Coding vs. Aesthetics Leonard Paul of Lotus Audio Vancouver, Canada Game Audio :

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

ARREL AUDIO ML-118 Mid-Side Unit Livio Argentini, Marco Re ARREL AUDIO Rome Via Arnoldo

Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung Audio Indexing and Retrieval

CobraNet CobraNet Audio Network Audio Network Overview Overview Developed by Peak Audio

CS378 - Mobile Computing Audio Android Audio Use the MediaPlayer class Common Audio

Automatic Wrapper Generation and Data Extraction Kristina Lerman University of Southern

Variability Extraction and Analysis Toolkit (VEXA) VEXA Introduction The Variability Extraction

Automated Feature Extraction Automated Feature Extraction for Object Recognition for Object

Wave-U-Net A Multi-Scale Neural Network for End-to-End Audio Source Separation DANIEL STOLLER 1 ,

GCT634@KAIST Invited lecture: Sound Source Separation 7 June 2018 Keunwoo Choi at QMUL.uk,

The Big Picture: Where are We Now? I/O System Design Issues interrupts Processor Network

Indoor Sound Localization Fares Abawi Universitt Hamburg Fakultt fr Mathematik, Informatik

Coroutines and Reactive Programming friends or foes? Konrad Kami ski Allegro.pl suspend

Mono for Game Developers Miguel de Icaza miguel@xamarin.com,

Common Parameters for mono-W/Z/H/ Marie-Hlne Genest:

Low Resource Machine Translation MarcAurelio Ranzato Facebook AI Research - NYC