Speech Enhancement Based
- n Adaptive Line Enhancer
Research Thesis
Aviva Atkins
April 7th, 2020 Supervised by Prof. Israel Cohen
on Adaptive Line Enhancer Research Thesis Aviva Atkins April 7 th , - - PowerPoint PPT Presentation
Speech Enhancement Based on Adaptive Line Enhancer Research Thesis Aviva Atkins April 7 th , 2020 Supervised by Prof. Israel Cohen Outline Introduction The problem researched The challenges Research contributions Adaptive
Research Thesis
Aviva Atkins
April 7th, 2020 Supervised by Prof. Israel Cohen
▪ Introduction ▪ The problem researched ▪ The challenges ▪ Research contributions ▪ Adaptive Line Enhancer background
▪ Convention fixed step size ▪ Mutual Information approach
▪ Proposed method ▪ Conclusions and future research
Sound test
Interference Source Reverberation Echo Additive noise
Speech Enhancement
▪ Contains deterministic sinusoidal components
Reducing nonstationary harmonic noise from a speech signal recorded with a single microphone
Source Nonstationary Harmonic Additive noise
▪ Single channel – only the noisy signal is available with no access to additional reference signals and no spatial information →only intrinsic properties of speech or noise can be used ▪ The vast majority of methods require an estimate of the noise spectrum
▪ When the noise is stationary it can be estimated during segments when speech is absent ▪ When the noise is nonstationary it needs to be tracked continuously
→it is more difficult to estimate nonstationary noise ▪ Trade-off between noise reduction to speech distortion ▪ The developed method needs to be relevant for real-time applications
▪ Introduced a filtering method based on the frequency domain Adaptive Line Enhancer, that enables better reduction of nonstationary harmonic noise.
▪ Proposed the combined filter – a combination of the commonly-used forward adaptive linear filter and a non-causal backward adaptive linear filter used together, increasing the reduction span of the noise transient ▪ Applied the filter based on a comparison to the noisy spectrum, reducing noise
▪ Applied the filter based on a noise presence indicator for better speech preservation ▪ Employed a set of filter lengths, to ensure the combined filter spans throughout the noise transient
▪Investigated a statistical model as an alternative to the Decision Directed for the a-priori SNR estimator and showed that it can eliminate the musical noise while compromising between signal distortion and noise reduction. ▪ Introduced a beamformer that enables fine tuning of the compromise between Directivity Factor and White Noise Gain, through a simple computationally- efficient algorithm.
▪ Exploits the structure of the harmonic noise ▪ Simple with low computational cost ▪ Modifies both magnitude and phase so has the potential to improve on signal intelligibility and not just quality
+ −
Adaptive filter Signal Source Noise Source Primary Input Output 𝑦 𝑜 + 𝑤(𝑜) 𝑤0(𝑜) ො 𝑤 (𝑜) 𝑓(𝑜) Adaptive algorithm ො 𝑦 (𝑜) ො 𝑦 = 𝑦 + 𝑤 − ො 𝑤 𝑛𝑗𝑜𝐹 ො 𝑦2 = 𝐹 𝑦2 + 𝑛𝑗𝑜𝐹 𝑤 − ො 𝑤 2 𝑛𝑗𝑜𝐹 𝑦 − ො 𝑦 2 = 𝑛𝑗𝑜𝐹 𝑤 − ො 𝑤 2 ො 𝑤 = 𝑤, ො 𝑦 = 𝑦
Ideal case:
+𝑒(𝑜) +𝑒(𝑜) +𝑦0(𝑜)
?
Reference Input
distortion
+ −
Adaptive filter Signal Source Noise Source Output 𝑧 𝑜 = 𝑦 𝑜 + 𝑤(𝑜) 𝑤(𝑜) 𝑨 (𝑜) 𝑓(𝑜) Adaptive algorithm ො 𝑦 (𝑜) 𝑦(𝑜)
−
z
Input ො 𝑦 (𝑜) Output’
Signal decorrelated Noise correlated Noise decorrelated signal correlated
+ −
Adaptive filter Signal Source Noise Source Output 𝑧 𝑜 = 𝑦 𝑜 + 𝑤(𝑜) 𝑤(𝑜) 𝑨 (𝑜) 𝑓(𝑜) Adaptive algorithm ො 𝑦 (𝑜) 𝑦(𝑜)
−
z
Input X(𝑙, 𝑛) 𝑍 𝑙, 𝑛 = 𝑌 𝑙, 𝑛 + 𝑊 𝑙, 𝑛 𝑊 𝑙, 𝑛 𝑎 𝑙, 𝑛 𝐹 𝑙, 𝑛 𝑌 𝑙, 𝑛
TD FD
𝑌 𝑙, 𝑛 𝑎 𝑙, 𝑛 𝐹 𝑙, 𝑛 𝑍 𝑙, 𝑛 = 𝑌 𝑙, 𝑛 + 𝑊 𝑙, 𝑛 𝑊 𝑙, 𝑛 X(𝑙, 𝑛)
+ −
Adaptive filter Signal Source Noise Source Output Adaptive algorithm
−
z
Input 𝜈
( ) ( ) ( )
− = m k m k m k Z
H
, , , y h
( ) ( ) ( )
T L
m k H m k H m k , ,..., , ,
1 −
= h
( ) ( ) ( )
T
L m k Y m k Y m k 1 , ,..., , , + − − − = − y
( ) ( ) ( ) ( ) ( )
+ − − + = +
2 *
, , , , 1 , m k m k m k E m k m k y y h h NLMS:
FD
𝜐
For the conventional fixed step size, it is difficult to both reduce the noise and maintain high quality of the enhanced signal
Frame Index Frame Index Frame Index
(a) Clean signal (b) Noisy signal (c) Enhanced signal
3 , 1 = = L
Taghia, J., Martin, R., 2016, “A frequency-domain adaptive line enhancer with step-size control based on mutual information for harmonic noise reduction” IEEE Trans. Audio Speech Lang. Process.
▪ Frequency dependent step size, detecting harmonic noise presence per frequency ▪ Based on Mutual Information (MI) ▪ Step size: ( ) ( )
k Q k
=
( )
( )
=
=
else , if , 1
K 1 k 2 thr P
I k I Q
constant
( ) ( ) ( )
=
* *
, ˆ , ˆ k k k I k k k I k
total P
Frame Index
(b) Noisy signal (c) MI Step Size
1 =
Q
𝜈
Frame Index Frame Index
(a) Clean signal (c) Enhanced signal - MI Frame Index
(b) Enhanced signal – fixed step size
▪ Implemented in block-wise manner ▪ Assumption: stationarity of the noise is at least as large as the block length ▪ They take block length of 3 seconds
Taghia, J., Martin, R., 2016, “A frequency-domain adaptive line enhancer with step-size control based on mutual information for harmonic noise reduction” IEEE Trans. Audio Speech Lang. Process.
The assumption does not hold for highly non- stationary signals, such as the heart monitor beeping Decision block often zero for highly non-stationary signals, such as the heart monitor beeping
Spectrogram of 3.4s long heart monitor beeping
Frame Index
(a) Noisy signal (b) MI Step Size
=
Q
𝜈
Frame Index Frame Index
(a) Clean signal (c) Enhanced signal - MI Frame Index
(b) Noisy signal ignored
Q
?
+ −
Adaptive filter Signal Source Noise Source Output 𝑧 𝑜 = 𝑦 𝑜 + 𝑤(𝑜) 𝑤(𝑜) 𝑨 (𝑜) 𝑓(𝑜) Adaptive algorithm 𝑦(𝑜)
−
z
Input
?
= ො 𝑤(𝑜) ො 𝑦 (𝑜)
▪ Clean speech: 20 different speech signals from different speakers from TIMIT database (0.5M/0.5F) ▪ Sampled @ 16KHz ▪ SNR range [0,20] dB ▪ STFT, overlap-add ▪ Noise: 26 different non-stationary harmonic noise signals, e.g., heart monitor beeping, train door beeping, house alarm, railroad crossing bells.
𝛅𝑌 𝑙, 𝑛, 𝜐 = 𝐹 𝑌 𝑙, 𝑛 𝐲∗ 𝑙, 𝑛 − 𝜐 𝐹 𝑌 𝑙, 𝑛
2
𝛅V 𝑙, 𝑛, 𝜐 = 𝐹 𝑊 𝑙, 𝑛 𝐰∗ 𝑙, 𝑛 − 𝜐 𝐹 𝑊 𝑙, 𝑛
2
[Frames]
= ො 𝑤(𝑜)
1 Frame = 32ms
▪ Combined filter (CMLNLMS): 𝐹𝑑 𝑙, 𝑛 = 𝐹𝑐 𝑙, 𝑛 + 𝑀 , 𝐹𝑐 𝑙, 𝑛 + 𝑀
2 ≤ 𝐹𝑔 𝑙, 𝑛 2 𝑏𝑜𝑒 𝐹𝑐 𝑙, 𝑛 + 𝑀 2 ≤ 𝑍 𝑙, 𝑛 2
𝐹𝑔 𝑙, 𝑛 , 𝐹𝑐 𝑙, 𝑛 + 𝑀
2 > 𝐹𝑔 𝑙, 𝑛 2 𝑏𝑜𝑒 𝐹𝑔 𝑙, 𝑛 + 𝑀 2 ≤ 𝑍 𝑙, 𝑛 2
𝑍 𝑙, 𝑛 , 𝑓𝑚𝑡𝑓
F B C
▪ Harmonic noise presence detector for better speech preservation ▪ Set of filters with changing length, until maximal filter length L, based on the available amount
𝐽 𝑙, 𝑛 = ቊ1 𝑊 𝑙, 𝑛 ∈ ℋ0 𝑊 𝑙, 𝑛 ∈ ℋ1
F B C C
▪ Distortion Index ▪ Noise reduction Factor ▪ Perceptual Evaluation of Speech Quality (PESQ) ITU-T P.862.2 ▪ The Short-Time Objective Intelligibility (STOI)
𝑤𝑡𝑒 𝜊𝑜𝑠
Frame Index NRR [dB]
3 , 3 = = L threshold indicator dB 25 − 5 . =
Better noise reduction which leads to improved , PESQ, and STOI levels for the combined filter 𝜊𝑜𝑠
▪ An appropriate selection of the step size is required ▪ Fixed step size ▪
Frame Index 𝑤𝑡𝑒 NRR [dB]
𝜐
5 . =
?
const , MI max
[Frames]
PESQ STOI 𝑤𝑡𝑒 𝑒𝐶 𝜊𝑜𝑠 𝑒𝐶
5 . = threshold indicator dB 25 −
𝜐 [Frames] 𝜐 [Frames] 𝜐 [Frames] 𝜐 [Frames]
Combined & MI-Combined show better results than MI
PESQ STOI 𝑤𝑡𝑒 𝑒𝐶
5 . = threshold indicator dB 25 −
𝜐 [Frames] 𝜐 [Frames] 𝜐 [Frames]
𝜊𝑜𝑠 𝑒𝐶
𝜐 [Frames]
short L
1 =
Recommendation: Combined & MI-Combined show better results than MI
3 , 1 = = L
3 , 1 = = L threshold indicator dB 25 −
Frame Index Frame Index
(a) Clean signal (c) Enhanced signal - MI Frame Index
(b) Noisy signal (d) Enhanced signal - Combined
Frame Index
▪ Introduced the combined filter ▪ Parameter selection ▪ Noise presence indicator impact ▪ Improved results compared to other methods
▪Noise presence indicator implementation ▪ Residual noise at transient edges ▪ Deep Learning approach for noise reduction
▪ We investigate the use of the autoregressive conditional heteroscedasticity (ARCH) model as a replacement for the well-known Decision-Directed estimator by Epharim and Malah ▪ We employ three sound quality measures: speech distortion, noise reduction and musical noise, and explain the effect the ARCH model parameters have on these measures. ▪ We demonstrate that the ARCH model achieves better results than the decision-directed for some of these measures, while compromising between the speech distortion and noise reduction.
▪ Let 𝑍
ℓ 𝑙 = 𝑌ℓ 𝑙 + 𝐸ℓ 𝑙 denote an observed noisy speech signal in the STFT
domain. ▪ Given an error function between the clean signal and its estimate, the spectral enhancement problem can be formulated as 𝑌ℓ 𝑙 = argmin
𝑌𝐹 𝑓 𝑌ℓ 𝑙 ,
𝑌 𝑙 |𝑍
0 𝑙 , … , 𝑍 ℓ′ 𝑙
▪ We consider the casual case ℓ ≤ ℓ′ and the LSA error function 𝑓LSA 𝑌ℓ 𝑙 , 𝑌ℓ 𝑙 = log 𝑌ℓ 𝑙 − log 𝑌ℓ 𝑙
2
▪ The estimate is obtained by applying a spectral gain to each noisy spectral component: 𝑌ℓ 𝑙 = 𝐻LSA 𝜊ℓ|ℓ′ ∙ 𝑍
ℓ
where the a-priori and a-posteriori SNRs are defined, respectively, by: 𝜊ℓ|ℓ′ ≜
𝜇ℓ|ℓ′ 𝜏ℓ
2 , 𝛿ℓ ≜
𝑍ℓ 2 𝜏ℓ
2
𝜏ℓ
2 = 𝐹 𝐸ℓ 2 denotes the short-term spectrum of the noise, and
𝜇ℓ|ℓ′ = 𝐹 𝑌ℓ 2|𝑍
0 𝑙 , … , 𝑍 ℓ′ 𝑙
denotes the short-term spectrum of the speech signal.
amplitude estimator,“ IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-32, pp. 1109-1121, December 1984
▪ Over the past decades, the decision-directed (DD) approach has become the acceptable estimation method for the a-priori SNR መ 𝜊ℓ|ℓ = max 𝛽 𝑌ℓ−1
2
𝜏ℓ
2
+ 1 − 𝛽 𝑄 𝛿ℓ − 1 , 𝜊min where 𝑄 𝑦 = 𝑦 if 𝑦 ≥ 0 and 𝑄 𝑦 = 0 otherwise. ▪ The decision-directed approach is not supported by a statistical model. ▪ 𝛽 and 𝜊min have to be determined by simulations. ▪ 𝛽 and 𝜊min are fixed constants and are not adapted to the speech components.
▪ The GARCH (generalized autoregressive conditional heteroscedasticity) model is extensively used in financial applications where it is necessary to model time varying volatility while taking into account heavy tailed behavior and volatility clustering. ▪ Recently 1 , it was proposed to use the GARCH for statistically modeling the speech signals in the STFT domain, as they show these two characteristics. ▪ In this work, we investigate the use of a simplified case of the GARCH, the ARCH
commonly used performance measures and compare it to the decision-directed estimator.
[1] I. Cohen, “Modeling speech signals in the time frequency domain using GARCH,” Signal Processing, vol. 84 (12), pp. 2453–2459, 2004.
We use a two-step estimator, to recursively update the estimate of the conditional a- priori SNR as new data arrives. Given an estimate of መ 𝜊ℓ|ℓ−1 and a new noisy spectral component 𝑍
ℓ
Update step: መ 𝜊ℓ|ℓ = 𝐹 ฬ
𝑌ℓ 2 𝜏ℓ
2
መ 𝜊ℓ|ℓ−1, 𝑍
ℓ
Using ARCH(1), propagate the a-priori SNR to obtain the one-frame-ahead a priori SNR, Propagation step: መ 𝜊ℓ|ℓ−1 = 𝜆 + 𝜈 መ 𝜊ℓ−1|ℓ−1, 𝜆 > 0, 0 ≤ 𝜈 < 1
▪ Solving for the update step we get: መ 𝜊ℓ|ℓ = 𝐻𝑇𝑄
2
መ 𝜊ℓ|ℓ−1, 𝛿ℓ ∙ 𝛿ℓ where 𝐻𝑇𝑄 𝜊ℓ|ℓ′, 𝛿ℓ =
𝜊ℓ|ℓ′ 𝜊ℓ|ℓ′+1 1 𝛿ℓ + 𝜊ℓ|ℓ′ 𝜊ℓ|ℓ′+1
▪ Employing some algebra, we can write: መ 𝜊ℓ|ℓ = 𝛽ℓ መ 𝜊ℓ|ℓ−1 + 1 − 𝛽ℓ 𝛿ℓ − 1 where 𝛽ℓ = 1 −
𝜊ℓ|ℓ−1 𝜊ℓ|ℓ−1+1 2
, 𝛽ℓ ∈ 0,1 ▪ Note the similarity of form to the decision-directed but with a time-varying frequency-dependent weighting factor 𝛽ℓ.
▪ Since the a-priori SNRs need to be equal to 𝜊min when speech is absent, we
▪ Using ARCH(1) we have two parameters 𝜊min and 𝜈: Propagation step: መ 𝜊ℓ|ℓ−1 = 1 − 𝜈 𝜊min + 𝜈 መ 𝜊ℓ−1|ℓ−1, Update step: መ 𝜊ℓ|ℓ = 𝛽ℓ መ 𝜊ℓ|ℓ−1 + 1 − 𝛽ℓ 𝛿ℓ − 1 , where 𝛽ℓ = 1 −
𝜊ℓ|ℓ−1 𝜊ℓ|ℓ−1+1 2
, 𝛽ℓ ∈ 0,1
We employ three performance measures commonly used for the quality assessment of a speech enhancement algorithm. The first two are easily understood when we express the estimated signal as 𝑌ℓ = 𝐻 𝜊ℓ|ℓ′, 𝛿ℓ 𝑌ℓ + 𝐻 𝜊ℓ|ℓ′, 𝛿ℓ 𝐸ℓ = 𝑌𝑔𝑒 + 𝐸𝑠𝑜 Speech distortion: 𝐾𝑌 ≜ 𝐹 log 𝑌ℓ 𝑙 − log 𝐻 𝜊ℓ|ℓ′, 𝛿ℓ 𝑌ℓ
2
Noise Reduction Ratio (NRR): NRR ≜
𝐹 𝐸ℓ 2 𝐹 𝐻 𝜊ℓ|ℓ′,𝛿ℓ 𝐸ℓ
2
The attenuated noise will be composed of isolated spectral components, also known as tonal components. The amount of tonal components can be quantified by the kurtosis; kurtosis = Τ 𝜈4 𝜈2
2, where 𝜈𝑛 is the 𝑛th order moment of the signal.
As we are interested in the amount of tonal components caused by the processing, we use the ratio of the kurtosis before and after the processing: LKR ≜ log10 kurtosisproc kurtosisorf which is evaluated on noise only frames. The LKR increases as the musical noise increases, and the absence of musical noise corresponds to LKR of zero and below.
Analytical calculation of the kurtosis ratio requires the use of a specific noise reduction method or assumptions about the statistical spectral components. Here, we use the sample kurtosis: kurtosis =
1 𝑀 σℓ=0 𝑀
1 𝑂 σ𝑙=0 𝑂−1 𝐸ℓ(𝑙) 2− 𝐸ℓ(𝑙) 2 4 1 𝑂 σ𝑙=0 𝑂−1 𝐸ℓ(𝑙) 2− 𝐸ℓ(𝑙) 2 2 2
Where 𝐸ℓ(𝑙) 2=
1 𝑂 σ𝑙=0 𝑂−1 𝐸ℓ(𝑙) 2
▪ Speech signals: 20 different utterances from 20 different speakers, sampled at 16 kHz and degraded by white Gaussian noise with SNRs in the range [0,20]dB. ▪ The noisy signals are transformed to the time-frequency domain using STFT, with 75% overlapping Hamming analysis windows of 32ms length. ▪ The evaluation of the musical noise was done separately on a complex white Gaussian noise in the time-frequency domain, to emulate performance in noise
Comparison of decision-directed (solid lines) and ARCH (dashed lines) estimators for 5dB SNR: (a) Distortion, (b) NRR and, (c) LKR, with varying 𝛽(upper axis) and 𝜈(lower axis) respectively per estimator, and 𝜊min of -20dB (square), -15dB (circle), and for decision-directed method only 𝜊min = 0 (triangle).
We get the expected decision-directed behavior
For the ARCH estimator, increasing the value of 𝜈 decreases the distortion
When 𝜈 increases also the NRR decreases. The lower we take the noise floor 𝜊min , the more noise reduction we get.
The musical noise mainly depends on the noise floor 𝜊min . Lower 𝜊min means higher 𝛽ℓ , resulting in a smoother a-priori SNR around 𝜊min , thus reducing the musical noise.
For the decision-directed estimator we have to compromise between the amount of distortion and amount of musical noise, while for the ARCH estimator, the musical noise can be eliminated by choosing an appropriate value of 𝜊min . However, for the ARCH estimator we need to compromise between the amount of distortion and the amount of residual noise.
Results summary: ▪ We presented the use of the ARCH estimator, which is based on a statistical model. ▪ We explained the effect the ARCH model parameters have on three commonly used quality measures. ▪ We demonstrated that the ARCH model can achieve better results than the decision- directed, while compromising between the speech distortion and noise reduction. Future work: ▪ We used the ARCH(1) model for the a-priori SNR estimator, which is a special case of the GARCH(0,1). It would be interesting to expand the model to a full GARCH(p,q) model and conduct a similar analysis, to understand if the full general model could provide additional advantages
▪ We introduce an optimal beamformer design that facilitates a compromise between high directivity and low white noise amplification. ▪ The proposed beamformer involves a regularization factor, whose optimal value is determined using a simple and efficient one-dimensional search algorithm. ▪ Simulation results demonstrate controlled tuning of various gain properties of the desired beamformer, and improved performance compared to a competing method.
▪ We consider a plane wave, in the farfield, impinging on an array at angle 𝜄 ▪ Uniform linear microphone array of 𝑁 sensors, with distance 𝜀 between them ▪ The desired signal 𝑌(𝜕) propagates from 𝜄 = 0 (endfire) ▪ Neglecting the propagation attenuation, the observed signal is 𝐳 𝜕 = 𝐞 𝜕, 𝜄 𝑌 𝜕 + 𝐰(𝜕) where 𝐞 𝜕, 𝜄 is the steering vector, and 𝐰(𝜕) is the additive noise vector. 𝐞 𝜕, 𝜄 = 1 𝑓−𝑘𝜕 cos 𝜄𝜐0 ⋯ 𝑓−𝑘 𝑁−1 𝜕 cos 𝜄𝜐0 𝑈, 𝜐0 = 𝜀 𝑑
▪ For the endfire direction 𝐞 𝜕 = 𝐞 𝜕, 0 ▪ Applying a complex linear filter 𝐢 𝜕 , the estimated signal is Z 𝜕 = 𝐢𝐼 𝜕 𝐳 𝜕 = 𝐢𝐼 𝜕 𝐞 𝜕 𝑌 𝜕 + 𝐢𝐼 𝜕 𝐰(𝜕) ▪ The beamformer is distortionless when 𝐢𝐼 𝜕 𝐞 𝜕 = 1
▪ Taking the first microphone as reference, we define the input and output SNR iSNR ω =
𝜚𝑌(𝜕) 𝜚𝑊1(𝜕)
𝜚𝑌(𝜕) 𝜚𝑊1(𝜕) × 𝐢𝐼 𝜕 𝐞 𝜕
2
𝐢𝐼 𝜕 𝚫𝐰 𝜕 𝐢 𝜕
where 𝜚𝑔 𝜕 = 𝐹( 𝑔 𝜕
2) is the variance of 𝑔 ∈ 𝑌, 𝑊 1 , and 𝚫𝐰 𝜕 =
ൗ 𝐹 𝐰(𝜕)𝐰𝐼 𝜕 𝜚𝑊
1(𝜕)
is the pseudo-coherence matrix of the noise. ▪ We deduce the gain in SNR: 𝐢 𝜕 = oSNR ω iSNR ω = 𝐢𝐼 𝜕 𝐞 𝜕
2
𝐢𝐼 𝜕 𝚫𝐰 𝜕 𝐢 𝜕 ▪ WNG: 𝚫𝐰 𝜕 = 𝐉𝑁, 𝒳 𝐢 𝜕 =
𝐢𝐼 𝜕 𝐞 𝜕
2
𝐢𝐼 𝜕 𝐢 𝜕
▪ DF: 𝚫𝐰 𝜕 = 𝚫𝒆 𝜕 =
1 2 𝜌 𝐞 𝜕, 𝜄 𝐞𝐼 𝜕, 𝜄 sin 𝜄𝑒𝜄, 𝐢 𝜕
=
𝐢𝐼 𝜕 𝐞 𝜕
2
𝐢𝐼 𝜕 𝚫𝐞 𝜕 𝐢 𝜕
▪ Delay-and-Sum (DS): maximizes the WNG subject to the distortionless constraint 𝐢DS 𝜕, 𝜄 = 𝐞 𝜕, 𝜄 𝑁 𝒳 𝐢DS 𝜕, 𝜄 = 𝑁 = 𝒳
max
𝐢DS 𝜕, 𝜄 = 𝑁2 𝐞𝐼 𝜕, 𝜄 𝚫𝐞 𝜕 𝐞 𝜕, 𝜄 ≥ 1 While the DS maximizes WNG it never amplifies diffuse noise. ▪ Superdirective (SD): maximizes the DF subject to the distortionless constraint for the specific case of 𝜄 = 0 and small 𝜀 𝐢SD 𝜕, 𝜄 = 𝚫𝐞
−1 𝜕 𝐞 𝜕
𝐞𝐼 𝜕 𝚫𝐞
−1 𝜕 𝐞 𝜕
while maximizing the DF the 𝐢SD 𝜕, 𝜄 can amplify the white noise especially at low frequencies
▪ Robust Superdirective 𝐢𝑆,𝜁 𝜕 = 𝚫𝐞 𝜕 + 𝜻𝐉𝑁 −1𝐞 𝜕 𝐞𝐼 𝜕 𝚫𝐞 𝜕 + 𝜻𝐉𝑁 −1𝐞 𝜕 Where 𝜁 ≥ 0 is a Lagrange multiplier, which enables a compromise between the DF and the WNG If we define 𝚫𝜁 𝜕 = 𝚫𝐞 𝜕 + 𝜻𝐉𝑁 , we can write 𝐢𝑆,𝜁 𝜕 = 𝚫𝜻
−1 𝜕 𝐞 𝜕
𝐞𝐼 𝜕 𝚫𝜻
−1 𝜕 𝐞 𝜕
While the robust superdirective beamformer has control on the white noise amplification, it is not easy to find a closed form expression for 𝜁 for a desired value of the WNG
IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, pp. 877-886, May 2015
Berkun et al. proposed the combined beamformer: 𝐢𝛽,𝜁 𝜕 =
𝚫𝜻
−1 𝜕 +𝛽 𝜕 𝐉𝑁 𝐞 𝜕
𝐞𝐼 𝜕 𝚫𝜻
−1 𝜕 +𝛽 𝜕 𝐉𝑁 𝐞 𝜕 , 𝛽 ∈ ℝ
It can be reformulated as 𝐢𝛽,𝜁 𝜕 = 𝐢𝑆,𝜁 𝜕 1 + 𝛽𝜁 𝜕 + 𝐢𝐸𝑇 𝜕 1 + 𝛽𝜁
−1 𝜕
Where 𝛽𝜁 𝜕 = 𝛽 𝜕
𝒳max max,𝜁 𝜕 and max,𝜁 𝜕 = 𝐞𝐼 𝜕 𝚫𝜻 −1 𝜕 𝐞 𝜕
For a fixed 𝒳 𝐢𝛽,𝜁 𝜕 = 𝒳0 < 𝑁 or a fixed 𝐢𝛽,𝜁 𝜕 = 0 it is possible to analytically calculate 𝛽𝜁 𝜕 and hence 𝛽 𝜕 . While finding a closed form solution for the parameter 𝛽 𝜕 , which enables control of the trade-
regularization parameter 𝜁 and assumes it is user determined.
▪ We assume the signal is corrupted both by diffuse noise and additive white noise. ▪ The input and output SNR: iSNR ω = tr 𝜚𝑌(𝜕)𝐞 𝜕 𝐞𝐼 𝜕 tr 𝜚𝑒 𝜕 𝚫𝐞 𝜕 + 𝜚𝑥(𝜕)𝐉𝑁 = 𝜚𝑌(𝜕) 𝜚𝑒 𝜕 + 𝜚𝑥(𝜕)
𝜚𝑌(𝜕) 𝐢𝐼 𝜕 𝐞 𝜕
2
𝜚𝑒 𝜕 𝐢𝐼 𝜕 𝚫𝐞 𝜕 𝐢 𝜕 + 𝜚𝑥(𝜕)𝐢𝐼 𝜕 𝐢 𝜕 ▪ The SNR gain: 𝐢 𝜕 = 𝐢𝐼 𝜕 𝐞 𝜕
2
1 − 𝛽(𝜕) 𝐢𝐼 𝜕 𝚫𝐞 𝜕 𝐢 𝜕 + 𝛽(𝜕)𝐢𝐼 𝜕 𝐢 𝜕 Where 𝛽 𝜕 = 𝜚𝑥(𝜕) 𝜚𝑒 𝜕 + 𝜚𝑥(𝜕) , 0 ≤ 𝛽 𝜕 ≤ 1
▪ The proposed beamformer which maximizes the SNR gain is: 𝐢𝜷 𝜕 =
𝚫𝐞,𝛽
−1 𝜕 𝐞 𝜕
𝐞𝐼 𝜕 𝚫𝐞,𝛽
−1 𝜕 𝐞 𝜕 , where 𝚫𝐞,α 𝜕 = 1 − 𝛽(𝜕) 𝚫𝐞 𝜕 + 𝛽(𝜕)𝐉𝑁
▪ The SNR gain 𝐢𝜷 𝜕 = 𝐞𝐼 𝜕 𝚫𝐞,𝛽
−1 𝜕 𝐞 𝜕
▪ The proposed beamformer is equivalent to 𝐢𝑆,𝜁 𝜕 with 𝜁 𝜕 =
𝛽(𝜕) 1−𝛽(𝜕)
▪ Problem: 𝜚𝑒 𝜕 , 𝜚𝑥(𝜕) are not known → 𝛽(𝜕) is not known. ▪ Advantage 1: 𝛽(𝜕) varies from 0 to 1. ▪ Advantage 2: the gain is continuous and has a single minimum point in this range, the WNG and DF are monotonic in this range. ▪ Solution: 𝛽(𝜕) is found employing a binary-like search on each monotonic section.
▪Input: Desired gain 0 , and tolerance ▪Output: Optimal regularization 𝛽 1. Find 𝛽min that minimizes the gain (e.g., using gradient descent) 2. Divide the range 0,1 into 2 sections in which the gain is monotonic: 0, 𝛽min and 𝛽min, 1 3. For each section, apply the following continuous binary search: 4. Divide the section into 2 sub-sections 5. Calculate the gain 𝑙 in the middle of each sub-section 6. Choose the gain 𝑙 and its respective sub- section for which 𝑙 − 0 is minimal 7. if 𝑙 − 0 ≤ tolerance then 8. 𝛽 ←(middle of chosen sub-section) and stop 9. else
and go back to 4
and choose the best result
Setup: 𝑁 = 8 microphones, 𝜀 = 1 cm spacing Array gains for fixed SNR 𝛽(𝜕) is found for desired fixed SNR gain 0 using the proposed algorithm
Array gains for fixed WNG 𝛽(𝜕) is found for maximal SNR gain under a constant desired WNG 𝒳0 using the proposed algorithm from step 4
→ Our proposed beamformer outperforms the combined beamformer with 𝜁 = 10−4
Array gains for fixed DF in multi-band 𝛽(𝜕) is found for maximal SNR gain under a piece-wise constant gradually increasing DF using the proposed algorithm from step 4
→WNG-DF trade-off can be considered at each frequency band separately! → Our proposed beamformer outperforms the combined beamformer with 𝜁 = 10−4
Results summary:
▪ The proposed approach facilitates the design of beamformers with fixed SNR gain, beamformers with maximal SNR gain for constant WNG or DF, and multi-band fixed beamformers. ▪ Enables a fine tuning of the compromise between the DF and robustness against white noise.
Future work:
▪ Testing various angles of incidence other than the end-fire direction. ▪ Incorporating other considerations such as side-lobe requirements and performance under