Abstractgit
摘要github
High level of noise reduces the perceptual quality and intelligibility of speech. Therefore, enhancing the captured speech signal is important in everyday applications such as telephony and teleconferencing. Microphone arrays are typically placed at a distance from a speaker and require processing to enhance the captured signal. Beamforming provides directional gain towards the source of interest and attenuation of interference. It is
often followed by a single channel post-filter to further enhance the signal. Non-linear spatial post-filters are capable of providing high noise suppression but can produce unwanted musical noise that lowers the perceptual quality of the output. This work proposes an artificial neural network (ANN) to learn the structure of naturally occurring post-filters to enhance speech
from interfering noise. The ANN uses phase-based features obtained from a multichannel array as an input. Simulations are used to train the ANN in a supervised manner. The performance
is measured with objective scores from speech recorded in an office environment. The post-filters predicted by the ANN are found to improve the perceptual quality over delay-and-sum
beamforming while maintaining high suppression of noise characteristic to spatial post-filters.數據庫
高水平的噪聲下降了感知質量和可懂度語音。所以,增強抓獲的講話信號在平常應用中是很重要的,如電話與電話會議。麥克風陣列一般放置在一個揚聲器的距離,須要處理,以提升捕獲信號。波束造成提供了方向性增益感興趣的來源和干擾的衰減。它是每每其次是一個單通道後過濾器,以進一步提升信號。非線性空間後濾波器可以提供高噪音抑制,但能夠產生沒必要要的音樂下降輸出的感知質量的噪聲。這工做提出了一種人工神經網絡(人工神經網絡)學習天然發生的後置濾波器以加強語音的結構干擾噪聲。人工神經網絡使用相爲基礎的功能從一個多通道陣列得到一個輸入。模擬用於訓練神經網絡的監督方式。性能測量與客觀分數從語音記錄在辦公環境中。人工神經網絡預測的後過濾器被發現,以提升感知質量的延遲和總和保持高抑制噪聲特性的波束造成空間後過濾器。express
Index Terms:Speech enhancement, Microphone arrays, Array signal processing, Artificial neural networks, Psychoacoustics.數組
關鍵詞:語音加強,麥克風陣列,陣列信號處理,人工神經網絡,心理聲學網絡
1. Introductionapp
介紹框架
Speech enhancement is used to improve the observed quality and it is important in many everyday applications such as telephony and distant talking interfaces. When the talker is distant from the capturing microphone, reverberation and background noise often reduce the captured quality significantly. Speech enhancement can remove noise (denoising), reverberation (dereverberation), or both. When multiple speakers are talking concurrently the problem of removing the interfering speakers is called speech separation.less
語音加強是用來改善所觀察到的質量它是重要的,在許多平常應用,如電話和遙遠的談話接口。當說話的人是遙遠的從捕獲麥克風、混響和背景噪音每每會顯着下降捕獲的質量。演講加強可去除噪聲(去噪),混響(混響),或二者。當多個演講者在講話同時消除干擾揚聲器的問題被稱爲語音分離。dom
Time-frequency (T-F) masking is based on the windowingdisjoint orthogonality assumption of signals, i.e. speech energy is concentrated only to few time-frequency points, which do not
overlap between speakers [1]. A T-F mask typically approximates the ideal binary mask (IBM) and is applied by multiplying the observed mixture, thus passing only the desired components. However, musical noise artifacts can arise due to errors in mask estimation. Recently, the real-valued idealWiener filter (IWF) has been shown to improve speech intelligibility in noisy conditions over IBM [2].
時間頻率(TF)掩蔽是基於windowingd是關節信號的正交性假設,即語音能量只集中到少數的時間-頻率點,這不揚聲器之間的重疊[ 1 ]。一個以面具一般接近理想的二進制掩碼(IBM)和應用乘以所觀察到的混合物,從而只經過所需的組件。然而,音樂噪音的文物可能會出現因爲誤掩模估計。最近,實值idealwiener濾波器(IWF)已被證實是提升語音清晰度在IBM的[ 2 ]
條件。
Machine learning techniques are popular in speech enhancement. In [3] a non-negative matrix factorization (NMF) technique is used to learn spectral basis of speech and different
noise types. The NMF reconstruction is then used to denoise the observation. The authors of [4] train a long short-term memory (LSTM) recurrent neural network (RNN) to predict a T-F mask for speech enhancement. In [5] spectral features (such as Melfrequency cepstral coefficients) and their delta components are used to train a deep neural network (DNN) to predict the instantaneous SNR for each frequency band, which is used to estimate
the ideal ratio mask IRM. The authors of [6] use a combination of DNNs and support vector machines (SVMs) for speech enhancement by binary classification of T-F bands. In [7], deep
recurrent autoencoder neural network is trained to denoise input features for noise robust automatic speech recognition (ASR).
機器學習技術在語音加強中很受歡迎。在[ 3 ]一個非負矩陣分解(NMF)技術是用來學習的語音和不一樣的噪聲類型的光譜基礎。NMF重建進行去噪觀察。[ 4 ]培養一個長的短時間記憶的做者
(LSTM)遞歸神經網絡(RNN)預測時的面具語音加強。在[ 5 ]的光譜特徵(如採用倒譜系數)和三角洲組件用於訓練深層神經網絡(DNN)預測瞬時每一個頻帶的信噪比,這是用來估計
理想比面具IRM。[ 6 ]使用組合的做者的DNNs和支持向量機(SVM)的語音加強經過時頻帶的分類。在[ 7 ],深複發性自編碼神經網絡進行訓練,去噪的輸入噪音強大的自動語音識別的特色。
While the above methods primarily utilize a monophonic signal, binaural signals enable the use of spatial cues, i.e., interaural time delay (ITD) and interaural level difference (ILD).
The degenerate unmixing estimation technique (DUET) clusters each TF point based on its cue values [8]. In [9] this is done by supervised learning via kernel-density estimation for a binary T-F mask value. In [10] the spatial cues (along with pitch features for voiced frames) are used to train two sets of multilayer perceptrons (MLPs) for each combination of azimuth angle and frequency band. This approach requires a lot of training data and computations.
雖然上述的方法主要是利用單聲道信號,雙耳信號使空間線索,即便用雙耳時間延遲(ITD)和雙耳強度差(ILD)。退化分解估計技術(合唱)集羣基於它的提示值的每一個轉移點[ 8 ]。在[ 9 ]這是經過二元核密度估計的監督學習TF的掩碼值。在[ 10 ]的空間線索(隨着間距的功能
對於有聲的幀)被用來訓練兩組多層感知器(MLP)爲每一個組合和方位角頻帶。這種方法須要大量的訓練數據和計算。
Beamforming is linear filtering applied to microphone array signals in order to amplify the desired direction(s) and/or attenuate unwanted one(s). The most simple fixed weight beamformer is the delay-and-sum beamformer (DSB) that sums the temporally aligned input signals from the desired direction of arrival (DOA). In contrast, adaptive methods update the filter coefficients based on estimates of the noise and signal statistics. The beamforming output can be further enhanced by multiplying with a post-filter, i.e. a type of T-F mask. An adaptive beamformer known as minimum variance distortionless response
(MVDR) combined with the single channel Wiener filter has been shown to be an optimal approach in the minimum mean square error (MMSE) sense [11, Ch.3]. The ability to increase
the SNR of the beamformer output has been successfully shown with different post-filters [12, 11, 13, 14], which differ in the assumptions made of the signal and noise.
波束造成是一線性濾波應用於麥克風陣列信號,以擴大所需的方向(S)和/或衰減不須要的一個(s)。最簡單的固定重量的波束造成器是延遲和波束造成器(DSB)總結從所需的方向上的時間對準的輸入信號到達(DOA)。與之相反,自適應方法更新過濾器基於噪聲和信號統計估計的係數。能夠進一步提升波束造成的波束造成輸出用後過濾,即一種時頻掩模。一種自適應
被稱爲最小方差無失真響應波束造成器(MVDR)結合單通道維納濾波器已被證實是一個最佳的方法在最小均方偏差(MMSE)的意義[ 11 ],第3章。增長的能力波束造成器的輸出信噪比已成功顯示不一樣的後過濾器[ 12,11,13,14 ],其中不一樣在由信號和噪聲的假設中。
A spatial post-filter can suppress also point-wise noise sources. Tashev et al. derived the instantaneous DOA (IDOA) filter in [15], in which phase-difference measurements form a
likelihood function for post-filter estimation. Selzer et al. [16] proposed a statistical generative model to estimate speech and noise parameters as Gaussian random variables with application to post-filtering using phase-difference and spectral observations for a four microphone linear array. As in [10] the phase based features are dependent on the angle of the source. While spatial filtering has impressive suppression of noise as evident
in [15] it can also produce unwanted artifacts that lead to lower perceptual quality than that of the simple DSB. Therefore, it is important to investigate noise suppression capability of spatial
filtering in conjunction with perceptual quality. Selzer et al. [17] proposed a log-MMSE adaptive beamformer that uses the spatially post-filtered signal as the desired signal to produce higher perceptual quality over DSB.
空間濾波後能夠抑制也逐點噪聲 來源。tashev等人。導出了瞬時DOA(觀念) 過濾器[ 15 ],其中相位差測量形式後濾波估計的似然函數。Selzer等人。[ 16 ] 提出統計生成模型估計的語音和 噪聲參數的高斯隨機變量的應用 後用一四傳聲器線性陣列的相位差和光譜觀測 過濾。在[ 10 ]相 爲基礎的功能是依賴於源的角度。而 空間濾波對噪聲有使人印象深入的抑制明顯在[ 15 ],它也能夠產生沒必要要的文物,致使較低的 感知質量比簡單的DSB。所以,它是研究空間的噪聲抑制能力 重要結合感知質量的過濾。Selzer等人。[ 17 ] 提出日誌MMSE自適應波束造成器的使用空間 過濾後的信號所需的信號在 DSB產生更高的感知質量。
This work proposes the use of a multilayer perceptron (MLP), a type of artificial neural network, to learn the mapping from phase-based features directly into post-filter values
using a circular microphone array. In contrast to angle dependent models [16, 10] the input feature is angle independent and a single MLP can be used to predict the post-filter. This reduces the model complexity over previous methods. In contrast to previous binaural approaches that utilize the IBM as the target, the MLP here predicts the IWF, i.e., a real-valued postfilter. Finally, in contrast to traditional post-filters, the MLP does not require explicit assumptions or estimates of the signal and noise statistics. Instead, data generated by simulations is used to train the MLP, while the performance is evaluated with recorded speech. The proposed MLP based post-filter operates in the MEL-frequency domain.
這項工做提出了使用的多層感知器(MLP),一種人工神經網絡,學習的映射從相爲基礎的功能,直接進入後過濾值使用圓形麥克風陣列。相反角度依賴性模型[ 16,10 ]的輸入功能是角度獨立和一個單一的MLP能夠用來預測濾波器。這減小了模型的複雜性比之前的方法。在對比
之前的雙耳的方法,利用IBM的目標,這裏的MLP預測IWF,即實值後置濾波器。最後,在對比傳統的後置濾波器,MLP不須要明確的假設或估計的信號和噪聲統計。相反,模擬所產生的數據是用於訓練MLP,而評估與性能錄製的語音。所提出的基於MLP-後過濾操做在頻率域。
This paper is organized as follows. Section 2 reviews beamforming and DOA estimation. The conventional spatial postfilter is reviewed in Section 3. The proposed MLP based spatial
post-filter is presented in Section 4. Section 5 describes the array speech recordings. Section 6 reports and discusses the results and is followed by the conclusions Section 7.
本文組織以下。第2節評論波束造成和DOA估計。傳統空間後置濾波器在第3節反省。所提出的基於多層空間後過濾器是在第4節。第5節介紹陣列語音記錄。第6節報告和討論結果和隨後的結論部分7。
2. Beamforming and DOA Estimation
2波束造成和DOA估計
第i個麥克風接收信號xi(t),i=1,……,M
式中Hi(t,w)是麥克風到聲源的傳遞函數,w是角頻率,t分幀時間。
式中Hpost(t,w)是post濾波器的實數值。
式中mi->R是第i個麥克風的位置,以及K是
式中c是聲速,
2.1. DOA estimation
2.1波達方向估計
The generalized cross-correlation (GCC) is applied to estimate the source DOA k in frame t with the steered response power (SRP) method [18]
廣義互相關(GCC)應用於估計在與轉向響應幀的源DOA K(SRP)方法[ 18 ]
式中,path權重清除了幅度信息。
式中E(t)是
3. Conventional Spatial Post-Filter
3. 傳統空間的後置濾波器
Following the azimuth angle IDOA filter definition of [15] and omitting the time index for brevity the expression of IDOA for a DOA vector k is
如下的方位觀念濾波器定義[ 15 ],省略簡潔觀念表達的時間指數DOA矢量k
式中
The probability density for frequency ! to come from desired direction k is [15]
頻率的機率密度來自所需方向k爲[ 15 ]
where kl denotes l = 1, . . . , L different steering directions. DSPF allows steep noise suppression but entails artifacts. Note that [15] proposes the additional use of a HMM framework.
在KL表示L = 1,。..,L不一樣方向。DSPF容許陡峭的噪聲抑制但須要文物。注該[ 15 ]提出了一個隱馬爾可夫模型框架的附加使用。
4. Neural Network Based Post-Filter
4 基於神經網絡的後置濾波器
A block diagram of beamforming with spatial post-filtering is presented in Fig. 1. The post-filter values are obtained in the MEL frequency domain. A widely applied conversion from linear frequency fHz (in Hz) to MEL frequency is
空域濾波後的波束造成框圖如圖1所示。後過濾器的值在Mel頻率域。一種普遍應用的線性變換
頻率FHZ(Hz)Mel頻率
The use of MEL-frequency scale is motivated by the psychoacoustic properties of the human hearing system i.e. closely spaced frequencies mask each other. Furthermore, the computational complexity of post-filter gain for B frequency bands instead of NDFT frequency bins can lead to large computational savings, since typically B《NDFT.
Mel頻率尺度使用動機的心理聲學人類聽覺系統的性質即密切間隔頻率掩模。此外,計算
用於B頻段的後置濾波器增益的複雜性而測量頻點能夠致使大的計算儲蓄,由於一般B NDFT美圓。
Figure 1: A block diagram of the proposed post-filter approach. The phase-differences are first extracted between microphone pairs, then subtracted from theoretical delays, and converted into input features ut(b|k) for frequency bands b = 1, . . . ,B. Similarly, averaged features over other directions are extracted as vt(b). Using these values, theMLP predicts the post-filter values for each frequency band. The frequency band values are then converted to linear scale HMLP(t, !). Finally, post-filter values are applied to the beamformer output YDSB(t, !).
圖1:提出的後置濾波方法的方框圖。第一次提取的麥克風之間的相位差對,而後從理論延遲中減去,並轉換爲輸入特徵UT(B | K)頻段,B = 1,。..,B.一樣,其餘方向上的平均特徵提取
做爲VT(B)。使用這些值,themlp預測濾波器每一個頻帶的值。的頻帶的值是而後轉換爲線性刻度HMLP(t,!)最後,後置濾波器值應用到波束造成器的輸出ydsb(t,!)。
4.1. MLP Structure
4.1MLP結構
4.2. Training Data
4.2訓練數據
An eight microphone circular array with 10 cm radius was used to simulate audio with 16 kHz sampling rate with added noise and reverberation. Two different sized rooms with reverberation times (T60) 0.4 s and 0.9 s with two source distances of 1.2 m
and 2.4 m were used to generate room impulse responses (RIR) for each microphone using the image method [20]. For each room and distance combination 100 randomly selected TIMIT database speech sentences were convolved with the RIRs to simulate the reverberant array signals. In each repetition, the array was placed in the center of the room, and the source angle was drawn randomly between surrounding azimuth angles
[0&, 360&]. Independent and identically distributedwhiteGaussian noise was added to the microphone signals, and the resulting SNR was drawn from a uniform distribution between
[+12, +40] dB. The purpose of adding noise is to provide diverse training samples for the neural network in order to be generic enough to be applied in different conditions.
一個八麥克風的圓形陣列與10釐米半徑的使用用16千赫採樣率模擬音頻與附加噪聲和混響。兩個不一樣大小的房間混響倍(T60)0.4年代和0.9年代與1.2米兩源的距離2.4米被用來生成房間脈衝響應(RIR)對於每一個麥克風使用的圖像法[ 20 ]。對於每一個空間和距離組合100個隨機選擇的TIMIT數據庫的語音句子卷積RIR來模擬混響信號陣列。在每個重複中數組被放置在房間的中心,和源角隨機在周圍的方位角之間的隨機[ 0,360 ]。獨立同distributedwhitegaussian
噪聲被添加到麥克風信號,和所獲得的信噪比是從一個統一的分佈之間[ 12,+ 40 ]分貝。增長噪音的目的是提供多樣化的爲了成爲神經網絡的訓練樣本通用性,能夠應用在不一樣的條件。
A 32 ms window with 75 % overlap was used to extract the features (15) from all 400 simulated recordings. The target values were obtained from the ideal Wiener filter (IWF) [21]
一個32毫秒的窗口有75%個重疊被用來提取從全部400個模擬錄音的功能(15)。目標
從理想的維納濾波值(IWF)[ 21 ]
1The Deep Learn Toolbox implementation of MLP was used, http://github.com/rasmusbergpalm/DeepLearnToolbox.
5. Description of Recordings
5 關於錄音
An small office was used to capture speech recordings with an 8 channel microphone array with 10 cm radius and a reference microphone mounted on a stand at 1.5 m height. The array was elevated on a stand at 1.0 m height, and consisted of omnidirectional electret condenser microphones (Sennheiser MKE 2). The reference microphone was a cardioid pattern Røde NT 55 condenser microphone. The recordings consist of phonetically balanced sentences [22] captured at 1.3 m (near) and 2.0 m (far) distance from the array center with 48 kHz sampling rate. Two PCs were emitting fan noise at approximately 1 m and 1.5 m
distances from the array, in different angles than the speaker. A total of 77 recordings were captured from four different male speakers (38 far, 39 near), with an average sentence length of 3.8 s.
一個小的辦公室被用來捕獲語音記錄10聲道麥克風陣列,用8釐米半徑和參考安裝在一個支架上的麥克風在1.5米的高度。陣列高架上的一個站在1米的高度,包括全方位駐極體電容式麥克風(森海塞爾經濟部2)。參考麥克風是一個心形圖案Røde NT 55電容式麥克風。錄音包括語音平衡的句子[ 22 ]在1.3米(近)和2米(遠)與48千赫採樣率的陣列中心的距離。二我的電腦在約1米和1.5米的發射風扇噪聲從陣列的距離,在不一樣的角度比揚聲器。一共有77錄音被抓獲,從四個不一樣的男性揚聲器(38遠,39個),平均句子長度3.8秒。
6. Results and Discussion
6。結果與討論
7. Conclusions
This paper proposes using an artificial neural network (ANN) in the design of spatial post-filtering for beamforming. More specifically, the multilayer perceptron (MLP) is applied. Spatial
cues from noisy and reverberant speech are used to train a MLP to predict post-filter values corresponding to the ideal Wiener Filter (IWF). The post-filter is obtained in the MEL-frequency scale and is converted to linear frequency scale before being applied to delay-and-sum beamforming (DSB). The method was evaluated with microphone array recordings of speech sentences in an office at two different distances. Objective measurements
of intelligibility (STOI) show that the MLP based post-filter provides increase in perceptual quality over DSB, while the segmental SNR and frequency-weighted segmental SNR indicate significant noise suppression over DSB.
7。結論 本文提出了使用人工神經網絡(人工神經網絡)波束造成的空間濾波後濾波設計。更多 具體而言,多層感知器(MLP)的應用。空間從噪聲和混響的語音提示是用於訓練MLP 預測相應的理想維納濾波後的值濾波器(IWF)。後的過濾器是在Mel頻率獲得尺度,並轉換成線性頻率尺度應用延遲求和波束造成(DSB)。的方法用麥克風陣列錄音的語音句子進行評估 在兩個不一樣的距離的辦公室裏。目的測量可理解性(化學)代表,MLP的基礎後置濾波器提供了感知質量對DSB增長,而分段信噪比和頻率加權段信噪比有顯著的噪聲抑制DSB。