After real-time audio and video communication RTC has become an indispensable infrastructure in people’s lives and work, the various technologies involved are also constantly changing. Evolved to deal with complex multi-scenario problems, such as how to provide users with a clear and realistic hearing experience in multi-device, multi-person, and multi-noise scenarios in audio scenarios.
As the flagship international conference in the field of speech signal processing research, ICASSP (International Conference on Acoustics, Speech and Signal Processing) has always represented the most cutting-edge research direction in the field of acoustics. ICASSP 2023 includes a number of articles related to audio signal speech enhancement algorithms. Among them, Volcano Engine RTC audio team has a total of 4 research papers were accepted by the conference, covering the topics of speaker-specific speech enhancement, echo cancellation, multi-channel speech enhancement, and sound quality restoration. This article will introduce the core scene problems and technical solutions solved by these four papers, and share the thinking and practice of the Volcano Engine RTC audio team in the fields of voice noise reduction, echo cancellation, and interference human voice elimination.
Paper address:
https ://www.php.cn/link/73740ea85c4ec25f00f9acbd859f861d
There are many problems that need to be solved in real-time speaker-specific speech enhancement tasks. First, collecting the full frequency bandwidth of sound increases the processing difficulty of the model. Secondly, compared with non-real-time scenarios, it is more difficult for models in real-time scenarios to locate the target speaker. How to improve the information interaction between the speaker embedding vector and the speech enhancement model is a difficulty in real-time processing. Inspired by human auditory attention, Volcano Engine proposes a Speaker Attentive Module (SAM) that introduces speaker information, and combines it with a single-channel speech enhancement model-band segmentation recurrent neural network (Band- Split Recurrent Neural Network, BSRNN) fusion, build a specific human speech enhancement system as a post-processing module of the echo cancellation model, and optimize the cascade of the two models.
Band-split recurrent neural network (Band-split RNN, BSRNN) ) is a SOTA model for full-band speech enhancement and music separation. Its structure is shown in the figure above. BSRNN consists of three modules, namely the Band-Split Module, the Band and Sequence Modeling Module and the Band-Merge Module. The frequency band segmentation module first divides the spectrum into K frequency bands. After the features of each frequency band are batch normalized (BN), they are compressed to the same feature dimension C by K fully connected layers (FC). Subsequently, the features of all frequency bands are concatenated into a three-dimensional tensor and further processed by the frequency band sequence modeling module, which uses GRU to alternately model the time and frequency band dimensions of the feature tensor. The processed features are finally passed through the frequency band merging module to obtain the final spectrum masking function as the output. The enhanced speech can be obtained by multiplying the spectrum mask and the input spectrum. In order to build a speaker-specific speech enhancement model, we add a speaker attention module after the modeling module of each frequency band sequence.
The structure of the Speaker Attentive Module (Speaker Attentive Module) is as shown in the figure above. The core idea is to use the speaker embedding vector e as the attractor of the intermediate features of the speech enhancement model, and calculate the correlation s between it and the intermediate features at all times and frequency bands, which is called attention. value. This attention value will be used to scale and regularize the intermediate features h. The specific formula is as follows:
First transform e and h into k and q through full connection and convolution:
K and q are multiplied to get attention Force value:
Finally scale the original features by this attention value:
Regarding the model training data, we used the data from the 5th DNS speaker-specific speech enhancement track and the high-quality speech data of DiDispeech. Through data cleaning, we obtained about 3500 speeches. Clear human voice data. In terms of data cleaning, we used the pre-trained model based on ECAPA-TDNN [1] speaker recognition to remove the residual interfering speaker speech in the speech data, and also used the pre-trained model that won the first place in the 4th DNS Challenge to Remove residual noise from speech data. In the training phase, we generated more than 100,000 4s voice data, added reverberation to these audios to simulate different channels, and randomly mixed them with noise and interference vocals, setting them into one kind of noise, two kinds of noise, noise and interference speech There are 4 interference scenarios: human and only interfering speakers. At the same time, the levels of the noisy speech and the target speech are randomly scaled to simulate inputs of different sizes.
Paper address:
https: //www.php.cn/link/7c7077ca5231fd6ad758b9d49a2a1eeb
Echo cancellation has always been an extremely complex and crucial issue in external broadcast scenarios. In order to extract high-quality near-end clean speech signals, Volcano Engine proposes a lightweight echo cancellation system that combines signal processing and deep learning technology. Based on Personalized Deep Noise Suppression (pDNS), we further built a Personalized Acoustic Echo Cancellation (pAEC) system, which includes a pre-processing module based on digital signal processing, a pre-processing module based on A two-stage model of deep neural network and a speaker-specific speech extraction module based on BSRNN and SAM.
Overall framework of speaker-specific echo cancellation
The pre-processing module mainly includes two parts: time delay compensation (TDC) and linear echo cancellation (LAEC), which are both performed on sub-band characteristics.
Linear echo cancellation algorithm framework based on signal processing sub-band
TDC is based on subband cross-correlation, which first estimates a delay in each subband separately, and then uses a voting method to determine the final time delay.
LAEC is a sub-band adaptive filtering method based on NLMS, consisting of two filters: pre-filter (Pre-filter) and post-filter (Post-filter), the post-filter uses dynamic steps to adaptively update parameters, and the pre-filter is the backup of the stable post-filter. Based on the comparison of the residual energy output by the pre-filter and the post-filter, which error signal is finally decided to use.
LAEC processing flow chart
We recommend decoupling the pAEC task into two tasks: "echo suppression" and "specific speaker extraction" to reduce model modeling pressure. Therefore, the post-processing network mainly consists of two neural network modules: a lightweight CRN-based module for preliminary echo cancellation and noise suppression, and a pDNS-based post-processing module for better near-end speech signal reconstruction. .
The CRN-based lightweight module consists of a band compression module, an encoder, two dual-path GRUs, a decoder and It consists of a frequency band decomposition module. At the same time, we also introduced a Voice Activity Detection (VAD) module for multi-task learning, which helps improve the perception of near-end speech. CRN takes the compression amplitude as input and outputs a preliminary complex ideal ratio mask (cIRM) and near-field VAD probability of the target signal.
The pDNS module at this stage includes the frequency band segmentation recurrent neural network BSRNN introduced above and the speaker attention mechanism module SAM, cascade module It is connected in series after the lightweight CRN module. Since our pDNS system has achieved relatively excellent performance in the characteristic speaker speech enhancement task, we use a pre-trained pDNS model parameter as the second stage initialization parameter of the model to further process the output of the previous stage.
We improve the two-stage model through cascade optimization so that it can predict near-end speech in the first stage and predict a specific speaker in the second stage near-end voice. We also include a speech activity detection penalty for proximity to the speaker to enhance the model's ability to recognize speech at close range. The specific loss function is defined as follows:
Among them,
corresponds to the STFT features predicted in the first and second stages of the model respectively, representing the near-end speech and The STFT features of the near-end specific speaker's speech,
represent the model prediction and target VAD state respectively.
In order for the echo cancellation system to handle echoes from multiple devices, multiple reverberations, and multiple noise collection scenes, we obtained 2,000 hours of training data by mixing echoes and clean speech. , among which, the echo data uses AEC Challenge 2023 remote single speech data, the clean speech comes from DNS Challenge 2023 and LibriSpeech, and the RIR set used to simulate near-end reverberation comes from DNS Challenge. Since the echo in the AEC Challenge 2023 far-end single-talk data contains a small amount of noise data, directly using these data as echo can easily lead to near-end speech distortion. In order to alleviate this problem, we adopted a simple but effective data cleaning strategy, using pre-processing A trained AEC model processes remote single-channel data, identifies data with higher residual energy as noise data, and repeatedly iterates the cleaning process shown below.
Such a speech enhancement system based on fused echo cancellation and specific speaker extraction was used in ICASSP 2023 AEC Challenge Blind Its advantages in subjective and objective indicators were verified on the test set [2] - it achieved a subjective opinion score of 4.44 (Subjective-MOS) and a speech recognition accuracy rate of 82.2% (WAcc).
##"Multi-channel speech enhancement based on Fourier convolution attention mechanism"
Paper address:
https://www.php.cn/link/373cb8cd58cad5f1309b31c56e2d5a83
Beam weight estimation based on deep learning is one of the mainstream methods currently used to solve multi-channel speech enhancement tasks, that is, filtering multi-channel signals by solving beam weights through the network to obtain pure speech. In the estimation of beam weights, the role of spectrum information and spatial information is similar to the principle of solving the spatial covariance matrix in the traditional beam forming algorithm. However, many existing neural beamformers are unable to optimally estimate beam weights. To deal with this challenge, Volcano Engine proposes a Fourier Convolutional Attention Encoder (FCAE), which can provide a global receptive field on the frequency feature axis and enhance the context features of the frequency axis. of extraction. At the same time, we also proposed a FCAE-based Convolutional Recurrent Encoder-Decoder (CRED) structure to capture spectral contextual features and spatial information from input features. Model framework structureBeam weight estimation network This network uses the embedded beam network (Embedding and Beamforming Network, EaBNet) The structural paradigm divides the network into two parts: the embedding module and the beam module. The embedding module is used to extract the embedding vector that aggregates spectrum and spatial information, and sends the embedding vector to the beam part to derive the beam weight. Here, a CRED structure is used to learn the embedding tensor. After the multi-channel input signal is transformed by STFT, it is sent to a CRED structure to extract the embedding tensor. The embedding tensor is similar to the spatial covariance matrix in traditional beamforming and contains distinguishable speech and Characteristics of noise. The embedding tensor passes through the LayerNorm2d structure, then through two stacked LSTM networks, and finally through a linear layer to derive the beam weights. We apply the beam weight to the multi-channel input spectrum characteristics, perform filtering and summation operations, and finally obtain the pure speech spectrum. After ISTFT transformation, the target time domain waveform can be obtained. CRED structure##
The CRED structure we use is shown in the figure above. Among them, FCAE is the Fourier convolutional attention encoder, and FCAD is the decoder that is symmetrical to FCAE; the loop module uses the Deep Feedward Sequential Memory Network (DFSMN) to model the temporal dependence of the sequence. Reduce model size without affecting model performance; the jump connection part uses serial channel attention (Channel Attention) and spatial attention (Spatial Attention) modules to further extract cross-channel spatial information and connect deep layers Features and shallow features facilitate the transmission of information in the network.
The structure of the Fourier Convolutional Attention Encoder (FCAE) is shown in the figure above. Inspired by the Fourier convolution operator [3], this module takes advantage of the fact that the update of the discrete Fourier transform at any point in the transform domain will have a global impact on the signal in the original domain, and performs an on-frequency analysis of the frequency axis features. Through dimensional FFT transformation, the global receptive field can be obtained on the frequency axis, thereby enhancing the extraction of context features on the frequency axis. In addition, we introduced a spatial attention module and a channel attention module to further enhance the convolutional expression ability, extract beneficial spectral-spatial joint information, and enhance the network's learning of distinguishable features of pure speech and noise. In terms of final performance, the network achieved excellent multi-channel speech enhancement with only 0.74M parameters.
In terms of data set, we used the open source data set provided by the ConferencingSpeech 2021 competition. The clean speech data includes AISHELL-1, AISHELL-3, VCTK and LibriSpeech (train-clean -360), select the data with a signal-to-noise ratio greater than 15dB to generate multi-channel mixed speech, and use MUSAN and AudioSet as noise data sets. At the same time, in order to simulate actual multi-room reverberation scenarios, the open source data was convolved with more than 5,000 room impulse responses by simulating changes in room size, reverberation time, sound sources, noise source locations, etc., and finally generated more than 60,000 responses. Multi-channel training samples.
Paper address:
https: //www.php.cn/link/e614f646836aaed9f89ce58e837e2310
The Volcano Engine has also made some attempts at sound quality repair, including enhancing the speech of specific speakers, eliminating echoes and enhancing Multi-channel audio. In the process of real-time communication, different forms of distortion will affect the quality of the speech signal, resulting in a decrease in the clarity and intelligibility of the speech signal. Volcano Engine proposes a two-stage model that uses a staged divide-and-conquer strategy to repair various distortions that affect speech quality.
The picture below shows the overall framework composition of the two-stage model. Among them, the first-stage model mainly repairs the missing part of the spectrum, and the second-stage model mainly suppresses noise, reverberation and Possible artifacts from the first stage model.
The overall model adopts Deep Complex Convolution Recurrent Network (DCCRN) [4] architecture , including three parts: Encoder, timing modeling module and Decoder. Inspired by image repair, we introduce Gate complex-valued convolution and Gate complex-valued transposed convolution to replace the complex-valued convolution and complex-valued transposed convolution in Encoder and Decoder. In order to further improve the naturalness of the audio repair part, we introduced Multi-Period Discriminator and Multi-Scale Discriminator for auxiliary training.
The overall adopts S-DCCRN architecture, including three parts: Encoder, two lightweight DCCRN sub-modules and Decoder, of which two lightweight DCCRN The sub-modules perform sub-band and full-band modeling respectively. In order to improve the model's ability in time domain modeling, we replaced the LSTM in the DCCRN sub-module with the Squeezed Temporal Convolutional Module (STCM).
The clean audio, noise, and reverb used for training here to repair sound quality are all from the 2023 DNS competition data set, in which the total duration of clean audio is 750 hours, and the total duration of noise is is 170 hours. In the data augmentation of the first stage model, we use full-band audio to convolve with randomly generated filters, with a window length of 20ms to randomly set audio sampling points to zero and randomly downsample the audio to simulate spectrum loss. On the other hand, the audio amplitude frequency and audio collection points are multiplied by random scales respectively; in the second stage of data augmentation, we use the data already generated in the first stage to convolve various types of room impulses. The response is to obtain audio data with different levels of reverberation.
In the ICASSP 2023 AEC Challenge, the Volcano Engine RTC audio team, General echo cancellation (Non-personalized AEC) and specific speaker echo cancellation (Personalized AEC) Won the championship on the track, and won the dual-talk echo suppression, dual-talk near-end voice protection, near-end single-talk background noise suppression, comprehensive subjective audio quality scoring and final speech recognition accuracy etc. The indicators are significantly better than other participating teams and have reached the international leading level.
Let’s take a look at the voice enhancement processing effects of Volcano Engine RTC in different scenarios after the above technical solutions.
The following two examples show the comparative effects of the echo cancellation algorithm before and after processing in different signal-to-echo energy ratio scenarios.
Medium letter echo ratio scenario
Ultra-low signal echo ratio scenes pose the greatest challenge to echo cancellation. At this time, we not only need to effectively remove high-energy echoes, but also retain the weak target speech to the greatest extent at the same time. The non-target speaker's voice (echo) almost completely overshadows the target speaker's (female) voice, making it difficult to identify.
Super low signal echo ratio scene
The following two examples respectively show the comparative effects of specific speaker extraction algorithms before and after processing in noise and background person interference scenarios.
In the following sample, the specific speaker has both doorbell-like noise interference and background noise interference. Only using AI noise reduction can only remove the doorbell noise, so it is also necessary to perform vocal processing for the specific speaker. eliminate.
Target speaker and background interference vocals and noise
When the voiceprint features of the target speaker's voice and the background interfering voice are very close, the challenge for the specific speaker extraction algorithm is greater at this time, and it can test the robustness of the specific speaker extraction algorithm. In the following sample, the target speaker and the background interfering voice are two similar female voices.
Target female voice mixed with interfering female voice
The above introduces some solutions and effects made by the Volcano Engine RTC audio team based on deep learning in specific speaker noise reduction, echo cancellation, multi-channel speech enhancement, etc. Future scenarios are still faced with Challenges in multiple directions, such as how to adapt voice noise reduction to noise scenes, how to perform multi-type repair of audio signals in a wider range of sound quality repair, and how to run lightweight and low-complexity models on various terminals, these challenges will also This will be our next focus research direction.
The above is the detailed content of Demystifying some of the AI-based voice enhancement techniques used in real-time calls. For more information, please follow other related articles on the PHP Chinese website!