##In real-time audio and video communication Scenario, when the microphone collects the user's voice, it also collects a large amount of environmental noise. The traditional noise reduction algorithm only has a certain effect on stationary noise (such as fan sound, white noise, circuit noise floor, etc.), and has a certain effect on non-stationary transient noise (such as a noisy restaurant). Noise, subway environmental noise, home kitchen noise, etc.) The noise reduction effect is poor, seriously affecting the user's call experience. In response to hundreds of non-stationary noise problems in complex scenarios such as home and office, the ecological empowerment team of the Department of Integrated Communications Systems independently developed AI audio noise reduction technology based on the GRU model, and through algorithm and engineering optimization, reduced the size of the noise reduction model. Compressed from 2.4MB to 82KB, the running memory is reduced by about 65%; the computational complexity is optimized from about 186Mflops to 42Mflops, and the running efficiency is improved by 77%; in the existing test data set (in the experimental environment), human voice and noise can be effectively separated , improving the call voice quality Mos score (average opinion value) to 4.25.
#This article will introduce how our team does real-time noise suppression based on deep learning and implement it on mobile terminals and Jiaqin APP. The full text will be organized as follows, introducing the classification of noise and how to choose algorithms to solve these noise problems; how to design algorithms and train AI models through deep learning; finally, it will introduce the effects and key applications of current AI noise reduction. Scenes.
In real-time audio and video application scenarios, the device is in a complex acoustic environment. When the microphone collects voice signals, it also collects a large amount of noise, which is a very big challenge to the quality of real-time audio and video. There are many types of noise. According to the mathematical statistical properties of noise, noise can be divided into two categories:
Stationary noise: Statistics of noise Characteristics will not change over time over a relatively long period of time, such as white noise, electric fans, air conditioners, car interior noise, etc.;
##Non-stationary noise: The statistical characteristics of noise change over time, such as noisy restaurants, subway stations, offices, homes Kitchen etc.
In real-time audio and video applications, calls are susceptible to various types of noise interference This affects the experience, so real-time audio noise reduction has become an important function in real-time audio and video. For steady noise, such as the whirring of air conditioners or the noise floor of recording equipment, it will not change significantly over time. You can estimate and predict it and remove it through simple subtraction. Common There are spectral subtraction, Wiener filtering and wavelet transform. Non-stationary noises, such as the sound of cars whizzing by on the road, the banging of plates in restaurants, and the banging of pots and pans in home kitchens, all appear randomly and unexpectedly, and it is impossible to estimate and predict them. fixed. Traditional algorithms are difficult to estimate and eliminate non-stationary noise, which is why we use deep learning algorithms. In order to improve the noise reduction capabilities of the audio SDK for various noise scenes and make up for the shortcomings of traditional noise reduction algorithms, we developed an AI noise reduction module based on RNN, combined with traditional noise reduction technology and deep learning technology. Focusing on noise reduction processing for home and office usage scenarios, a large number of indoor noise types are added to the noise data set, such as keyboard typing in the office, friction sounds of desks and office supplies being dragged, chair dragging, and kitchens at home. Noises, floor slams, etc. #At the same time, in order to implement real-time speech processing on the mobile terminal, the AI audio noise reduction algorithm controls the computational overhead and library size to a very low level. magnitude. In terms of computational overhead, taking 48KHz as an example, the RNN network processing of each frame of speech only requires about 17.5Mflops, FFT and IFFT require about 7.5Mflops of each frame of speech, and feature extraction requires about 12Mflops, totaling about 42Mflops. The computational complexity is approximately The 48KHz Opus codec is equivalent. In a certain brand of mid-range mobile phone models, statistics indicate that the RNN noise reduction module CPU usage is about 4%. In terms of the size of the audio library, after turning on RNN noise reduction compilation, the size of the audio engine library only increases by about 108kB. The The module uses the RNN model because RNN carries time information compared to other learning models (such as CNN) and can model timing signals, not just separate audio input and output frames. At the same time, the model uses a gated recurrent unit (GRU, as shown in Figure 1). Experiments show that GRU performs slightly better than LSTM on speech noise reduction tasks, and because GRU has fewer weight parameters, it can save computing resources. Compared to a simple loop unit, a GRU has two extra gates. The reset gate control state is used to calculate the new state, while the update gate control state is how much it will change based on the new input. This update gate allows GRU to remember timing information for a long time, which is why GRU performs better than simple recurrent units.
## Figure 1 The left side is a simple cyclic unit, the right side The structure of the GRU model is shown in Figure 2. The trained model will be embedded into the audio and video communication SDK. By reading the audio stream of the hardware device, the audio stream will be framed and sent to the AI noise reduction preprocessing module. The preprocessing module will add the corresponding features ( Feature) is calculated and output to the trained model. The corresponding gain (Gain) value is calculated through the model, and the gain value is used to adjust the signal to ultimately achieve the purpose of noise reduction (as shown in Figure 3).
##Figure 2. GRU-based RNN network model ## Figure 3. The top is the model training process, and the bottom is the real-time reduction Noise process
Figure 4 shows the keystrokes Comparison of the speech spectrograms before and after noise reduction. The upper part is the noisy speech signal before noise reduction, and the red rectangular box is the keyboard tapping noise. The lower part is the speech signal after noise reduction. Through observation, it can be found that most of the keyboard tapping sounds can be suppressed, while the speech damage is controlled to a low level.
## Figure 4. Noisy speech (accompanied by Keyboard tapping sound) before and after noise reduction The current AI noise reduction model has been launched on the mobile phone and Jiaqin to improve the mobile phone and Jiaqin APP The call noise reduction effect has excellent suppression capabilities in more than 100 noise scenarios in homes, offices, etc., while maintaining voice distortion. In the next stage, we will continue to optimize the computational complexity of the AI noise reduction model so that it can be promoted and used on IoT low-power devices.
Part 03 Deep Learning Noise Reduction Algorithm Design
Part 04 Network model and processing process
Part 05 AI noise reduction processing effect and implementation
The above is the detailed content of Let's talk about AI noise reduction technology in real-time communication. For more information, please follow other related articles on the PHP Chinese website!