In 2019, Google launched the recording software Recorder under Android system for its Pixel mobile phones, which is comparable to voice memos under iOS and supports the recording, management and editing of audio files. Since then, Google has successively added a large number of machine learning-based features to Recorder, including speech recognition, audio event detection, automatic title generation, and smart browsing.
However, when the recording file is long and contains multiple speakers, some Recorder users will feel inconvenienced during use. Because the text obtained through speech recognition alone cannot determine who said each sentence. At this year’s Made By Google conference, Google announced the automatic speaker annotation feature of the Recorder app. This feature will add anonymous speaker tags (such as "Speaker 1" or "Speaker 2") to speech-recognized text in real time. This feature will greatly improve the readability and practicality of recorded texts. The technology behind this feature is called speaker diarization. Google first introduced its voiceprint segmentation and clustering system called Turn-to-Diarize at the 2022 ICASSP conference.
Left picture: The recording text with speaker annotation turned off. Right: The recording text with speaker annotation turned on.
Google’s Turn-to-Diarize system contains multiple highly optimized models and algorithms to implement mobile devices On the Internet, real-time voiceprint segmentation and clustering processing of hours-long audio is completed with very few computing resources. The system mainly consists of three components: a speaker switching detection model to detect speaker identity switching, a voiceprint encoder model to extract the voice characteristics of each speaker, and a multi-stage system that can efficiently complete speaker annotation. Clustering Algorithm. All components run entirely on the user's device and do not rely on any server connection.
Architecture diagram of the Turn-to-Diarize system.
The first component of the system is a speaker switch detection model based on Transformer Transducer (T-T) . This model can convert the acoustic feature sequence into a text sequence containing the special character . The special character indicates a speaker switching event. Previous papers published by Google used special characters such as or to represent the identity of a specific speaker. In the latest system, since the character is not limited to specific identities, its application is also more widespread.
For most applications, the output of the voiceprint segmentation and clustering system is generally not presented directly to the user, but is combined with the output of the speech recognition model. Since the speech recognition model has been optimized for the word error rate during the training process, the speaker switch detection model is more tolerant to the word error rate, but pays more attention to the accuracy of the special character . On this basis, Google proposed a new character-based loss function, which enables accurate detection of speaker switching events with only a smaller model.
After the audio signal is segmented according to speaker conversion events, the system extracts the features of each speaker segment through the voiceprint encoder model. The embedding code of voiceprint information, that is, d-vector. In previous papers published by Google, voiceprint embedding codes were generally extracted from fixed-length audio. In contrast, this new system has many improvements. First, the new system avoids extracting voiceprint embeddings from segments that contain multiple speaker information, thus improving the overall quality of the embeddings. Secondly, the speech fragment corresponding to each voiceprint embedding code is relatively long, so it contains more voiceprint information corresponding to the speaker. Finally, the final voiceprint embedding code sequence obtained by this method is shorter in length, making the subsequent clustering algorithm less computationally expensive.
The last step of voiceprint segmentation and clustering is to cluster the voiceprint embedding code sequences obtained in the previous steps. Since the recordings users generate using the Recorder app can range from just a few seconds to as long as 18 hours, a key challenge for clustering algorithms is being able to handle voiceprint embedding sequences of varying lengths.
To this end, Google’s multi-stage clustering strategy cleverly combines the advantages of several different clustering algorithms. For shorter sequences, the strategy uses aggregate hierarchical clustering (AHC). For sequences of medium length, this method uses spectral clustering and utilizes the maximum margin method of eigenvalues to accurately estimate the number of speakers. For longer sequences, this method first uses aggregated hierarchical clustering to preprocess the sequence, and then calls spectral clustering, thereby reducing the computational cost of the clustering step. During the entire streaming processing process, by dynamically caching and reusing the previous clustering results, the upper limit of the time complexity and space complexity of each clustering algorithm call can be set to a constant.
Multi-stage clustering strategy is a key optimization for device-side applications. Because on the device side, resources such as CPU, memory, and battery are usually scarce. This strategy can still operate in a low-power state even after processing audio for several hours. The upper limit of the constant complexity of this strategy can usually be adjusted according to the specific device model to achieve a balance between accuracy and performance.
Schematic diagram of multi-stage clustering strategy.
Because Turn-to-Diarize is a real-time streaming processing system, when the model is processed, it will be updated. With more audio, the predicted speaker labels will become more accurate. To this end, the Recorder application will continuously correct the previously predicted speaker labels during the user's recording process to ensure that the speaker labels that the user sees on the current screen are always more accurate labels.
At the same time, the user interface of the Recorder application also allows users to rename the speaker tag in each recording, for example, rename "Speaker 2" to "Car Dealership" "Business", thus making it easier for users to read and remember.
Recorder allows users to rename speaker tags to improve readability.
Google has launched its self-developed chip Google Tensor on the latest Pixel phones. The current voiceprint segmentation and clustering system mainly runs on the CPU module of Google Tensor. In the future, Google plans to run the voiceprint segmentation and clustering system on the TPU module of Google Tensor to further reduce energy consumption. In addition, Google also hopes to expand this feature to other languages in addition to English with the help of multi-lingual voiceprint encoders and speech recognition models.
The above is the detailed content of Google Recorder implements automatic speaker annotation, and its functionality and iOS voice memos are once again expanded. For more information, please follow other related articles on the PHP Chinese website!