


Google Recorder implements automatic speaker annotation, and its functionality and iOS voice memos are once again expanded
In 2019, Google launched the recording software Recorder under Android system for its Pixel mobile phones, which is comparable to voice memos under iOS and supports the recording, management and editing of audio files. Since then, Google has successively added a large number of machine learning-based features to Recorder, including speech recognition, audio event detection, automatic title generation, and smart browsing.
However, when the recording file is long and contains multiple speakers, some Recorder users will feel inconvenienced during use. Because the text obtained through speech recognition alone cannot determine who said each sentence. At this year’s Made By Google conference, Google announced the automatic speaker annotation feature of the Recorder app. This feature will add anonymous speaker tags (such as "Speaker 1" or "Speaker 2") to speech-recognized text in real time. This feature will greatly improve the readability and practicality of recorded texts. The technology behind this feature is called speaker diarization. Google first introduced its voiceprint segmentation and clustering system called Turn-to-Diarize at the 2022 ICASSP conference.
Left picture: The recording text with speaker annotation turned off. Right: The recording text with speaker annotation turned on.
System Architecture
Google’s Turn-to-Diarize system contains multiple highly optimized models and algorithms to implement mobile devices On the Internet, real-time voiceprint segmentation and clustering processing of hours-long audio is completed with very few computing resources. The system mainly consists of three components: a speaker switching detection model to detect speaker identity switching, a voiceprint encoder model to extract the voice characteristics of each speaker, and a multi-stage system that can efficiently complete speaker annotation. Clustering Algorithm. All components run entirely on the user's device and do not rely on any server connection.
Architecture diagram of the Turn-to-Diarize system.
Speaker Switch Detection
The first component of the system is a speaker switch detection model based on Transformer Transducer (T-T) . This model can convert the acoustic feature sequence into a text sequence containing the special character . The special character indicates a speaker switching event. Previous papers published by Google used special characters such as or to represent the identity of a specific speaker. In the latest system, since the character is not limited to specific identities, its application is also more widespread.
For most applications, the output of the voiceprint segmentation and clustering system is generally not presented directly to the user, but is combined with the output of the speech recognition model. Since the speech recognition model has been optimized for the word error rate during the training process, the speaker switch detection model is more tolerant to the word error rate, but pays more attention to the accuracy of the special character . On this basis, Google proposed a new character-based loss function, which enables accurate detection of speaker switching events with only a smaller model.
Extract voiceprint features
After the audio signal is segmented according to speaker conversion events, the system extracts the features of each speaker segment through the voiceprint encoder model. The embedding code of voiceprint information, that is, d-vector. In previous papers published by Google, voiceprint embedding codes were generally extracted from fixed-length audio. In contrast, this new system has many improvements. First, the new system avoids extracting voiceprint embeddings from segments that contain multiple speaker information, thus improving the overall quality of the embeddings. Secondly, the speech fragment corresponding to each voiceprint embedding code is relatively long, so it contains more voiceprint information corresponding to the speaker. Finally, the final voiceprint embedding code sequence obtained by this method is shorter in length, making the subsequent clustering algorithm less computationally expensive.
Multi-stage clustering
The last step of voiceprint segmentation and clustering is to cluster the voiceprint embedding code sequences obtained in the previous steps. Since the recordings users generate using the Recorder app can range from just a few seconds to as long as 18 hours, a key challenge for clustering algorithms is being able to handle voiceprint embedding sequences of varying lengths.
To this end, Google’s multi-stage clustering strategy cleverly combines the advantages of several different clustering algorithms. For shorter sequences, the strategy uses aggregate hierarchical clustering (AHC). For sequences of medium length, this method uses spectral clustering and utilizes the maximum margin method of eigenvalues to accurately estimate the number of speakers. For longer sequences, this method first uses aggregated hierarchical clustering to preprocess the sequence, and then calls spectral clustering, thereby reducing the computational cost of the clustering step. During the entire streaming processing process, by dynamically caching and reusing the previous clustering results, the upper limit of the time complexity and space complexity of each clustering algorithm call can be set to a constant.
Multi-stage clustering strategy is a key optimization for device-side applications. Because on the device side, resources such as CPU, memory, and battery are usually scarce. This strategy can still operate in a low-power state even after processing audio for several hours. The upper limit of the constant complexity of this strategy can usually be adjusted according to the specific device model to achieve a balance between accuracy and performance.
Schematic diagram of multi-stage clustering strategy.
Real-time correction and user annotation
Because Turn-to-Diarize is a real-time streaming processing system, when the model is processed, it will be updated. With more audio, the predicted speaker labels will become more accurate. To this end, the Recorder application will continuously correct the previously predicted speaker labels during the user's recording process to ensure that the speaker labels that the user sees on the current screen are always more accurate labels.
At the same time, the user interface of the Recorder application also allows users to rename the speaker tag in each recording, for example, rename "Speaker 2" to "Car Dealership" "Business", thus making it easier for users to read and remember.
Recorder allows users to rename speaker tags to improve readability.
Future Work
Google has launched its self-developed chip Google Tensor on the latest Pixel phones. The current voiceprint segmentation and clustering system mainly runs on the CPU module of Google Tensor. In the future, Google plans to run the voiceprint segmentation and clustering system on the TPU module of Google Tensor to further reduce energy consumption. In addition, Google also hopes to expand this feature to other languages in addition to English with the help of multi-lingual voiceprint encoders and speech recognition models.
The above is the detailed content of Google Recorder implements automatic speaker annotation, and its functionality and iOS voice memos are once again expanded. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

This article introduces the registration process of the Sesame Open Exchange (Gate.io) web version and the Gate trading app in detail. Whether it is web registration or app registration, you need to visit the official website or app store to download the genuine app, then fill in the user name, password, email, mobile phone number and other information, and complete email or mobile phone verification.

A detailed introduction to the login operation of the Sesame Open Exchange web version, including login steps and password recovery process. It also provides solutions to common problems such as login failure, unable to open the page, and unable to receive verification codes to help you log in to the platform smoothly.

Why can’t the Bybit exchange link be directly downloaded and installed? Bybit is a cryptocurrency exchange that provides trading services to users. The exchange's mobile apps cannot be downloaded directly through AppStore or GooglePlay for the following reasons: 1. App Store policy restricts Apple and Google from having strict requirements on the types of applications allowed in the app store. Cryptocurrency exchange applications often do not meet these requirements because they involve financial services and require specific regulations and security standards. 2. Laws and regulations Compliance In many countries, activities related to cryptocurrency transactions are regulated or restricted. To comply with these regulations, Bybit Application can only be used through official websites or other authorized channels

This article recommends the top ten cryptocurrency trading platforms worth paying attention to, including Binance, OKX, Gate.io, BitFlyer, KuCoin, Bybit, Coinbase Pro, Kraken, BYDFi and XBIT decentralized exchanges. These platforms have their own advantages in terms of transaction currency quantity, transaction type, security, compliance, and special features. For example, Binance is known for its largest transaction volume and abundant functions in the world, while BitFlyer attracts Asian users with its Japanese Financial Hall license and high security. Choosing a suitable platform requires comprehensive consideration based on your own trading experience, risk tolerance and investment preferences. Hope this article helps you find the best suit for yourself

It is crucial to choose a formal channel to download the app and ensure the safety of your account.

This guide provides detailed download and installation steps for the official Bitget Exchange app, suitable for Android and iOS systems. The guide integrates information from multiple authoritative sources, including the official website, the App Store, and Google Play, and emphasizes considerations during download and account management. Users can download the app from official channels, including app store, official website APK download and official website jump, and complete registration, identity verification and security settings. In addition, the guide covers frequently asked questions and considerations, such as

This guide provides detailed download and installation steps for the official Bitget Exchange app, suitable for Android and iOS systems. The guide integrates information from multiple authoritative sources, including the official website, the App Store, and Google Play, and emphasizes considerations during download and account management. Users can download the app from official channels, including app store, official website APK download and official website jump, and complete registration, identity verification and security settings. In addition, the guide covers frequently asked questions and considerations, such as

Original title: Bittensor=AIBitcoin? Original author: S4mmyEth, Decentralized AI Research Original translation: zhouzhou, BlockBeats Editor's note: This article discusses Bittensor, a decentralized AI platform, hoping to break the monopoly of centralized AI companies through blockchain technology and promote an open and collaborative AI ecosystem. Bittensor adopts a subnet model that allows the emergence of different AI solutions and inspires innovation through TAO tokens. Although the AI market is mature, Bittensor faces competitive risks and may be subject to other open source
