Let ChatGPT help you write the script and Stable Diffusion generate illustrations. Do you need a voice actor to make a video? It's coming!
Recently, researchers from Microsoft released a new text-to-speech (TTS) model VALL-E, which only needs to provide three seconds of audio samples to simulate the input of human voices, and Corresponding audio is synthesized based on the input text, and the emotional tone of the speaker can also be maintained.
Thesis link: https://www.php.cn/link/402cac3dacf2ef35050ca72743ae6ca7
Project link: https://valle-demo.github. io/
Code link: https://github.com/microsoft/unilm
Let’s take a look at the effect first: Suppose you have a 3-second recording.
diversity_speaker Audio: 00:0000:03
Then just enter the text "Because we do not need it." to get the synthesized voice.
diversity_s1 Audio: 00:0000:01
Even using different random seeds, personalized speech synthesis can be performed.
diversity_s2 Audio: 00:0000:02
VALL-E can also maintain the speaker’s ambient sound, such as inputting this voice.
env_speaker Audio: 00:0000:03
Then according to the text "I think it's like you know um more convenient too.", you can output the synthesized speech while maintaining the ambient sound.
env_vall_eAudio: 00:0000:02
And VALL-E can also maintain the speaker's emotion, such as inputting an angry voice.
anger_ptAudio: 00:0000:03
Based on the text "We have to reduce the number of plastic bags.", you can also express angry emotions.
anger_oursAudio: 00:0000:02
There are many more examples on the project website.
Specifically, the researchers trained the language model VALL-E from discrete encodings extracted from off-the-shelf neural audio codec models, and treated TTS as a conditional language modeling task rather than Continuous signal regression.
In the pre-training stage, the TTS training data received by VALL-E reached 60,000 hours of English speech, which is hundreds of times larger than the data used by the existing system.
And VALL-E also demonstrates in-context learning capabilities. It only needs to use the 3-second registration recording of the unseen speaker as a sound prompt to synthesize high-quality personalized speech.
Experimental results show that VALL-E is significantly better than the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity, and can also preserve the speaker's emotion and the acoustics of the sound cues in the synthesis environment.
Over the past decade, speech synthesis has made huge breakthroughs through the development of neural networks and end-to-end modeling.
But current cascaded text-to-speech (TTS) systems usually utilize a pipeline with an acoustic model and a vocoder that uses mel spectrograms as intermediate representations.
Although some high-performance TTS systems can synthesize high-quality speech from single or multiple speakers, it still requires high-quality clean data from the recording studio, which cannot be achieved with large-scale data scraped from the Internet. meet the data requirements and will lead to model performance degradation.
Due to the relatively small amount of training data, the current TTS system still has the problem of poor generalization ability.
Under the zero-shot task setting, for speakers who have not appeared in the training data, the similarity and naturalness of speech will drop sharply.
In order to solve the zero-shot TTS problem, existing work usually utilizes methods such as speaker adaption and speaker encoding, which require additional fine-tuning and complex pre-designed features. , or heavy structural work.
Rather than designing a complex and specialized network for this problem, given their success in text synthesis, the researchers believe the ultimate solution should be to train the model with as much diverse data as possible.
In the field of text synthesis, large-scale unlabeled data from the Internet is directly fed into the model. As the amount of training data increases, the model performance is also constantly improving.
Researchers migrated this idea to the field of speech synthesis. The VALL-E model is the first TTS framework based on language models, utilizing massive, diverse, and multi-speaker speech data.
In order to synthesize personalized speech, the VALL-E model generates corresponding acoustic tokens based on the acoustic tokens and phoneme prompts of the 3-second enrolled recording. This information can limit the speaker. and content information.
Finally, the generated acoustic token is used to synthesize the final waveform with the corresponding neural codec.
The discrete acoustic tokens from the audio codec model enable TTS to be regarded as conditional codec language modeling, so some advanced hint-based large model techniques (such as GPTs) can be used in TTS tasks On.
Acoustic tokens can also use different sampling strategies during the inference process to produce diverse synthesis results in TTS.
The researchers trained VALL-E using the LibriLight dataset, which consists of 60,000 hours of English speech with more than 7,000 unique speakers. The raw data is audio-only, so only a speech recognition model is used to generate the transcripts.
Compared with previous TTS training datasets, such as LibriTTS, the new dataset provided in the paper contains more noisy speech and inaccurate transcriptions, but provides different speakers and registers (prosodies ).
The researchers believe that the method proposed in the article is robust to noise and can utilize big data to achieve good generality.
It is worth noting that existing TTS systems are always trained with dozens of hours of monolingual speaker data or hundreds of hours of multilingual speaker data. More than hundreds of times smaller than VALL-E.
In short, VALL-E is a brand-new language model method for TTS, which uses audio encoding and decoding codes as intermediate representations and uses a large amount of different data to give the model powerful contextual learning capabilities.
Reasoning: In-Context Learning via Prompting
Context learning (in-context learning) is an amazing ability of text-based language models, which can predict unseen Input labels without requiring additional parameter updates.
For TTS, if the model can synthesize high-quality speech for unseen speakers without fine-tuning, then the model is considered to have contextual learning capabilities.
However, existing TTS systems do not have strong in-context learning capabilities because they either require additional fine-tuning or suffer from significant degradation to unseen speakers.
For language models, prompting is necessary to achieve context learning in zero-shot situations.
The prompts and reasoning designed by the researchers are as follows:
First convert the text into a phoneme sequence, and encode the enrolled recording into an acoustic matrix to form a phoneme prompt and an acoustic prompt, both of which Used in AR and NAR models.
For AR models, use sampling-based decoding conditional on hints, because beam search may cause LM to enter an infinite loop; in addition, sampling-based methods can greatly increase the diversity of outputs.
For the NAR model, use greedy decoding to select the token with the highest probability.
Finally, a neural codec is used to generate waveforms conditioned on the eight encoding sequences.
Acoustic cues may not necessarily have a semantic relationship with the speech to be synthesized, so they can be divided into two cases:
VALL-E: The main goal is for unseen speakers Generate the given content.
The input of this model is a text sentence, a piece of enrolled speech and its corresponding transcription. Add the transcribed phonemes of the enrolled speech as phoneme cues to the phoneme sequence of the given sentence, and use the first-level acoustic token of the registered speech as the acoustic prefix. With phoneme cues and acoustic prefixes, VALL-E generates an acoustic token for a given text, cloning the speaker's voice.
VALL-E-continual: Uses the entire transcript and the first 3 seconds of the utterance as phonemic and acoustic cues respectively, and asks the model to generate continuous content.
The reasoning process is the same as setting VALL-E, except that the enrolled speech and the generated speech are semantically continuous.
The researchers evaluated VALL-E on the LibriSpeech and VCTK datasets, where all tested speakers did not appear in the training corpus.
VALL-E significantly outperforms state-of-the-art zero-shot TTS systems in terms of speech naturalness and speaker similarity, with a 0.12 Comparative Average Option Score (CMOS) and a 0.93 Similarity Average on LibriSpeech Option Score (SMOS).
VALL-E also surpasses the baseline system with performance improvements of 0.11 SMOS and 0.23 CMOS on VCTK, even reaching a 0.04CMOS score against ground truth, indicating that on VCTK on, synthetic speech from unseen speakers is as natural as human recordings.
Furthermore, qualitative analysis shows that VALL-E is able to synthesize different outputs with 2 identical texts and target speakers, which may be beneficial for pseudo-data in speech recognition tasks create.
It can also be found in the experiment that VALL-E can maintain the sound environment (such as reverberation) and the emotion prompted by the sound (such as anger, etc.).
Security hazard
If powerful technology is misused, it may cause harm to society. For example, the threshold for phone fraud has been lowered again!
Due to VALL-E’s potential for mischief and deception, Microsoft has not opened VALL-E’s code or interfaces for testing.
Some netizens shared: If you call the system administrator, record a few words they say "Hello", and then re-synthesize the voice based on these words "Hello, I am the system administrator." "My voice is a unique identifier and can be safely verified." I always thought this was impossible. You couldn't accomplish this task with so little data. Now it seems that I may be wrong...
In the final Ethics Statement of the project, the researchers stated that "the experiments in this article were based on the model user as the target speaker and obtained performed under the assumption of speaker consent. However, when the model generalizes to unseen speakers, the relevant parts should be accompanied by speech editing models, including protocols to ensure that speakers agree to perform modifications and systems to detect edited speech.”
The author also states in the paper that since VALL-E can synthesize speech that maintains the identity of the speaker, it may bring potential risks of misuse of the model, Such as spoofing voice recognition or imitating a specific speaker.
To reduce this risk, a detection model can be built to distinguish whether an audio clip is synthesized by VALL-E. As we further develop these models, we will also put Microsoft AI principles into practice.
Reference materials:
https://www.php.cn/link/402cac3dacf2ef35050ca72743ae6ca7
The above is the detailed content of It only takes 3 seconds to steal your voice! Microsoft releases speech synthesis model VALL-E: Netizens exclaimed that the threshold for 'telephone fraud' has been lowered again. For more information, please follow other related articles on the PHP Chinese website!