News on January 10th, Microsoft recently released an artificial intelligence tool called VALL-E that can imitate human speech in just 3 seconds of audio.
The tool is trained on 60,000 hours of English speech data and uses 3-second clips of specific speech to generate content. Unlike many current AI tools, VALL-E can replicate a speaker's mood and tone, even in words the speaker himself has never spoken.
IT House learned that a paper from Cornell University used VALL-E to synthesize several sounds. You can listen to it on GitHub These AI-synthesized audios.
The researchers noted that in many cases, Vall-E outperformed current text-to-speech models. However, the study also writes that AI models currently have several problems. For example, some words in a text prompt might be unclearly pronounced, missing entirely, or appear twice in the output. Additionally, the model currently has difficulty imitating certain voices, especially those with accents. Like other new AI technologies, VALL-E has also raised concerns in terms of safety and ethics. Microsoft issued an ethics statement about the use of VALL-E, but it was unclear about its future use. Currently, Microsoft Vall-E has not yet been open sourced. Microsoft has created a Vall-E repository on GitHub, but it currently only contains a description file.
The above is the detailed content of Microsoft releases VALL-E, an AI voice generation tool that can imitate human speech in just 3 seconds of audio. For more information, please follow other related articles on the PHP Chinese website!