January 19, 2022

A neural codec language model - VALL-E can reproduce a voice from a three-second audio recording

A team of researchers at Microsoft has introduced a new AI system that is capable of mimicking a person's voice with a recording just three seconds long. Scientists trained a neural codec language model called VALL-E using discrete codes derived from an off-the-shelf neural audio codec model, and regard text-to-speech (TTS) as a conditional language modeling task rather than continuous signal regression.

The new app was created on the basis of Meta's EnCodec audio compression technology, and was originally intended to improve the quality of phone conversations. Further work demonstrated that the model is capable of much more. VALL-E can not only mimic a voice, but also simulate tone and even copy the acoustics of the environment in which the original recording was made. For example, if the original recording was made from a telephone conversation, then the result will resemble a telephone conversation.

VALL-E developers used over 60,000 hours of recordings during the pre-training stage, which is hundreds of times larger than the amount of materials used for other existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech using as little as a 3-second audio recording.

In addition to reducing the training time to generate a new voice, VALL-E creates a much more natural-sounding synthetic voice than other models. According to the experiments’ results, VALL-E significantly outperforms the current TTS systems in terms of speech naturalness and speaker similarity.

See the model demo on the website.

In the samples presented on this website, the "Speaker Prompt" column contains speech samples. In the column "Ground Truth" there is the required text pronounced by the person's voice as the recorded sample. The "Baseline" column is an example of the traditional text-to-speech synthesis. And finally, the "VALL-E" column demonstrates the result of the new AI model’s work.

Try out a convenient TTS service provided by Qudata as a free example of traditional online text-to-speech converters. It is completely free and available for both desktop and mobile devices.

Microsoft has not made the source code for VALL-E public, noting that it may carry potential risks in misuse of the model, such as faking voice identification or impersonating a specific speaker. Therefore, everyone who wants to test the operation of the model will not be able to.

AI/ML News

A neural codec language model - VALL-E can reproduce a voice from a three-second audio recording