Google introduced MusicLM – a model for generating music from text
A team of engineers from Google presented a new music generation AI system called MusicLM. The model creates high-quality music based on textual descriptions such as "a calming violin melody backed by a distorted guitar riff." It works in a similar way to DALL-E that generates images from texts.
MusicLM uses AudioLM's multi-step autoregressive modeling as a generative component, extending it to text processing. In order to solve the main challenge of the scarcity of paired data, the scientists applied MuLan – a joint music-text model that is trained to project music and the corresponding text description to representations close to each other in an embedding space.
While training MusicLM on a large dataset of unlabeled music, the model treats the process of creating conditional music as a hierarchical sequence modeling task, and generates music at 24kHz that remains constant for several minutes. To address the lack of evaluation data, the developers released MusicCaps – a new high-quality music caption dataset with 5 500 examples of music-text pairs prepared by professional musicians.
The experiments demonstrate that MusicLM outperforms previous systems in terms of both sound quality and adherence to text description. In addition, the MusicLM model can be conditioned on both text and melody. The model can generate music according to the style described in the textual description and transform melodies even if the songs were whistled or hummed.
See the model demo on the website.
The AI system was taught to create music by training it on a dataset containing five million audio clips, representing 280,000 hours of songs performed by singers. MusicLM can create songs of different lengths. For example, it can generate a quick riff or an entire song. And it can even go beyond that by creating songs with alternating compositions, as is often the case in symphonies, to create a feeling of a story. The system can also handle specific requests, such as requests for certain instruments or a certain genre. It can also generate a semblance of vocals.
The creation of the MusicLM model is part of deep-learning AI applications designed to reproduce human mental abilities, such as talking, writing papers, drawing, taking tests, or writing proofs of mathematical theorems.
For now, the developers have announced that Google will not release the system for public use. Testing has shown that approximately 1% of the music generated by the model is copied directly from a real performer. Therefore, they are wary of content misappropriation and lawsuits.