Music generation is notably one of the earliest applications for programmatic technology to enter public and recreational use. The synthesizer, looping machine, and beat pads are all examples of how basic programming could be implemented to create novel music, and this has only grown in versatility and applicability. Modern production employs complex software to make minute adjustments to physical recordings to achieve the perfect sonic qualities we all associate with recorded tracks. Even further than that, there are plenty of technologies, like the Ableton or Logic Pro X, that have been around for a long time that let users directly create music themselves - with nothing but a laptop!
Machine and Deep Learning attempts at music generation have achieved mixed results in the past. Projects from large companies like Google have produced notable work in this field, like their MusicLM project, but the trained models have largely remained restricted to internal use only.
In today's article, we will be discussing MusicGen - one of the first powerful, easy to use music generators ever to be released to the public. This text-to-audio model was trained on over 20,000 hours of audio data, including 10k from a proprietary dataset and the rest from the ShutterStock and Pond5 music datasets. It is capable of quickly generating novel music unconditionally, with a text prompt, and even continuing on an existing inputted song.
We will start with a brief overview of the MusicGen model itself, discussing its capabilities and architecture to help build a full understanding of the AudioCraft technology at play. Afterwards, we will walk through the provided demo Notebook in Paperspace Gradient, so that we can demonstrate the power of these models on Paperspace's powerful GPUs. Click the link at the top of this article to open the demo in a Free GPU powered Notebook.
The MusicGen application is a sophisticated transformer based encoder-decoder model which is capable of generating novel music under a variety of tasks and conditions. These include:
- Unconditional: generating music without any sort of prompting or input
- Music continuation: predicting the end portion of a song and recreating it
- Text-conditional generation: generating music with instruction provided by text that can control genre, instrumentation, tempo and much more
- Melody conditional generation: combining text and music continuation to create an augmented prediction for the music continuation
This was made possible through the comprehensive training process. The models were trained on 20 thousand hours of licensed music. Specifically, they created an
an internal dataset of 10 thousand high-quality music tracks, and augmented it with the ShutterStock and Pond5 music datasets, with some 400 thousand additional instrument only tracks.
To learn from all this data, MusicGen was constructed with an autoregressive transformer-based decoder architecture, conditioned on a text or melody representation. For the audio tokenization model, they used a non-causal five layers EnCodec model for 32 kHz monophonic audio. The embeddings are quantized with an Residual Vector Quantization with four quantizers, each with a codebook size of 2048. Each quantizer encodes the quantization error left by the previous quantizer, thus quantized values for different codebooks are in general not independent, and the first codebook is the most important one.
To run the demo, click the link above or at the top of this page. This will open a new Gradient Notebook with everything needed within. All of this code can be found in the official repo for the AudioCraft MusicGen project.
!pip install -r requirements.txt !pip install -e .
Once our Notebook has spun up, the first thing we want to do is install the requirements and the MusicGen packages itself. To do so, run the first code cell in the notebook.
from audiocraft.models import MusicGen # Using small model, better results would be obtained with `medium` or `large`. model = MusicGen.get_pretrained('melody')
Once that has completed, scroll down to the next code cell. This is what we will use to load in the model into our cache for use in this session. This won't count against our storage capacity, so feel free to try them all. It's doubtful that the large model will run quickly on the free GPU however, so let's use the
melody model for now.
model.set_generation_params( use_sampling=True, top_k=250, duration=5 )
In the next code cell, we can set our generation parameters. These will be the settings used throughout the notebook, unless overwritten later on. The only one we may want to alter is the duration. A 5 second default is barely a song snippet, and it can extend all the way to 30.
from audiocraft.utils.notebook import display_audio output = model.generate_unconditional(num_samples=2, progress=True) display_audio(output, sample_rate=32000)
The demo starts with a quick example of unconditional generation - synthesis without any control parameters or prompting. This will generate a tensor that we can then use the provided
display_audio prompt to show within our notebook.
Here is the sample we got when we ran the melody model for a ten second duration. While a tad rambling, it still maintains a relatively coherent beat and consistency of instrumentation. While there is no miraculous generation of a true melody, the quality speaks for itself.
One of the most interesting capabilities of this model is its ability to imitate and learn songs from a short snippet, and mimic the songs instrumentation in continuation from a set point in the song. This allows for some creative methods to remix and alter existing tracks, and might serve as an excellent inspirational tool for artists.
import math import torchaudio import torch from audiocraft.utils.notebook import display_audio def get_bip_bip(bip_duration=0.125, frequency=440, duration=0.5, sample_rate=32000, device="cuda"): """Generates a series of bip bip at the given frequency.""" t = torch.arange( int(duration * sample_rate), device="cuda", dtype=torch.float) / sample_rate wav = torch.cos(2 * math.pi * 440 * t)[None] tp = (t % (2 * bip_duration)) / (2 * bip_duration) envelope = (tp >= 0.5).float() return wav * envelope
To run the code, we first must instantiate the
get_bip_bip helper function. This will help facilitate smooth audio generation without an initial input.
res = model.generate_continuation( get_bip_bip(0.125).expand(2, -1, -1), 32000, ['Jazz jazz and only jazz', 'Heartful EDM with beautiful synths and chords'], progress=True) display_audio(res, 32000)
We can then use that to generate the artificial signal to prompt the model at the start of synthesis. Here is an example made using two prompts, with each resulting in a generated song snippet.
Above are the examples we made in our run:
# You can also use any audio from a file. Make sure to trim the file if it is too long! prompt_waveform, prompt_sr = torchaudio.load("./assets/bach.mp3") ## <-- Path here prompt_duration = 2 prompt_waveform = prompt_waveform[..., :int(prompt_duration * prompt_sr)] output = model.generate_continuation(prompt_waveform, prompt_sample_rate=prompt_sr, progress=True) display_audio(output, sample_rate=32000)
Finally, let's try this with our own music, rather than a synthetic signal. Let's use the code above, but replace the path to the song with one in our own library. For our example, we used a snippet from Ratatat's Shempi for a 30 second sample. Check it out below:
Now, rather than using a song input as initial, let's try using text. This will allow us to get a greater degree of control over the songs genre, instruments, tempo, etc. with simple text. Use the code below to run text-conditional generation with MusicGen.
from audiocraft.utils.notebook import display_audio output = model.generate( descriptions=[ '80s pop track with bassy drums and synth', '90s rock song with loud guitars and heavy drums', ], progress=True ) display_audio(output, sample_rate=32000)
Here are the examples we received from our run:
Melody Conditional Generation
Now, let's combine everything together in a single run. This way, we can augment our existing song with a text controlled extension of the original track. Run the code below to get a sample using the provided Bach track, or substitute your own.
import torchaudio from audiocraft.utils.notebook import display_audio model = MusicGen.get_pretrained('melody') model.set_generation_params(duration=30) melody_waveform, sr = torchaudio.load("/notebooks/assets/bach.mp3") melody_waveform = melody_waveform.unsqueeze(0) output = model.generate_with_chroma( descriptions=[ 'Ratatat song', ], melody_wavs=melody_waveform, melody_sample_rate=sr, progress=True ) display_audio(output, sample_rate=32000)
Listen to the original, and modified samples below:
MusicGen is a really incredible model. It represents a substantive step forward for music synthesis, in the same way stable diffusion and GPT did for image and text synthesis. Look out in coming months for this model to be iterated on heavily as open source developers seek to augment and capitalize on the massive success achieved here by Meta Labs research.