MusicGen: the most powerful, easy to use music generator running on Paperspace

This tutorial walks through the MusicGen demo and shows how to run it in a Gradient Notebook.

3 years ago • 7 min read

By James Skelton

Bring this project to life

Music generation is notably one of the earliest applications for programmatic technology to enter public and recreational use. The synthesizer, looping machine, and beat pads are all examples of how basic programming could be implemented to create novel music, and this has only grown in versatility and applicability. Modern production employs complex software to make minute adjustments to physical recordings to achieve the perfect sonic qualities we all associate with recorded tracks. Even further than that, there are plenty of technologies, like the Ableton or Logic Pro X, that have been around for a long time that let users directly create music themselves - with nothing but a laptop!

Machine and Deep Learning attempts at music generation have achieved mixed results in the past. Projects from large companies like Google have produced notable work in this field, like their MusicLM project, but the trained models have largely remained restricted to internal use only.

In today's article, we will be discussing MusicGen - one of the first powerful, easy to use music generators ever to be released to the public. This text-to-audio model was trained on over 20,000 hours of audio data, including 10k from a proprietary dataset and the rest from the ShutterStock and Pond5 music datasets. It is capable of quickly generating novel music unconditionally, with a text prompt, and even continuing on an existing inputted song.

We will start with a brief overview of the MusicGen model itself, discussing its capabilities and architecture to help build a full understanding of the AudioCraft technology at play. Afterwards, we will walk through the provided demo Notebook in Paperspace Gradient, so that we can demonstrate the power of these models on Paperspace's powerful GPUs. Click the link at the top of this article to open the demo in a Free GPU powered Notebook.

MusicGen

The MusicGen application is a sophisticated transformer based encoder-decoder model which is capable of generating novel music under a variety of tasks and conditions. These include:

Unconditional: generating music without any sort of prompting or input
Music continuation: predicting the end portion of a song and recreating it
Text-conditional generation: generating music with instruction provided by text that can control genre, instrumentation, tempo and much more
Melody conditional generation: combining text and music continuation to create an augmented prediction for the music continuation

This was made possible through the comprehensive training process. The models were trained on 20 thousand hours of licensed music. Specifically, they created an
an internal dataset of 10 thousand high-quality music tracks, and augmented it with the ShutterStock and Pond5 music datasets, with some 400 thousand additional instrument only tracks.

To learn from all this data, MusicGen was constructed with an autoregressive transformer-based decoder architecture, conditioned on a text or melody representation. For the audio tokenization model, they used a non-causal five layers EnCodec model for 32 kHz monophonic audio. The embeddings are quantized with an Residual Vector Quantization with four quantizers, each with a codebook size of 2048. Each quantizer encodes the quantization error left by the previous quantizer, thus quantized values for different codebooks are in general not independent, and the first codebook is the most important one.

Demo

Bring this project to life

Run on Gradient

To run the demo, click the link above or at the top of this page. This will open a new Gradient Notebook with everything needed within. All of this code can be found in the official repo for the AudioCraft MusicGen project.

!pip install -r requirements.txt
!pip install -e .

Once our Notebook has spun up, the first thing we want to do is install the requirements and the MusicGen packages itself. To do so, run the first code cell in the notebook.

from audiocraft.models import MusicGen

# Using small model, better results would be obtained with `medium` or `large`.
model = MusicGen.get_pretrained('melody')

Once that has completed, scroll down to the next code cell. This is what we will use to load in the model into our cache for use in this session. This won't count against our storage capacity, so feel free to try them all. It's doubtful that the large model will run quickly on the free GPU however, so let's use the melody model for now.

model.set_generation_params(
    use_sampling=True,
    top_k=250,
    duration=5
)

In the next code cell, we can set our generation parameters. These will be the settings used throughout the notebook, unless overwritten later on. The only one we may want to alter is the duration. A 5 second default is barely a song snippet, and it can extend all the way to 30.

Unconditional generation

from audiocraft.utils.notebook import display_audio

output = model.generate_unconditional(num_samples=2, progress=True)
display_audio(output, sample_rate=32000)

The demo starts with a quick example of unconditional generation - synthesis without any control parameters or prompting. This will generate a tensor that we can then use the provided display_audio prompt to show within our notebook.

Upload1

0:00

/10

Here is the sample we got when we ran the melody model for a ten second duration. While a tad rambling, it still maintains a relatively coherent beat and consistency of instrumentation. While there is no miraculous generation of a true melody, the quality speaks for itself.

Music Continuation

One of the most interesting capabilities of this model is its ability to imitate and learn songs from a short snippet, and mimic the songs instrumentation in continuation from a set point in the song. This allows for some creative methods to remix and alter existing tracks, and might serve as an excellent inspirational tool for artists.

import math
import torchaudio
import torch
from audiocraft.utils.notebook import display_audio

def get_bip_bip(bip_duration=0.125, frequency=440,
                duration=0.5, sample_rate=32000, device="cuda"):
    """Generates a series of bip bip at the given frequency."""
    t = torch.arange(
        int(duration * sample_rate), device="cuda", dtype=torch.float) / sample_rate
    wav = torch.cos(2 * math.pi * 440 * t)[None]
    tp = (t % (2 * bip_duration)) / (2 * bip_duration)
    envelope = (tp >= 0.5).float()
    return wav * envelope

To run the code, we first must instantiate the get_bip_bip helper function. This will help facilitate smooth audio generation without an initial input.

res = model.generate_continuation(
    get_bip_bip(0.125).expand(2, -1, -1), 
    32000, ['Jazz jazz and only jazz', 
            'Heartful EDM with beautiful synths and chords'], 
    progress=True)
display_audio(res, 32000)

We can then use that to generate the artificial signal to prompt the model at the start of synthesis. Here is an example made using two prompts, with each resulting in a generated song snippet.

Upload jazz

0:00

/10

Upload edm

0:00

/10

Above are the examples we made in our run:

# You can also use any audio from a file. Make sure to trim the file if it is too long!
prompt_waveform, prompt_sr = torchaudio.load("./assets/bach.mp3") ## <-- Path here
prompt_duration = 2
prompt_waveform = prompt_waveform[..., :int(prompt_duration * prompt_sr)]
output = model.generate_continuation(prompt_waveform, prompt_sample_rate=prompt_sr, progress=True)
display_audio(output, sample_rate=32000)

Finally, let's try this with our own music, rather than a synthetic signal. Let's use the code above, but replace the path to the song with one in our own library. For our example, we used a snippet from Ratatat's Shempi for a 30 second sample. Check it out below:

Shempi remix (starts at 20 seconds)

0:00

/30

Text-conditional generation

Now, rather than using a song input as initial, let's try using text. This will allow us to get a greater degree of control over the songs genre, instruments, tempo, etc. with simple text. Use the code below to run text-conditional generation with MusicGen.

from audiocraft.utils.notebook import display_audio

output = model.generate(
    descriptions=[
        '80s pop track with bassy drums and synth',
        '90s rock song with loud guitars and heavy drums',
    ],
    progress=True
)
display_audio(output, sample_rate=32000)

Here are the examples we received from our run:

Pop

0:00

/10

Rock

0:00

/10

Melody Conditional Generation

Now, let's combine everything together in a single run. This way, we can augment our existing song with a text controlled extension of the original track. Run the code below to get a sample using the provided Bach track, or substitute your own.

import torchaudio
from audiocraft.utils.notebook import display_audio

model = MusicGen.get_pretrained('melody')
model.set_generation_params(duration=30)

melody_waveform, sr = torchaudio.load("/notebooks/assets/bach.mp3")
melody_waveform = melody_waveform.unsqueeze(0)
output = model.generate_with_chroma(
    descriptions=[
        'Ratatat song',
    ],
    melody_wavs=melody_waveform,
    melody_sample_rate=sr,
    progress=True
)
display_audio(output, sample_rate=32000)

Listen to the original, and modified samples below:

Bach - original

0:00

/10.031

Bach - pop

0:00

Bach - rock

0:00

Closing thoughts

MusicGen is a really incredible model. It represents a substantive step forward for music synthesis, in the same way stable diffusion and GPT did for image and text synthesis. Look out in coming months for this model to be iterated on heavily as open source developers seek to augment and capitalize on the massive success achieved here by Meta Labs research.

Add speed and simplicity to your Machine Learning workflow today

Get started

Blog

Docs

Community

ML Showcase

Professional Services

Talk to an Expert

Review of Imagination-Guided Open-Ended Text Generation

Context-Based Automated Conversational System Using a Pretrained Model

Solutions

Product

Resources

Company

MusicGen

Demo

Unconditional generation

Music Continuation

Text-conditional generation

Melody Conditional Generation

Closing thoughts

Spread the word

Review of Imagination-Guided Open-Ended Text Generation

Context-Based Automated Conversational System Using a Pretrained Model

Keep reading

Innovating Speech Synthesis: Hierarchical Variational Approach in HierSpeech++

Audio Transcription Effortlessly with Distill Whisper AI

How to quickly clone your voice with TorToiSe Text-To-Speech

Subscribe to our newsletter

Solutions

Product

Resources

Company