Text guided video synthesis with background music and speech in Gradient

Check out our guide on creating a full video synthesis pipeline with audio and speech using VideoFusion, YourTTS, Riffusion, and MoviePy.

a year ago   •   9 min read

By James Skelton

Sign up FREE

Build & scale AI models on low-cost cloud GPUs.

Get started Talk to an expert
Table of contents

Bring this project to life

We've extensively covered text2image synthesis frameworks over the last year and a half, when models like the PixRay project first began to hit the public scene. Since then, we have covered the extensive advancements that have come to text2image synthesis like the advent of Diffusion models, the application of guidance techniques to improve them, and comprehensive software to host serve these models in deployed web pages.

There have already been a number of efforts to transfer the newfound capability of the popular diffusion models to video synthesis. Most of these avenues have been image2image based: where each frame from a video is used as the initial noise image for the diffusion model to generate an image with from some prompt. These have a lot of great potential, but still require work on the actor's part to either film or source a video containing the features they desire to transfer to the new image sequence.

One of the first real tangible steps forward in the quest for complete text2video generation pipelines has been realized by Modelscope with their release of their text2video pipe. This fantastic model is capable of creating short (2-5 seconds before it starts to struggle) form videos containing just about anything we can imagine!

In today's tutorial, we are going to show how to create a full video pipeline: including generating the video with Modelscope, generating speech from text in any voice with YourTTS, and generating thematic background music to tie it all together with . When combined together, we get one of the first unified ML pipelines for creating video with audio. Let's walk through each of the code components for the pipeline, and discuss what each does. Afterwards, we will combine everything in a Gradio application for us to query the model from in a simple environment.

Pipeline sub-components

Before we dive into the code, let's examine each of the component models in the pipeline, so that we can have a better understanding of how they fit together in the application.

Modelscope VideoFusion


The most important and novel of all the model sub-components is the text2video model from Modelscope, VideoFusion. Their work consists of  introducing a decomposed diffusion process, where the per-frame noise is resolved into a base noise that is shared among all frames and a residual noise that varies along the time axis. The denoising pipeline employs two jointly-learned networks to match the noise decomposition accordingly. (Source)


Here we can see how the model creates the decomposed diffusion process of a video clip. As we can see, the base noise bt is shared across different frames with the addition of the residual values. At each sampling step, the model first removes the base noise with the base generator. It then estimates the remaining residual noise via the residual generator.  τᵢ denotes the DDIM sampling steps. μ denotes
mean-value predicted function of DDIM in-prediction formulation. (Source)

By keeping bt fixed, they are able to conserve the image features while introducing novel displacements for those features at each step. This introduces the differentiation that leads to the appearance of interpolation in the video and allows for much longer sequences of generated images in the video.

When all put together, we get a complex diffusion based model that is able to pay attention to previous and future generated images to inform itself as it creates more in a sequence that mimics the subject(s) moving.


YourTTS is a capable Text-To-Speech (TTS) model that is highly robust at synthesizing voices in multiple languages. Compared to other models, YourTTS is capable of significantly reducing data requirements by transferring knowledge among languages in the training set. In practice, this makes the model highly capable at numerous tasks, such as: Multi-Lingual TTS, Multi-Speaker TTS, Zero-Shot learning, Speaker/language adaptation, Cross-language voice transfer, and Zero-shot voice conversion.


For today's demo, we are interested in the ability to learn and synthesize speech in the sample speaker's voice from a relatively small amount of data and without retraining the model, a type of Zero-Shot learning. We're going to use this capability in the application to generate novel speech in the voice of our desired subject, pulled from a Youtube video.

YourTTS will allow us to mimic the voice of any speaker on YouTube for our generated videos.


Riffusion is an amazing project that trained Stable Diffusion v1-5 on millions of spectrogram image-text pairs. The model is capable of generating these spectrograms with a high degree of accuracy to the original prompt, and these can then be transformed into audio files using the Short Form Fourier Transform. This allows for high quality and accurate music generation via the Stable Diffusion pipeline.

electric guitar uplifting superhero theme

Here is a sample spectrogram and audio clip we made from the prompt "electric guitar uplifting superhero theme." As we can hear, this does a decent job of following the instructions. There is clear electric guitar sounds, and the tone is generally ethereal, like a superhero's theme may be.

Riffusion will allow us to generate thematically accurate background music for our generated videos.


To run this demo, we are going to need to log into Paperspace Gradient. This should be able to run on the Free GPU Notebooks, but we recommend at least 16 GB of VRAM to run this pipeline in any reasonable amount of time.

Click the link below to open the Pipeline in Gradient:

Bring this project to life


We've collated all the setup into a single bash script, setup.sh. Execute the first cell of the notebook.ipynb file to run the installation script. We will then restart the notebook kernel, so that all the installs are actively accessible for the pipeline. Setup will take around 5 minutes to complete.

Note: Running this requires a huggingface.co account. Follow the instructions that print in the cell to get the access token.
!huggingface-cli login
!bash setup.sh
import os

Generate Video

The next cell contains everything we need to generate the video with Modelscope! Running this cell will download the required model weights into the designated directory. These will then be used to start up the 'text-to-video-synthesis' pipeline, using the inputted text prompt as the guidance for the images within. This video is then outputted as video.mp4 in the outs/ directory.  

Note: You can adjust the length of the video by changing the numeric value on line 16 of the configuration.json file in the directory /modelscope-damo-text-to-video-synthesis. We recommend a value of 42. Otherwise, the video will likely be shorter than the audio.
from huggingface_hub import snapshot_download
import os
from modelscope.pipelines import pipeline
from modelscope.outputs import OutputKeys
import pathlib
import torch

model_dir = pathlib.Path('/notebooks/modelscope-damo-text-to-video-synthesis')

if not os.path.exists('modelscope-damo-text-to-video-synthesis'):
    snapshot_download('damo-vilab/modelscope-damo-text-to-video-synthesis', repo_type='model', local_dir=model_dir)
    subprocess.run(['cp', 'configuration.json', 'modelscope-damo-text-to-video-synthesis/configuration.json'])

pipe = pipeline('text-to-video-synthesis', model_dir.as_posix(),output_video = 'outs/video.mp4')
test_text = {
        'text': 'Alice in Wonderland animated disney princess dancing',
        'output_video_path' : 'outs/video.mp4'
output_video_path = pipe(test_text,output_video = 'outs/video.mp4')[OutputKeys.OUTPUT_VIDEO]
print('output_video_path:', output_video_path)

After this cell has completed running, we will be left with our initial video sample. Here is an example we made using the same prompt as above:


Generate Speech

!yt-dlp --extract-audio --audio-format wav https://www.youtube.com/watch?v=vst5C63iNh8 --output TTS/audio_samps/mspiggy.wav

First, we need to pull an audio sample from Youtube. For this example, let's grab a video of Ms. Piggy. Be wary of videos with multiple speakers, as they may cause the model to be confused. We will then use that audio clip as our sample to base the generated speech on.

from TTS.api import TTS

tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False, gpu=True)
tts.tts_to_file('Oh what a beautiful day!', speaker_wav="/notebooks/TTS/audio_samps/mspiggy.wav", language="en", file_path="outs/speech.wav")

To generate speech, we are using YourTTS from CoquiAI. This generation pipeline is comparatively simple compared to the other's, and can be done in as little as three lines. We will first instantiate the TTS model, and call the .tts_to_file() method to, relatively quickly, generate the sound byte. Below is an example we made using Ms Piggy's voice:


Generate Audio

Generating the background music is a bit more complicated in terms of code implementation. We are going to take advantage of Riffusion's being based on Stable Diffusion to take advantage of the diffusers StableDiffusionPipeline.

import torch

from PIL import Image
import numpy as np
from spectro import wav_bytes_from_spectrogram_image

from diffusers import StableDiffusionPipeline
from diffusers import StableDiffusionImg2ImgPipeline
import gradio as gr
device = "cuda"
MODEL_ID = "riffusion/riffusion-model-v1"
pipe = StableDiffusionPipeline.from_pretrained(MODEL_ID, torch_dtype=torch.float16)
pipe = pipe.to(device)

def predict(prompt, negative_prompt, audio_input, duration):
	return classic(prompt, negative_prompt, duration)

def classic(prompt, negative_prompt, duration):
    if duration == 5:
    else :
        width_duration = 512 + ((int(duration)-5) * 128)
    spec = pipe(prompt, negative_prompt=negative_prompt, height=512, width=width_duration).images[0]
    wav = wav_bytes_from_spectrogram_image(spec)
    with open("outs/music.wav", "wb") as f:
    return spec, 'outs/music.wav', gr.update(visible=True), gr.update(visible=True), gr.update(visible=True)

prompt_input = 'sesame street theme song'
negative_prompt = ''
audio_input = None
duration_input = 5
spectrogram_output, sound_output, share_button, community_icon, loading_icon = predict(prompt_input, negative_prompt, audio_input, duration_input)

In summation, the StableDiffusionPipeline creates a spectrogram image that corresponds to the text prompt we input, thanks to the training of Riffusion on audio/spectrogram-text pairs. This spectrogram is then converted into a .wav file that we can overlay on top of our video as "background" music.

Put it all together

from moviepy.editor import *
# load the video
video_clip = VideoFileClip('outs/video.mp4')
# # load the audio
music_clip = AudioFileClip('outs/music.wav')
speech_clip = AudioFileClip('outs/speech.wav')

new_audioclip = CompositeAudioClip([music_clip, speech_clip])
video_clip.audio = new_audioclip

We can now finish the pipeline by using MoviePy to composite all of the different files into a single video file with background music and generated speech. You can view our example using the code above, below:


While there is clearly a lot of work that may go into scaling up the quality, intricacy, and size of the images outputted by the text-to-video pipeline on display in this blog post, this still represents an amazing step torward creating completely text guided video synthesis models.

Future projects may consider integrating guidance techniques like ControlNet or Self Attention Guidance to effectively improve the outputs without any additional training.


The final cell of the Notebook file contains the cell to run all that we saw above in a Gradio application. This application can be accessed by any web platform or browser, and can be queried through it's FastAPI integration if we don't want to use a proper browser (like a chatbot request).

Execute the code below to spin up the sample app:

python app.py

You can then click the shared link to open the application in your browser.

This will present you with a simplistic interface to modularly interact with the pipeline, and view all of the results in a single place. This makes it easy to iterate and make changes on each of the three processes without affecting the others or the final video output.

Closing thoughts

This was a very fun project to put together, and it represents a palpable, exciting step forward for hobbyists and non-FAANG ML engineers to start working on and creating increasingly high resolution text to image pipelines. Looking forward, we can't wait to see Stable Diffusion and Imagen's text2video pipelines when they eventually reach fruition.

Add speed and simplicity to your Machine Learning workflow today

Get startedTalk to an expert

Spread the word