Tutorial

How to quickly clone your voice with TorToiSe Text-To-Speech

In this tutorial, we show how to clone voices with TorToise TTS, and discuss necessary steps to ensure ideal cloning takes place.

a year ago • 7 min read

By James Skelton

Bring this project to life

Run on Paperspace

One of the coolest possibilities offered by AI and Deep Learning technologies is the ability to replicate various things in the real world. Whether it be generating realistic images from scratch or the right response to an incoming chat request or appropriate music for a given theme, we can rely on AI to deliver awesome approximations of the things previously only possible when guided directly by a humans hand.

Voice cloning is one of those interesting possibilities offered by this novel tech. This is the quality of mimicking the voice qualities of some actor by attempting to recreate their specific intonation, accent, and pitch using some deep learning model. When combined with technologies like Generative Pretrained Transformers and static image manipulators, like SadTalker, we can start to make some really interesting approximations of real life human behaviors - albeit from behind a screen and speaker.

In this short article, we will walk through each of the steps required to clone your own voice, and then generate accurate impersonations of yourself using Tortoise TTS in Paperspace. We can then take these clips and combine it with other projects to create some really interesting outcomes with AI.

Tortoise TTS

Released by solo author James Betker, Tortoise is undoubtedly the best and easiest to use voice cloning model available for use on local and cloud machines without requiring any sort of API or service payment to access. It makes it easy to clone a voice from just a few (3-5) 10 second voice clips.

In terms of how it works and its inspiration, both lie with image generation with AutoRegressive Transformers and Denoising Diffusion Probabilistic Models. The author sought to recreate the success of those model approaches, but applied towards speech generation. In those models, they learn the process of image generation with a step-wise probabilistic procedure which, over time and large amounts of data, learn the image distribution.

With TorToise, the model is specifically trained on visualizations of speech data called MEL spectrograms. These representations of the audio can be easily modeled using the same process as used in typical DDPM situations with only slight modification to account for voice data. Additionally, we add the ability to mimic some existing voice type by using it as an initial noise object weight condition.

Together, this can be used to accurately recreate voice data using very little initial input.

Demo

Bring this project to life

Run on Paperspace

For the demo, we are going to use the provided IPython Notebook in the original TorToise TTS repo. To spin this up in a Paperspace Notebook on a Free GPU, all we need to do is use the link above! Once we are in the Notebook space, just click run to get started, and open up the tortoise_tts.ipynb notebook.

Voice Sample Selection

In addition to their own suggestions for selecting voice samples, we have a few of our own for making things easier:

If you do not have a proper microphone stand, we suggest using a mobile phone rather than a computer. The phone microphone will likely have much better noise reduction
A good place to record will have no echoes. We tried to use samples of 'Bane' from "The Dark Knight Rises" for this demo, but his voice was too full of echo from the inside of his mask. We recommend a closet full of clothing that will damp any extra sound
Write out a script for your recordings. This will help you avoid any stuttering, "uh" or "um" sounds, or minor flubs
If possible, try to cover the widest variety of phonemes (sounds in language) possible. These are called phonetic pangrams. This will help the model know all the different potential sounds in your speech. An example of this would be "That quick beige fox jumped in the air over each thin dog. Look out, I shout, for he's foiled you again, creating chaos."

If you follow both our suggestions as well as the originals, your clone should go without hitches. Here are the recordings we used for this demonstration:

Pangram 1