DeepFaceLab: Introduction to Deepfakes with Paperspace

DeepFaceLab is the most popular deepfaking software on the internet. We invited one of the contributors to teach us how to get started making deepfakes with a powerful cloud GPU machine from Paperspace.

18 days ago   •   14 min read

By Nikolay Chervoniy
Table of contents

Bring this project to life

Nikolay Chervoniy is a contributor to the most popular deepfaking library on the internet, DeepFaceLab, and the author of the popular DeepFaceLab fork for IPython Notebooks called DFL-Colab.

Deepfakes have become a significant opportunity for all kinds of creative content, especially in film and TV where deepfake artists are pushing the boundary on what is possible in visual effects.

The last few years have seen a massive amount of content using deepfake technology, from faceswap videos made by fans to bigger projects that are gaining the attention of Hollywood studios and filmmakers.

Today we're going to learn to generate deepfakes using DeepFaceLab and cloud computing from Paperspace Core.

Install and setup

Before we begin, we need to decide which GPU to use. The more powerful the better. The recommended minimum performance for DeepFaceLab is 10-15 TFLOPS in FP32 precision.

According to the Ultimate Guide to Cloud GPU Providers created by Paperspace, there are currently QTY 9 different GPUs available on Paperspace with more than 10 TFLOPS in FP32 (single precision) -- including popular new options like RTX A5000, RTX A6000, and A100.

GPUs with more than 10 TFLOPs FP32 suitable for running DeepFaceLab

Since DeepFaceLab is designed to be used directly on Windows, we will need to use a Windows machine with a remote desktop.

For a primer on how to create a new machine on Paperspace, check out this article in the docs:

Create Compute Machine | Paperspace
Create Compute Machine

In short, you'll need to sign up for Paperspace and create a new GPU machine, preferably something toward the top of the list above, like an RTX A4000, RTX A5000, RTX A6000, A100, or V100.

Be sure to pick a data center region that is closest to you for maximum streaming performance since we will be using a Windows VM. Also keep in mind that we'll need quite a lot of disk space. There are two reasons for this. First, we will need to increase our Windows swap file to at least 100 GB in order to run correctly. Second, we will need storage space for our data. Longer videos will require larger amounts of storage.

Here is a short video of the installation and testing process on a new Paperspace Core machine.

Data: Collection and processing

First of all, we must understand that the basis of any deepfake is two sets of data β€” the source (SRC) data and the destination (DST) data.

The SRC dataset is the dataset of the face or person we want to transfer from. DST, meanwhile, is the destination or target video onto which we'll be applying the source face.

It is no exaggeration to say that 50% of a deepfake project's success is a result of creating a quality SRC dataset. The main idea is to collect as much high-quality and diverse SRC face video material as possible. In a perfect world, we would use exclusively high-quality video sources like professional films or interviews.

The SRC dataset should contain representations of faces from different angles and lighting. With more information of this kind, the neural network can better generalize the SRC face and, as a result, improve the quality of face swapping and light transfer.

Now let's assume that we've assembled videos with required SRC faces and merged them into a single mp4 file. DeepFaceLab provides two files data_src and data_dst on which we can try to extract faces ourselves to see how it works, however we should keep in mind that these are just example files.

So, what are the next steps?

Step 1: Extract frames from SRC and DST videos.

To extract images from our source video, we'll run the corresponding bat file Β 2) extract images from video data_src.

In the options we should specify what FPS we want to extract frames with as well as the format (PNG or JPG). We should note that the FPS setting is responsible for defining the frame rate at which we want to extract frames – it has nothing to do with the original frame rate of our video. We do this because as a rule of thumb we don't need to extract all frames from a video because doing so would lead to too much redundancy from frame to frame.

As a guideline, if we have a 30 FPS video, it would be optimal to extract frames from at 20 FPS.

Although we can choose any frame export format that we like, I recommend always using PNG because its lossless. Keep in mind, however, that PNG takes up about 10x more space than JPG.

For DST video we will do the same process. We will run the bat file 3) extract images from video data_dst FULL FPS. The only difference between them is that all frames will be extracted from the DST video because we want to replace the face in all frames.

Step 2: Extract faces from SRC and DST frames

During this step, we will detect faces on the extracted frames, detect face landmarks (key points), and use them to align faces. To extract faces from SRC frames we will run 4) data_src faceset extract, and for DST we will run 5) data_dst faceset extract.

The extraction settings are simple. Let's take a look at what they mean below:

  • Face type - "Full face," "Whole face," or "Head." This setting defines what area of the face we are going to capture during extraction. "Full face" covers the face up until just above the eyebrows. "Whole face" covers the full area of the face including the forehead. "Head" covers the full head. "Whole face" is the most universal option and will be best for most cases.
  • Max number of faces from image - This is simply a limit on the number of faces that can be extracted from the frame. Faces are extracted in order of their size in the frame.For example, if you have 5 faces in the frame and you put a limit of 2 faces, the 2 largest faces will be extracted. It is recommended to limit the number of faces to 3.
  • Image size Β - The resolution of the image with an aligned face. Note that higher values will require higher quality source frames. The recommended value is 512.
  • Jpeg quality - This defines the level of JPEG compression. The higher the value, the less compression. At 100, the JPEG will be essentially equivalent to PNG, since there is no compression. Since lossless takes a lot of space, I recommend setting it to 95.
  • Write debug images to aligned_debug - When this option is checked the frames with marking of extracted faces on them will be recorded. This option is intended mainly for DST frames. If the face is extracted incorrectly, we will be able to re-extract it manually.
Visualization of face extract process

Step 3: Sorting, filtering, and segmentation masks

The next step is to sort and filter the SRC and DST datasets.

In the SRC dataset, our goal is to get rid of all unnecessary faces, because the extraction may have extracted faces of other people who were in frame. We also need to remove samples that were misaligned and samples that are of obviously poor quality.

It is important to understand that these samples are what the neural network will learn from. The better samples in the dataset, the better results.

Sorting the DST dataset is very simple and boils down to two tasks:

  • Remove unnecessary faces that we do not want to replace
  • Remove and manually re-extract misaligned samples

The final step in preparing datasets is to apply segmentation mask to our samples. DeepFaceLab has its own tool for this called XSeg. The decision by DeepFaceLab contributors to create our own model for segmentation was made when all open-source models showed their inefficiency for our purposes.

There is a lot to say about XSeg, so I will focus on the most important.

First, DeepFaceLab has a pretrained XSeg model for whole face (WF) samples that does well in most cases.

Second, you can train your XSeg model for your samples. To do this, you need to manually mark up your samples with the XSeg Editor tool. It is important that you do not need to mark up all your samples to train your XSeg. For example, for a dataset of 5000 samples, it is enough to mark 100 samples to train XSeg and get masks for the whole dataset.

In order to apply the pretrained XSeg model to SRC and DST datasets, you need to run the appropriate bat files, the names of which begin with 5. XSeg Generic) ...

Training the model

The most difficult or perhaps most confusing stage for users is often the model training stage. Let's start by learning the three different models in DeepFaceLab.

Models and architectures

There are three models in DeepFaceLab: SAEHD, AMP, and Quick96 (Q96).

  • SAEHD - This is the main model that we will be working with most of the time. It is very flexible in its configuration and can produce high-quality photorealistic results.
  • AMP - This model is similar to SAEHD except that it allows us to swap faces depending on morph factor. In other words we can get as pure transfer SRC face, and mix SRC and DST faces by morph value.
  • Quick96 (Q96) - This model is more for beginners. Its purpose is to provide a demo of face swap so that beginners can see how model training works. Also, this model can be used to get a quick demo of specific SRC-DST pairs to see how well they fit together. Unlike others it works very fast, but the result is of poorer quality.

The SAEHD model has several types of network architectures, as well as some tweaks that can be applied to achieve certain effects.

Model architecture types
  • DF - This is the classic design of the deepfake model that uses one encoder and two decoders. The basic feature of this architecture is to provide better SRC likeness. When using this architecture, we need the SRC and DST faces to be as similar as possible because unlike the LIAE structure, it does not adapt SRC face shape to DST face shape during transfer. We should also keep in mind that this architecture has worse color and light tolerance than the LIAE structure.
  • LIAE - This is a more interesting type of architecture than DF because it is more flexible and adaptive. LIAE is able to adapt the SRC face shape to DST and also performs better with light and color transfer.

Remember that these are only general characteristics of both architectures. It all depends on our skills. For example, a properly trained DF model may have better lighting transfer than an improperly trained LIAE. The opposite is also true – a properly trained LIAE may have better results than an improperly trained DF.

It all comes down to how well SRC and DST are matched, how well the SRC dataset is captured, and how well we understand the principles of model training. Even if we know basics it can still take a good amount of trial and error to get everything right.

For each architecture there are a number of tweaks that can be applied to change the behavior of the model in one way or another:

  • U - Improves SRC likeness and is generally always recommended to use.
  • D - Optimization to increase model resolution by a factor of 2 without additional computational cost. This option requires longer training times and the resolution must be a multiple of 32 (as opposed to 16 without this optimization).
  • T - Improves SRC likeness. Unlike -U, it does this by changing the architecture of the model which can affect performance.

One advantage of these architecture tweaks is that they are freely combinable with each other. To combine them, after DF/LIAE add the - symbol and then letters in the same order as presented above. Here are some examples: DF-UDT, LIAE-DT, LIAE-UDT, DF-UD, etc.

If you are a beginner and don't know what to choose, I recommend starting with LIAE-UDT. As you experiment with models and gain experience you'll get a better understanding of which architecture options you like the best.

Model options

The model itself also has many parameters and options. Some options affect the number of trainable parameters of the neural network while some affect the training process itself. I will describe the most basic options and provide a link where you can read more if you're interested.

Since the SAEHD model is an autoencoder, we have the encoder (the input image), the decoder (the output layer or reconstructed image), and the inter or hidden layer or "bottleneck."

Accordingly, we can configure the base dimensions of these modules as follows:

Option Description
AutoEncoder dimensions The autoencoder dims setting is a value of the dimensions of inter module, which is the bottleneck of the autoencoder. This is responsible for generalization of facial features that have been recognized by the encoder. The total capacity of the network depends on the size of the autoencoder dims.
Encoder dimensions The dimensions of the encoder module, which are responsible for determining facial features and further recognition. When these dimensions are insufficient and facial features are too diverse, the encoder has to sacrifice non-standard cases, which worsens the overall quality of the output.
Decoder dimensions The dimensions of the decoder module, which is responsible for generating an image from code received from the bottleneck. When these dimensions are insufficient, with output faces too varied in color, illumination, etc., the decoder will sacrifice maximum allowable sharpness.
Decoder mask dimensions The dimensions of the mask decoder module, which is responsible for forming the mask image.

The following options relate more to training process (except for resolution) than to parameters of model itself.

Option Description
Batch size Increasing value improves generalization of faces, especially useful in early stage, but also increases time until clarity of faces is achieved. Increases memory consumption. In terms of quality of final fake, higher this value is, the better. Optimal value - 8. Do not set it lower than 4.
Resolution At first glance, the more the better. However, if face in frame is small, there is no point in choosing a large resolution. Increasing resolution increases training time. For WF face type, you need more resolution, because coverage of face is larger, thereby reducing details of face. Less than 224 for WF makes no sense.
Eyes and mouth priority Helps to fix eye problems during training like "alien eyes" and wrong eyes direction. Also makes the detail of the teeth higher
Use learning rate dropout Allows improve detail of face and improve subpixel transitions of facial features. Spends more VRAM. Therefore, when selecting a network configuration for your GPU, take into account inclusion of this option
Enable random warp of samples Required condition for correct generalization of faces. Works like augmentation by randomly deforming faces that arrive at the encoder input. Turn it off only when the face is already trained enough.
GAN power Adds a discriminator model. Allows you to improve detail of face. Turn on at very end of training process. Requires more memory, increases iteration time. Based on generative-adversarial principle.

Training workflow

You can find many DeepFaceLab guides and tutorials on the internet. In them, people often demonstrate their approaches to training process based on their experience. In fact, there are an infinite number of ways to train a model.

Therefore, I will give the training workflow that I personally use and that will be simple enough for beginners. With experience you will be able to try your own variations of the training workflow and model configurations.

  1. Download pretrained model or RTT (Ready-to-Train) model

Personally, I always use a RTT model. The difference between the RTT and a regular pretrained model is the way they were pretrained. RTT model can be downloaded from here.

  1. Start training

Place RTT model in workspace/model folder. Start train SAEHD.bat. Don't change the settings and train 500k iterations with your SRC and DST datasets.

Periodically delete the inter_AB.npy file in the model folder. The purpose of this procedure is that when you delete inter_AB.npy, this module is reinitialized. It has been found experimentally that this helps to improve SRC likeness of result.

  1. Disable random warp

Disable random warp and continue training for another +300k iterations. When you turn off random warp, the quality of reconstruction increases significantly. At the same time, it can have negative consequences for generalization, which increases the degree of morphing of the result. Therefore, we would not want to keep the model in this stage for too long – it is only needed to improve the quality of the reconstruction.

  1. Enable GAN

Set Β GAN power to 0.1. Leave GAN patch size at its default value, and set GAN dims at 32.

This option connects a discriminator model to training, which is trained in parallel to determine the level of "clarity" of SRC samples. When we enable this option, SRC loss will gradually increase, and we may see artifacts on SRC and SRC-DST samples in preview. This is normal. The model is adapting to new training conditions. This will go away with time. Train like this for another +500k iterations.

We should train the model until we are satisfied with quality of preview. There is no clear number of iterations that must pass for the model to give photorealistic results. It depends on both the quality of the datasets and it also varies from actor to actor.

That said, it's safe to say that training will take 1 million iterations or more.

Final conversion and merging

Once we've finished our training, we can move on to the final stage β€” merging.

There are two ways to get the final video output: merging with the DFL tool or exporting the output frames with the face swap result and manual composing them together in professional software like Adobe After Effects, DaVinci Resolve, etc.

To merge frames with DeepFaceLab, we will run the bat file 7) merge SAEHD.bat. This will provide a choice whether to use interactive merger mode or default merger mode.

I recommend using the interactive merger mode. Interactive mode displays a visual preview of the merged image as well as the current frame number and settings. We will be able to change settings on each individual frame. A useful feature of this mode is that we can adjust merge settings on the fly for a better visual experience. Also, we can save our session and resume it later.

In default or non-interactive mode, we will need to enter settings to apply to the entire sequence. All frames will be processed one-by-one and we will not be able to change settings or see the result until the end of process.

Interactive merger controls

In interactive mode, the merger window will open with hotkeys. Hotkeys allow us to change the settings of the face overlay in frame. In the console we can see the number of the current frame and its settings (config/cfg). With the Tab key we can toggle the help window and frame preview window.

We can also export frames using raw-rgb blending mode for manual post-processing in software such as Adobe After Effects.

If we did merge with DeepFaceLab and want to get finished mp4 video, then we will need to run 8) merged to mp4.bat. This function will assemble already merged frames into a single mp4 file. The same for face mask. Thus, two videos result.mp4 and result_mask.mp4 will be created in the workspace folder.

Go ahead and open up the result.mp4 video and take a look at your work! Congratulations, you've created your first deepfake video using DeepFaceLab!

Afterwords

I would like to note that this is just an introductory guide to get you started. In order to make really good deepfakes you need to spend a lot of time gaining experience – this will involve a certain amount of trial and error.

If you are a technician and you are interested in how DeepFaceLab models work from the inside, you can read our paper on arXiv: DeepFaceLab: Integrated, flexible and extensible face-swapping framework.

If you are interested in learning more about the process of creating deepfakes with DeepFaceLab, you can read more about it here.

Also, we have a Youtube video with the full training process from one of our contributors.

Bring this project to life

Spread the word

Keep reading