# Custom Diffusion with Paperspace

Learn how to customize your diffusion model images with multiple concepts!

22 days ago   •   9 min read

Bring this project to life

On this blog, we have covered extensively how to set up fine-tuning and customization techniques for the popular Stable Diffusion model for generating novel images. We first demonstrated the efficacy of creating a novel token for the model to associate with the image features shown in the training set using Textual Inversion. Next, we looked at fine-tuning the model directly with Dreambooth, which leverages the semantic prior embedded in the model with a new autogenous class-specific prior preservation loss to enable the synthesis of the subject in diverse scenes, poses, views, and lighting conditions that do not appear in the reference images. These powerful techniques give great control to the user during the image generation stage.

Today, we are going to look into the latest addition to this ensemble of techniques for tuning diffusion models: Custom Diffusion. In addition to the typical features of Stable Diffusion, Custom Diffusion, unlike the previous two techniques, allows for the model to learn single or multiple concepts. This in turn allows the user to utilize multi-concept prompts when synthesizing images from the model. Furthermore, the compressibility of the model makes it possible to create far lighter model checkpoints than ever before.

Visit the Github repo for Custom Diffusion here.

# Background

Released to the public last month by Adobe Research, Custom Diffusion is an efficient method for augmenting existing text-to-image models. We find that only optimizing a few parameters in the text-to-image conditioning mechanism is sufficiently powerful to represent new concepts while enabling fast tuning. Additionally, we can jointly train for multiple concepts or combine multiple fine-tuned models into one via closed-form constrained optimization. Our fine-tuned model generates variations of multiple new concepts in novel unseen settings.

## How it works:

In short, their method for model fine-tuning works by only updating a small subset of the weights in the cross-attention layers of the model. At the same time, a regularization set of real images to prevent overfitting on the few training samples of the target concepts.

To elaborate, when given images of new concepts, the model retrieves real images with similar captions as the given concepts. It uses these images to create a training dataset for fine-tuning. Above is an example of concepts and their corresponding images.

During training, to represent personal concepts of a general category, they use a modifier token V∗. This token is placed directly in front of the category name to signal the concept textually. During training, the model optimizes key and value projection matrices in the diffusion model cross-attention layers along with the modifier token. The retrieved real images are then used as a regularization dataset during fine-tuning.

The figure above shows an instance of the cross-attention layer and the trainable parameters. The Cross-attention block modifies the latent features of the network according to the condition features, i.e., text features in the case of text-to-image diffusion models. We can represent this mathematically, as shown below:

$Attention(Q, K, V ) = Softmax(QKᵀ/ √d' )  V,$

We can see that the latent image feature f and text feature c are projected into query Q, key K, and value V . This computes the output as a weighted sum of the values, weighted by the similarity between the query and key features. The model then highlights the updated parameters Wk and Wv.

#### Single concept training

The basic use for this model is to train the model on a single concept with only a few, a recommended minimum of four, images. The model is capable of retaining the knowledge it held before fine-tuning, so it is able to impart novel features and characteristics to the features learned during tuning.

#### Multi concept training

In order to learn multiple concepts, Custom diffusion takes in each individual concept, and works to train on them jointly and concurrently with the same model. To denote the separate target concepts, they recommend the use of different modifier tokens, V∗i . These are initialized with different rarely-occurring tokens to optimize them, along with cross-attention key and value matrices, for each layer. In practice, this allows the model to simultaneously learn multiple concepts with very little loss in capability, and then combine these two subjects as desired during synthesis with the features known to the pre-trained model.

## Comparing the techniques

As you can see from the examples above, Custom Diffusion compares favorably with both Textual Inversion and Dreambooth for fine-tuning tasks on a purely subjective, qualitative measure. All three techniques do a fine job of imparting the characteristics of the image, but we can see a higher degree of variation, versatility, and spatial understanding in the Custom Diffusion examples.

Be sure to try out the Paperspace Stable Diffusion Runtime to generate your own comparisons!

# Demo

Now that we are familiar with the concepts at play behind Custom Diffusion, we can get started setting up our demo. We are going to be using a Gradient Notebook to test out the sample code, and you can follow along by clicking the link here.

## Setup

First, we are going to execute setup.sh, a shell script containing all of the installs we need to run the model. Let's look at the packages we need directly:

# Clone repos
cd custom-diffusion
git clone https://github.com/CompVis/stable-diffusion.git
cd stable-diffusion
pip install -e .
git clone https://huggingface.co/spaces/nupurkmr9/custom-diffusion
cd ../..

## Install requirements
pip install clip-retrieval
pip install diffusers bitsandbytes clip_retrieval gradio albumentations diffusers opencv-python pudb invisible-watermark imageio imageio-ffmpeg pytorch-lightning omegaconf test-tube streamlit einops torch-fidelity transformers torchmetrics kornia
pip install accelerate==0.15.0
pip install bitsandbytes==0.35.4
pip install diffusers==0.10.2
pip install ftfy==6.1.1
pip install Pillow==9.3.0
pip install torch==1.13.0
pip install torchvision==0.14.0
pip install transformers==4.25.1
pip install triton==2.0.0.dev20220701
pip install clip_retrieval
pip install -e git+https://github.com/openai/CLIP.git@main#egg=clip
pip install -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers

We first clone the Custom diffusion repository, and clone the stable diffusion repo within it. This redundancy is only in place to make sure Custom diffusion has access to all the scripts it needs to run. After installing the Stable Diffusion repo, we clone the HuggingFace.co repo for the Custom Diffusion application. This will enable less experienced coders to easily run Custom Diffusion using the Gradio application.

After we install all the requirements, we can move onto the next section.

## Run the demo

There are two ways to run the fine-tuning demo: using Diffusers or using the Gradio application.

Bring this project to life

#### Diffusers

Let's start by looking at the args for the Diffusers method:

And now for multi-concept, we need to include a JSON file containing the file information similar to this one. We can then execute the code in the cell below:

## launch training script (2 GPUs recommended, increase --max_train_steps to 1000 if 1 GPU)
## provide some json file with the info about each concept
!CUDA_VISIBLE_DEVICES=2,3 accelerate launch src/diffuser_training.py \
--pretrained_model_name_or_path=compvis/stable-diffusion-v1-4   \
--output_dir=./logs/cat_wooden_pot  \
--concepts_list=./assets/concept_list.json \
--with_prior_preservation --real_prior --prior_loss_weight=1.0 \
--resolution=512  \
--train_batch_size=2  \
--learning_rate=1e-5  \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--num_class_images=200 \
--scale_lr --hflip  \
--modifier_token "<new1>+<new2>"

## sample
python src/sample_diffuser.py --delta_ckpt logs/cat_wooden_pot/delta.bin --ckpt "CompVis/stable-diffusion-v1-4" --prompt "<new1> cat sitting inside a <new2> wooden pot and looking up"

Notably, these method include the following arguments to pay attention to:

• pretrained_model_name_or_path: inputs for different Diffuser model checkpoints stored on Huggingface or your local environment.
• instance_data_dir: the location of your input photos
• instance_prompt: the prompt describing the subject of the input photos with the new token
• max_train_steps: how many training steps to take

Run this cell with the corresponding changes to the settings listed above to fine-tune the model to the inputted images.

HuggingFace user nupurkmr9 has created a wonderful Gradio app that demonstrates the functionality of this new technique in a low code environment. To set up the application, open up the file at notebooks/custom-diffusion/custom-diffusion/app.py. Navigate to the last line, and change the "share = False" to "share = True" in the launch call.

Then, back in our Notebook, we can run the application by executing the cell with the code shown below:

%cd ~/../notebooks/custom-diffusion/custom-diffusion
python app.py --share

## Sample from the new model

Now that we have our new pretrained model, we can use the built in sample method or the standard Diffusers pipeline to synthesize images from the model.

If we want to use the provided sample script, which will not display in the window automatically, we can execute the code in the cell below:

!python src/sample_diffuser.py --delta_ckpt logs/cat/delta.bin --ckpt "CompVis/stable-diffusion-v1-4" --prompt "<new1> cat playing with a ball"

Alternatively, we can use the Diffusers StableDiffusionPipeline to sample from the model within the Notebook. Below is a snippet that can be used to query from the model we just trained in this demo:

import torch
from torch import autocast
from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler

%cd ~/../notebooks

## Set the args
device = 'cuda'
model_ = 'custom-diffusion/results/checkpoint-500'
size_ = 512
precision = 512
sample_num = 5

print(f'Generating samples from Stable Diffusion {model_} checkpoint ({precision})')

## Instantiate the model pipe with StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(model_, torch_dtype=torch.float16, revision="fp16")
pipe = pipe.to(device)

for j in range(sample_num):
a = pipe(prompt = 'a photo of a <new1> cat',
negative_prompt = None,
guidance_scale=7.5,
height = 512,
width = 512,
num_images_per_prompt = 1,
num_inference_steps=50)['images']
for i in a:
display(i)
# hash the next line out to save results
#a.save(f'outputs/gen-image-{j}.png')


From here, you can adjust the parameters as needed to fit the subject of your training, and generate any images you want quickly with Paperspace's GPU powered machines.

## Compress the model

By adjusting the amount we are compressing the image, we can significantly affect the outputs of the Diffusion model. Take a look at the examples below, which use varying amounts of compression on a Custom Diffusion model. The smallest, at 0 Rank, is a model approaching the minimum size of ~15 MB. This new size is orders of magnitude smaller than the original diffusion checkpoints, which can range from 2-8 GB in size.

Once training is complete, we can use model compression to greatly reduce the size of the model files. In turn, this will correspond to a weakening of the understanding of both the original training of Stable Diffusion and the understanding of the novel subject we introduced in fine tuning.

Doing so is relatively simple. Navigate back to the custom-diffusion repo, and run the following code cell to compress your model:

python src/compress.py --delta_ckpt <finetuned-delta-path> --ckpt <pretrained-model-path>

# Closing thoughts

In this article we sought to better understand the newest technique for fine-tuning Latent Diffusion models. Specifically, the custom diffusion technique allows for fine-tuning on one or more concepts featured in the training images, and allows them to be used in tandem to synthesize combinations of those features in the outputted images. We showed this by walking through each of the capabilities for the new technique, before jumping into a coding demonstration for using the technique within a Gradient Notebook.

We highly recommend you check out the Gradient Stable Diffusion runtime tile, which contains all the code from our work with Stable Diffusion models. We also recommend you try combining this technique with Textual Inversion for even greater degrees of control over your final image outputs.