YOLOR object detection in Gradient

In this new tutorial, we will examine YOLOR object detection with PyTorch in detail to see how it combines implicit and explicit information with a unified representation. We then demonstrate how to use YOLOR with Gradient Notebooks.

2 years ago   •   9 min read

By James Skelton
Table of contents

One of the most popular and immediately usable concepts to come out of the popularization and study of deep learning is object detection. Object detection is the practice of using object recognition with image segmentation to create labeled bounding boxes to identify and mark categorizations of objects in images and video. This concept has rapidly grown in popularity thanks to the utility of the freely accessible YOLO algorithms. In the past, we covered YOLOv4 extensively in our series on implementing the object detector in PyTorch on Gradient.

In this article, we will examine YOLOR (You Only Learn One Representation). YOLOR is an algorithm for object detection released in 2021 that matches and even outperforms a scaled YOLO v4 model. YOLOR is conceptually different from YOLO because it uses a unified network to encode implicit knowledge and explicit knowledge simultaneously. YOLOR can perform "kernel space alignment, prediction refinement, and multi-task learning in a convolutional neural network," and the authors findings indicate that the inclusion of the implicit information aids in the performance of all tasks. (1)

In this tutorial, we will break down how YOLOR functions to detect objects, contrast it with the well known YOLO algorithm, and close with a coding demonstration for training a YOLOR model before using it to detect objects in a Youtube video.

YOLOR: architecture and capabilities

The basis of YOLOR's difference to YOLO lies in its ability to encode explicit and implicit knowledge in the same representation, hence the name, using a single, unified model. To clarify, explicit deep learning refers to understanding the coarse details of the image that are stored in the shallow layers of the network. Implicit deep learning focuses on the finer details and corresponds to the deeper layers of a network. By combining these two efforts in the single model, YOLOR can quickly and accurately detect fine details in high definition photos, and YOLOR is approximately 88% faster than the Scaled-YOLOv4 models.

Let's break down how YOLOR makes use of implicit knowledge learning, and then examine the unified model.

How YOLOR accounts for implicit knowledge:

Before we look at the unified network, we will introduce a few concepts the original paper authors suggested to use for this task. These techniques allowed the author to model for implicit knowledge and inference it quickly.

Manifold Space Reduction:

Manifold space reduction is the first technique mentioned in the paper. The authors assert that a good, single representation should be able to locate a corresponding projection in the manifold space it belongs (1). Using manifold learning, this technique aims to reduce the dimensions of the manifold space to a flat, featureless space. This near euclidean space then allows the algorithm to gain implicit understanding of the high dimensional structure of the data without any predetermined classifications. If the target categories can then be classified within this reduced projection space, we can improve the predictions while reducing dimensionality.

Kernel Alignment:

As shown in the figure above, it can be problematic to deal with kernel space misalignment with multi-task and multi-head neural networks. This misalignment can cause major disruptions in the capabilities of the model. To remedy this, the authors propose a series of operations on the output features and implicit representations. This allows the kernel space to be appropriately translated, rotated, and scaled so that each output kernel space is aligned (1). In practice, this means aligning the features of both the smaller details and coarser details of the outputs in Feature Pyramid Networks (2). These feature pyramids allow the recognition system to retain its capabilities at different scales.

Additional suggested methods:

As shown above in portion a, the introduction of addition can make a neural network predict the offset of the center coordinate, which can then be used to inform the final predictions. In portion b, we see introducing multiplication allows for anchor refinement. Anchor boxes allow one grid cell to detect multiple objects, and prevent complications from overlapping details. In practice, this makes a more robust model by effectively automatically scanning the hyperparameter set of the anchor. Finally, in section c, the figure shows how dot multiplication and concatenation can be applied to achieve multi-task feature selection and to set pre-conditions for subsequent calculations (1).

Architecture: the unified model

In their pursuit of a model which can handle both the explicit knowledge of a conventional neural network with implicit knowledge through the techniques described above, the authors of YOLOR created a multimodal unified network (pictured above). This model functions to generate one representation with explicit knowledge and implicit knowledge for serving multiple tasks.

In order to achieve this, the author's used explicit and implicit knowledge together to model the error term. This error term was then, in turn, used to guide the multi-purpose network training process. This can be mapped out to the following equation for training:

$y = fθ(x) + ε + gφ(εₑₓ(x), εᵢₘ(z))$

$minimize  ε + gφ(εₑₓ(x), εᵢₘ(z))$

where the εₑₓ and εᵢₘ terms represent operations which respectively model the explicit and implicit error terms from observation x and latent code z. is then a task specific operation that combines or selects information from the explicit and implicit knowledge stores. Since we have outlined methods to integrate explicit knowledge directly into fθ, we can then rewrite the equation as

$y = fθ(x) ☆ gφ(z)$

where ☆ represents the possible operators to combine fθ and gφ. Here the operators we discussed in the implicit knowledge section like addition, multiplication, and concatenation can all be used. By extending the derivation process of the error term to handle multiple tasks, we get the following equation:

$F(x, θ, Z, Φ, Y, Ψ) = 0$

where Z = {z1, z2, ..., zT } is a set of implicit latent code of T different tasks. Next, Φ act as the parameters to generate the implicit representation from Z. The term Ψ is then used to calculate the final output parameters from different combinations of the explicit representation and implicit representation.

Finally, from this we can derive the following formula to obtain prediction for all z ∈ Z, and use it to solve a variety of tasks.

$dΨ(fθ(x), gΦ(z), y) = 0$

Following this equation, we start all tasks with a common unified representation fθ(x). Each task then goes through task-specific implicit representation, gΦ(z). The task discriminator then completes each of the different tasks. (1)

In practice, this creates a system that is able to take in both the explicit knowledge that corresponds to the shallow layers of the network and the implicit knowledge that corresponds to the deeper layers of the network in a holistic representation. The discriminator can then direct the different tasks to completion quickly and efficiently.


We've talked a lot about YOLOR so far, and anyone reading this article who is familiar with YOLO can guess that YOLOR is also an extremely capable object detection algorithm. Through the combined effects of the process described earlier in the model, the YOLOR algorithm is able to detect and classify objects before surrounding them with a labeled bounding box. As you can see from the figure above, the authors have demonstrated that YOLOR can outperform YOLOX and Scaled-YOLOv4, when trained on the same dataset, in terms of both Average Precision and batch 1 latency.

Now that we have seen how YOLOR works, let's look at how we can train a YOLOR model, and use it to detect objects from images and videos in a Gradient Notebook.

Bring this project to life


To run YOLOR on a Gradient Notebook, first create a notebook with a PyTorch runtime. If you want to speed things up for training, you can set up a multi-GPU machine with the Growth package, and you may need to take steps to reduce the memory being used in the training process. Any of the available GPUs will run the detection script.

One final step in set up to make sure you do is to toggle the advanced options and set the following URL as your Workspace URL:


Once you've finished the set up, hit create to spin up your Notebook.


Now within your Gradient Notebook, we can get started with setup. There are two things we need to do before we can train a model or use a pretrained model to make some detections. First, we need to prepare and download the dataset. While the data is available as a Public Dataset, it is currently stored in an immutable volume. This unfortunately creates the need for the data in our working directory. The fastest way to get the COCO data ready for YOLOR, therefore, is to use the built in Python script in the scripts directory get_coco.sh. Execute the following in the terminal to load the data into your working directory:

mkdir coco
cd coco
bash get_coco.sh

This will set up your data in the coco directory, and place the images, annotations, and label files in their correct places to run YOLOR.

Next, users will need to install gdown, a Google Drive downloading program, to get the pretrained model weights for YOLOR-CSP-X that we are going to use later to get object detection. You can install gdown and download the weights using the following snippet.

pip install gdown 
gdown https://drive.google.com/uc?id=1NbMG3ivuBQ4S8kEhFJ0FIqOQXevGje_w

Once this is complete, you can begin training YOLOR or detecting with the pretrained model we just downloaded.

How to use YOLOR to detect objects in videos and images

Let's start by looking at how YOLOR works in real time. Use youtube_dl to download a video of people walking through a crowd. You can install it with

pip install youtube_dl

and then execute a notebook cell with the following code to download the video to the inference directory.

import youtube_dl

url = 'https://www.youtube.com/watch?v=b8QZJ5ZodTs'
url_list = [url]
youtube_dl.YoutubeDL({'outtmpl': 'inference/images/inputvid.mp4', 'format_id': 'worstvideo/worst', 'format': '160', 'vcodec': 'utf-8'}).download(url_list)

The inference directory stores the images that we will use with YOLOR for the object detection task. It will come with an existing picture of horses. We are going to use this image and the video we just downloaded to demonstrate the speed and accuracy of YOLOR. You can do that by executing the following in the terminal:

python detect.py --source inference/images/* --cfg cfg/yolor_csp_x.cfg --weights yolor_csp_x.pt --conf 0.25 --img-size 1280 --device 0

This will then run the object detector on each of the files in the inference/images/ folder, and output them to inference/output/. When it finishes running, the model will output a video and image containing labeled bounding boxes over the classified objects in each file. They should look something like this:


Now, if you want to run the same process, all you need to do is put your image or video file into the inference/images folder.

How to train your own YOLOR model with COCO

Setting up training can be a bit more complicated. Earlier, we set up the COCO datasets using bash get_coco.sh. This has positioned the annotations, labels, and annotations for YOLOR to use for training. To execute a training routine for a new YOLOR model with the COCO dataset, you can put the following into the console:

python train.py --batch-size 8 --img 1280 1280 --data data/coco.yaml --cfg cfg/yolor_csp_x.cfg --weights '' --device 0 --name yolor_csp_x_run1 --hyp data/hyp.scratch.1280.yaml --epochs 300

The coco.yaml file came with our clone of the YOLOR repo, and will guide the training to the correct directories and input the classification labels. This will run the training for 300 epochs, which will take a very long time. Once the model is completed, you can then use the outputted model weights for the best overall model to come out of training to execute a similar detection process to the pretrained model:

python detect.py --source inference/images/horses.jpg --cfg cfg/yolor_csp_x.cfg --weights runs/train/yolor_csp_x_run1/weights/best_overall.pt --conf 0.25 --img-size 1280 --device 0

Concluding thoughts

In this article, we looked at YOLOR in detail. We saw how YOLOR is able to integrate implicit knowledge with explicit knowledge in a single unified model, and use that understanding of a single representation to perform complex object detection tasks. We concluded with a demonstration showing how to launch YOLOR on a Gradient Notebook, and execute a detection task on a downloaded Youtube video.

For more information on YOLOR, you can read the original paper here and visit the author's Github.

Add speed and simplicity to your Machine Learning workflow today

Get startedContact Sales

Spread the word

Keep reading