The Swish Activation Function

This blogpost is an in-depth discussion of the Google Brain paper titled "Searching for activation functions" which has since revived research into activation functions.

a year ago   •   6 min read

Activation functions might seem to be a very small component in the grand scheme of hundreds of layers and millions of parameters in deep neural networks, yet their importance is paramount. Activation functions not only help with training by introducing non-linearity, but they also help with network optimization. Since the inception of perceptrons, activation functions have been a key component impacting the training dynamics of neural networks. From the early days of a step function to the current default activation in most domains, ReLU, activation functions have remained a key area of research.

ReLU (Rectified Linear Unit) has been widely accepted as the default activation function for training deep neural networks because of its versatility in different task domains and types of networks, as well as its extremely cheap cost in terms of computational complexity (considering the formula is essentially $max(0,x)$). In this blog post, however, we take a look at a paper proposed in 2018 by Google Brain titled "Searching for activation functions", which spurred a new wave of research into the role of different types of activation functions. The paper proposes a novel activation function called Swish, which was discovered using a Neural Architecture Search (NAS) approach and showed significant improvement in performance compared to standard activation functions like ReLU or Leaky ReLU. However, this blog post is not only based on the paper specified above, but also on another paper published at EMNLP, titled "Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks". This paper essentially evaluates Swish empirically on various NLP-focused tasks. To note, in this blog post, we will discuss Swish itself and not the NAS method that was used by the authors to discover it.

We will first take a look at the motivation behind the paper, followed by a dissection of the structure of Swish and its similarities to SILU (Sigmoid Weighted Linear Unit). We will then go through the results from the two aforementioned papers and finally provide some conclusive remarks along with the PyTorch code to train your own deep neural networks with Swish.

1. Motivation
2. Swish
3. PyTorch Code
4. Notable Results
5. Conclusion
6. References

Bring this project to life

Abstracts

Searching for activation functions

The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, $f(x) = x · sigmoid(\beta x)$, which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.

Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP Tasks

Activation functions play a crucial role in neural networks because they are the nonlinearities which have been attributed to the success story of deep learning. One of the currently most popular activation functions is ReLU, but several competitors have recently been proposed or ‘discovered’, including LReLU functions and swish. While most works compare newly proposed activation functions on few tasks (usually from image classification) and against few competitors (usually ReLU), we perform the first largescale comparison of 21 activation functions across eight different NLP tasks. We find that a largely unknown activation function performs most stably across all tasks, the so-called penalized tanh function. We also show that it can successfully replace the sigmoid and tanh gates in LSTM cells, leading to a 2 percentage point (pp) improvement over the standard choices on a challenging NLP task.

Motivation

Activation functions have been a primary area of research in understanding the training dynamics of neural networks. While the role of activation functions has been massive, they have been viewed as a mere component in applied perspectives while in theoretical fields like Mean Field Theory (MFT), the debate around activation functions is more hyped up. While ReLU and Leaky ReLU have dominated the scene because of their simplicity in terms of formulation and computational complexity, many have proposed smoother variants to improve optimization and information propagation like ELU and Softplus, but they have all but been one-night wonders and have failed to replace the ever-so versatile ReLU.

In this work, we use automated search techniques to discover novel activation functions. We focus on finding new scalar activation functions, which take in as input a scalar and output a scalar, because scalar activation functions can be used to replace the ReLU function without changing the network architecture. Using a combination of exhaustive and reinforcement learning-based search, we find a number of novel activation functions that show promising performance. To further validate the effectiveness of using searches to discover scalar activation functions, we empirically evaluate the best discovered activation function. The best discovered activation function, which we call Swish, is $f(x) = x · sigmoid(\beta x)$, where $\beta$ is a constant or trainable parameter. Our extensive experiments show that Swish consistently matches or outperforms ReLU on deep networks applied to a variety of challenging domains such as image classification and machine translation.

Swish

Simply put, Swish is an extension of the SILU activation function which was proposed in the paper "Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning". SILU's formula is $f(x) = x \ast sigmoid(x)$, where $sigmoid(x) = \frac{1}{1 + e^{-x}}$. The slight modification made in the Swish formulation is the addition of a trainable $\beta$ parameter, making it $f(x) = x \ast sigmoid(\beta x)$. As observed in the above graph, it has a few key distinct properties that make it different and better than ReLU. Firstly, Swish is a smooth continuous function, unlike ReLU which is a piecewise linear function. Swish allows a small number of negative weights to be propagated through, while ReLU thresholds all negative weights to zero. This is an extremely important property and is crucial in the success of non-monotonic smooth activation functions, like that of Swish, when used in increasingly deep neural networks. Lastly, the trainable parameter allows to better tune the activation function to maximize information propagation and push for smoother gradients, which makes the landscape easier to optimize, thus generalizing better and faster. Swish is also a self-gating activation function since it modulates the input by using it as a gate to multiply with the sigmoid of itself, a concept first introduced in Long Short-Term Memory (LSTMs).

PyTorch Code

The following code snippets provide the PyTorch implementation of Swish with $\beta = 1$, which is SILU, since that is the most widely used variant.

import torch
import torch.nn as nn

class Swish(nn.Module):
def __init__(
self,
):
"""
Init method.
"""
super(Swish, self).__init__()

def forward(self, input):
"""
Forward pass of the function.
"""
return input * torch.sigmoid(input)

To run ResNet models equipped with a Swish activation function (for instance a ResNet-18) on CIFAR datasets, use the following commands.

CIFAR-10

python train_cifar_full.py --project Swish --name swish1 --version 1 --arch 1

CIFAR-100

python train_cifar100_full.py --project Swish --name swish --version 1 --arch 1

Note: You would require a Weights & Biases account to enable WandB Dashboard logging.

Conclusion

While Swish is a very influential paper in promoting more research into activation functions, the activation function Swish itself hasn't been able to replace ReLU because of the increased computational complexity associated with it. While it is nearly impossible to generalize to an activation function as cost-efficient as that of ReLU, with more smooth activation function variants coming out, only time will tell if ReLU will be dethroned.