Hugging Face

Using Adapter Transformers at Hugging Face

In this tutorial, we show how to use the HuggingFace AdapterHub to access adapter transformers in Paperspace Notebooks.

2 years ago • 6 min read

By Adrien Payong

Add speed and simplicity to your Machine Learning workflow today

Get started

Introduction

NLP has recently made great strides thanks to transformer-based language models that have been pre-trained on massive amounts of text data. These models are trained on a target task and can perform most NLU tasks at the state of the art (SotA) level. Recent models have reached billions of parameters (Raffel et al., 2019; Brown et al., 2020), and their performance has been shown to scale with size (Kaplan et al., 2020). Large, pre-trained models can be fine-tuned reasonably quickly on data from the target task, but it is typically impractical to train them for many tasks and share learned models. This prevents the exploration of more modular designs (Shazeer et al., 2017), task composition (Andreas et al., 2016), and the introduction of biases and external information (e.g., world or language knowledge) into large models (Lauscheret al., 2019; Wang et al., 2020).

Lightweight fine-tuning techniques, such as adapters (Houlsby et al., 2019), have recently been proposed as a viable alternative to full fine-tuning (Peters et al., 2019) for most workloads. They are a small collection of newly initialized weights added to each transformer layer. In fine-tuning, these weights are trained while holding the pre-trained parameters of the large model constant. By training multiple task- and language-specific adapters on the same model, and then swapping and combining them post-hoc, we can efficiently share parameters across tasks. Recent advances in adapter technology have led to impressive successes in areas such as multitasking and cross-lingual transfer learning (Pfeiffer et al., 2020a,b).

However, the process of adapter reuse and distribution is not straightforward. Adapters are often not distributed individually; their designs differ in small but significant ways, and they depend on the model, the task, and the language. Researchers have proposed AdapterHub, a framework for seamless training and sharing of adapters, as a means to mitigate the identified problems and facilitate transfer learning using adapters in a variety of contexts.

The state-of-the-art pre-trained language models are made available through AdapterHub, which is built on top of HuggingFace's popular transformers architecture. Researchers have developed adapter modules that allow transformers to work with existing SotA models with very little changes to the source code. They also provide a website for the easy transfer of pre-trained adapters between users.

You can access AdapterHub at http://AdapterHub.ml.

Benefits of using adapters

Task-specific Layer-wise Representation Learning: Prior to incorporating adapters, the full pre-trained transformer model had to be fine-tuned to achieve SotA performance on downstream tasks. By making adjustments to the representations at each level, adapters were shown to achieve results comparable to full fine-tuning.
Small, Scalable, Shareable: For example, XLM-R Large requires around 2.2Gb of compressed storage space (Conneau et al., 2020) since it is a very deep neural network with millions or billions of weights. A copy of the fine-tuned model for each task must be stored if they are to be fully fine-tuned. This makes it difficult to iterate and parallelize training, especially in settings with limited storage space. Thankfully, adapters help with this issue. Depending on the complexity of the model and the magnitude of the bottleneck in the adapter, a single task may use as low as 0.9Mb of memory.
Modularity of Representations: An adapter learns to encode task-related information within a given set of parameters. For the transformer model to work, each adapter must learn an output representation that is appropriate for the next layer due to their enclosed placement, where the parameters around them are fixed. Adapters can be stacked on top of one another in this configuration, or swapped out on the fly.
Non-Interfering Composition of Information: Information transfer across tasks has a long history in machine learning (Ruder, 2017). Perhaps the most important study has been in multitask learning (MTL), which uses the same settings for different tasks. Despite its promise, MTL has a number of drawbacks, including catastrophic forgetting, in which information learned in earlier stages of training is "overwritten"; catastrophic interference, in which performance on a set of tasks deteriorates when new tasks are added; and complex task weighting for tasks with different distributions (Sanh et al., 2019). Adapters are encapsulated to ensure that they learn output representations that are task-agnostic. Adapters store information from training on different downstream tasks in their relevant parameters. This allows many adapters to be combined, for example with attention (Pfeiffer et al., 2020a). Training each set of adapters separately eliminates the need for sampling heuristics caused by inconsistencies in data size. Adapters avoid the two main drawbacks of multitask learning-catastrophic forgetting and catastrophic interference-by decoupling the processes of knowledge extraction and composition.

Exploring adapter-transformers in the Hub

On the Models page, you can find over a hundred adapter-transformer models by using the filter options on the left. The AdapterHub repository contains many adapter models for you to browse. The AdapterHub is then used to aggregate the models from both sources.

Using existing models

We suggest checking out the official guide for detailed instructions on loading pre-trained adapters. To recap, the Adapter can be loaded and made active using the load_adapter function once a model has been loaded using the standard model classes.

Using the pip command, you can install adapter-transformers:

pip install -U adapter-transformers

The below code loads an adapter from AdapterHub and activates it for a pre-trained BERT model.

from transformers import AutoModelWithHeads

model = AutoModelWithHeads.from_pretrained("bert-base-uncased")
adapter_name = model.load_adapter("AdapterHub/bert-base-uncased-pf-imdb", source="hf")
model.active_adapters = adapter_name

The AutoModelWithHeads class is imported from the transformers library. We can use this class to load a pre-trained model with adapter heads.
The second line loads the pre-trained BERT model with heads through the from_pretrained method of AutoModelWithHeads class. The string "bert-base-uncased" is used as an input to load the pre-trained BERT model.
On the third line, we use the model object's load_adapter method to get the adapter from AdapterHub. For the IMDB sentiment analysis task, we load the pre-trained adapter by passing the string "AdapterHub/bert-base-uncased-pf-imdb" as the parameter. By specifying "hf" as the source parameter, we're telling the system that the adapter can be retrieved from the Hugging Face model repository.

To list all Adapter Models, use list_adapters in your code.

from transformers import list_adapters

# source can be "ah" (AdapterHub), "hf" (huggingface.co) or None (for both, default)
adapter_infos = list_adapters(source="ah", model_name="bert-base-uncased")

for adapter_info in adapter_infos:
    print("Id:", adapter_info.adapter_id)
    print("Model name:", adapter_info.model_name)
    print("Uploaded by:", adapter_info.username)

On the first line, we bring in the transformers library and its list_adapters method.
In the second line, the list_adapters method is called with two parameters:
source: A string identifying the adapters' source. There are three possible values: "ah" (AdapterHub), "hf" (Hugging Face model hub), and "None" (both).
model_name: A string identifying the name of the pre-trained model for which adapters should be display. In this case, to get a list of adapters for the pre-trained BERT model, we provide the parameter "bert-base-uncased" .
Then, we use a for loop to go through the adapter_infos list, printing out the ID, model name, and username of each adapter by accessing their corresponding attributes in the AdapterInfo object.

Conclusion

The combination of adapter transformers and AdapterHub provides a robust and time-saving method for refining pre-trained language models and enabling transfer learning. You can easily add adapters to your current transformer models and take advantage of these tools and resources.

Add speed and simplicity to your Machine Learning workflow today

Get started