# Graphcore on Paperspace: Introduction for Users

23 days ago   •   11 min read

Paperspace is well-known for providing high speed easy-to-use hardware-accelerated compute capability on the cloud, in the form of GPUs.

Now, since its initial launch on Paperspace in August 2022, we have partnered with Graphcore to up the game and provide the same easy-to-use access to their Intelligence Processing Units, or IPUs.

In the latest phase, larger paid machines have been introduced, removing the constraints of the free tier.

While GPUs were originally built for graphics and gaming, then adapted to AI, IPUs are built from the ground up for AI, and hence provide considerably better performance than even the fastest GPUs. The latest system now available on Paperspace, the Bow-POD16, provides 5.6 petaFLOPS AI compute at half-precision floating-point (FP16).

In this blogpost, we will introduce IPUs on Paperspace, and give various details of interest and relevance to those who would like to use them.

## Why are IPUs faster?

IPUs use a MIMD architecture, which means that it can deal with both multiple instructions (MI) and multiple pieces of data (MD) at the same time.

This compares to GPUs, which use SIMD, meaning that while they can also deal with large blocks of data, they can only execute single instructions on them at once, and therefore need those large blocks of data to be efficient.

The ability of IPUs to run individual processing threads on smaller data blocks, combined with their large on-chip memory that is next to the processor cores, means that they can be executing many parts of a model on many parts of the data in parallel, achieving higher performance than GPUs.

Other AI-specific needs such as lower precision floating point and sparsity are also built into the hardware.

A comparison between CPUs, GPUs, and IPUs is shown in the diagram:

The latest generation of IPU, the Bow-POD16, takes this further by using TSMC’s revolutionary Wafer-on-Wafer production technique. This means that the power can be delivered in a closely coupled manner, enabling a higher operating frequency, delivering a 40% leap in performance as well as improved power efficiency.

### How much faster (and cheaper)?

The peak performance of a Bow-POD16 is 5.6 petaFLOPS, compared to 4.0 petaFLOPS for an IPU-POD16, or 312 teraFLOPS for an A100-80GB GPU.

So even if you had 16 A100-80GBs (and currently the most available is 8, on Paperspace Core), the Bow would be faster: 5.6 vs. 0.312*16 = 4.992 petaFLOPS.

IPUs have much faster memory: The Bow IPU’s memory bandwidth of 65TB/s compares to that of an A100-80GB, at 2.04TB/s. The Bow-POD16 has 14.4GB of In-Processor-Memory™ within the IPU made up of SRAM that has a memory bandwidth of 65TB/s. This ultra-fast memory is complemented by DRAM chips up to 512GB capacity which can transfer data to the In-Processor memory.

IPUs are also cheaper. On Paperspace the Bow-POD16 is \$26.71/hr, giving a price of \$1.67/IPU/hr, which compares to \\$3.18/GPU/hr for the A100-80GB. So this makes the best IPUs available at just over 50% of the price of the best GPUs available on Paperspace. The non-Bow IPUs are even lower priced.

### Benchmarking?

Of course, the complete way to measure speed of IPU vs. GPU is to benchmark how fast your computation of interest, such as training a model, finishes.

Unfortunately this is not a trivial task because a fair comparison would involve measuring the speed of an optimized GPU model versus that of an optimized IPU model. But the optimizations involved are very different for the two types of hardware, because their architectures differ substantially.

However, when using IPUs on our runtimes with Graphcore's software, you are already taking advantage of a great deal of optimizations built into the system, for example the composition of the model graphs. These optimizations are made at different levels. The ML frameworks are optimized for the IPU programming model, for example by using graph compilation by default. The ML models and algorithms are also optimized for the IPU, for example by working at small batch sizes and making use of model parallelism.

IPUs also provide faster throughput and latency for model inference, and the same hardware is used as for training.

So the appropriate benchmark criterion to consider is the practical reality of the speed and performance per dollar spent of your model on the IPU versus what you are getting from the GPU.

And, if one does want to perform detailed benchmarks, Paperspace is an ideal place to do it because both GPUs and IPUs are available. You can run your own, or replicate published benchmarks such as on MLPerf using the available content on our runtimes.

While the exact speedups are data- and model-dependent, various results from independent competitions and case studies point to 2x to over 10x performance improvements on real trained models when moving from GPU to IPU.

So, roughly, by moving from a common GPU to an IPU one might expect to train a model in days instead of weeks, or hours instead of days, and it costs less.

## What models are IPUs good for?

IPUs give large speed ups to deep learning models in many different areas. If you convert your non-IPU code to IPU, it is likely to result in a speed up.

Nevertheless, some areas are particularly strong and extensively developed:

• Natural language processing (NLP)
• Graph neural networks (GNN)
• Computer vision (CV)
• Speech processing

This is because the architecture of the IPU matches well with the architecture of these models, for example computer vision models using group or depth-wise convolutions (e.g., EfficientNet) which require more fine-grained compute capabilities, as described in more detail here. It also does well on BERT, GPT, U-Net, YOLO, and others.

For model benchmarks, see Graphcore's performance results page.

## Simplifying coding: Hugging Face's Optimum Graphcore

While IPUs give a large performance increase versus GPUs, developing a model from scratch using PyTorch or TensorFlow 2 requires some code changes, both at the library level (e.g., torch -> poptorch) and when defining the model (e.g., the loss computation has to be part of the forward() function in PyTorch).

Often, these are straightforward drop-in library or Python class replacements, with appropriate changes of arguments, but when things become more advanced or customized, changes to the model implementation to enable IPU optimizations can multiply the complexity.

This provides a strong motivation for a library that can simplify the usage of IPUs by organizing or abstracting what is needed, and this is the purpose of Hugging Face's Optimum for Graphcore (HF Optimum). Forming the interface between the HF Transformers library and Graphcore, this allows models supported by Transformers to be used on IPUs.

Typical code changes are to replace the transformers import of Trainer and TrainingArguments with the optimum.graphcore equivalents, plus the class to configure the IPUs. Then the IPU config needed is added to the code. Hugging Face's documentation shows an example:

You can also use the pipeline API, which is even fewer lines of code, e.g.,

from optimum.graphcore import pipeline
classifier = pipeline('sentiment-analysis', model="distilbert-base-uncased-finetuned-sst-2-english", ipu_config = "Graphcore/distilbert-base-ipu")

classifier('We are very happy to introduce pipeline to the transformers repository.')

[{'label': 'POSITIVE', 'score': 0.9996947050094604}]


The large variety of transformer models already supported by the library means that working with these on IPU becomes an excellent option for many use cases.

In addition to the Transformers library, Hugging Face’s Diffusers library can support the IPU, as shown here, which enables diffusion models such as Stable Diffusion 2.0 to run on the IPU.

### Other Models

If the models you want to use are not in HF Optimum, you can port regular PyTorch or TensorFlow 2 code to IPU. The changes may be as simple as drop-in library replacements, or more complex.

For an introduction to the code changes needed, see here or here for PyTorch, and here for TensorFlow 2.

Another route to simplifying code in the general case for PyTorch is PyTorch Lightning, which organizes raw code into classes such as data preparation and models, and includes IPU support.

## IPUs on Paperspace

IPUs are the most powerful hardware available on Paperspace.

The types of IPU machines available are:

• Free IPU-POD4
• IPU-POD4
• IPU-POD16
• Bow-POD16

As might be expected, the IPU-POD16 consists of 16 IPUs, giving 4 petaFLOPS of performance as indicated earlier. The IPU-POD4 is 4 IPUs, so it will run proportionately slower, but it is also cheaper. This is good for those users who do not need 16 IPUs.

The Bow-POD16 is the latest-and-greatest addition to Graphcore's lineup, made available here and giving 5.6 petaFLOPS.

### Ease of use

A major aim of the Paperspace-Graphcore partnership is to combine making IPUs available to users with making them easier to use.

This is achieved via several avenues:

• No setup required: Gradient's runtimes with IPUs, notebooks, etc., are ready to go so you can start running IPUs right away. Previously, users have had to install the Poplar SDK themselves, as well as any other libraries needed.
• Runtimes for common tools: 3 runtimes are available - Hugging Face Optimum, PyTorch, and TensorFlow 2, containing installs of these libraries ready to start immediately working with IPU.
• Curated set of examples in notebooks: A set of examples in Jupyter .ipynb notebooks is newly curated and presented on Gradient and are ready to run. Notebooks' ability to present inline commentary, data content and visualization, while at the same time allowing the user to run and see the outputs from the code, makes it easier to understand deep learning on IPUs.
• Hugging Face Optimum: As detailed above, this library can greatly simplify the coding for IPUs for models that it supports, and comes ready to use on Paperspace.
• Public datasets: Large datasets used in the example material are already on Gradient as public datasets, meaning that they do not have to be downloaded by the user.

### How to Run

To run IPUs, simply start up a Gradient Notebook in the usual way. Select a runtime showing IPU, then use the default settings or adjust them as needed.

Alternatively, you can instantly launch a specific notebook by exploring the full range of IPU-powered notebooks and sample models tabulated below, or here.

Stable Diffusion Image-to-Image Generation on IPU HF Optimum
Stable Diffusion Text-to-Image Generation on IPU HF Optimum
Stable Diffusion Text Guided In-Painting on IPU HF Optimum
Training a ViT Hugging Face model in PyTorch using the IPU using your own dataset PyTorch (on HF model)
Fine-tuning for Image Classification with Hugging Face Optimum on IPU HF Optimum
BERT-Large Fine Tuning on IPU PyTorch
Fast sentiment analysis using pre-trained models on Graphcore IPU HF Optimum
Text Generation with GPT2 using IPUs PyTorch (coming soon)
Inference for Named Entity Recognition with BERT on IPU HF Optimum (coming soon)
Fine-Tuning for Named Entity Recognition with BERT on IPU HF Optimum (coming soon)
Fine-tuning a model on a multiple choice task on IPU HF Optimum
Fine-tuning a model on a summarization task on IPU HF Optimum
Fine-tuning a model on a text classification task on IPU HF Optimum
Fine-tuning a model on a token classification task on IPU HF Optimum
Fine-tuning a model on a translation task on IPU HF Optimum
Training large graphs efficiently with Cluster-GCN on IPU TensorFlow2
Training Dynamic Graphs with Temporal Graph Networks (TGN) on IPU PyTorch
Prediction of molecular properties using SchNet on IPU PyTorch
Predicting of molecular properties using GPS++ on IPU (OGB-LSC) TensorFlow2
Training for molecular property prediction using GPS++ on IPU (OGB-LSC) TensorFlow2
Link prediction training for knowledge graphs using Distributed KGE on IPU (OGB-LSC) PyTorch (coming soon)
Fine-tune a wav2vec 2 checkpoint for Automatic Speech Recognition (ASR) on IPU HF Optimum
Running Automated Speech Recognition (ASR) using a fine-tuned wav2vec 2.0 checkpoint on IPU HF Optimum

For users who would like to work in interfaces other than Gradient's own for notebooks, these are available:

• JupyterLab interface
• Remote kernel from an IDE, e.g., VS Code
• Terminal, e.g., for .py scripts instead of .ipynb notebooks, and the gc-monitor command to see IPU usage (or download PopVision Graph Analyser)

The combination of advanced options under Gradient Notebook create, and these interfaces, allow you to run any combination of GitHub repository and Docker container, from your preferred interface. This opens up the rest of Graphcore's content on their examples, tutorials, and HF Optimum repos that is in .py form rather than notebooks, and other Docker containers from their Docker Hub.

IPU developers can also run their own notebooks directly by selecting their own Docker image (or one of Graphcore’s) and/or pointing to their own GitHub repository under Advanced Options when creating a Gradient Notebook.

Some parts of Paperspace remain to be integrated with IPU hardware, for example, Core machines, and incorporation into Workflows and Deployments, but this is all in scope for future releases and coming very soon!

### Example content

Content is supplied in the usual manner for Paperspace Gradient, via runtimes. These combine a container (typically Docker) with a repository (typically GitHub) to allow you to begin coding immediately within a defined environment that has your desired installations and files present.

These are the main examples that are presented as .ipynb Jupyter notebooks. Many of these are quite extensive real-world or real-world-like dataflows and models.

For full details of the models available, see the list.

Hugging Face Optimum runtime

• Natural language processing (NLP): 10 notebooks, including Optimum, sentiment analysis with the BERT-Large model, question answering, translation, classification, and summarization
• Stable Diffusion: Image-to-image, inpainting, and text-to-image
• Image classification: Vision transformers (ViT)
• Audio processing: Wav2Vec2 for speech recognition
• Managing IPU resources

PyTorch runtime

• Knowledge Graph Embedding (KGE)
• Fine-tuning BERT-Large for question answering (NLP)
• PyG SchNet-GNN graph neural network using PyTorch Geometric (PyG)
• Temporal graph networks (TGN) for a dynamically evolving graph
• Vision Transformers (ViT) for image classification
• Learning PyTorch on IPU: Basics, efficient data loading, mixed precision data, and IPU pipelining
• Managing IPU resources

TensorFlow 2 runtime

• Learning TensorFlow 2 on IPU: Keras, MNIST, and TensorBoard
• Cluster-GCN graph neural network for large graphs
• GPS++ training and inference, which won the Open Graph Benchmark Large-Scale Challenge

The corresponding GitHub repositories accessed by the runtimes are: Hugging Face Optimum, PyTorch, and TensorFlow2, using containers from the Graphcore Docker Hub.

## Conclusion

We have introduced IPUs on Paperspace for users, and discussed

• Why IPUs are faster than GPUs, and by how much
• What models IPUs are good for: NLP, GNN, computer vision, and many others
• Simplifying coding: Hugging Face Optimum
• Running IPUs on Paperspace

## Next steps

Get started on Gradient by signing up, creating a project, then creating a Gradient Notebook using the Hugging Face Optimum on IPU, PyTorch on IPU, or TensorFlow 2 on IPU runtimes.

For more on Paperspace + Graphcore, see our page or their page.

Graphcore also has a large amount of comprehensive and well presented information online about their hardware, software, and using IPUs for deep learning.

Some good entry points to Graphcore's content are:

Note that some of these will refer to generic machine setups such as setting up the Poplar SDK that are not required on Paperspace.

We plan to publish further entries on other IPU topics on Paperspace that dive into various areas in more detail, so keep an eye out for those!