Comparing the Ampere GPUs available on Paperspace

Follow this guide to learn what makes the Ampere GPUs so powerful. We will then show how the A4000, A5000, and A6000 are the most cost effective GPU offerings at Paperspace.

2 years ago   •   10 min read

By James Skelton
Table of contents

Ampere GPUs like the A100 have received excellent press, and for good reason, since entering the GPU market for their incredible capabilities in machine and deep learning tasks. Thanks to the advances in their generation's technology, they are consistently able to match and outperform older GPUs with some similar specifications, like GPU memory, at a much lower cost to run the other machine types. Even compared to powerful Tesla GPUs, they are comparable in performance at nearly half the price.

Selecting what GPU to use for any given task is always a challenge. This blog post will cover the advantages of using the Paperspace offerings for the Ampere RTX GPUs, the A4000, A5000, and A6000, over other GPU's with similar GPU memory values. Since this is a common first metric for users to base their choices on, we will also suggest some other statistics to consider when choosing the best GPU for your ML/DL task. We'll start by examining the Ampere RTX GPU architecture and its innovations, before jumping into a breakdown of the comparative abilities of these GPUs in practice with other generation GPUs with similar capabilities.

The Nvidia Ampere Quadro RTX Architecture

The Nvidia Ampere RTX and Quadro RTX GPU line was created to bring the most powerful technologies to professional visualization, and significantly enhance the performance of RTX chips for tasks such as AI training or rendering. The Quadro RTX series was originally based on the Turing microarchitecture, and features real-time raytracing. This is accelerated by the use of new RT cores, which are designed to process quadtrees and spherical hierarchies, and speed up collision tests with individual triangles. Ampere GPU memory also comes with error-correction code (ECC) to run processing tasks without compromising computing accuracy and reliability. Thanks to these upgraded features, the Ampere label has become synonymous with cutting edge GPU technology; something which has only continued further with the development of more advanced machines in the line, like the A100.

The Ampere RTX GPUs represent the second generation of the Quadro RTX technologies. The architecture builds on the power of the predecessor RTX technology to significantly enhance the performance of rendering, graphics, AI, and compute workloads over previous generations. These updated features include:

  • Second-generation Ray Tracing cores: As shown by their Deep Learning Super Sampling paradigm, RT cores can massively boost frame rates and help to generates sharp images in visualization task using the DLSS NN. This can be very useful for designers working with graphics on Paperspace Core machines.
  • Third generation tensor cores: "New Tensor Float 32 (TF32) precision provide up to 5X the training throughput over the previous generation to accelerate AI and data science model training without requiring any code changes." (1)
  • CUDA cores: CUDA cores in Ampere GPUs are up to 2 x as power efficient. These enable double speed processing for single-precision FP 32 operations, and enable significant performance gains for Deep Learning tasks
  • PCI Express Gen 4.0: PCIe Gen 4.0  provides 2X the bandwidth of PCIe Gen 3.0, which improves data transfer speeds from CPU memory for data-intensive tasks such as AI and data science. (1) One benchmark article demonstrated this by showing "the read data is 61% faster for the PCIe gen 4 versus the PCIe gen 3 and the write data is 46% faster for the PCIe gen 4." (2)
  • Third-Generation NVLink: NVLink enables 2 GPUs to connect and share capabilities and performance for a single task. "With up to 112 gigabytes per second (GB/s) of bidirectional bandwidth and combined graphics memory of up to 96 GB, professionals can tackle the largest rendering, AI, virtual reality, and visual computing workloads." (1)

Comparing Ampere RTX GPUs with others in their class

At Paperspace, you can access 4 different Ampere GPUs: the A4000, A5000, A6000, and A100. Each of these can be used in multi-GPU instances up to x 2 as well. We concluded the A100 was the best GPU for Deep Learning tasks in our recent computer vision benchmarking report in terms of speed and power, but noticed that the other Ampere RTX GPUs frequently were more cost effective to use in practice. In the interest of covering new material, we will focus more deeply on the A4000, A5000, and A6000 in this article, and demonstrate why these machines will often be the best choices for your Deep Learning tasks.

Ampere GPU Specifications

Here is a table comparing the Ampere RTX GPUs with others of comparable performance in terms of GPU memory. Since there is no comparable GPU for the A6000, in terms of GPU memory, available at the moment, we opted compare it with the A100.

GPUA4000RTX5000V100 - 16GBA5000P6000V100 - 32GBA6000A100
GenerationAmpereTuringVoltaAmperePascalVoltaAmpereAmpere
CUDA cores6,1443,0725,1208,1923,8405,12010,7526,912
GPU Memory (GB)1616162424324840
Single precision performance in TeraFLOPS (SP FP32) 19.211.214.027.812.014.038.719.5
Memory bandwidth (GB/s)4484489007684329007681,555
vCPU888888812
Memory4832324832324897
Storage incl. (GB)5050505050505050
Price per hour$0.76$0.82$2.30$1.38$1.10$2.30$1.89$3.09
Price per month (usage only no subscriptions)$0$0$0$0$0$0$0$0
Price of subscription required for this instance$8.00$8.00$39.00$39.00$8.00$39.00$39.00$39.00
Monthly price usage plus subscription$8$8$39$39$8$39$39$39
Available to be chosenyesyesyesyesyesyesyesyes
$ per GB/minute (Throughput) $0.10$0.11$0.15$0.11$0.15$0.15$0.15$0.12
$ per 100 CUDA cores (H)$0.01$0.03$0.04$0.02$0.03$0.04$0.02$0.04
$ per Memory (GB)$0.05$0.05$0.14$0.06$0.05$0.07$0.04$0.08
$ per TeraFLOPS SP 32$0.42$0.71$2.79$1.40$0.67$2.79$1.01$2.00

The reasons for Ampere RTX GPU's incredible performance can be shown numerically in three places: each Ampere GPU's contain more CUDA cores than comparable other machines, significantly higher single precision performance values, and, less importantly, more CPU memory. More CUDA cores translate directly to being capable of processing more data in parallel at any given time, the single precision performance reflects how many floating point operations per second can be done in TeraFLOPS, and the amount of CPU memory aids in other processes outside the GPU like data cleaning or plotting. Together, we can start to build our case for recommending Ampere RTX machines be used wherever possible.

In addition to these specs, here are the corresponding benchmarks for the selected GPUs from our recent benchmark report. These all show a measure of time to completion, so that we can better understand how these differences in specs manifest in practice.

A4000RTX5000V100-16GBA5000P5000V100-32GBA6000A100
YOLOR (s)18.65820.01619.33514.10760.97617.044713.87812.421
StyleGAN_XL (s)106.66710785.66783295105.33387.33388.333
EfficientNet (s)925867801769OOM*690627528
Cost ($ per hour)$0.76$0.82$2.30$1.38$0.78$2.30$1.89$3.09
YOLOR single run cost**$0.004$0.005$0.012$0.005$0.013$0.011$0.007$0.011
StyleGAN_XL single run cost**$0.02$0.02$0.05$0.03$0.06$0.07$0.05$0.08
EfficientNet single run cost**$0.20$0.20$0.51$0.29Fail!$0.44$0.33$0.45

*OOM: Out of Memory. This indicates the training failed due to lacking memory resources in the kernel.

** Single run cost: Time in seconds was converted to hours, and then multiplied by the cost per hour. Reflects the cost of a single run of the task on the GPU.

The YOLOR benchmark measured the time to generate image detections on a short video, the StyleGAN XL task was for training a single epoch on a dataset of Pokemon images, and the EfficientNet benchmark was a single training epoch on tiny-imagenet-200. We then compare each time to completion value based on their per hour cost. For the full benchmarking report, visit this link.

When to use the A4000:

As shown above, the A4000 performs very well both in terms of specs, cost metrics, and benchmark times. At 0.76 USD per hour, it also holds the spot as the most affordable option in this guide. It also has the highest CUDA core count, SP FP32, and CPU memory in its category.

While throughput is comparatively low, the A4000 only took 115.48% as long to train the EfficientNet benchmarking model with tiny-imagenet-200, and 124.51% as long to train the StyleGAN XL model when compared to the V100- 16 GB. For the detection task with YOLOR, it was actually faster. This indicates that the A4000 instance offers an excellent budget choice when compared to the V100 - 16GB. Moreover, the cost over throughput, cost per 100 CUDA cores, and cost over the SP FP32 value is the best of all the options in the table at large.

While the RTX5000 is nearly at the same performance and cost, you will still be getting a better value with the A4000. This will become more apparent as tasks become more complex for the machine, judging by the differences in SP FP32 and CUDA cores. Based on this cumulative evidence, we can recommend the A4000 as the first choice for any user seeking the most cost effective GPU on Gradient, running for less than 1 dollar per hour.

When to use the A5000

At a cost of 1.38 USD per hour, the A5000 represents an interesting place in the middle of the range of costs for Paperspace machines. Nonetheless, the A5000 has the second highest values for CUDA cores and SP 32 in the listing. It also has the third highest GPU memory, after the V100 32 GB and A6000.  

In practice, the A5000 performed exceptionally well on all the benchmarks, especially in comparison to others in its category. It eclipsed the detection time of the P5000, showcasing the significant difference in capability between the older Pascale and newer Ampere Quadro machines.

The A5000 outperformed the V100 32GB on all tasks except EfficientNet training, where it's training time was 111.44% of that of the V100 32 GB.  Surprisingly, it actually outperformed both the A100 and A6000, on average, for the StyleGAN XL tasks, as well. This is likely a fluke owing to a peculiarity in the training task itself, but still offers an interesting demonstration of how truly quick the Ampere architecture can be at performing these highly complex tasks.

In terms of cost, the A5000 is much more efficient than the other options in its category. It has overall the third cheapest average cost to make a single run of any of the training or detection tasks, and the third best ratios of cost over throughput and cost over 100 CUDA cores.

All in all, the data suggests that the A5000 represents an excellent middle choice between the most powerful GPUs on Gradient, like the A100, A6000, and V100, 32GB and the weakest, like the RTX4000 and P4000. For 1.38 USD per hour, it is a direct upgrade to the A4000 for an additional 62 cents per hour. The increases in speed and memory over the A4000, and therefore over all available budget GPUs and all others in its class, make the A5000 an excellent choice for more complex tasks like training an image generator or object detector on a budget. We can therefore recommend that users try the A5000 when both speed and budget must be considered in equal measure, as the increase in speed, and therefore decrease in overall usage time, will not totally offset the increased cost over the A4000.

When to use the A6000

The A6000 is, situationally, the single best GPU available on Gradient, and the A100 is the only competition. While the A100 has 8 less GB of RAM, it has a significantly higher throughput thanks to its upgraded architecture. It's this incredible throughput that makes the A100 so amazing for Deep Learning tasks, but at 3.09 USD per hour, it is the most expensive single GPU to use on Gradient (while still being cheaper than A100's with competitors).

To aid in understanding where the A6000 would best be used, here is a small selection of the benchmark reports for max batch size for training:

GPUA6000A100V100-32GB
Cost ($ per hour)$1.89$3.09$2.30
StyleGAN_XL (max batch size)323216
StyleGAN_XL ($ per batch item)0.05906250.09656250.14375
EfficientNet (max batch size)256128128
EfficientNet ($ per batch item)0.00738281250.0241406250.01796875

The 48 GB of GPU memory and 10,752 CUDA cores makes the A6000 the most capable GPU for handling large inputs and batch sizes. This also indicates the A6000 GPU is the best GPU on Gradient for performing parallel processing. Furthermore,  the A6000 was significantly more cost effective in terms of processing cost per batch item than its competitors, thanks in part to this ability to handle massive batch sizes.

At 1.89 USD per hour, the A6000 is 1.20 cents cheaper per hour to run than the A100. At nearly 2/3 the cost, the A6000 should always be considered as an alternative to the A100 when cost is being considered. The reasoning for this is reflected in the time to completion benchmark times. The YOLOR task and EfficientNet tasks were only 111.73 % and 118.75% as lengthy to run on the A6000 as the A100, and the StyleGAN XL training was actually faster on the A6000. It is therefore easy to see how running a model that takes ~115% as long to train for 61.16% of the cost would be an excellent tradeoff.

Therefore, we can infer from this evidence that the A6000 is a very cost effective alternative to the slightly faster A100. We recommend users use the A6000 when working on particularly complex tasks, large batch sizes, and especially when doing lengthy training on a model. The A6000 may be even better than the A100 for tasks with particularly large input sizes, in particular.

Concluding thoughts

In this blog post, we examined the Ampere RTX GPU architecture before comparing three of these GPUs, the A4000, A5000, and A6000, with other Paperspace GPUs with similar GPU memory values. We saw how the Ampere RTX architecture was improved over previous generation microarchitecture's, and then compared their specifications and benchmark results with these other, previous generation GPUs to see how they compare in real tasks.

We found that the A4000 was the best GPU on Paperspace for users seeking a machine costing under 1 USD per hour. The performance is still fantastic at a fraction of the cost of the V100 16 GB, which has similar specifications. Next, we determined the A5000 is an excellent middle ground GPU. When speed and cost must be considered in equal measure, then an A5000 is an excellent first choice. Finally, the showed that the A6000 can be situationally the most powerful GPU on Paperspace, at less than 2/3rds the cost of the A100. We recommend the use of the A6000 when working on particularly complex or lengthy tasks on Gradient, like training an object detection model like YOLOR.

For more information about Paperspace machines, visit our docs here.

For more information about Nvidia Ampere and Ampere RTX GPUs, visit the following links:

Thanks for reading!

Spread the word

Keep reading