Reliability and performance at scale

Recent and upcoming releases are focused on improving reliability and performance of GPU machines

5 months ago   •   1 min read

By Daniel Kobran

The Paperspace cloud has grown massively over the past few years. We're now supporting 600K+ users and approaching 100M hours of GPU compute served to our users.

As weโ€™ve grown, weโ€™ve hit some scaling snags along the way. We are fully aware that outages and bugs are not acceptable, especially as more and more of our user base is running in production.

We've already shipped some improvements behind the scenes to enhance virtual machine health checks and alerting. We're seeing a drop in error rates, faster spin-up and spin-down durations in Core and Gradient, and better kernel performance in Gradient Notebooks. In the Paperspace console we've surpassed a 99.85% sustained crash-free session rate and the trend is continuing upward.

Console data from May 2022

Weโ€™re working around the clock to address remaining reliability and performance issues head-on. Over the coming months look out for impactful changes on issues big and small. Weโ€™ll be taking shipping improvements across hardware and software, including a re-write of the billing engine which has been a persistent thorn in our sides for some time.

We'll be keeping these improvements coming behind the scenes over the next couple of releases. Thanks for your support as we scale the world's best cloud for accelerated computing.

๐Ÿ’œ PS Engineering

Spread the word

Keep reading