[12/2/2021 Update: This article contains information about Gradient Experiments. Experiments are now deprecated, and Gradient Workflows has replaced its functionality. Please see the Workflows docs for more information.]
The machine learning space is arguably one of the most advanced industries in the world and yet, ironically, the way it operates is reminiscent of software development in the '90s.
Here are a few examples: an absurd amount of work happens on siloed desktop computers (even to this day); there's practically zero automation (e.g. automated tests); collaboration is a mess; pipelines are hacked together with a constellation of home-rolled and brittle scripts; visibility is practically non-existent when it comes to models deployed to development, staging, and production; CI/CD is a foreign concept; model versioning is difficult for many organizations to even define; rigorous health and performance monitoring of models is extremely rare, and the list goes on.
Sound familiar? The entire industry seems to be stuck in R+D mode, and needs a clear path to maturity.
When we started building Gradient, the vision was to take the past several decades of software engineering and DevOps best practices and apply them to the machine learning industry. We call this CI/CD for machine learning.
As a team made up of primarily software engineers, this analogy seems inevitable and perhaps even somewhat obvious. Yet, with some exceptions here and there, the industry has not matured to the point where ML in practice functions as efficiently and with as much rigor as software development. Why is this? We tend to take for granted how rich the ecosystem around the software industry has become. When software teams deploy each release, they rely on dozens of tools like GitHub, CircleCI, Docker, several automated testing suites, health and performance monitoring tools like NewRelic, etc. In contrast, there are very few comparable tools in the machine learning industry. We believe the lack of best practices for machine learning is not the result of a lack of will, but rather a lack of available tooling.
Buy vs Build?
Not only are many companies not equipped to develop an entire machine learning platform on their own, it's also extremely inefficient for every company in the world to build their own platform from scratch. Even the hyperscalers, which generally have spare cycles and the expertise to build internal tools, probably shouldn't rebuild GitHub. What a waste of time that would be. Companies should focus on building things internally that are in their wheelhouse and that produce some sort of competitive advantage (intellectual property or otherwise). Rebuilding GitHub checks neither of those boxes. The same goes for building a machine learning platform. This is why we've been so busy building Gradient: we can solve problems once and instantly ship new capabilities to thousands of organizations, so they can focus 100% of their resources on developing models instead of developing tools.
Introducing the concept of MLOps
Machine Learning Operations (MLOps) is a set of practices that provide determinism, scalability, agility, and governance in the model development and deployment pipeline. This new paradigm focuses on four key areas within model training, tuning, and deployment (inference): machine learning must be reproducible, it must be collaborative, it must be scalable, and it must be continuous.
For any modern software team, it would be trivial to view the individual components (code + dependencies) of a release from a year ago and re-deploy that exact version into production. Conversely, the likelihood of re-constructing a machine learning model (within a few percentage points of accuracy) from a year ago with today's lack of tooling is typically next to impossible. This requires having traceability covering all of the inputs: the dataset that was used, the version of the machine learning framework, the code commit, the dependencies/packages, the driver version and low level libraries like CUDA and cuDNN, the container or runtime, the parameters used to train the model, the device it was trained on, and some specific machine learning inputs such as the initialization of layer weights.
Whether it's for regulatory purposes or your organization simply values having a record of what it has developed and made available to customers and internal stakeholders–reproducibility is paramount. The path towards reproducible machine learning can be thought of as a philosophical shift away from ad-hoc methodologies towards a more deterministic way of working.
Working on a python project and producing models on siloed workstations (or even an AWS instance) is an anti-pattern. If you're an ML team of one you can get away with this, but the moment you put a model into production or have a handful of contributors attempting to work together, this strategy quickly falls apart. A lack of collaborative environment becomes especially problematic as the number and complexity of models increases.
Tactically, collaboration begins with having a unified hub where all activity, lineage, and model performance is tracked. This includes training runs, Jupyter notebooks, hyperparameter searches, visualizations, statistical metrics, datasets, code references, and a repository of model artifacts (often referred to as a model repository). Layering in things like granular permissions for team members, audit logs, and tags is also essential.
Ultimately, organization-wide visibility and realtime collaboration are essential to modern ML teams just as they are to software teams. This methodology should span each phase of the model lifecycle, from concept and R+D through testing and QA, all the way to production.
This one is a bit wordy, but the concept is simple: in contrast with software engineering, machine learning in practice requires a large (and sometimes massive) amount of computational power (and storage), and often requires esoteric infrastructure like purpose-built silicon (e.g. GPUs, TPUs, and the myriad of other chips entering the market). Machine learning engineers need an infrastructure abstraction layer which makes it easy to schedule and scale workloads without needing years of experience in networking, Kubernetes, Docker, storage systems, etc. These are all major distractions.
Some examples of the value of infrastructure automation:
- Multi-cloud: A machine learning platform should make it trivial to train a model on-premise and seamlessly deploy that model to the public cloud (or vice versa).
- Scaling workloads: As computational demands increase, training or tuning models that span multiple compute instances becomes essential. Plumbing shared storage volumes into a distributed fleet of containers running on heterogenous hardware and connected via a MPI or gRPC message bus is not something you want your machine learning engineers spending cycles on.
Ultimately, when ML teams can operate with full autonomy and own the entire stack, they are much more efficient and agile. Data scientists need access to on-demand compute and storage resources so they can iterate faster in the training, tuning, and inference phase. With MLOps the entire process is infrastructure-agnostic, scalable, and minimizes complexity for the data scientist.
For an industry that lives at the forefront of automation (e.g. chatbots to autonomous vehicles), there is very little in the way of automation in the production of ML models.
Many years ago, the software development industry consolidated around a process called CI/CD where a code merge by an engineer triggers a series of automated steps. In a basic pipeline, the application is automatically compiled, tested, and deployed. Once deployed, it is standard to rely on automated health and performance monitoring of deployed applications. Concepts like these are critical to the reliability of applications and developer velocity. Unfortunately, there is no equivalent for many of these concepts in the machine learning industry. This results in huge inefficiencies as highly-paid data scientists spend the bulk of their time dealing with repetitive and tedious tasks that are performed manually in an error-prone process. Check out our post on CI/CD for machine learning to learn more about what this looks like in practice.
Ultimately, the time it takes to move from concept to production and deliver business value is a major hurdle in the industry. That’s why we need good MLOps that are designed to standardize and streamline the lifecycle of ML in production.
DevOps as a practice ensures that the software development and IT operations lifecycle is efficient, well-documented, scalable, and easy to troubleshoot. MLOps incorporates these practices to deliver machine learning applications and services at high velocity.