[12/2/2021 Update: This article contains information about Gradient Experiments. Experiments are now deprecated, and Gradient Workflows has replaced its functionality. Please see the Workflows docs for more information.]
If you follow the emerging discipline of MLOps, by now you've probably heard of some of the well-known internally developed ML platforms like Uber's Michelangelo and AirBnB's BigHead. The big technology companies are early adopters of ML and have invested an enormous amount of resources developing sophisticated proprietary platforms that help them develop and deploy models at scale. The decision to build-out their their own platforms makes a lot of sense for two key reasons:
- They possess vast engineering resources to devote to building (complex) internal tools.
- They were investing in ML before any viable off-the-shelf platforms were available.
So when doesn't it make sense to build vs buy?
Not all critical software platforms are built in-house. Source Control Management (eg GitHub) and CI/CD pipelines (eg Travis CI) are two examples that come to mind.
This begs the question, where do you draw the line? One popular approach is to "build" (versus "buy") when the specific problem being solved is in the company's wheelhouse. For example, if you're an online retailer, it may be strategic to build a custom ecommerce checkout system because it becomes a form of intellectual property (read: competitive advantage). However, this isn't always the case. Building and maintaining tools places a significant burden on the company, creates technical debt, and diverts resources away from other potentially more valuable efforts (a classic example of an opportunity cost). Theoretically speaking, if you buy a solution, the vendor that provides the solution will be laser focused on this problem and focus often leads to better products. Most importantly, they will continue to innovate and improve their product and therefore, add more and more value to your organization over time. Conversely, internal tools are often perennially out of date, clunky, unreliable, and a costly distraction in terms of resource allocation and operational efficiency.
And despite all the hype the media coverage that Uber Michelangelo has received, there's even a case to be made that the Ubers and AirBnBs of the world should opt for buying an ML platform now that off-the-shelf solutions are viable. As long as a platform is sufficiently extensive to accommodate the needs of an organization, then the cost-value equation will rarely justify building an internal tool.
What's involved in building an end-to-end ML platform?
Building and deploying a toy ML application is fairly trivial. Training large models on large datasets and deploying them reliably (eg serving 1000s of requests with low latency) is hard.
Iteration speed > iteration quality
To build accurate models on real data, success depends on the scale & speed of experimentation — running more experiments with more hyperparameters yields better results. Agile ML teams need to be able to experiment rapidly iterate across these large demanding models and do so in a reproducible fashion.
Running experiments in parallel and serving models is computationally expensive. In the last 5 years, the best models have led to a 300,000x increase in the computational demands. To accelerate training and inference, models are often run against large compute clusters that sometimes need to span cloud environments and even cloud and on-premise hybrid environments. Providing a robust infrastructure orchestration layer, queuing system, and resource controls (eg to place limits, track runaway jobs, etc.), is essential as you scale up.
The complexity compounds with every new model and every new ML team member. Model management (eg versioning, performance monitoring, governance, lineage, staging/production environments, etc.) requires automation, policy controls, and other capabilities at the compute and management layers.
Production ML systems begin to degrade the moment they are deployed. Deploying these systems isn't just about shipping a model. It's about building an infrastructure so that there is a continuous refinement and re-training on new data, there is alerting on unexpected behavior and debuggability, and there is a mechanism to roll out updates (and rollback changes) without interruptions.
If all of this reminds you of traditional software development concepts like CI/CD, then you are not alone. Applying CI/CD methodology to machine learning is a new concept that many believe will help mature the industry from a purely R+D or academic discipline to one that is focused on shipping models into the hands of end-users and driving business value.
As the industry evolves out of its infancy, companies will want to maximize their return on their ML investment. Pressures such as time to market will drive companies to streamline their ML efforts.
When a technology is in it's infancy, find a trusted partner
Something to keep in mind is the philosophy of treating companies that are evangelizing a space as a partner instead of someone you just purchase software from. These thought leaders are pioneers bringing a new technology to market and their interests are aligned with yours, especially in the early days of finding product market fit, tracking trends and best practices, and absorbing as much customer feedback as possible.
Similar to traditional software development, the way machine learning is practiced from one organization to another is often almost identical, even across wildly different verticals and use-cases. What this means is that the problems you are facing are most likely the same problems facing other organizations and your feature wishlist is probably highly relevant to these organizations. As a result, your technology partner can take your requests and ship new features to all of their customers at once knowing that everyone will benefit. Put another way, if each of these organizations had opted to build their own ML platform, they would all be building the same capabilities over and over again in their siloed environments.
Wrapping up
Ultimately, machine learning is cross-disciplinary, spanning data engineering, data science, and DevOps. MLOps is an appropriate term to describe the emerging practice of "productionizing" machine learning. Today, organizations investing in machine learning spend the majority of their time on tooling and infrastructure, and very little of their time on building and deploying models. Many ML initiatives fail because companies simply haven't figured out how to operationalize their efforts. In response, MLOps platforms have sprung up to address these challenges by abstracting away the entire model management lifecycle. Building a platform from scratch is an enormously complex undertaking and is best suited for the tech behemoths and arguably, even this is a stretch. We don't see companies rebuilding GitHub every time they want to start developing software. Building an ML platform should not be a prerequisite in order to invest in machine learning.