Introduction to Part 6
In this final part of the six-part series, we recap the main points from the series, and point to next steps, both for this work in terms of other things that Gradient can do, and for the reader who would like to learn more.
Part 1: Posing a business problem
Part 2: Preparing the data
Part 3: Building a TensorFlow model
Part 4: Tuning the model for best performance
Part 5: Deploying the model into production
Part 6: Summary, conclusions, and next steps
- The main location for accompanying material to this blog series is the GitHub repository at https://github.com/gradient-ai/Deep-Learning-Recommender-TF .
- This contains the notebook for the project,
deep_learning_recommender_tf.ipynb, which can be run in Gradient Notebooks or the JupyterLab interface, and 3 files for the Gradient Workflow:
- The repo is designed to be able to be used and followed along without having to refer to the blog series, and vice versa, but they compliment each other.
Model deployment support in the Gradient product on public clusters and from Workflows is currently pending, expected in 2021 Q4. Therefore section 5 of the Notebook
deep_learning_recommender_tf.ipynb on model deployment is shown but will not yet run.
As we saw in part 1, there are various highlights of the series that we have aimed to show. Recapping these:
- Demonstrate a real-world-style example of machine learning on Gradient
- Incorporate an end-to-end dataflow with Gradient Notebooks and Workflows
- Use modern data science methodology based on Gradient's integrations with Git
- Use TensorFlow 2 and TensorFlow Recommenders (TFRS) to train a recommender model that includes deep learning
- Use training data that reflects real-world project variables and realities
- Construct a custom model using the full TensorFlow subclassing API
- Show working hyperparameter tuning that improves the results
- Deploy model using Gradient Deployments and its TensorFlow Serving integrations
- Include a self-contained working Jupyter notebook and Git repository
- Make it relatable to a broad audience including ML engineers, data scientists, and those somewhere in between
We can now add some discussion and further comment.
A lot of online material shows ML but neglects engineering, or shows engineering but neglects ML.
In other words, a sophisticated model might be trained but we don't see how to deploy it, or else a new engineering setup might be shown but only on a toy model or data.
In either case, essential elements like data preparation and model tuning are often neglected as well. This leaves the user who wants to do real things short of working examples.
Because Gradient spans data science experimentation and production (Notebooks and Workflows), and is designed to be used for data science on real end-to-end problems, we wanted to show both of these areas, and not skip them, while at the same time avoiding the series becoming too long.
Data science in the enterprise is still very much an evolving field.
Best practices are changing constantly. Data scientists sometimes have poorer software engineering skills than software engineers, and software engineers may be productionalizing models that contain analyses that they do not fully understand. This is exemplified by the difference between data scientists working in notebooks and engineers trying to translate that into a production deployment.
What is true is that data science is fundamentally iterative and that tools like Gradient can help data scientists and ML engineers iterate in a shared environment which is a valuable condition to accomplishing shared goals.
TensorFlow 2 (TF2) is much easier to use than TensorFlow 1. A lot of online material, like tutorials, blogs, Stack Overflow, is still version 1, or doesn't differentiate. This is true of much of Gradient's material too, since it has been around longer than TF2. So we considered it important here to start showing substantive examples of TF2 and how it can be used on Gradient.
TensorFlow Recommenders (TFRS) is an extension to TensorFlow that adds classes to make it easier to build state-of-the-art recommenders. At the time of writing, it was still quite new at version 0.4. But it looks impressive, and is well suited to the tasks we used it for here.
The subclassing API was good to show because real-world business problems often have some required custom component that the simpler higher-level sequential and functional ML interfaces like AutoML or no-code tools can't express. So by showing the subclassing API here it demonstrates how the Gradient setup can be used for fully general analyses that solve real problems.
Deploying models is not easy.
One example of this is that at the time of writing, the 8 TFRS tutorials did not show a working production deployment.
There were some reasons for this. The pedagogical setup of their models means that they were arranged to clearly show how they worked rather than be deployed. They did show returning predictions in some form, but clearly showing production deployment would be a desirable addition. We show such deployment in this series, and it is made easier for us because Gradient already has the necessary infrastructure, containers, and compute hardware to do it. So all we had to do was rewrite the model classes appropriately.
The accompanying notebook roughly follows the blog, but does not correspond to all its sections exactly. The notebook is designed to be self-contained so you don't need to flip back-and-forth between it and this blog to use it. Hopefully this setup works well for those readers who also view the notebook.
Our business problem was stated as:
Demonstrate that Paperspace Gradient Notebooks and Workflows can be used for solving real machine learning problems. Do this by showing an end-to-end solution from raw data to production deployment. Further show that what is demonstrated can be plausibly built up into a full enterprise-grade system. In this case, the model is a recommender system whose results are improved by utilizing a tuned deep learning component.
We have shown that a recommender with tuned deep learning added improves the root-mean-square error between the predicted and true user ratings from 1.11 to 1.06. The main way to improve further is to add other elements of recommender architecture such as cross-features.
Gradient combines a user- and analysis-friendly style of working for data scientists with a solid MLOps foundation for engineers. We have shown how the use of both Notebooks and Workflows can contribute to a working end-to-end data science project.
Thanks for reading!
Here we discuss some next steps for the project if the work were to be extended. They are arranged in roughly the order they would be encountered in the dataflow.
For next steps as a reader, try out the Notebook in the project GitHub repository, or follow the links below.
Data science and recommenders
These next steps are motivated by some general data science considerations, and ones more specific to recommenders and this project.
The business story
We started by talking about the business problem, but didn't expand at the end much beyond successful user rating predictions. Of course, here our actual aim was "Demonstrate that Paperspace Gradient can be used for solving real machine learning problems," but it could be shown, for example, how data science metrics are convertible to business metrics, and in turn show whether these meet quantitative targets and add value.
Exploratory data analysis (EDA)
Part 2 resolved some identified issues with the data, such as features and targets not being differentiated, but also mentioned various further explorations that would be done in a full-scale project. Plots such as the distributions of the user ratings in the training set would be the next step. We could also uncover other potential bad values, unexpected distributions, and so on. They could then be compared to the outputs on the testing set, and form one of the bases for monitoring for model drift.
Bad or missing values in data can take many forms, such as
9999.9, out of range, etc., and can in general only be fully located by use of domain knowledge and communication with those who supplied the data. In movie data, mismatched movies and IDs, duplicates, typos, mislabeled ratings, etc., may also be present. Some algorithms are more robust than others to such data, but in general for ML better data preparation will give better results.
The term is slightly passé nowadays, but the author's favorite definition of it is "data of a size that breaks your favorite analysis tool." Recommender systems may be accessing data that contain millions of users and items of content, making for datasets much larger than MovieLens. Showing how to handle such data within the Gradient ecosystem, both for training and deployment (streaming), would be valuable.
With a deep learning component to the recommender, richer features can be used than those in, for example, matrix factorization. An obvious example to add here is to featurize the timestamps into time of day, day of week, etc., but there are also more sophisticated methods such as feature crosses that are shown in the TFRS tutorials.
The notebook showed a basic hyperparameter grid search over a few learning rates. This was enough to show how they can be handled via Gradient Notebooks and Workflows. Other common parameters such as layer sizes, number of layers, optimizer, activation function, regularization, and network architecture could be more fully explored.
Writing all these as loops would become unwieldy, so some combination of passing parameters via environment variables, or using Keras tuner or TensorBoard HParams, would give a better search.
Similarly, some form of smart parameter search or AutoML can give better results than defining one's own hyperparameter grid.
Add the retrieval model
Our project shows the ranking model so the corresponding retrieval model could be added. Also possible would be a combined model that contains both.
More sophisticated recommenders
The TFRS tutorials show various more sophisticated models than retrieval and ranking, including a combined retrieval and ranking model, and deep cross networks (DCN) that implement the feature crosses mentioned above. DCNs will tend to give as-good recommendations using many fewer parameters. Since we are already using and deploying the generic subclassing API form of the models, all of these could be added to our project.
Cold start problem
The cold start problem for recommenders manifests here as what to recommend to a new user who has not yet viewed any movies and so has given no ratings. This could be shown being explicitly addressed during deployment, by feeding in new users.
Recommending only movies very similar to what a user has viewed is probably too restrictive because they may not want to watch more of the same. Recommending movies too different is not useful either because they won't want to watch those. The amount of difference, or variety, between what is recommended and what the user has watched can be adjusted. As with many product adjustments, a good way to guide this would be to be specific about what business metric is to be maximized, and determine what works well via A/B testing.
Deployment at scale, performance
Here we showed a few rows of example data being passed to the deployed model from an already-a-dictionary format. Better would be from a format likely to be passed in a production system, such as a form processed by part of the company's upstream stack that deals with incoming user activity but is still raw from the ML standpoint.
Because it is likely that recommendations will want to be supplied in real time, a data stream would be better than batches. TFRS shows an example of performance enhancement, where the ScaNN approximate nearest neighbors library is used to speed up neighbor calculations and hence speed up the retrieval model portion of the system.
Convert the byte-encoded data
We sidestepped the issue of converting the byte-encoded movie data to JSON by just using sample rows to send to the model already given as dictionaries. If the incoming new movie data to a production deployment is byte-encoded and needs converting, this of course has to be addressed.
Out-of-vocabulary (OOV) classes
Model deployments need to be robust to data that was not present in the training set, such as wrong formatting, out-of-range values for numerical data, or a new unseen class for categorical data. Particularly for something like movies, new unseen classes such as new movies might end up being fed to the deployed model. Aside from rejecting rows that don't fit all criteria, TensorFlow has the ability to deal with such classes by assigning them to one or more OOV class placeholders, and this could be shown.
Here we saw that the outputs look OK. They are numbers between 0 and 5 as expected. It might be the case, however, that the best mean-squared-error has been achieved by the model simply assigning most predictions to be about 3.5. Therefore, exploring the predictions and comparing them to the training set would be important.
Since Gradient has a setup that includes a Git repository, model versions, containers, and YAML workflow specifications, a recommender model training and result on given data should be reproducible.
In practice, TensorFlow and neural networks have a plethora of data selections, parameters, and random seeds that all vary, plus in distributed systems file line ordering is not guaranteed. So exact reproducibility may not be possible, but it could still be shown that the results are statistically reproducible – e.g. if error bars are derived on a rating using some sensible method, then future instances of the same version of the dataflow should be consistent within those errors.
Interpretation and explanations
As is well-known, various industries have regulatory requirements to say "why" a model made a certain prediction, and some algorithms are easier to interpret than others. Deep learning is one of the more difficult ones, but model-agnostic approaches such as SHAP could be shown. In addition, they would have to be suitable for a recommender as opposed to regular supervised learning.
The movement for models to be fair, accountable, and transparent (FAccT) is becoming increasingly important as they affect more aspects of our daily lives. While movie recommendations may not be the most crucial example of this, there is still potential for recommenders to be altered to promote certain content or suppress other content. Showing, for example, that the range of movies recommended fairly reflects what is out there, and there are not biases in recommendations that depend upon, say, discriminatory characteristics in user demographics, would be useful.
Machine learning can be compute-intensive, and in particular deep learning can require a lot of compute time and training data to give the best results. While Gradient is designed to make it easy to add both GPUs and distributed computing to your setup, it makes sense to reduce the computational burden where possible.
One method that is now becoming common is transfer learning, where networks trained on basic information, such as common components of images, can be used as a starting point for training something more specific, instead of training a network from scratch. Transfer learning is commonly used in computer vision and natural language processing.
Monitoring divides into application status – such as whether the model is running, uptime, latency of throughput, etc., and data science status – such as sensibility of outputs, concept drift, data drift, or model drift.
Since ground truth labels to compare the model's outputs are in general not available, or maybe only available later, the performance needs to be monitored based upon the input and output data. Therefore, application monitoring could be shown via queries to Gradient's backend usage of the Prometheus database.
Various data science metrics such as the "distance" between the distribution of ratings in the training set and those in the output data could be derived, possibly via integration with an existing monitoring tool. The key point for monitoring is that the model is on an API endpoint, so what it is doing is accessible to any tool that can see it.
While readers of this blog series and users of Gradient are most likely technical, the model's results and therefore its business value could be opened up to nontechnical users via an application that can see its endpoint such as Streamlit.
Links & further reading
Core, our cloud infrastructure
Advanced Technologies Group
Machine Learning Showcase
Recommenders & TensorFlow
Paperspace is a TensorFlow service partner and also works with fast.ai, Nvidia, and others. The code in this series is in part based on the TensorFlow Recommenders tutorials.
Google recommenders course
Modern recommender systems