End-to-end Data Science on Gradient: Nvidia Merlin

In this article, we discuss the process of conducting end-to-end data science on Gradient with Nvidia Merlin. This includes walkthroughs on 3 examples: Multi-stage recommenders, training and serving a MovieLens model, and scaling for the massive Criteo dataset.

a year ago   •   12 min read

By Nick Ball
Table of contents

Gradient is designed to be an end-to-end data science platform, and as such covers all stages of data science, from initial viewing of raw data through to models deployed in production with applications attached.

Therefore, it is of benefit to show working examples of end-to-end data science on Gradient.

Nvidia's GPU-accelerated recommender system, Merlin, is one such example. One of Nvidia's Use Case Frameworks, it has some excellent notebooks containing end-to-end work, and showcases a particularly large range of tools working together to interact with data at scale and provide actionable outputs.

Major parts of end-to-end data science shown in this blog entry are:

  • ETL/ELT at scale
  • Feature engineering
  • Feature store
  • Synthetic data
  • Model training of deep learning recommenders
  • Save model
  • Deployment
  • Ensemble of models: solve deployment preprocessing

Additionally, other detail methods such as approximate nearest neighbors search, single-hot and multi-hot encoding of categorical features, and frequency thresholding (less than a given number of occurrences of a class are mapped to same index).

The various stages such as data preparation, model training, and deployment, are done on GPU, and thus highly accelerated.

Tools included in the working notebook examples are:

plus others such as Faiss fast similarity search, Protobuf text files, and GraphViz.

Let's take a brief look at Nvidia Merlin, followed by how to run it on Gradient, and the three end-to-end examples provided.

Nvidia Merlin

Merlin is Nvidia's end-to-end recommender system designed for GPU-accelerated work at a production scale.

As such, it is specialized for recommenders as opposed to, for example, computer vision or NLP text models. Recommenders are nonetheless a common use case in data science, and require their own frameworks to achieve optimal results like any other sub-domain of Deep Learning.

It solves many of the issues that arise when attempting end-to-end data science at scale, for example:

  • Data preparation is done on the GPU, giving large speedups (NVTabular)
  • Large data can be handled in a familiar API without having to add parallelism into the code (cuDF)
  • An efficient column-oriented file storage format is used that is much faster than plain text files (Parquet)
  • A feature store is included so that feature engineering is organized, simplified and reusable (Feast)
  • Real training data can be augmented by synthetic data when needed, an increasingly popular component of AI work (Merlin generate_data())
  • Distributed model training enables better models to be trained faster (HugeCTR)
  • Models are optimized for inference and can be deployed into production in a robust manner (Triton Inference Server)
  • Preprocessing of incoming raw data in a deployment is solved by deploying the multiple components of a typical recommender as an ensemble (Triton ensembles)

The easiest way to run it is to use Nvidia's provided GPU-enabled Docker containers. By running on Gradient, many of the barriers to setting these up and using them, such as obtaining GPU hardware and setting up Docker, are removed.

Running it on Gradient

Because Merlin is supplied as Docker containers, plus a GitHub repository, it can be immediately run on Gradient using our runtimes, which combine these two components.

Create Notebook

After signing in, create a Project, then a Notebook.

Notebook creation in Gradient. You can see the use of the Merlin GitHub repository, Docker container, and Nvidia NGC Container catalog Docker registry credentials (see below) under Advanced Options.

Choose a GPU machine to run it on. We used the Ampere A6000 with 48GB of RAM, but others, like the A100, may work. This is an expensive package to run, so be considerate of the high GPU costs when choosing a machine type and the possibility of OOM error if using a less powerful machine.

Select the TensorFlow recommended runtime (TensorFlow 2.9.1 at the time of writing). Under Advanced Options, for workspace enter https://github.com/NVIDIA-Merlin/Merlin, and for image, enter nvcr.io/nvidia/merlin/merlin-tensorflow:22.10 .

You can also launch the Notebook with a click (if you are a Growth account user, see link for details) using the link below:

Bring this project to life

Extra step needed: Nvidia API Key

For many containers and repositories, that's all there is to it! You are ready to go.

For this one, however, there is one extra step: Nvidia NGC Catalog Docker images require you to sign in to Nvidia first to download them. Luckily, Paperspace makes this easy because of its capability to sign into credentialed Docker registries.

So, under Advanced Options, for Registry Username enter $oauthtoken (this actual string, no substitution), and under Registry Password, paste in your Nvidia NGC Catalog API key.

💡
If you don't have an Nvidia NGC Catalog API key, you can create one by going to the NGC Catalog page, signing up, then in your profile dropdown on the top right under Setup there is an option to generate a key.

It is also possible to create a Gradient Notebook programmatically in the usual way (including the API key), using the Gradient command line interface.

For more on Gradient Notebooks and their creation, see the documentation.

The Examples

The Nvidia Merlin repository supplies three main examples:

  • Multi-stage recommender
  • MovieLens dataset
  • Scaling data with Criteo

Here we will not go into depth about how the recommenders are working, but:

(a) Talk about how common issues encountered in end-to-end work at scale are solved

(b) Provide a few notes and pointers for best running of the examples on Gradient. In fact they are remarkably easy to run for what they are achieving, but the odd detail here and there is worth mentioning.

💡
Note: the examples provided in the Merlin containers have parallel tracks using either TensorFlow 2 or HugeCTR (plus some PyTorch), but they run on different containers. To simplify the presentation, we focus on the TensorFlow 2 route, which provides end-to-end in all cases.

General Setup

Some things to note for general Gradient Notebook setup:

  • We ran the Notebooks on an A6000 single GPU. Other GPUs should work, although some with less memory may not do so well.
  • If an out-of-memory error is encountered, resetting the kernel of the Jupyter notebook in question, or of other run notebooks, should clear the used GPU memory and allow things to run. You can check GPU memory usage via the GUI left-hand navigation bar metrics tab, or nvidia-smi on the terminal.
  • We recommend running each of Multi-stage, MovieLens, and Scaling Criteo in its own separate Gradient Notebook, so it remains clear which output came from which example. All 3 can use the same Github repository and Docker container, as given above under Create Notebook.
  • When the notebooks are run, you may see occasional warnings such as NUMA codes or deprecated items. These may be from Gradient, or from the original examples, and are not critical to seeing the notebooks run through.

Example 1: Multi-Stage

The multi-stage recommender example can be found here, and consists of two Jupyter notebooks: 01-Building-Recommender-Systems-with-Merlin.ipynb and 02-Deploying-multi-stage-RecSys-with-Merlin-Systems.ipynb.

It demonstrates how to build a recommender system pipeline using four stages: retrieval, filtering, scoring and ordering. The figure from the first notebook shows what each of these are:

Multi-stage recommendation system (from the first notebook in this example)

It trains both a retrieval (two-tower) model and a scoring/ranking model (Deep Learning Recommendation Model, aka. DLRM), includes data preparation and a feature store, and, crucially, shows how to deploy the multi-model setup into production using Triton Inference Server.

In 01-Building-Recommender-Systems-with-Merlin.ipynb, the dataset is prepared, feature engineering is done, and the models are trained. By default, the dataset generated is synthetic, using Merlin's generate_data() function, but there is an option to use real data from ALI-CCP (Alibaba Click and Conversion Prediction) too. The feature engineering is done using NVTabular, and the feature store uses Feast. After model training, the output goes to /workspace/data/.

In 02-Deploying-multi-stage-RecSys-with-Merlin-Systems.ipynb, the deployment is set up, including feature store, model graph exportation as an ensemble, and the Triton server. The model is then deployed, and recommendations can be retrieved.

Unfortunately, the notebook doesn't provide much insight into the correctness of the recommendations, or how they would then be used, but they look plausible.

Running the notebooks

When running notebook 1, you can do Run All, except that you will need to uncomment the %pip install "feast<0.20" faiss-gpu line in the first code cell.

For notebook 2, assuming you ran notebook 1, you can run everything except the last cell, which will fail. For that cell to run, start Triton Inference Server first in the terminal using tritonserver --model-repository=/Merlin/examples/Building-and-deploying-multi-stage-RecSys/poc_ensemble/ --backend-config=tensorflow,version=2. If it says out of memory, restart the notebook 1 kernel to clear some. Then you can run the last cell of notebook 2.

💡
Note: At the present time the deployment in this example will in fact hang on Gradient before becoming ready, and so the last cell in notebook 2 will still fail. This is because the Docker container invocation (docker run) used in the original Nvidia GitHub repository uses the argument memlock=-1, and in Gradient the container is invoked behind the scenes without this. A fix is planned in the first quarter of 2023. In the meantime, to see deployments that do run, go to the MovieLens and Scaling Data examples below, which do not have this issue.

Example 2: MovieLens

The MovieLens recommender example can be found here, and consists of notebooks in 4 steps. We will focus on the TensorFlow (TF) path, as that covers both training and deployment, and allows us to continue using the same Docker image as above. Therefore we run these 4 notebooks:

  • 01-Download-Convert.ipynb
  • 02-ETL-with-NVTabular.ipynb
  • 03-Training-with-TF.ipynb
  • 04-Triton-Inference-with-TF.ipynb

In 01-Download-Convert.ipynb, the MovieLens-25M dataset is downloaded and converted from its supplied CSV format into the much more efficient Parquet.

In 02-ETL-with-NVTabular.ipynb, NVTabular is used to perform data preparation. It has some useful functions like multi-hot features, meaning information like movie genres, whose number of categories varies for each data row, can be used fully. This is in addition to the usual usage of the user, item, and rating information. NVTabular is also able to handle data larger than the memory size of the GPU.

03-Training-with-TF.ipynb then trains the models, here using deep learning layers with embeddings, ReLU, etc. NVTabular's data loader is used to accelerate the training by reading the data directly into GPU memory, among other optimizations, removing what can otherwise be a bottleneck when using GPUs.

Finally, in 04-Triton-Inference-with-TF.ipynb the model is deployed using Triton Inference Server. As with the first example on Multi-Stage above, we don't get much insight into the utility of the recommendations, but it's clear that model outputs are being produced.

The examples here show again the capability of NVTabular in doing real data preparation, and speeding it up with GPUs:

NVTabular functionality (from the second notebook in this example)

We also see again the usage of Triton Inference Server to solve the problem of preprocessing incoming unseen data in production deployment:

Solving production deployment preprocessing by using a model ensemble (from the fourth notebook in this example)

Running the notebooks

The first 3 notebooks can be run as-are using Run All.

Before running notebook 4, start the Triton Inference Server, using the directories as they are created on Gradient: tritonserver --model-repository=/root/nvt-examples/models/ --backend-config=tensorflow,version=2 --model-control-mode=explicit.

Then notebook 4 also works with Run All.

Example 3: Scaling Criteo Data

💡
Note: This example takes longer to run than the others, 20 minutes+, because notebook 1 has steps to download, extract, and then write the now large-scale data.

In the third and final example, we see how to work with larger-scale data. The full public Criteo dataset is 1 terabyte, and consists of 4 billion rows of click logs over 24 days.

We focus on a 2 day subset, although we did find that using all 24 days does work for downloading and prep. The training and deployment worked for a Kaggle-sized 7 days, but 24 days needs a larger GPU setup.

For the full 24 days, the information online quotes "1.3 terabytes", but in fact this is the combined size of the compressed and uncompressed files, which don't both have to be kept on disk. 1 terabyte is the uncompressed size of the flat files before conversion to Parquet, which is about 250GB.

As with the first 2 examples, we focus on the TensorFlow 2 path, and run:

  • 01-Download-Convert.ipynb
  • 02-ETL-with-NVTabular.ipynb
  • 03-Training-with-Merlin-Models-TensorFlow.ipynb
  • 04-Triton-Inference-with-Merlin-Models-TensorFlow.ipynb

In 01-Download-Convert.ipynb, the dataset is downloaded and converted to Parquet format. By default, it does not download all 1 terabyte, but downloads 2 of the 24 days, totaling about 30GB compressed. There is the option to download any number of days between 2 and 24, so you can scale up as desired. This notebook also sets up and uses a Dask cluster.

In 02-ETL-with-NVTabular.ipynb, the data is prepared using NVTabular, and the setup includes the capability to extend to both multi-GPU and multi-node. This uses a Dask cluster and the Nvidia RAPIDS dask_cudf library to work with Dask dataframes. Data preparation includes imposing frequency thresholding, so that categories that occur infrequently are grouped together into one category. The resulting NVTabular workflows take about a minute each to run on a single GPU with 2 days' data, and the output goes to /raid/data/criteo/test_dask/output/.

Dask dataframes simplify working with large data in a distributed fashion (from the second notebook in this example)

The model is trained in 03-Training-with-Merlin-Models-TensorFlow.ipynb. This time, it is the DLRM (Deep Learning Ranking Model) again, and the training + evaluation take a few minutes with the 2 days' worth of data. The model is saved to /raid/data/criteo/test_dask/output/dlrm/.

When the model is deployed in 04-Triton-Inference-with-Merlin-Models-TensorFlow.ipynb, the Triton setup includes the NVTabular workflow for feature engineering that was set up in notebook 2.

As with the other 2 examples, the correctness and utility of the model output recommendations are not really addressed, but we again see that they are being produced, and so we have run our "end-to-end" data science.

Running the notebooks

All 4 can be run as-are with Run All.

💡
Note: For this example in particular, you should restart the notebook kernels after running each one both to avoid running out of GPU memory that has been used for the large data, and to avoid attempting to start a Dask cluster twice on the same port.

For notebook 1, after it has run and the original data is converted to Parquet, you can remove the original data to save disk space by doing rm -rf /raid/data/criteo/crit_orig/. The converted data is in /raid/data/criteo/converted/criteo/. Note that if you shut down and restart the Gradient Notebook, the data will be deleted as /raid is not a persistent directory. To retain it, move it to somewhere that is persisted, such as /storage.

For Notebook 2 using Dask, Gradient supports multi-GPU, although you would need to change the machine type to a multi-GPU instance, such as A6000x2. This is not required, however, and it will run fine on a single A6000. The dashboard that it mentions to be displayed in 127.0.0.1:8787 unfortunately does not work with the current Gradient Notebook architecture, but it is not needed for the rest to function. You can still see the CUDA cluster setup in the cell output.

For notebook 3, it runs through as-is.

For notebook 4, do Run All first before starting the Triton Inference Server using tritonserver --model-repository=/raid/data/criteo/test_dask/output/ensemble --backend-config=tensorflow,version=2. The penultimate cell will fail, but after the server has started you can run the last 2 cells. The server takes a minute or two to start up for these models.

Conclusions and Next Steps

We have seen how real end-to-end data science can be done on Paperspace Gradient, at a scale and robustness that is suitable for production. Nvidia Merlin recommenders provide an excellent demonstration of this.

We saw three examples:

  • Multi-stage recommender
  • MovieLens dataset
  • Scaling data with Criteo

and a large number of different tools that are commonly used in end-to-end data science working together.

For some next steps, you can:

Good luck!

Add speed and simplicity to your Machine Learning workflow today

Get startedContact Sales

Spread the word

Keep reading