Ten Steps to Production — Machine Learning Project’s Full Life Cycle
Today hits the sixth month mark of my current company working as a machine learning engineer. While my previous position is also in the machine learning land, we had less sophisticated eco system and focused more on research and proof of concept. In contrast, my current team is utilizing more mature tools around machine learning projects to support its full life cycle.
In this post, I want to share my learnings of productionizing recommendation models. I will walk through the steps and life cycle of setting up a machine learning project in production, combined with materials from the Coursera course Machine Learning Engineering for Production (MLOps).
Stage Zero: Defining Architecture and SLA
Tools: meetings with PMs, cross team collaboration
The first step when productionizing a machine learning system is always knowing how it will be hit by the client (could be another service that user interacts with). Setting expectations on the SLA and estimating the traffic are crucial when deciding which features or model to use (complex models potentially have higher latency), machine learning system deployment strategy (what strategy we should use to scale up the kubernetes pods), etc.
It is also important to align with PMs on the business goal so when evaluating models, you can pick the right metrics to improve on to actually help the business.
In most cases, you can benefit from a model deployment framework that can do the heavy lifting for you. In our case, since the model is developed in Python, we picked BentoML as our deployment framework that supports Tensorflow keras library. BentoML provides some performance enhanced magic when serving the endpoint so we con’t have to worry about it too much. We all cover more details about this in Stage Three: Model Training and Serving below.
Stage One: Data
Tools: Kafka, Databricks, Spark
For the past six months, I’ve worked on productionizing recommendation models for cross selling products on our platform. The data for recommendation is from the click stream data we collect via Kafka events. When user search and click on certain results, we consider that as a positive signal to recommend that listing. Since we are doing product cross selling, purchase data is also taken into consideration when evaluating. However, the data is much less than the click data so it could lead to skewed results.
When collecting data, we also need to test the data to ensure data quality and live up to data contract for modeling.
Stage Two: Model Features
Tools: Databricks, Spark
After we collect the data from our data source (parquet files accessible on Spark running on Databricks), we can write Spark jobs to evaluate and transform the data to what we need. This step is converting the raw data fields into what the model should take in as input. For example, instead of using the raw click data, we convert it into day of week and day of month.
Another important feature in our use case is the geo location so it is filtered separately from the model layers. Look for special scenarios in your use case and incorporate that into the project is important.
Another important thing we face was handling unbalanced data due to there is a lot less data of the product that we are trying to cross sell. To handle this, we are treating the positive clicks as anomaly events and the prediction should be evaluated by precision and recall, and the Harmonic mean of them (so called F1-score) instead of only looking at accuracy.
High precision means the proportion of recommended items in the top-k set that are relevant is high, while high recall means the proportion of relevant items found in the top-k recommendations is high. In our case, we favor high precision since making sure the top items we recommend is relevant is more important. If we miss some relevant products it is not that big of a deal. Compare to factories which favors high recall, if there’s a false alarm, human can double check and it is fine, but they cannot risk missing any positives.
Consider the following two charts of skewed data, can you tell which model is better just by looking at the accuracy?
After computing the features, we need to upload the features to feature stores (usually a NoSQL database) so we can do fast lookup with ids to save space for the inference in production (a process we call hydration). This is especially crucial for embeddings where the feature is actually large metrics.
After the Spark jobs complete, we save the parquet files to S3 bucket so the models can later access it easily for training.
Stage Three: Model Architecture
Tools: Databricks, Tensorflow
This stage is where most of the ML research focuses on. So I will leave out the details here and simply note that we are using Tensorflow keras library to train embeddings by performing supervised classification task with transformer based model. Since product data is very sparse, so we are using SparseCategoricalCrossEntropy as the loss function. The metrics of evaluating the model is NDCG, which is pretty common for recommendation systems.
Stage Four: Model Training and Serving
Tools: BentoML, Databricks GPU instance, feature store
After we’ve trained our model, looked at the metrics and decide it’s good enough for deployment, we are now looking to package the model and so we can serve it as a service.
When it comes to actually deploying the model, we need to consider the use case and make decisions accordingly. For example, is it a realtime or batch prediction? Can it be deployed on cloud or need to be on edge devices for speed or privacy concerns? How much latency (SLA) and throughput (QPS) should we support? In our case, the service should be realtime generating recommendations whenever the user makes a purchase. Since there is no sensitive data required, deployed on cloud is a preferred choice for us.
Model training (on GPUs) and serving (on CPUs most likely) are usually on separate machines, but the code are usually tightly coupled. To increase maintainability and reusability, we have the training code lives with serve repo and reuse the data transformation code for both training and inference. The flow for training and serving model goes as:
- Debug training code locally and use dbx cli to initiate remote training on GPU box use BentoML library to save the model (bentoml.tensorflow.keras.save)
- Use Github commit model on GPU box to trigger build on Jenkins
- Serve the model with BentoML library (bentoml.tensorflow.keras.load)
- In the endpoint, parse the input, lookup features in the feature store (usually a NoSQL database), hit the served model, and generate output to return to the client
- Spin up the endpoint with BentoML command and now your service is up!
Stage Five: Profiling
Tools: CProfile, snakeviz
After making sure the model can be served (the key challenge lies in the format of saving and loading models, which depends on how BentoML library supports Tensorflow or PyTorch), it is now time to optimize. In this step, we look for space to optimize performance and try to eliminate bottlenecks by precomputing things.
Since we’re using Python, w use cProfile to generate the profiling file and use SnakeViz to visualize the results. One single request happen to take ~70ms while the BentoML Keras runner taking up most time (50ms), meaning if we want to further decrease the latency, we need to optimize the model layers.
Runner allow BentoML to better leverage multiple threads or processes for higher hardware utilization (CPU, GPU, Memory) and higher concurrency: parallelize data extraction, data transformation or multiple runner execution.
Stage Six: CI/CD Pipeline and Deployment
Tools: Jenkins, Docker, Kubernetes
Now everything is looking good locally, it is time to make things work on the cloud. To port the Python libraries to the cloud via a CI/CD pipeline (such as Jenkins), we use poetry to lock python packages on local machine with poetry.toml file, and multi-stage build in the Dockerfile. The first stage is to use poetry to install Python libraries; the second stage is to copy the model and serve files as long as the Python libraries in virtualenv from stage one.
On the CI/CD pipeline, we use docker to build, tag, and push the image, and deploy the image on Kubernetes to serve the inference endpoint.
Stage Seven: Blazemeter Performance Test
Tools: Blazemeter
After the inference endpoint is in production, we need to decide the scaling strategy with the help of load testing on the production inference endpoint.
When testing, we start with slow load to measure the latency of one request per Kubernetes pod. This configuration of testing can give us an idea of how long one request is taking. For example, if one request takes 70 ms to process, 1000ms it can process 14 requests. So as long as the load is under 14, there should be no backed up requests.
Base on this number, we can test our hypothesis on Blazemeter with different scaling strategies to tune the chart values, and see if the latency performs as planned:
- HPA set to max 10 pods, and min to 3 pods to avoid busy neighbor issue
- Concurrency = 30, HPA will spin up 3 pods (10 qps per pod), throughput/qps set to 30 meaning there should be no requests piled up
- Concurrency = 30, hpa will spin up 3 pods (10 qps per pod), throughput/qps set to 60 meaning requests will get piled up, causing the latency to double
Decide how many CPUs and memory we should request on each pod to perform our scaling strategy. Before testing, we face a situation where the requests is backed up causing the latency to be very bad. But the service wasn’t scaled up because no new pods was span up. The reason behind this we later found out is that CPU usage is low, and HPA on Kubernetes only spin up new ones when cpu usage reach 60%. Because the prediction service is CPU bound task with blocking I/O, so CPU usage will not reach 60% even there’s requests backed up, thus it will not auto-scaling.
BentoML Workers
Besides from scaling with more pods, we can also increase parallelism in the pods itself. In BentoML, there is a concept of “worker” which is defined in the configuration file under “bento_server” (default configuration can be found here).
The benefit of having multiple workers in single pod is to save memory since model read only copy can be shared on pods between workers. There are three scenarios that we can configure the workers:
- Scenario 1: 1 bentoml asyncio worker (native support in BentoML wrapper already)
- Scenario 2: 2 bentoml asyncio workers (handle both I/O and compute) -> 2 workers on each pod, process use 2 cpu to parallelize compute (1 for compute and 1 for I/O, level of compute is happening to consume CPU)
- Scenario 3: optimal solution — process pool: 1 front web server (I/O) only + 3 workers (compute only, CPU intensive), use process pool to handle requests (web server send request to worker CPU)
Be aware of Python’s concurrency nature, Python can only can use single core so the I/O and CPU intensive task will be on the same thread, which will block the thread for 70ms when the model is performing the prediction. Compare to Java asynchronous blocking mode, every thread can take 1 cpu core, if 32 cores then can Java can have 32 threads to process in parallel.
Stage Eight: Monitor deployment & online metrics
Tools: Grafana, Tensorflow dashboards
When our prediction service is ready to take on some real traffic, it is important to consider the deployment cases and avoid deploying without ramp up or monitoring. Below is some common deployment cases:
After the model is live and taking prediction traffic, tracking data drift and concept drift for re-evaluate and retain model are crucial to complete the ML life cycle so the model would go obsolete.
When the model is online, make sure the inference endpoint is passed in the data as the model expected. Otherwise it could lead to great model decay or misalignment on the offline/online model.
Input Drift Detection: feature or concept drift
To be able to react in a timely manner, model behavior should be monitored solely based on the feature values of the incoming data, without waiting for the ground truth to be available.
The logic is that if the data distribution (e.g., mean, standard deviation, correlations between features) diverges between the training and testing phases on one side and the development phase on the other, it is a strong signal that the model’s performance won’t be the same. It is not the perfect mitigation measure, as retraining on the drifted dataset will not be an option, but it can be part of mitigation measures (e.g., reverting to a simpler model, re-weighting).
After you push the system to deployment, the question is, will the data change or after you’ve run it for a few weeks or a few months, has the data changed yet again? Because the data has changed, the performance of the prediction system can degrade. It’s important for you to recognize how the data has changed, and if you need to update your learning algorithm as a result. Data changes in real life, which could happen slowing or suddenly.
Stage Nine: Data pipeline and Retraining
Tools: online metrics, Databricks
To complete the full ML project cycle, an important part for the model to self learn is by doing retraining. There is possible that we see online metrics showing data or concept drift away from the offline one. The key to correct the drift is to construct and track online metrics that can reflect the drifts effectively.
When the model is live, we can collect online metrics from users’ interaction with the model. We can schedule Spark jobs (schedule on Airflow or Databricks) to run data pipeline (say weekly or monthly) to generate more up-to-date training data from the source and retrain from that.