Machine Learning Operations (MLOps) is an interdisciplinary blend of practices from DevOps, Data Science, and Data Engineering. It’s about the lifecycle of ML systems: data, training, inference, monitoring, and much more.
Fig. 1: Illustration of MLOps as the intersection of disciplines
What exactly does an MLOps specialist do?
It depends… ML products evolve over several months. They start with data science-oriented analyses and experiments and end in a robust, continuously improved system. Here’s a real-time example, though some systems process batches instead, requiring architectural adjustments.
Let’s examine the phases and what MLOps do in our team!
Phase 1: Exploration
In Phase 1, our team doesn’t have a product yet. We’ve identified a use case where ML could solve a problem or replace non-ML algorithms. An example could be automating or personalizing the arrangement of manually sorted pages.
In this exploration phase, you, as MLOps specialists, assist Data Scientists in extracting, transforming, and loading (storing) data (ETL). You might analyze to identify “dirty” data needing cleanup. Working closely with Data Scientists, you discuss which data is reliable and useful for our case. You should be able to foresee issues that may arise as the system matures, like switching from historical to real-time data capture.
Depending on the team, we use shared Data Science platforms or custom solutions at OTTO. Thus, you might provide an environment for Data Scientists to work and collaborate, such as a Git repository or self-hosted runners for model training.
To summarize Phase 1: A lot of the work involves Data Engineering, Ops, and data discussions. it support
Fig. 2: Tasks by discipline in Phase 1 “Exploration”
On the data development side (beige), we handle data procurement, transformation, and storage, often controlled by a workflow management tool like Airflow. The Data Science tasks (red) involve trying out and iterating models until we find a promising approach.
There’s a constant feedback loop between disciplines and roles for mutual support.
Phase 2: Going Live
After the exploration, we move directly to testing and proof of concept (POC).
Our team should now have a trained model, exportable in any suitable way, like, for hosting. The training might already be automated to help the team test faster in Phase 1.
Next, we implement live inference. We focus on the inference service and feeding real-time data, rather than perfecting the model validation cycle. Depending on latency requirements, choose the appropriate technology for inference (e.g., a ready-made solution like NVIDIA Triton or a custom solution adept at multiprocessing).
Monitoring is crucial to verify our POC in an A-B test. Tools like Grafana combined with Prometheus are ideal for most of our teams. You’ll collect and visualize information on our model’s performance, like user interactions with the model’s output, the size and speed of our inference, and key performance indicators (KPI).
CI/CD is used where it aids us. Manual steps may still exist in this phase if they don’t bottleneck our POC creation.
To conclude Phase 2: We focus on gathering real feedback for our model as quickly as possible. There’s no need for perfect infrastructure yet, as we might start over if the model doesn’t meet our expectations.
Fig. 3: Tasks by disciplines in Phase 2: Going Live
Phase 2 introduces many Ops tasks. CI/CD helps automate our system for reliability. The actual inference server needs to be added and monitored. Data extraction, previously on-demand, must now operate on real-time or production batch data. Monitoring the data to see if our features change is also important.
Phase 3: Improving and Maintaining the Product
Congratulations, we now have a product! Our A-B test was successful, and we can implement our model permanently.it support
In Phase 3, we balance maintaining the old model (monitoring, validation, training) while improving our product. Improvement can mean anything from tweaking some hyperparameters to creating a new model for a similar problem in our team’s domain.
First, focus on a sophisticated ML lifecycle. Some parts are already automated; we handle the rest. The goal is a workflow or pipeline retrieving and preparing data, training a model, evaluating it against the current live model, and deploying it (Figure 4).
Fig. 4: Simplified ML Pipeline
Great! Once the pipeline is complete, our system is automated, but it’s not yet a lifecycle. We need a pipeline trigger. Ideally, we can detect when our model needs retraining (e.g., data shifts or inaccurate predictions) and initiate our pipeline under such conditions. A well-estimated time-based trigger is a good start if monitoring isn’t that advanced.