Use MLflow with Apache Airflow | Astronomer Documentation

The MLflow Airflow provider has been deprecated and is no longer maintained. This tutorial was kept for reference purposes only.

MLflow is a popular tool for tracking and managing machine learning models. It can be used together with Airflow for ML orchestration (MLOx), leveraging both tools for what they do best. In this tutorial, you’ll learn about three different ways you can use MLflow with Airflow.

Three ways to use MLflow with Airflow

The DAG in this tutorial shows three different ways Airflow can interact with MLflow:

Use an MLflow operator from the MLflow Airflow provider. The MLflow provider contains several operators that abstract over common actions you might want to perform in MLflow, such as creating a deployment with the CreateDeploymentOperator or running predictions from an existing model with the ModelLoadAndPredictOperator.
Use an MLflow hook from the MLflow Airflow provider. The MLflow provider contains several Airflow hooks that allow you to connect to MLflow using credentials stored in an Airflow connection. You can use these hooks if you need to perform actions in MLflow for which no dedicated operator exists. You can also use these hooks to create your own custom operators.
Use the MLflow Python package directly in a @task decorated task. The MLflow Python package contains functionality like tracking metrics and artifacts with mlflow.sklearn.autolog. You can use this package to write custom Airflow tasks for ML-related actions like feature engineering.

Time to complete

This tutorial takes approximately 30 minutes to complete.

Assumed knowledge

To get the most out of this tutorial, make sure you have an understanding of:

The basics of MLflow. See MLflow Concepts.
Airflow fundamentals, such as writing DAGs and defining tasks. See Get started with Apache Airflow.
Airflow operators. See Operators 101.
Airflow hooks. See Hooks 101.
Airflow connections. See Managing your Connections in Apache Airflow.

Prerequisites

The Astro CLI.
An MLflow instance. This tutorial uses a local instance.
An object storage connected to your MLflow instance. This tutorial uses MinIO.

Step 1: Configure your Astro project

Create a new Astro project:

1 $ mkdir astro-mlflow-tutorial && cd astro-mlflow-tutorial
2 $ astro dev init

Add the following packages to your packages.txt file:

git
gcc
gcc python3-dev

Add the following packages to your requirements.txt file:

airflow-provider-mlflow==1.1.0
mlflow-skinny==2.3.2

Step 2: Configure your Airflow connection

To connect Airflow to your MLflow instance, you need to create a connection in Airflow.

Run astro dev start in your Astro project to start up Airflow and open the Airflow UI at localhost:8080.
In the Airflow UI, go to Admin -> Connections and click +.
Create a new connection named mlflow_default and choose the HTTP connection type. Enter the following values to create a connection to a local MLflow instance:
- Connection ID: mlflow_default
- Connection Type: HTTP
- Host: http://host.docker.internal
- Port: 5000

If you are using a remote MLflow instance, enter your MLflow instance URL as the Host and your username and password as the Login and Password in the connection. If you are running your MLflow instance via Databricks, enter your Databricks URL as the Host, enter token as the Login and your Databricks personal access token as the Password. When you test the connection from the Airflow UI, please note that the Test button might return a 405 error message even if your credentials are correct.

Step 3: Create your DAG

In your dags folder, create a file called mlflow_tutorial_dag.py.

Copy the following code into the file. Make sure to provide the name of a bucket in your object storage that is connected to your MLflow instance to the ARTIFACT_BUCKET variable.

1 """
2 ### Show three ways to use MLFlow with Airflow
3 
4 This DAG shows how you can use the MLflowClientHook to create an experiment in MLFlow,
5 directly log metrics and parameters to MLFlow in a TaskFlow task via the mlflow Python package, and
6 create a new model using the CreateRegisteredModelOperator of the MLflow Airflow provider package.
7 """
8 
9 from airflow.decorators import dag, task
10 from pendulum import datetime
11 from astro.dataframes.pandas import DataFrame
12 from mlflow_provider.hooks.client import MLflowClientHook
13 from mlflow_provider.operators.registry import CreateRegisteredModelOperator
14 
15 # Adjust these parameters
16 EXPERIMENT_ID = 1
17 ARTIFACT_BUCKET = "<your-bucket-name>"
18 
19 ## MLFlow parameters
20 MLFLOW_CONN_ID = "mlflow_default"
21 EXPERIMENT_NAME = "Housing"
22 REGISTERED_MODEL_NAME = "my_model"
23 
24 
25 @dag(
26     schedule=None,
27     start_date=datetime(2023, 1, 1),
28     catchup=False,
29 )
30 def mlflow_tutorial_dag():
31     # 1. Use a hook from the MLFlow provider to interact with MLFlow within a TaskFlow task
32     @task
33     def create_experiment(experiment_name, artifact_bucket, **context):
34         """Create a new MLFlow experiment with a specified name.
35         Save artifacts to the specified S3 bucket."""
36 
37         ts = context["ts"]
38 
39         mlflow_hook = MLflowClientHook(mlflow_conn_id=MLFLOW_CONN_ID)
40         new_experiment_information = mlflow_hook.run(
41             endpoint="api/2.0/mlflow/experiments/create",
42             request_params={
43                 "name": ts + "_" + experiment_name,
44                 "artifact_location": f"s3://{artifact_bucket}/",
45             },
46         ).json()
47 
48         return new_experiment_information
49 
50     # 2. Use mlflow.sklearn autologging in a TaskFlow task
51     @task
52     def scale_features(experiment_id: str):
53         """Track feature scaling by sklearn in Mlflow."""
54         from sklearn.datasets import fetch_california_housing
55         from sklearn.preprocessing import StandardScaler
56         import mlflow
57         import pandas as pd
58 
59         df = fetch_california_housing(download_if_missing=True, as_frame=True).frame
60 
61         mlflow.sklearn.autolog()
62 
63         target = "MedHouseVal"
64         X = df.drop(target, axis=1)
65         y = df[target]
66 
67         scaler = StandardScaler()
68 
69         with mlflow.start_run(experiment_id=experiment_id, run_name="Scaler") as run:
70             X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
71             mlflow.sklearn.log_model(scaler, artifact_path="scaler")
72             mlflow.log_metrics(pd.DataFrame(scaler.mean_, index=X.columns)[0].to_dict())
73 
74         X[target] = y
75 
76     # 3. Use an operator from the MLFlow provider to interact with MLFlow directly
77     create_registered_model = CreateRegisteredModelOperator(
78         task_id="create_registered_model",
79         name="{{ ts }}" + "_" + REGISTERED_MODEL_NAME,
80         tags=[
81             {"key": "model_type", "value": "regression"},
82             {"key": "data", "value": "housing"},
83         ],
84     )
85 
86     (
87         create_experiment(
88             experiment_name=EXPERIMENT_NAME, artifact_bucket=ARTIFACT_BUCKET
89         )
90         >> scale_features(experiment_id=EXPERIMENT_ID)
91         >> create_registered_model
92     )
93 
94 
95 mlflow_tutorial_dag()

This DAG consists of three tasks, each showing a different way to use MLflow with Airflow.

The create_experiment task creates a new experiment in MLflow by using the MLflowClientHook in a TaskFlow API task. The MLflowClientHook is one of several hooks in the MLflow provider that contains abstractions over calls to the MLflow API.
The scale_features task uses the mlflow package in a Python decorated task with scikit-learn to log information about the scaler to MLflow. This functionality is not included in any modules of the MLflow provider, so a custom Python function is the best way to implement this task.
The create_registered_model task uses the CreateRegisteredModelOperator to register a new model in your MLflow instance.

Step 4: Run your DAG

In the Airflow UI run the mlflow_tutorial_dag DAG by clicking the play button.
Open the MLflow UI (if you are running locally at localhost:5000) to see the data recorded by each task in your DAG.

The create_experiment task created the Housing experiments, where your Scaler run from the scale_features task was recorded.

The create_registered_model task created a registered model with two tags.
Open your object storage (if you are using a local MinIO instance at localhost:9001) to see your MLflow artifacts.

Conclusion

Congratulations! You used MLflow and Airflow together in three different ways. Learn more about other operators and hooks in the MLflow Airflow provider in the official GitHub repository.

1	$ mkdir astro-mlflow-tutorial && cd astro-mlflow-tutorial
2	$ astro dev init

1	"""
2	### Show three ways to use MLFlow with Airflow
3
4	This DAG shows how you can use the MLflowClientHook to create an experiment in MLFlow,
5	directly log metrics and parameters to MLFlow in a TaskFlow task via the mlflow Python package, and
6	create a new model using the CreateRegisteredModelOperator of the MLflow Airflow provider package.
7	"""
8
9	from airflow.decorators import dag, task
10	from pendulum import datetime
11	from astro.dataframes.pandas import DataFrame
12	from mlflow_provider.hooks.client import MLflowClientHook
13	from mlflow_provider.operators.registry import CreateRegisteredModelOperator
14
15	# Adjust these parameters
16	EXPERIMENT_ID = 1
17	ARTIFACT_BUCKET = "<your-bucket-name>"
18
19	## MLFlow parameters
20	MLFLOW_CONN_ID = "mlflow_default"
21	EXPERIMENT_NAME = "Housing"
22	REGISTERED_MODEL_NAME = "my_model"
23
24
25	@dag(
26	schedule=None,
27	start_date=datetime(2023, 1, 1),
28	catchup=False,
29	)
30	def mlflow_tutorial_dag():
31	# 1. Use a hook from the MLFlow provider to interact with MLFlow within a TaskFlow task
32	@task
33	def create_experiment(experiment_name, artifact_bucket, **context):
34	"""Create a new MLFlow experiment with a specified name.
35	Save artifacts to the specified S3 bucket."""
36
37	ts = context["ts"]
38
39	mlflow_hook = MLflowClientHook(mlflow_conn_id=MLFLOW_CONN_ID)
40	new_experiment_information = mlflow_hook.run(
41	endpoint="api/2.0/mlflow/experiments/create",
42	request_params={
43	"name": ts + "_" + experiment_name,
44	"artifact_location": f"s3://{artifact_bucket}/",
45	},
46	).json()
47
48	return new_experiment_information
49
50	# 2. Use mlflow.sklearn autologging in a TaskFlow task
51	@task
52	def scale_features(experiment_id: str):
53	"""Track feature scaling by sklearn in Mlflow."""
54	from sklearn.datasets import fetch_california_housing
55	from sklearn.preprocessing import StandardScaler
56	import mlflow
57	import pandas as pd
58
59	df = fetch_california_housing(download_if_missing=True, as_frame=True).frame
60
61	mlflow.sklearn.autolog()
62
63	target = "MedHouseVal"
64	X = df.drop(target, axis=1)
65	y = df[target]
66
67	scaler = StandardScaler()
68
69	with mlflow.start_run(experiment_id=experiment_id, run_name="Scaler") as run:
70	X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
71	mlflow.sklearn.log_model(scaler, artifact_path="scaler")
72	mlflow.log_metrics(pd.DataFrame(scaler.mean_, index=X.columns)[0].to_dict())
73
74	X[target] = y
75
76	# 3. Use an operator from the MLFlow provider to interact with MLFlow directly
77	create_registered_model = CreateRegisteredModelOperator(
78	task_id="create_registered_model",
79	name="{{ ts }}" + "_" + REGISTERED_MODEL_NAME,
80	tags=[
81	{"key": "model_type", "value": "regression"},
82	{"key": "data", "value": "housing"},
83	],
84	)
85
86	(
87	create_experiment(
88	experiment_name=EXPERIMENT_NAME, artifact_bucket=ARTIFACT_BUCKET
89	)
90	>> scale_features(experiment_id=EXPERIMENT_ID)
91	>> create_registered_model
92	)
93
94
95	mlflow_tutorial_dag()