ELT with Apache Airflow® and Databricks
ELT with Apache Airflow® and Databricks
ELT with Apache Airflow® and Databricks
The ELT with Apache Airflow® and Databricks GitHub repository is a free and open-source reference architecture showing how to use Apache Airflow® to copy synthetic data about a green energy initiative from an S3 bucket into a Databricks table and then run several Databricks notebooks as a Databricks job created by Airflow to analyze this data. This demo was showcased in the How to Orchestrate Databricks Jobs Using Airflow webinar.
Databricks is a popular unified data and analytics platform built around fully managed Apache Spark clusters. Using the Airflow Databricks provider package, you can create a Databricks job from Databricks notebooks running as a task group in your Airflow DAG. This lets you use Airflow’s orchestration features in combination with Databricks’ cheapest compute, Databricks Workflows.
For more detailed instructions on using the Airflow Databricks provider, see our Orchestrate Databricks jobs with Airflow tutorial.


This reference architecture consists of 3 main components:
The DAGs in this reference architecture highlight several key Airflow best practices and features:
If you’d like to build your own ELT/ETL pipeline with Databricks and Apache Airflow®, feel free to fork the repository and adapt it to your use case. We recommend deploying the Airflow pipelines using a free trial of Astro.