Integrating Airflow with Databricks

BLOG

Integrating Airflow with Databricks - Overview of Airflow, Databricks, and dbt

Before diving into the integration steps, let’s quickly overview the technologies we’ll be using.

**Apache Airflow**: An open-source platform used to programmatically author, schedule, and monitor workflows.
**Databricks**: A unified data analytics platform that provides a Spark-based processing engine, allowing you to
build scalable data pipelines.
**dbt (data build tool)**: A tool that enables data analysts and engineers to transform data in a warehouse via
SQL, with the added benefit of validating the data quality through testing and documentation.

By combining these tools, you can orchestrate a full end-to-end data pipeline in which data is processed in Databricks, and dbt is used to validate the quality of that data.

Setting Up Airflow

Start by installing and setting up Apache Airflow. If you already have it running, you can skip this section.

Install Airflow via pip: pip install apache-airflow.
Initialize the database and start the webserver:
airflow db init
airflow webserver
airflow scheduler
Next, configure Airflow to work with Databricks.
Airflow Configuration for Databricks:
 Install the Databricks integration package:
pip install apache-airflow-providers-databricks
 Configure the Databricks connection in Airflow’s UI:
 Go to the Airflow Admin tab → Connections → Add a new connection.
 Set **Conn Type** to Databricks.
 Add your **Databricks host URL** and **token**.

Running a Databricks Job from Airflow

To orchestrate a Databricks job in Airflow, you can use the DatabricksSubmitRunOperator. Here’s a simple example of a
DAG that triggers a Databricks job.

from airflow import DAG
from airflow.providers.databricks.operators.databricks import DatabricksSubmitRunOperator
from airflow.utils.dates import days_ago
# Define Databricks job parameters
DATABRICKS_TASK = {
    'new_cluster': {
        'spark_version': '7.3.x-scala2.12',
        'node_type_id': 'i3.xlarge',
        'num_workers': 2,
    },
    'notebook_task': {
        'notebook_path': '/Shared/your_notebook_path',

    },
}
default_args = {
    'owner': 'airflow',
    'start_date': days_ago(1),
}
# Define the DAG
with DAG('databricks_job',
         schedule_interval='@daily',
         default_args=default_args,
         catchup=False) as dag:
    # Submit Databricks job
    run_databricks_job = DatabricksSubmitRunOperator(
        task_id='run_databricks_job',
        json=DATABRICKS_TASK,
        databricks_conn_id='databricks_default'
    )
    run_databricks_job

BLOG