Before diving into the integration steps, let’s quickly overview the technologies we’ll be using.
By combining these tools, you can orchestrate a full end-to-end data pipeline in which data is processed in Databricks, and dbt is used to validate the quality of that data.
Start by installing and setting up Apache Airflow. If you already have it running, you can skip this section.
To orchestrate a Databricks job in Airflow, you can use the DatabricksSubmitRunOperator. Here’s a simple example of a
DAG that triggers a Databricks job.
from airflow import DAG
from airflow.providers.databricks.operators.databricks import DatabricksSubmitRunOperator
from airflow.utils.dates import days_ago
# Define Databricks job parameters
DATABRICKS_TASK = {
'new_cluster': {
'spark_version': '7.3.x-scala2.12',
'node_type_id': 'i3.xlarge',
'num_workers': 2,
},
'notebook_task': {
'notebook_path': '/Shared/your_notebook_path',
},
}
default_args = {
'owner': 'airflow',
'start_date': days_ago(1),
}
# Define the DAG
with DAG('databricks_job',
schedule_interval='@daily',
default_args=default_args,
catchup=False) as dag:
# Submit Databricks job
run_databricks_job = DatabricksSubmitRunOperator(
task_id='run_databricks_job',
json=DATABRICKS_TASK,
databricks_conn_id='databricks_default'
)
run_databricks_job
WhatsApp us