Example Airflow DAG and Spark Job for Google Cloud Dataproc
This example is meant to demonstrate basic functionality within Airflow for managing Dataproc Spark Clusters and Spark Jobs. It also demonstrates usage of the BigQuery Spark Connector.
- Uses Airflow DataProcHook and Google Python API to check for existence of a Dataproc cluster
- Uses Airflow BranchPythonOperator to decide whether to create a Dataproc cluster
- Uses Airflow DataprocClusterCreateOperator to create a Dataproc cluster
- Uses Airflow DataProcSparkOperator to launch a spark job
- Uses Airflow DataprocClusterDeleteOperator to delete the Dataproc cluster
- Creates Spark Dataset from data loaded from a BigQuery table by BigQuery Spark Connector
- run
sbt assemblyin spark-example to create an assembly jar - upload the assembly jar to GCS
- Add dataproc_dag.py to your dags directory (
/home/airflow/airflow/dags/on your airflow server or your dags directory in GCS if using Cloud Composer) - In the Airflow UI, set variables:
projectGCP project idregionGCP region ('us-central1')subnetVPC subnet id (short id, not the full uri)bucketGCS bucketprefixGCS prefix ('/dataproc_example')datasetBigQuery datasettableBigQuery tablejarPrefixwhere you uploaded the assembly jar
Apache License 2.0