Usage

Monitoring

The managed Airflow instances are automatically configured to export Prometheus metrics. See Monitoring for more details.

Configuration & Environment Overrides

The cluster definition also supports overriding configuration properties and environment variables, either per role or per role group, where the more specific override (role group) has precedence over the less specific one (role).

Overriding certain properties which are set by operator (such as the HTTP port) can interfere with the operator and can lead to problems. Additionally, for Airflow it is recommended that each component has the same configuration: not all components use each setting, but some things - such as external end-points - need to be consistent for things to work as expected.

Configuration Properties

Airflow exposes an environment variable for every Airflow configuration setting, a list of which can be found in the Configuration Reference.

Although Kubernetes can override these settings in one of two ways (Configuration overrides, or Environment Variable overrides), the affect is the same and currently only the latter is implemented. This is described in the following section.

Environment Variables

These can be set - or overwritten - at either the role level:

webservers: envOverrides: AIRFLOW__WEBSERVER__AUTO_REFRESH_INTERVAL: "8" roleGroups: default: replicas: 1

Or per role group:

webservers: roleGroups: default: envOverrides: AIRFLOW__WEBSERVER__AUTO_REFRESH_INTERVAL: "8" replicas: 1

In both examples above we are replacing the default value of the UI DAG refresh (3s) with 8s. Note that all override property values must be strings.

Storage for data volumes

The Airflow Operator currently does not support using PersistentVolumeClaims for internal storage.

Resource Requests

Stackable operators handle resource requests in a sligtly different manner than Kubernetes. Resource requests are defined on role or group level. See Roles and role groups for details on these concepts. On a role level this means that e.g. all workers will use the same resource requests and limits. This can be further specified on role group level (which takes priority to the role level) to apply different resources.

This is an example on how to specify CPU and memory resources using the Stackable Custom Resources:

--- apiVersion: example.stackable.tech/v1alpha1 kind: ExampleCluster metadata: name: example spec: workers: # role-level config: resources: cpu: min: 300m max: 600m memory: limit: 3Gi roleGroups: # role-group-level resources-from-role: # role-group 1 replicas: 1 resources-from-role-group: # role-group 2 replicas: 1 config: resources: cpu: min: 400m max: 800m memory: limit: 4Gi

In this case, the role group resources-from-role will inherit the resources specified on the role level. Resulting in a maximum of 3Gi memory and 600m CPU resources.

The role group resources-from-role-group has maximum of 4Gi memory and 800m CPU resources (which overrides the role CPU resources).

For Java products the actual used Heap memory is lower than the specified memory limit due to other processes in the Container requiring memory to run as well. Currently, 80% of the specified memory limits is passed to the JVM.

For memory only a limit can be specified, which will be set as memory request and limit in the Container. This is to always guarantee a Container the full amount memory during Kubernetes scheduling.

If no resource requests are configured explicitely, the operator uses the following defaults:

workers: roleGroups: default: config: resources: cpu: min: '200m' max: "4" memory: limit: '2Gi'

Initializing the Airflow database

Airflow comes with a default embedded database (intended only for standalone mode): for cluster usage an external database is used such as PostgreSQL, described above. This database must be initialized with an airflow schema and the Admin user defined in the airflow credentials Secret. This is done the first time the cluster is created and can take a few moments.

Using Airflow

When the Airflow cluster is created and the database is initialized, Airflow can be opened in the browser.

The Airflow port which defaults to 8080 can be forwarded to the local host:

kubectl port-forward airflow-webserver-default-0 8080

Then it can be opened in the browser with http://localhost:8080.

Enter the admin credentials from the Kubernetes secret:

If the examples were loaded then some dashboards are already available:

Click on an example DAG and then invoke the job: if the scheduler is correctly set up then the job will run and the job tree will update automatically:

Authentication

Every user has to authenticate themselves before using Airflow and there are several ways of doing this.

Webinterface

The default setting is to view and manually set up users via the Webserver UI. Note the blue "+" button where users can be added directly:

LDAP

Airflow supports authentication of users against an LDAP server. This requires setting up an AuthenticationClass for the LDAP server. The AuthenticationClass is then referenced in the AirflowCluster resource as follows:

apiVersion: airflow.stackable.tech/v1alpha1 kind: AirflowCluster metadata: name: airflow-with-ldap spec: image: productVersion: 2.4.1 stackableVersion: 0.7.0 [...] authenticationConfig: authenticationClass: ldap (1) userRegistrationRole: Admin (2)

1	The reference to an AuthenticationClass called `ldap`
2	The default role that all users are assigned to

Users that log in with LDAP are assigned to a default Role which is specified with the userRegistrationRole property.

You can follow the Authentication with OpenLDAP tutorial to learn how to set up an AuthenticationClass for an LDAP server, as well as consulting the AuthenticationClass reference.

The users and roles can be viewed as before in the Webserver UI, but note that the blue "+" button is not available when authenticating against LDAP:

Authorization

The Airflow Webserver delegates the handling of user access control to Flask AppBuilder.

Webinterface

You can view, add to, and assign the roles displayed in the Airflow Webserver UI to existing users.

LDAP

Airflow supports assigning Roles to users based on their LDAP group membership, though this is not yet supported by the Stackable operator. All the users logging in via LDAP get assigned to the same role which you can configure via the attribute authenticationConfig.userRegistrationRole on the AirflowCluster object:

apiVersion: airflow.stackable.tech/v1alpha1 kind: AirflowCluster metadata: name: airflow-with-ldap spec: [...] authenticationConfig: authenticationClass: ldap userRegistrationRole: Admin (1)

1	All users are assigned to the `Admin` role

Monitoring

The managed Airflow instances are automatically configured to export Prometheus metrics. See Monitoring for more details

Mounting DAGs

DAGs can be mounted by using a ConfigMap or a PersistentVolumeClaim. This is best illustrated with an example of each, shown in the next section.

via `ConfigMap`

--- apiVersion: v1 kind: ConfigMap metadata: name: cm-dag (1) data: test_airflow_dag.py: | (2) from datetime import datetime, timedelta from airflow import DAG from airflow.operators.bash import BashOperator from airflow.operators.dummy import DummyOperator with DAG( dag_id='test_airflow_dag', schedule_interval='0 0 * * *', start_date=datetime(2021, 1, 1), catchup=False, dagrun_timeout=timedelta(minutes=60), tags=['example', 'example2'], params={"example_key": "example_value"}, ) as dag: run_this_last = DummyOperator( task_id='run_this_last', ) # [START howto_operator_bash] run_this = BashOperator( task_id='run_after_loop', bash_command='echo 1', ) # [END howto_operator_bash] run_this >> run_this_last for i in range(3): task = BashOperator( task_id='runme_' + str(i), bash_command='echo "{{ task_instance_key_str }}" && sleep 1', ) task >> run_this # [START howto_operator_bash_template] also_run_this = BashOperator( task_id='also_run_this', bash_command='echo "run_id={{ run_id }} | dag_run={{ dag_run }}"', ) # [END howto_operator_bash_template] also_run_this >> run_this_last # [START howto_operator_bash_skip] this_will_skip = BashOperator( task_id='this_will_skip', bash_command='echo "hello world"; exit 99;', dag=dag, ) # [END howto_operator_bash_skip] this_will_skip >> run_this_last if __name__ == "__main__": dag.cli()

--- apiVersion: airflow.stackable.tech/v1alpha1 kind: AirflowCluster metadata: name: airflow spec: image: productVersion: 2.4.1 stackableVersion: 0.7.0 executor: CeleryExecutor loadExamples: false exposeConfig: false credentialsSecret: simple-airflow-credentials volumes: - name: cm-dag (3) configMap: name: cm-dag (4) volumeMounts: - name: cm-dag (5) mountPath: /dags/test_airflow_dag.py (6) subPath: test_airflow_dag.py (7) webservers: roleGroups: default: envOverrides: AIRFLOW__CORE__DAGS_FOLDER: "/dags" (8) replicas: 1 workers: roleGroups: default: envOverrides: AIRFLOW__CORE__DAGS_FOLDER: "/dags" (8) replicas: 2 schedulers: roleGroups: default: envOverrides: AIRFLOW__CORE__DAGS_FOLDER: "/dags" (8) replicas: 1

1	The name of the configuration map
2	The name of the DAG (this is a renamed copy of the `example_bash_operator.py` from the Airflow examples)
3	The volume backed by the configuration map
4	The name of the configuration map referenced by the Airflow cluster
5	The name of the mounted volume
6	The path of the mounted resource. Note that should map to a single DAG.
7	The resource has to be defined using `subPath`: this is to prevent the versioning of configuration map elements which may cause a conflict with how Airflow propagates DAGs between its components.
8	If the mount path described above is anything other than the standard location (the default is `$AIRFLOW_HOME/dags`), then the location should be defined using the relevant environment variable.

The advantage of this approach is that a DAG can be provided "in-line", as it were. This becomes cumbersome when multiple DAGs are to be made available in this way, as each one has to be mapped individually. For multiple DAGs it is probably easier to expose them all via a mounted volume, which is shown below.

via PersistentVolumeclaim

--- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc-airflow (1) spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi --- apiVersion: batch/v1 kind: Job (2) metadata: name: airflow-dags spec: template: spec: restartPolicy: Never volumes: - name: external-dags (3) persistentVolumeClaim: claimName: pvc-airflow (4) initContainers: - name: dest-dir image: docker.stackable.tech/stackable/tools:0.2.0-stackable0.4.0 env: - name: DEST_DIR value: "/stackable/externals" command: [ "bash", "-x", "-c", "mkdir -p $DEST_DIR && chown stackable:stackable ${DEST_DIR} && chmod -R a=,u=rwX,g=rwX ${DEST_DIR}", ] securityContext: runAsUser: 0 volumeMounts: - name: external-dags (5) mountPath: /stackable/externals (6) containers: - name: airflow-dags image: docker.stackable.tech/stackable/tools:0.2.0-stackable0.4.0 env: - name: DEST_DIR value: "/stackable/externals" command: (7) [ "bash", "-x", "-c", "curl -L https://raw.githubusercontent.com/apache/airflow/2.4.1/airflow/example_dags/example_bash_operator.py \ -o ${DEST_DIR}/example_bash_operator.py && \ curl -L https://raw.githubusercontent.com/apache/airflow/2.4.1/airflow/example_dags/example_complex.py \ -o ${DEST_DIR}/example_complex.py && \ curl -L https://raw.githubusercontent.com/apache/airflow/2.4.1/airflow/example_dags/example_branch_datetime_operator.py \ -o ${DEST_DIR}/example_branch_datetime_operator.py", ] volumeMounts: - name: external-dags (5) mountPath: /stackable/externals (6)

--- apiVersion: airflow.stackable.tech/v1alpha1 kind: AirflowCluster metadata: name: airflow spec: image: productVersion: 2.4.1 stackableVersion: 0.7.0 executor: CeleryExecutor loadExamples: false exposeConfig: false credentialsSecret: simple-airflow-credentials volumes: - name: external-dags (8) persistentVolumeClaim: claimName: pvc-airflow (9) volumeMounts: - name: external-dags (10) mountPath: /stackable/external-dags (11) webservers: roleGroups: default: envOverrides: AIRFLOW__CORE__DAGS_FOLDER: "/stackable/external-dags" (12) replicas: 1 workers: roleGroups: default: envOverrides: AIRFLOW__CORE__DAGS_FOLDER: "/stackable/external-dags" (12) replicas: 2 schedulers: roleGroups: default: envOverrides: AIRFLOW__CORE__DAGS_FOLDER: "/stackable/external-dags" (12) replicas: 1

1	The name of the `PersistentVolumeClaim` that references the PV
2	Job used to populate the `PersistentVolumeClaim` with DAG files
3	The volume name that will be mounted as a target for the DAG files
4	Defines the `Volume` backed by the PVC, local to the Custom Resource
5	The `VolumeMount` used by the Custom Resource
6	The path for the `VolumeMount`
7	The command used to access/download the DAG files to a specified location
8	The `Volume` used by this Custom Resource
9	The `PersistentVolumeClaim` that backs this `Volume`
10	The `VolumeMount` referencing the `Volume` in the previous step
11	The path where this `Volume` is located for each role (webserver, worker, scheduler)
12	If the mount path described above is anything other than the standard location (the default is `$AIRFLOW_HOME/dags`), then the location should be defined using the relevant environment variable.

Node selection

Airflow expects that all its components (webserver, scheduler, workers etc.) have access to the DAG folder. If this is mounted via a PersistentVolumeClaim, then the permissible access modes on that claim may require that a specific node is selected. This can be done by providing a label-match as shown below:

 workers: roleGroups: default: envOverrides: AIRFLOW__CORE__DAGS_FOLDER: "/stackable/external-dags" replicas: 1 selector: matchLabels: node: "2"

Applying Custom Resources

Airflow can be used to apply custom resources from within a cluster. An example of this could be a SparkApplication job that is to be triggered by Airflow. The steps below describe how this can be done.

Define an in-cluster Kubernetes connection

An in-cluster connection can either be created from within the Webserver UI (note that the "in cluster configuration" box is ticked):

Alternatively, the connection can be defined by an environment variable in URI format:

AIRFLOW_CONN_KUBERNETES_IN_CLUSTER: "kubernetes://?__extra__=%7B%22extra__kubernetes__in_cluster%22%3A+true%2C+%22extra__kubernetes__kube_config%22%3A+%22%22%2C+%22extra__kubernetes__kube_config_path%22%3A+%22%22%2C+%22extra__kubernetes__namespace%22%3A+%22%22%7D"

This can be supplied directly in the custom resource for all roles (Airflow expects configuration to be common across components):

--- apiVersion: airflow.stackable.tech/v1alpha1 kind: AirflowCluster metadata: name: airflow spec: image: productVersion: 2.4.1 stackableVersion: 0.7.0 executor: CeleryExecutor loadExamples: false exposeConfig: false credentialsSecret: simple-airflow-credentials webservers: roleGroups: default: envOverrides: AIRFLOW_CONN_KUBERNETES_IN_CLUSTER: "kubernetes://?__extra__=%7B%22extra__kubernetes__in_cluster%22%3A+true%2C+%22extra__kubernetes__kube_config%22%3A+%22%22%2C+%22extra__kubernetes__kube_config_path%22%3A+%22%22%2C+%22extra__kubernetes__namespace%22%3A+%22%22%7D" replicas: 1 workers: roleGroups: default: envOverrides: AIRFLOW_CONN_KUBERNETES_IN_CLUSTER: "kubernetes://?__extra__=%7B%22extra__kubernetes__in_cluster%22%3A+true%2C+%22extra__kubernetes__kube_config%22%3A+%22%22%2C+%22extra__kubernetes__kube_config_path%22%3A+%22%22%2C+%22extra__kubernetes__namespace%22%3A+%22%22%7D" replicas: 1 schedulers: roleGroups: default: envOverrides: AIRFLOW_CONN_KUBERNETES_IN_CLUSTER: "kubernetes://?__extra__=%7B%22extra__kubernetes__in_cluster%22%3A+true%2C+%22extra__kubernetes__kube_config%22%3A+%22%22%2C+%22extra__kubernetes__kube_config_path%22%3A+%22%22%2C+%22extra__kubernetes__namespace%22%3A+%22%22%7D" replicas: 1

Define a cluster role for Airflow to create SparkApplication resources

Airflow cannot create or access SparkApplication resources by default - a cluster role is required for this:

--- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: airflow-spark-clusterrole rules: - apiGroups: - spark.stackable.tech resources: - sparkapplications verbs: - create - get

and a corresponding cluster role binding:

--- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: airflow-spark-clusterrole-binding roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: airflow-spark-clusterrole subjects: - apiGroup: rbac.authorization.k8s.io kind: Group name: system:serviceaccounts

DAG code

Now for the DAG itself. The job to be started is a simple Spark job that calculates the value of pi:

--- apiVersion: spark.stackable.tech/v1alpha1 kind: SparkApplication metadata: name: pyspark-pi namespace: default spec: version: "1.0" sparkImage: docker.stackable.tech/stackable/pyspark-k8s:3.3.0-stackable0.3.0 mode: cluster mainApplicationFile: local:///stackable/spark/examples/src/main/python/pi.py driver: cores: 1 coreLimit: "1200m" memory: "512m" executor: cores: 1 instances: 3 memory: "512m"

This will called from within a DAG by using the connection that was defined earlier. It will be wrapped by the KubernetesHook that the Airflow Kubernetes provider makes available. There are two classes that are used to:

start the job
monitor the status of the job

These are written in-line in the python code below, though this is just to make it clear how the code is used (the classes SparkKubernetesOperator and SparkKubernetesSensor will be used for all custom resources and thus are best defined as separate python files that the DAG would reference).

# # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. """Example DAG demonstrating how to apply a Kubernetes Resource from Airflow running in-cluster""" from datetime import datetime, timedelta from airflow import DAG from typing import TYPE_CHECKING, Optional, Sequence, Dict from kubernetes import client from airflow.exceptions import AirflowException from airflow.sensors.base import BaseSensorOperator from airflow.models import BaseOperator from airflow.providers.cncf.kubernetes.hooks.kubernetes import KubernetesHook if TYPE_CHECKING: from airflow.utils.context import Context class SparkKubernetesOperator(BaseOperator): (1) """ Creates a SparkApplication resource in kubernetes: :param application_file: Defines a 'SparkApplication' custom resource as either a path to a '.yaml' file, '.json' file, YAML string or JSON string. :param namespace: kubernetes namespace for the SparkApplication :param kubernetes_conn_id: The :ref:`kubernetes connection id <howto/connection:kubernetes>` for the Kubernetes cluster. :param api_group: SparkApplication API group :param api_version: SparkApplication API version """ template_fields: Sequence[str] = ('application_file', 'namespace') template_ext: Sequence[str] = ('.yaml', '.yml', '.json') ui_color = '#f4a460' def __init__( self, *, application_file: str, namespace: Optional[str] = None, kubernetes_conn_id: str = 'kubernetes_in_cluster', (2) api_group: str = 'spark.stackable.tech', api_version: str = 'v1alpha1', **kwargs, ) -> None: super().__init__(**kwargs) self.application_file = application_file self.namespace = namespace self.kubernetes_conn_id = kubernetes_conn_id self.api_group = api_group self.api_version = api_version self.plural = "sparkapplications" def execute(self, context: 'Context'): hook = KubernetesHook(conn_id=self.kubernetes_conn_id) self.log.info("Creating SparkApplication...") response = hook.create_custom_object( group=self.api_group, version=self.api_version, plural=self.plural, body=self.application_file, namespace=self.namespace, ) return response class SparkKubernetesSensor(BaseSensorOperator): (3) """ Monitors a SparkApplication resource in kubernetes: :param application_name: SparkApplication resource name :param namespace: the kubernetes namespace where the SparkApplication reside in :param kubernetes_conn_id: The :ref:`kubernetes connection<howto/connection:kubernetes>` to Kubernetes cluster. :param attach_log: determines whether logs for driver pod should be appended to the sensor log :param api_group: SparkApplication API group :param api_version: SparkApplication API version """ template_fields = ("application_name", "namespace") FAILURE_STATES = ("Failed", "Unknown") SUCCESS_STATES = ("Succeeded") def __init__( self, *, application_name: str, attach_log: bool = False, namespace: Optional[str] = None, kubernetes_conn_id: str = 'kubernetes_in_cluster', (2) api_group: str = 'spark.stackable.tech', api_version: str = 'v1alpha1', poke_interval: float = 60, **kwargs, ) -> None: super().__init__(**kwargs) self.application_name = application_name self.attach_log = attach_log self.namespace = namespace self.kubernetes_conn_id = kubernetes_conn_id self.hook = KubernetesHook(conn_id=self.kubernetes_conn_id) self.api_group = api_group self.api_version = api_version self.poke_interval = poke_interval def _log_driver(self, application_state: str, response: dict) -> None: if not self.attach_log: return status_info = response["status"] if "driverInfo" not in status_info: return driver_info = status_info["driverInfo"] if "podName" not in driver_info: return driver_pod_name = driver_info["podName"] namespace = response["metadata"]["namespace"] log_method = self.log.error if application_state in self.FAILURE_STATES else self.log.info try: log = "" for line in self.hook.get_pod_logs(driver_pod_name, namespace=namespace): log += line.decode() log_method(log) except client.rest.ApiException as e: self.log.warning( "Could not read logs for pod %s. It may have been disposed.\n" "Make sure timeToLiveSeconds is set on your SparkApplication spec.\n" "underlying exception: %s", driver_pod_name, e, ) def poke(self, context: Dict) -> bool: self.log.info("Poking: %s", self.application_name) response = self.hook.get_custom_object( group=self.api_group, version=self.api_version, plural="sparkapplications", name=self.application_name, namespace=self.namespace, ) try: application_state = response["status"]["phase"] except KeyError: self.log.debug(f"SparkApplication status could not be established: {response}") return False if self.attach_log and application_state in self.FAILURE_STATES + self.SUCCESS_STATES: self._log_driver(application_state, response) if application_state in self.FAILURE_STATES: raise AirflowException(f"SparkApplication failed with state: {application_state}") elif application_state in self.SUCCESS_STATES: self.log.info("SparkApplication ended successfully") return True else: self.log.info("SparkApplication is still in state: %s", application_state) return False with DAG( (4) dag_id='sparkapp_dag', schedule_interval='0 0 * * *', start_date=datetime(2021, 1, 1), catchup=False, dagrun_timeout=timedelta(minutes=60), tags=['example'], params={"example_key": "example_value"}, ) as dag: t1 = SparkKubernetesOperator( (5) task_id='spark_pi_submit', namespace="default", application_file="pyspark-pi.yaml", do_xcom_push=True, dag=dag, ) t2 = SparkKubernetesSensor( (6) task_id='spark_pi_monitor', namespace="default", application_name="{{ task_instance.xcom_pull(task_ids='spark_pi_submit')['metadata']['name'] }}", poke_interval=5, dag=dag, ) t1 >> t2 (7)

1	the wrapper class used for calling the job via `KubernetesHook`
2	the connection that created for in-cluster usage
3	the wrapper class used for monitoring the job via `KubernetesHook`
4	the start of the DAG code
5	the initial task to invoke the job
6	the subsequent task to monitor the job
7	the jobs are chained together in the correct order

Once this DAG is mounted in the DAG folder it can be called and its progress viewed from within the Webserver UI:

Clicking on the "spark_pi_monitor" task and selecting the logs shows that the status of the job has been tracked by Airflow:

Usage

Monitoring

Configuration & Environment Overrides

Configuration Properties

Environment Variables

Storage for data volumes

Resource Requests

Initializing the Airflow database

Using Airflow

Authentication

Webinterface

LDAP

Authorization

Webinterface

LDAP

Monitoring

Mounting DAGs

via ConfigMap

via PersistentVolumeclaim

Node selection

Applying Custom Resources

Define an in-cluster Kubernetes connection

Define a cluster role for Airflow to create SparkApplication resources

DAG code

via `ConfigMap`