Skip to content

Load DAGs per git-sync #177

@adwk67

Description

@adwk67

Issue #150 introduces one possibility for general management of external resources: for DAGs, the recommendation from Airflow is to use git-sync (see e.g. here for more info). This issue covers implementing git-sync in a container to regularly keep DAGs updated.

An example of this is given here. The airflow CRD can be changed to include an optional section shown below this (in airflow the roles are expected to have the same config, so the git-sync definition can be at top-level. This might need to be revisited if git-sync is a feature useful for other operators, such as nifi workflows etc.).

There are quite a number of parameters that can be set (see here and the sections that follow the link). Analagous to the SparkHistoryServerSpec the proposal is to define mandatory fields and expose the rest via a map;

... credentialsSecret: test-airflow-credentials gitSync: #--------------------------------------------- # necessary and mandatory #--------------------------------------------- name: git-sync repo: https://github.com/kubernetes/git-sync dagsDirectory: dags wait: 60 #--------------------------------------------- # optional #--------------------------------------------- gitSyncConf: - # depth: default 1 ... webservers: roleGroups: ... 

A git-sync container can only sync a single repo, so multiple repos would require a container each.
A git-sync block may occur at top-level (as would be the case for e.g. airflow), or at role level, if the product only requires external git resources for a specific component (i.e. the init-container will only be created for that role).

Acceptance critiera

  • Git-sync implemented and documented.
  • Strategy how to handle multiple git repos containing multiple DAGs developed and documented (the implementation of this is not covered by this ticket, but there should be a strategy to look at in a separate issue).
    • Strategy
      • implement gitsync as a list under a top-level cluster-config.
      • current implementation only considers the first element of this list (thus avoiding a breaking change when a list is processed)
      • when implementing multiple repositories
        • one gitsync container is required per repository endpoint
        • code will be written to consolidate these endpoints into a single folder to used as the DAG folder (possibly by using the webhook mechanism documented here)
  • The git-sync image must be available in off-line mode e.g. must be mirrored (image is incorporated into the airflow product images)
  • remove PVC-usage from tests and documentation.
    • N.B. inform @backstreetkiwi when this is complete so that node labels (needed for the airflow PVC test) can be removed from T2 cluster nodes

See stackabletech/docker-images#337

Metadata

Metadata

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions