Skip to content

NiccoloSalvini/airflow-docker

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Airflow 2.0

Table of Contents

Overview

The purpose of this repository is to provide a base config for Airflow 2.0 with Docker.

This repository can be used as a template for other repositories to build upon. More information about creating a repository from this template here.

Quick Start

Show/Hide Details

This guide is based off Airflow's quick start to Running Airflow in Docker

TL;DR

Create an .env file as follows:

AIRFLOW_UID=<YOUR LOCAL USER ID>
AIRFLOW_GID=<YOUR LOCAL GROUP ID>
AIRFLOW_IMAGE_NAME=<AIRFLOW DOCKER IMAGE e.g apache/airflow:2.1.2>
_AIRFLOW_WWW_USER_USERNAME=<AIRFLOW ADMIN USERNAME>
_AIRFLOW_WWW_USER_PASSWORD=<AIRFLOW ADMIN PASSWORD>
FERNET_KEY=<YOUR FERNET KEY>

Initialise Airflow:

docker-compose up airflow-init

Start all Airflow services:

docker-compose up

Place your dag files within dags/ and your plugins within plugins/.

Initialising Environment

Set file permissions

Within our docker-compose file we are mounting a number of directories i.e config/, dags/, logs/ and plugins/. To avoid issue with non matching file permissions, we need to ensure that our mounted volumes within every docker container use the native Linux filesystem user/group permissions.

To fix this we can specify the UID and GID for Airflow and pass this to our env file:

echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" >> .env

Create admin user and password for web authentication

Airflow 2.0 requires use of the RBAC-UI, which means users are required to specify a password prior to login.

To set the admin user, update the .env file to include values for _AIRFLOW_WWW_USER_USERNAME and _AIRFLOW_WWW_USER_PASSWORD:

#.env
_AIRFLOW_WWW_USER_USERNAME=<ADMIN USERNAME>
_AIRFLOW_WWW_USER_PASSWORD=<ADMIN PASSWORD>

In order to create additional users, you can use the following CLI command and you will then be prompted to enter a password for the user:

./airflow.sh airflow users create \
    --username niall.oriordan \
    --firstname Niall \
    --lastname O\'Riordan \
    --role Op \
    --email oriordn@tcd.ie

For more information about different roles, visit Airflow's Access Control documentation. Additionally, more information about the airflow users create command can be found here.

Specify the Airflow Docker image

Within the env file specify the airflow docker image to use:

#.env
AIRFLOW_IMAGE_NAME=<AIRFLOW DOCKER IMAGE e.g apache/airflow:2.1.2>

Create a Fernet key for Airflow

Airflow users Fernet to encrypt passwords within its connection and variable configuration.

To generate the key you will need to install cryptography and then run the following command:

echo -e "FERNET_KEY=$(python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())")" >> .env

This will add the Fernet key to your .env file as follows:

# .env
FERNET_KEY=<YOUR FERNET KEY>

It is important to keep the generated fernet key safe as it is guaranteed that a password encrypted using it cannot be manipulated or read without the key.

Initialise Airflow

It is necessary to initialise airflow by running any database migrations and creating the first admin user. This can be achieved by running:

docker-compose up airflow-init

After initialization is complete, the last few messages should read like below:

airflow-init_1       | Upgrades done
airflow-init_1       | Admin user airflow created
airflow-init_1       | 2.1.2
start_airflow-init_1 exited with code 0

Run Airflow

To start all Airflow services run:

docker-compose up

After listing the running containers in another terminal you should see a similar output to the one below:

CONTAINER ID   IMAGE                  COMMAND                  CREATED              STATUS                        PORTS                                                 NAMES
b60b1cf71c84   apache/airflow:2.1.2   "/usr/bin/dumb-init …"   About a minute ago   Up About a minute (healthy)   0.0.0.0:5555->5555/tcp, :::5555->5555/tcp, 8080/tcp   airflow-docker_flower_1
cc8d0e7d4313   apache/airflow:2.1.2   "/usr/bin/dumb-init …"   About a minute ago   Up About a minute (healthy)   8080/tcp                                              airflow-docker_airflow-scheduler_1
d8dd9720bffd   apache/airflow:2.1.2   "/usr/bin/dumb-init …"   About a minute ago   Up About a minute (healthy)   8080/tcp                                              airflow-docker_airflow-worker_1
8887045f093e   apache/airflow:2.1.2   "/usr/bin/dumb-init …"   About a minute ago   Up About a minute (healthy)   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp             airflow-docker_airflow-webserver_1
cf55784b4c05   postgres:13            "docker-entrypoint.s…"   About a minute ago   Up About a minute (healthy)   5432/tcp                                              airflow-docker_postgres_1
e6b51c4d2d68   redis:latest           "docker-entrypoint.s…"   About a minute ago   Up About a minute (healthy)   0.0.0.0:6379->6379/tcp, :::6379->6379/tcp             airflow-docker_redis_1

Note:

  • It may take a few minutes for the containers to finish starting up

Place your dag files within dags/ and your plugins within plugins/.

Architecture

Show/Hide Details

All applications are packaged up using Docker to isolate the software from its env so that it works in different development environments. We also make use of the docker compose tool to define and run multiple Docker container applications.

Airflow consists of several components:

  • Metadata Database - Contains information about the status of tasks, DAGs, Variables, connections, etc.
  • Scheduler - Reads from the Metadata database and is responsible for adding the necessary tasks to the queue
  • Executor - Works closely with the Scheduler to determine what resources are needed to complete the tasks as they are queued
  • Web server - HTTP Server provides access to DAG/task status information

Metadata Database - Postgres

For the Metadata Database a number of databases can be chosen. More information here.

For our purposes we have chosen a Postgres backend.

Scheduler

A new feature for Airflow 2.0 is the option to provide multiple schedulers. This new feature enables high availability, scalability and greater performance.

More information here.

For instance we can scale up to two schedulers by running:

docker-compose up --scale airflow-scheduler=2

Executor

Airflow provides the option for Local or Remote executors. Local executors usually run tasks inside the scheduler process, while Remote executors usually run those tasks remotely via a pool of workers.

The default executor is a Sequential Executor which runs tasks sequentially without the option for parallelism.

Celery Executor as Remote Executor to scale out the number of workers.

For instance we can scale up to three celery clusters by running:

docker-compose up --scale airflow-worker=3

Or scale up to three celery clusters and two schedulers by running:

docker-compose up --scale airflow-worker=3 --scale airflow-scheduler=2

Redis - Message Broker

In order to use a Celery Executor a Celery backend such as RabbitMQ, Redis, etc. are required. For our purposes Redis was chosen.

Redis is a distributed in memory key value database. For Airflow it is used as a message broker by delivering messages to the celery workers.

Flower - Celery Monitoring Tool

Flower is a tool used for monitoring and administering Celery clusters.

Flower is accessible on port 5555.

For example after scaling to three celery workers our Flower tool should provide a dashboard which looks as follows:

Flower

Structure

Show/Hide Details
  • docker-compose.yaml: build Airflow services
  • dags: folder for airflow dags
  • docker: folder with Dockerfile
  • logs/: folder for airflow logs
  • plugins/: folder for airflow plugins
  • airflow.sh: convenient bash file to run Airflow CLI commands
  • airflow.cfg: airflow config file
  • Makefile makefile to start, stop, rerun airflow services based on docker/Dockerfile build

About

Dockerised Airflow 2.0

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Makefile 48.7%
  • Shell 35.0%
  • Dockerfile 16.3%