Skip to content

Commit

Permalink
Merge pull request #1 from dataops-sre/adaptdockercompose
Browse files Browse the repository at this point in the history
Adapt docker compose file and better documentations
  • Loading branch information
dataops-sre authored Aug 6, 2021
2 parents 287b4c8 + 48c9e74 commit fcc217b
Show file tree
Hide file tree
Showing 6 changed files with 75 additions and 66 deletions.
6 changes: 4 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@ name: Docker Image CI

on:
push:
branches: [ master ]
branches:
- '*'
pull_request:
branches: [ master ]
branches:
- master

jobs:

Expand Down
10 changes: 9 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
FROM apache/airflow:2.1.2-python3.8

LABEL maintainer="dataops-sre"

ARG AIRFLOW_VERSION=2.1.2
ARG MY_PYTHON_VERSION=3.8
ARG PYTHON_VERSION=3.8

ARG AIRFLOW_DEPS=""
ARG PYTHON_DEPS=""

RUN pip install apache-airflow[kubernetes,snowflake${AIRFLOW_DEPS:+,}${AIRFLOW_DEPS}]==${AIRFLOW_VERSION} \
&& if [ -n "${PYTHON_DEPS}" ]; then pip install ${PYTHON_DEPS}; fi

COPY script/entrypoint.sh /entrypoint.sh
COPY config/webserver_config.py $AIRFLOW_HOME/
Expand Down
54 changes: 36 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,24 +11,44 @@ This repository contains **Dockerfile** of [apache-airflow2](https://github.com/

* Based on official Airflow 2 Image [apache/airflow2:2.1.2-python3.8
](https://hub.docker.com/_/python/) and uses the official [Postgres](https://hub.docker.com/_/postgres/) as backend and [Redis](https://hub.docker.com/_/redis/) as queue
* Docker entrypoint script is based on [puckel/docker-airflow](https://github.com/puckel/docker-airflow)
* Docker entrypoint script is forked from [puckel/docker-airflow](https://github.com/puckel/docker-airflow)
* Install [Docker](https://www.docker.com/)
* Install [Docker Compose](https://docs.docker.com/compose/install/)


## Installation
## Motivation
This repo is forked form [puckel/docker-airflow](https://github.com/puckel/docker-airflow), the original repo seems not maintained.

Pull the image from the Docker repository.
Airflow is been updated to version 2 and release its [official docker image](https://hub.docker.com/r/apache/airflow), you can also find [bitnami airflow image](https://hub.docker.com/r/bitnami/airflow). Nevertheless, puckel's image is still interesting, in the market none of providers offer an Airflow run with LocalExecutor with scheduler in one container, it is extremely usefull when to deploy a simple Airflow to an AWS EKS cluster. With Kubernetes you can resolve Airflow scablity issue by using uniquely KubernetesPodOpetertor in your dags, then we need zero computational power for airflow, it serves pure purpose of scheduler, seperate scheduler and webserver into two different pods is a bit problematic on AWS EKS cluster, we want to keep dags and logs into a Persistant volume, but AWS has some limitation for EBS volume multi attach, which means webserver and scheduler pod has to be scheduled on the same EKS node, it is a bit annoying. Thus puckel's airflow startup script is usefull.

what this fork do :

* Disactive by default the login screen in Airflow 2
* Improve current script to only take into account Airflow environment variables
* Make sure docker compose files works

You can use my [Airflow helm chart](https://github.com/dataops-sre/helm-charts) which deploys this image to a Kubernetes cluster.


## Build

Optionally install [Extra Airflow Packages](https://airflow.incubator.apache.org/installation.html#extra-package) and/or python dependencies at build time :

docker build --rm --build-arg AIRFLOW_DEPS="datadog,dask" -t dataopssre/docker-airflow2 .
docker build --rm --build-arg PYTHON_DEPS="requests" -t dataopssre/docker-airflow2 .

or combined

docker build --rm --build-arg AIRFLOW_DEPS="datadog,dask" --build-arg PYTHON_DEPS="requests" -t dataopssre/docker-airflow2 .

docker pull dataopssre/docker-airflow2

## Usage

By default, docker-airflow runs Airflow with **SequentialExecutor** :

docker run -d -p 8080:8080 puckel/docker-airflow webserver

If you want to run another executor, use the other docker-compose.yml files provided in this repository.
If you want to run another executor, use the docker-compose.yml files provided in this repository.

For **LocalExecutor** :

Expand All @@ -38,11 +58,10 @@ For **CeleryExecutor** :

docker-compose -f docker-compose-CeleryExecutor.yml up -d

NB : If you want to have DAGs example loaded (default=False), you've to set the following environment variable :
NB : If you want to have DAGs example loaded (default=False), you've to set the following environment variable in docker-compose files :

`LOAD_EX=n`
`AIRFLOW__CORE__LOAD_EXAMPLES=True`

docker run -d -p 8080:8080 -e LOAD_EX=y puckel/docker-airflow

If you want to use Ad hoc query, make sure you've configured connections:
Go to Admin -> Connections and Edit "postgres_default" set this values (equivalent to values in airflow.cfg/docker-compose*.yml) :
Expand All @@ -53,26 +72,26 @@ Go to Admin -> Connections and Edit "postgres_default" set this values (equivale

For encrypted connection passwords (in Local or Celery Executor), you must have the same fernet_key. By default docker-airflow generates the fernet_key at startup, you have to set an environment variable in the docker-compose (ie: docker-compose-LocalExecutor.yml) file to set the same key accross containers. To generate a fernet_key :

docker run puckel/docker-airflow python -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print(FERNET_KEY)"
docker run dataopssre/docker-airflow2 python -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print(FERNET_KEY)"

## Configuring Airflow

It's possible to set any configuration value for Airflow from environment variables, which are used over values from the airflow.cfg.
It's possible to set any configuration value for Airflow from environment variables

The general rule is the environment variable should be named `AIRFLOW__<section>__<key>`, for example `AIRFLOW__CORE__SQL_ALCHEMY_CONN` sets the `sql_alchemy_conn` config option in the `[core]` section.

Check out the [Airflow documentation](http://airflow.readthedocs.io/en/latest/howto/set-config.html#setting-configuration-options) for more details
Check out the [Airflow documentation](https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html) for more details

You can also define connections via environment variables by prefixing them with `AIRFLOW_CONN_` - for example `AIRFLOW_CONN_POSTGRES_MASTER=postgres://user:password@localhost:5432/master` for a connection called "postgres_master". The value is parsed as a URI. This will work for hooks etc, but won't show up in the "Ad-hoc Query" section unless an (empty) connection is also created in the DB

## Custom Airflow plugins

Airflow allows for custom user-created plugins which are typically found in `${AIRFLOW_HOME}/plugins` folder. Documentation on plugins can be found [here](https://airflow.apache.org/plugins.html)
Airflow allows for custom user-created plugins which are typically found in `${AIRFLOW_HOME}/plugins` folder. Documentation on plugins can be found [here](https://airflow.apache.org/docs/apache-airflow/stable/plugins.html)

In order to incorporate plugins into your docker container
- Create the plugins folders `plugins/` with your custom plugins.
- Mount the folder as a volume by doing either of the following:
- Include the folder as a volume in command-line `-v $(pwd)/plugins/:/usr/local/airflow/plugins`
- Include the folder as a volume in command-line `-v $(pwd)/plugins/:/opt/airflow/plugins`
- Use docker-compose-LocalExecutor.yml or docker-compose-CeleryExecutor.yml which contains support for adding the plugins folder as a volume

## Install custom python package
Expand All @@ -99,20 +118,19 @@ This can be used to scale to a multi node setup using docker swarm.

If you want to run other airflow sub-commands, such as `list_dags` or `clear` you can do so like this:

docker run --rm -ti puckel/docker-airflow airflow list_dags
docker run --rm -ti dataopssre/docker-airflow2 airflow dags list

or with your docker-compose set up like this:

docker-compose -f docker-compose-CeleryExecutor.yml run --rm webserver airflow list_dags
docker-compose -f docker-compose-CeleryExecutor.yml run --rm webserver airflow dags list

You can also use this to run a bash shell or any other command in the same environment that airflow would be run in:

docker run --rm -ti puckel/docker-airflow bash
docker run --rm -ti puckel/docker-airflow ipython
docker run --rm -ti dataopssre/docker-airflow2 bash
docker run --rm -ti dataopssre/docker-airflow2 ipython

# Simplified SQL database configuration using PostgreSQL

If the executor type is set to anything else than *SequentialExecutor* you'll need an SQL database.
Here is a list of PostgreSQL configuration variables and their default values. They're used to compute
the `AIRFLOW__CORE__SQL_ALCHEMY_CONN` and `AIRFLOW__CELERY__RESULT_BACKEND` variables when needed for you
if you don't provide them explicitly:
Expand Down
42 changes: 21 additions & 21 deletions docker-compose-CeleryExecutor.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
version: '2.1'
services:
redis:
image: 'redis:5.0.5'
image: 'redis:6.2'
# command: redis-server --requirepass redispass

postgres:
Expand All @@ -16,77 +16,77 @@ services:
# - ./pgdata:/var/lib/postgresql/data/pgdata

webserver:
image: dataopssre/docker-airflow2:2.1.2
build: .
restart: always
depends_on:
- postgres
- redis
environment:
- LOAD_EX=n
- FERNET_KEY=46BKJoQYlPPOexq0OhDZnIlNepKFf87WFwLbfzqDDho=
- EXECUTOR=Celery
- AIRFLOW__CORE__LOAD_EXAMPLES=false
- AIRFLOW__CORE__FERNET_KEY=46BKJoQYlPPOexq0OhDZnIlNepKFf87WFwLbfzqDDho=
- AIRFLOW__CORE__EXECUTOR=CeleryExecutor
# - POSTGRES_USER=airflow
# - POSTGRES_PASSWORD=airflow
# - POSTGRES_DB=airflow
# - REDIS_PASSWORD=redispass
volumes:
- ./dags:/usr/local/airflow/dags
- ./dags:/opt/airflow/dags
# Uncomment to include custom plugins
# - ./plugins:/usr/local/airflow/plugins
ports:
- "8080:8080"
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
test: ["CMD-SHELL", "[ -f /opt/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3

flower:
image: dataopssre/docker-airflow2:2.1.2
build: .
restart: always
depends_on:
- redis
environment:
- EXECUTOR=Celery
- AIRFLOW__CORE__EXECUTOR=CeleryExecutor
# - REDIS_PASSWORD=redispass
ports:
- "5555:5555"
command: flower
command: airflow celery flower

scheduler:
image: dataopssre/docker-airflow2:2.1.2
build: .
restart: always
depends_on:
- webserver
volumes:
- ./dags:/usr/local/airflow/dags
- ./dags:/opt/airflow/dags
# Uncomment to include custom plugins
# - ./plugins:/usr/local/airflow/plugins
environment:
- LOAD_EX=n
- FERNET_KEY=46BKJoQYlPPOexq0OhDZnIlNepKFf87WFwLbfzqDDho=
- EXECUTOR=Celery
- AIRFLOW__CORE__LOAD_EXAMPLES=false
- AIRFLOW__CORE__FERNET_KEY=46BKJoQYlPPOexq0OhDZnIlNepKFf87WFwLbfzqDDho=
- AIRFLOW__CORE__EXECUTOR=CeleryExecutor
# - POSTGRES_USER=airflow
# - POSTGRES_PASSWORD=airflow
# - POSTGRES_DB=airflow
# - REDIS_PASSWORD=redispass
command: scheduler
command: airflow scheduler

worker:
image: dataopssre/docker-airflow2:2.1.2
build: .
restart: always
depends_on:
- scheduler
volumes:
- ./dags:/usr/local/airflow/dags
- ./dags:/opt/airflow/dags
# Uncomment to include custom plugins
# - ./plugins:/usr/local/airflow/plugins
environment:
- FERNET_KEY=46BKJoQYlPPOexq0OhDZnIlNepKFf87WFwLbfzqDDho=
- EXECUTOR=Celery
- AIRFLOW__CORE__FERNET_KEY=46BKJoQYlPPOexq0OhDZnIlNepKFf87WFwLbfzqDDho=
- AIRFLOW__CORE__EXECUTOR=CeleryExecutor
# - POSTGRES_USER=airflow
# - POSTGRES_PASSWORD=airflow
# - POSTGRES_DB=airflow
# - REDIS_PASSWORD=redispass
command: worker
command: airflow celery worker
4 changes: 2 additions & 2 deletions docker-compose-LocalExecutor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ services:
depends_on:
- postgres
environment:
- LOAD_EX=n
- EXECUTOR=Local
- AIRFLOW__CORE__LOAD_EXAMPLES=false
- AIRFLOW__CORE__EXECUTOR=LocalExecutor
logging:
options:
max-size: 10m
Expand Down
25 changes: 3 additions & 22 deletions script/entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,9 @@
TRY_LOOP="20"

# Global defaults and back-compat
: "${AIRFLOW_HOME:="/usr/local/airflow"}"
: "${AIRFLOW__CORE__FERNET_KEY:=${FERNET_KEY:=$(python -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print(FERNET_KEY)")}}"
: "${AIRFLOW__CORE__EXECUTOR:=${EXECUTOR:-Sequential}Executor}"

# Load DAGs examples (default: Yes)
if [[ -z "$AIRFLOW__CORE__LOAD_EXAMPLES" && "${LOAD_EX:=n}" == n ]]; then
AIRFLOW__CORE__LOAD_EXAMPLES=False
fi
: "${AIRFLOW_HOME:="/opt/airflow"}"
: "${AIRFLOW__CORE__FERNET_KEY:=$(python -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print(FERNET_KEY)")}"
: "${AIRFLOW__CORE__EXECUTOR:="SequentialExecutor"}"

export \
AIRFLOW_HOME \
Expand Down Expand Up @@ -110,27 +105,13 @@ fi
case "$1" in
webserver)
airflow db init
airflow users create --username airflow --password airflow --firstname Peter --lastname Parker --role Admin --email [email protected]
if [ "$AIRFLOW__CORE__EXECUTOR" = "LocalExecutor" ] || [ "$AIRFLOW__CORE__EXECUTOR" = "SequentialExecutor" ]; then
# With the "Local" and "Sequential" executors it should all run in one container.
airflow scheduler &
fi
exec airflow webserver
;;
worker|scheduler)
# Give the webserver time to run db init.
sleep 10
exec airflow "$@"
;;
flower)
sleep 10
exec airflow "$@"
;;
version)
exec airflow "$@"
;;
*)
# The command is something like bash, not an airflow subcommand. Just run it in the right environment.
exec "$@"
;;
esac

0 comments on commit fcc217b

Please sign in to comment.