Fyber - airflow best practices in production

Airflow
Production tales
Eran Shemesh - Senior Big Data Developer

Why?
4
Spark
Update
DB
Http
Spark
Update
DB
Send
emails
Http Spark
Update
DB
30m-50m 5m-10m
1h-1.5h
10 sec
1m-3m 20m-40m 10 sec
The cron way
0 * * * * 0 * * * *
15 * * * *
50 * * * *
0 * * * * 5 * * * * 55 * * * *

Why?
5
The cron way
■ Each valid flow takes more time than it should
■ Each job should be aware to the buffer from its execution time to its working time
■ In a case of a retry for a certain task in the flow, the whole flow can fail
■ What if the time buffer is sometimes not enough?
■ What if one of the system that runs a cron job was down for a run or more?
■ What if the input data to a flow was incorrect?
■ What if, for a product requirement change, I need to re-run the past X runs?
■ Visability

Why?
6
The airflow way
■ Tasks are really dependant on each other
■ Easily Scalable
■ Web UI
■ Can recover from downtime

■ Each valid flow takes more time than it should
■ Each job should be aware to the buffer from its execution time to its working time
■ In a case of a retry for a certain task in the flow, the whole flow can fail
■ What if the buffer is sometimes not enough?
■ What if one of the system that runs a cron job was down for a run or more?
■ What if the input data to a flow was incorrect?
■ What if, for a product requirement change, I need to re-run the past X runs?
Why?
7
The airflow way

■ An HTTP request to invoke job on databricks (SimpleHttpOperator)
■ Extract the databricks task_id from the response (PythonOperator)
■ Monitor task progress (HttpSensor) by task id
■ In case of success, get the result (SimpleHttpOperator)
■ Extract result from the HttpResponse (PythonOperator)
Hello Airflow
SimpleHttpOperator PythonOperator HttpSensor SimpleHttpOperator PythonOperator

Fyber - airflow best practices in production

■ An operator like any other, for self-running a group of tasks
■ Better visualisation
■ Reusable Components
■ Encapsulation
Sub - DAGs

■ There is no retry mechanism on a dag level, only on task level
■ Out of the box, a sub DAG does not retry well
■ We utilized the sub DAG’s on_retry_callback for it’s retry mechanism when needed
Retryable Sub Dags

Sub dags - use with caution!
15
subdag task task subdag task taskWorker
Concurrency Level
task subdag task task

16
subdag subdag subdag subdag task taskWorker
Concurrency Level

17
subdag subdag subdag subdag subdag subdagWorker
Concurrency Level

18
subdag subdag subdag subdag task taskWorker
Thread pool
task task task task
Airflow 10’s default solution:
SequentialExecutor ( One process to run them all)

19
subdag subdag subdag subdag subdag subdagWorker 1
Concurrency Level
task subdag taskWorker 2
Concurrency Level
task taskWorker 3
Concurrency Level
Second option -
Add more workers!

Monitoring pipeline
22
A typical flow

Monitoring pipeline
23
Each task (or a group of tasks) be followed by a monitoring task

Monitoring pipeline
24
Each monitoring task is a group of tasks for monitoring and auto fixing

Building modules
26
■ A template of tasks and dependencies between them
■ Using the template method design pattern, the module dictates the general flow, to be
implemented by different business logic subclasses
■ Most commonly used inside a sub dag, like in the monitoring example
DAG extensions

Building modules
27
Creating a template for a sets of tasks

Building modules
28
Further extending this template when needed

Building modules
29
Further extending this template when needed

Use case 1: Skipping daily tasks
31
■ Each hour calculates hourly aggregation and than daily agg
■ When fixing data or when the task runs are delayed, it’s unnecessary to calculate partial
daily aggregations
■ Using the ShortCircuitOperator, we check if the next execution should have happened
already
■ If it has, we skip all following tasks in the same dag run
Hourly and daily flow

32

33

34

Use case 2: Programatically clearing DAG
35
S3/{bucket_name}/day=23

36
■ Creating a DAG for executing a single day’s flow
■ The scheduling for the above DAG would occur by another DAG (and not the Airflow’s scheduler)
■ The scheduling DAG would:
○ Create a new run for each day in the target DAG
○ Clear the target DAG runs for the previous 14 days

37
Using another DAG to clear the above DAG for the last 14 days:

Tips and best practices
39
■ Create only idempotent tasks
■ Notice that the worker only creates an OS process for each task
■ Always use a retry on a task, the workers can fail!
■ Use connections to store passwords and secret keys (for encryption)
■ Notice that your python files gets executed constantly by the scheduler
■ Use a docker compose environment on your dev machine

Fyber - airflow best practices in production

More Related Content

What's hot (20)

Similar to Fyber - airflow best practices in production (20)

More from Itai Yaffe (20)

Recently uploaded (20)

Fyber - airflow best practices in production