SlideShare a Scribd company logo
Airflow
Production tales
Eran Shemesh - Senior Big Data Developer
2
Pipeline
Airflow’s Architecture
Why?
4
Spark
Update
DB
Http
Spark
Update
DB
Send
emails
Http Spark
Update
DB
30m-50m 5m-10m
1h-1.5h
10 sec
1m-3m 20m-40m 10 sec
The cron way
0 * * * * 0 * * * *
15 * * * *
50 * * * *
0 * * * * 5 * * * * 55 * * * *
Why?
5
The cron way
■ Each valid flow takes more time than it should
■ Each job should be aware to the buffer from its execution time to its working time
■ In a case of a retry for a certain task in the flow, the whole flow can fail
■ What if the time buffer is sometimes not enough?
■ What if one of the system that runs a cron job was down for a run or more?
■ What if the input data to a flow was incorrect?
■ What if, for a product requirement change, I need to re-run the past X runs?
■ Visability
Why?
6
The airflow way
■ Tasks are really dependant on each other
■ Easily Scalable
■ Web UI
■ Can recover from downtime
■ Each valid flow takes more time than it should
■ Each job should be aware to the buffer from its execution time to its working time
■ In a case of a retry for a certain task in the flow, the whole flow can fail
■ What if the buffer is sometimes not enough?
■ What if one of the system that runs a cron job was down for a run or more?
■ What if the input data to a flow was incorrect?
■ What if, for a product requirement change, I need to re-run the past X runs?
Why?
7
The airflow way
■ An HTTP request to invoke job on databricks (SimpleHttpOperator)
■ Extract the databricks task_id from the response (PythonOperator)
■ Monitor task progress (HttpSensor) by task id
■ In case of success, get the result (SimpleHttpOperator)
■ Extract result from the HttpResponse (PythonOperator)
Hello Airflow
SimpleHttpOperator PythonOperator HttpSensor SimpleHttpOperator PythonOperator
Fyber - airflow best practices in production
Subdags
Use with caution!
■ An operator like any other, for self-running a group of tasks
■ Better visualisation
■ Reusable Components
■ Encapsulation
Sub - DAGs
// Previous code
■ There is no retry mechanism on a dag level, only on task level
■ Out of the box, a sub DAG does not retry well
■ We utilized the sub DAG’s on_retry_callback for it’s retry mechanism when needed
Retryable Sub Dags
Airflow’s Architecture
Sub dags - use with caution!
15
subdag task task subdag task taskWorker
Concurrency Level
task subdag task task
Sub dags - use with caution!
16
subdag subdag subdag subdag task taskWorker
Concurrency Level
task subdag task task
Sub dags - use with caution!
17
subdag subdag subdag subdag subdag subdagWorker
Concurrency Level
task subdag task task
Sub dags - use with caution!
18
subdag subdag subdag subdag task taskWorker
Thread pool
task subdag task task
task task task task
Airflow 10’s default solution:
SequentialExecutor ( One process to run them all)
Sub dags - use with caution!
19
subdag subdag subdag subdag subdag subdagWorker 1
Concurrency Level
task subdag task task
task subdag taskWorker 2
Concurrency Level
task taskWorker 3
Concurrency Level
Second option -
Add more workers!
Monitoring
And auto fixing...
21
Pipeline
Monitoring pipeline
22
A typical flow
Monitoring pipeline
23
Each task (or a group of tasks) be followed by a monitoring task
Monitoring pipeline
24
Each monitoring task is a group of tasks for monitoring and auto fixing
Building modules
25
Building modules
26
■ A template of tasks and dependencies between them
■ Using the template method design pattern, the module dictates the general flow, to be
implemented by different business logic subclasses
■ Most commonly used inside a sub dag, like in the monitoring example
DAG extensions
Building modules
27
Creating a template for a sets of tasks
Building modules
28
Further extending this template when needed
Building modules
29
Further extending this template when needed
Some dev
paradigms
Use case 1: Skipping daily tasks
31
■ Each hour calculates hourly aggregation and than daily agg
■ When fixing data or when the task runs are delayed, it’s unnecessary to calculate partial
daily aggregations
■ Using the ShortCircuitOperator, we check if the next execution should have happened
already
■ If it has, we skip all following tasks in the same dag run
Hourly and daily flow
32
Use case 1: Skipping daily tasks
Hourly and daily flow
33
Use case 1: Skipping daily tasks
Hourly and daily flow
Use case 1: Skipping daily tasks
34
Hourly and daily flow
Use case 2: Programatically clearing DAG
35
S3/{bucket_name}/day=23
S3/{bucket_name}/day=22
S3/{bucket_name}/day=21
S3/{bucket_name}/day=10
36
■ Creating a DAG for executing a single day’s flow
■ The scheduling for the above DAG would occur by another DAG (and not the Airflow’s scheduler)
■ The scheduling DAG would:
○ Create a new run for each day in the target DAG
○ Clear the target DAG runs for the previous 14 days
Use case 2: Programatically clearing DAG
37
Using another DAG to clear the above DAG for the last 14 days:
Use case 2: Programatically clearing DAG
Tips and best
practices
Tips and best practices
39
■ Create only idempotent tasks
■ Notice that the worker only creates an OS process for each task
■ Always use a retry on a task, the workers can fail!
■ Use connections to store passwords and secret keys (for encryption)
■ Notice that your python files gets executed constantly by the scheduler
■ Use a docker compose environment on your dev machine
Thanks!

More Related Content

What's hot (20)

PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
PDF
Introduction to Apache Airflow
mutt_data
 
PPTX
Apache airflow
Pavel Alexeev
 
PDF
Apache Airflow
Knoldus Inc.
 
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
PDF
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Kaxil Naik
 
PDF
Airflow presentation
Ilias Okacha
 
PDF
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
Kaxil Naik
 
PPTX
Airflow 101
SaarBergerbest
 
PPTX
Airflow presentation
Anant Corporation
 
PPSX
Data Pipelines with Apache Airflow
Manning Publications
 
PPTX
Apache Airflow overview
NikolayGrishchenkov
 
PDF
Introducing Apache Airflow and how we are using it
Bruno Faria
 
PDF
Apache airflow
Purna Chander
 
PDF
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
PDF
nginx + ansible로 점검모드 만들기
June Kim
 
PDF
Apache Airflow at Dailymotion
Germain Tanguy
 
PPTX
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Bolke de Bruin
 
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
PDF
It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)
Jarek Potiuk
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Introduction to Apache Airflow
mutt_data
 
Apache airflow
Pavel Alexeev
 
Apache Airflow
Knoldus Inc.
 
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Kaxil Naik
 
Airflow presentation
Ilias Okacha
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
Kaxil Naik
 
Airflow 101
SaarBergerbest
 
Airflow presentation
Anant Corporation
 
Data Pipelines with Apache Airflow
Manning Publications
 
Apache Airflow overview
NikolayGrishchenkov
 
Introducing Apache Airflow and how we are using it
Bruno Faria
 
Apache airflow
Purna Chander
 
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
nginx + ansible로 점검모드 만들기
June Kim
 
Apache Airflow at Dailymotion
Germain Tanguy
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Bolke de Bruin
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)
Jarek Potiuk
 

Similar to Fyber - airflow best practices in production (20)

PDF
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Noam Elfanbaum
 
PDF
Container Orchestration from Theory to Practice
Docker, Inc.
 
PPTX
Bots on guard of sdlc
Alexey Tokar
 
PDF
Heart of the SwarmKit: Store, Topology & Object Model
Docker, Inc.
 
PDF
Flux architecture and Redux - theory, context and practice
Jakub Kocikowski
 
PDF
Managing Apache Spark Workload and Automatic Optimizing
Databricks
 
PDF
Gearman - Northeast PHP 2012
Mike Willbanks
 
PDF
Airflow Intro-1.pdf
BagustTriCahyo1
 
KEY
improving the performance of Rails web Applications
John McCaffrey
 
PDF
Paris.rb – 07/19 – Sidekiq scaling, workers vs processes
Maxence Haltel
 
PDF
SwarmKit in Theory and Practice
Laura Frank Tacho
 
PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
PDF
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Data Con LA
 
PDF
Real-time Stream Processing using Apache Apex
Apache Apex
 
PDF
Introduction to Apache Apex - CoDS 2016
Bhupesh Chawda
 
PDF
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
javier ramirez
 
PPT
Operating system presentation Operating system presentation
naackrmu2023
 
PDF
03 performance
marangburu42
 
PDF
Testing Persistent Storage Performance in Kubernetes with Sherlock
ScyllaDB
 
PPTX
Web Performance & Latest in React
Talentica Software
 
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Noam Elfanbaum
 
Container Orchestration from Theory to Practice
Docker, Inc.
 
Bots on guard of sdlc
Alexey Tokar
 
Heart of the SwarmKit: Store, Topology & Object Model
Docker, Inc.
 
Flux architecture and Redux - theory, context and practice
Jakub Kocikowski
 
Managing Apache Spark Workload and Automatic Optimizing
Databricks
 
Gearman - Northeast PHP 2012
Mike Willbanks
 
Airflow Intro-1.pdf
BagustTriCahyo1
 
improving the performance of Rails web Applications
John McCaffrey
 
Paris.rb – 07/19 – Sidekiq scaling, workers vs processes
Maxence Haltel
 
SwarmKit in Theory and Practice
Laura Frank Tacho
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Flink Forward
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Data Con LA
 
Real-time Stream Processing using Apache Apex
Apache Apex
 
Introduction to Apache Apex - CoDS 2016
Bhupesh Chawda
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
javier ramirez
 
Operating system presentation Operating system presentation
naackrmu2023
 
03 performance
marangburu42
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
ScyllaDB
 
Web Performance & Latest in React
Talentica Software
 
Ad

More from Itai Yaffe (20)

PDF
Mastering Partitioning for High-Volume Data Processing
Itai Yaffe
 
PDF
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Itai Yaffe
 
PDF
Lessons Learnt from Running Thousands of On-demand Spark Applications
Itai Yaffe
 
PPTX
Why do the majority of Data Science projects never make it to production?
Itai Yaffe
 
PDF
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Itai Yaffe
 
PDF
Evaluating Big Data & ML Solutions - Opening Notes
Itai Yaffe
 
PDF
Big data serving: Processing and inference at scale in real time
Itai Yaffe
 
PDF
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Itai Yaffe
 
PDF
Unleashing the Power of your Data
Itai Yaffe
 
PDF
Data Lake on Public Cloud - Opening Notes
Itai Yaffe
 
PDF
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Itai Yaffe
 
PDF
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
Itai Yaffe
 
PDF
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Itai Yaffe
 
PDF
Introducing Kafka Connect and Implementing Custom Connectors
Itai Yaffe
 
PDF
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
PDF
Scalable Incremental Index for Druid
Itai Yaffe
 
PDF
Funnel Analysis with Spark and Druid
Itai Yaffe
 
PDF
The benefits of running Spark on your own Docker
Itai Yaffe
 
PDF
Optimizing Spark-based data pipelines - are you up for it?
Itai Yaffe
 
PDF
Scheduling big data workloads on serverless infrastructure
Itai Yaffe
 
Mastering Partitioning for High-Volume Data Processing
Itai Yaffe
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Itai Yaffe
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Itai Yaffe
 
Why do the majority of Data Science projects never make it to production?
Itai Yaffe
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Itai Yaffe
 
Evaluating Big Data & ML Solutions - Opening Notes
Itai Yaffe
 
Big data serving: Processing and inference at scale in real time
Itai Yaffe
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Itai Yaffe
 
Unleashing the Power of your Data
Itai Yaffe
 
Data Lake on Public Cloud - Opening Notes
Itai Yaffe
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Itai Yaffe
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
Itai Yaffe
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Itai Yaffe
 
Introducing Kafka Connect and Implementing Custom Connectors
Itai Yaffe
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
Scalable Incremental Index for Druid
Itai Yaffe
 
Funnel Analysis with Spark and Druid
Itai Yaffe
 
The benefits of running Spark on your own Docker
Itai Yaffe
 
Optimizing Spark-based data pipelines - are you up for it?
Itai Yaffe
 
Scheduling big data workloads on serverless infrastructure
Itai Yaffe
 
Ad

Recently uploaded (20)

PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
PPTX
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PDF
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
short term internship project on Data visualization
JMJCollegeComputerde
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
7 Easy Ways to Improve Clarity in Your BI Reports
sophiegracewriter
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 

Fyber - airflow best practices in production

  • 1. Airflow Production tales Eran Shemesh - Senior Big Data Developer
  • 4. Why? 4 Spark Update DB Http Spark Update DB Send emails Http Spark Update DB 30m-50m 5m-10m 1h-1.5h 10 sec 1m-3m 20m-40m 10 sec The cron way 0 * * * * 0 * * * * 15 * * * * 50 * * * * 0 * * * * 5 * * * * 55 * * * *
  • 5. Why? 5 The cron way ■ Each valid flow takes more time than it should ■ Each job should be aware to the buffer from its execution time to its working time ■ In a case of a retry for a certain task in the flow, the whole flow can fail ■ What if the time buffer is sometimes not enough? ■ What if one of the system that runs a cron job was down for a run or more? ■ What if the input data to a flow was incorrect? ■ What if, for a product requirement change, I need to re-run the past X runs? ■ Visability
  • 6. Why? 6 The airflow way ■ Tasks are really dependant on each other ■ Easily Scalable ■ Web UI ■ Can recover from downtime
  • 7. ■ Each valid flow takes more time than it should ■ Each job should be aware to the buffer from its execution time to its working time ■ In a case of a retry for a certain task in the flow, the whole flow can fail ■ What if the buffer is sometimes not enough? ■ What if one of the system that runs a cron job was down for a run or more? ■ What if the input data to a flow was incorrect? ■ What if, for a product requirement change, I need to re-run the past X runs? Why? 7 The airflow way
  • 8. ■ An HTTP request to invoke job on databricks (SimpleHttpOperator) ■ Extract the databricks task_id from the response (PythonOperator) ■ Monitor task progress (HttpSensor) by task id ■ In case of success, get the result (SimpleHttpOperator) ■ Extract result from the HttpResponse (PythonOperator) Hello Airflow SimpleHttpOperator PythonOperator HttpSensor SimpleHttpOperator PythonOperator
  • 11. ■ An operator like any other, for self-running a group of tasks ■ Better visualisation ■ Reusable Components ■ Encapsulation Sub - DAGs
  • 13. ■ There is no retry mechanism on a dag level, only on task level ■ Out of the box, a sub DAG does not retry well ■ We utilized the sub DAG’s on_retry_callback for it’s retry mechanism when needed Retryable Sub Dags
  • 15. Sub dags - use with caution! 15 subdag task task subdag task taskWorker Concurrency Level task subdag task task
  • 16. Sub dags - use with caution! 16 subdag subdag subdag subdag task taskWorker Concurrency Level task subdag task task
  • 17. Sub dags - use with caution! 17 subdag subdag subdag subdag subdag subdagWorker Concurrency Level task subdag task task
  • 18. Sub dags - use with caution! 18 subdag subdag subdag subdag task taskWorker Thread pool task subdag task task task task task task Airflow 10’s default solution: SequentialExecutor ( One process to run them all)
  • 19. Sub dags - use with caution! 19 subdag subdag subdag subdag subdag subdagWorker 1 Concurrency Level task subdag task task task subdag taskWorker 2 Concurrency Level task taskWorker 3 Concurrency Level Second option - Add more workers!
  • 23. Monitoring pipeline 23 Each task (or a group of tasks) be followed by a monitoring task
  • 24. Monitoring pipeline 24 Each monitoring task is a group of tasks for monitoring and auto fixing
  • 26. Building modules 26 ■ A template of tasks and dependencies between them ■ Using the template method design pattern, the module dictates the general flow, to be implemented by different business logic subclasses ■ Most commonly used inside a sub dag, like in the monitoring example DAG extensions
  • 27. Building modules 27 Creating a template for a sets of tasks
  • 28. Building modules 28 Further extending this template when needed
  • 29. Building modules 29 Further extending this template when needed
  • 31. Use case 1: Skipping daily tasks 31 ■ Each hour calculates hourly aggregation and than daily agg ■ When fixing data or when the task runs are delayed, it’s unnecessary to calculate partial daily aggregations ■ Using the ShortCircuitOperator, we check if the next execution should have happened already ■ If it has, we skip all following tasks in the same dag run Hourly and daily flow
  • 32. 32 Use case 1: Skipping daily tasks Hourly and daily flow
  • 33. 33 Use case 1: Skipping daily tasks Hourly and daily flow
  • 34. Use case 1: Skipping daily tasks 34 Hourly and daily flow
  • 35. Use case 2: Programatically clearing DAG 35 S3/{bucket_name}/day=23 S3/{bucket_name}/day=22 S3/{bucket_name}/day=21 S3/{bucket_name}/day=10
  • 36. 36 ■ Creating a DAG for executing a single day’s flow ■ The scheduling for the above DAG would occur by another DAG (and not the Airflow’s scheduler) ■ The scheduling DAG would: ○ Create a new run for each day in the target DAG ○ Clear the target DAG runs for the previous 14 days Use case 2: Programatically clearing DAG
  • 37. 37 Using another DAG to clear the above DAG for the last 14 days: Use case 2: Programatically clearing DAG
  • 39. Tips and best practices 39 ■ Create only idempotent tasks ■ Notice that the worker only creates an OS process for each task ■ Always use a retry on a task, the workers can fail! ■ Use connections to store passwords and secret keys (for encryption) ■ Notice that your python files gets executed constantly by the scheduler ■ Use a docker compose environment on your dev machine