Airflow Deployment in AWS

Introduction

In the book “Data Pipelines with Apache Airflow”, many concepts, mechanisms and usages of Airflow are introduced.

Here I would like to share two chapters, which I feel the ideas can be applied to other software and applications.

Coding style
- Developers in a team should follow the same style guides, e.g.
  - Python PEP-8
  - Python Google
- Use checkers to force the style in merge request CI/CD
  - Flake8
- Use code formatters before code commits
  - YAPF
  - Black
- Set up some application-specific style conversions
- Preferable to use factory pattern for better code reusability
Manage configurations
- Use a central place to manage all changable parameters
- Take care of credentials & secrets
- Put lists of variables into YAML/JSON files
Design tasks
- Group closely-related tasks together
- Use version control and create new jobs for big changes
- A single task should be designed to be:
  - idempotent (the task is rerunable, same results in source & destination)
  - deterministic (the task is rerunable, same output)
  - functional paradigms
  - limited resource requirement on local environment (CPU, filesystem)
- Optimize the tasks:
  - decouple light and heavy workload into different environments, connect them with async methods
  - try to split large dataset into smaller/incremental ones for higher efficiency
  - improve task speed by cache intermediate data
- Always add monitoring and alerting