All about DataSince, DataEngineering and ComputerScience
View the Project on GitHub datainsightat/DataScience_Examples
Helps you to manage Datepipelines. Airflow vs Beam.
Apache Airflow is an open source platform to programmatically author, schedule and monitor workflows.
Download Data > Process Data > Store Data | | | API Spark Insert/Update
Airflow is not:
Directed Acyclig Graph. Datapipeline. Nodes are Tasks. Edges are Dependencies.
T1 \
T2 - T4
T3 /
Task in datapipeline.
db = connecT(host, credentials)
db.insert(sql_request)
Operators in charge of executing something (Python, Bash, SQL …)
Transfer data from a source to a destination
Wait for condition to be met to happen, before doing something.
Instance of the opertator in DAG
Combination of operatotr and dependencies in a DAG.
Provider gives you new functionality (operators). Extras are only dependencies.
$ docker container exec -it airflow /bin/bash
$ airflow -h
$ airflow db init
$ airflow db reset
Upgrade schemas of metadata db.
$ airflow db upgrade
$ airflow webserver
$ airflow scheduler
$ airflow celery worker
$ airflow list dags
$ airflow dags trigger example_bash_operator -e 2022-02-03
$ airflow dags list-runs -d example_bash_operator
$ airflow dags backfill -s 2022-02-03 -e 2022-02-05 --reset-dagruns
$ airflow tasks list example_bash_operator
$ airflow tasks test example_bash_operator runme_0 2022-01-01
Application Service
Airflow - Webserver -> Container
\ Scheduler -> Container
\ DB -> Container
Containers run in the same network!
Dockerfile -> Docker Image (Airflow)
Docker Compose File -> Run Airflow Services