Code for blog at Data Engineering Project for Beginners.
Code available at beginner_de_project repository.
You can run this data pipeline using GitHub codespaces. Follow the instructions below.
- Create codespaces by going to the beginner_de_project repository, cloning it(or click
Use this template
button) and then clicking onCreate codespaces on main
button. - Wait for codespaces to start, then in the terminal type
make up
. - Wait for
make up
to complete, and then wait for 30s (for Airflow to start). - After 30s go to the
ports
tab and click on the link exposing port8080
to access Airflow UI (username and password isairflow
).
Note Make sure to switch off codespaces instance, you only have limited free usage; see docs here.
To run locally, you need:
- git
- Github account
- Docker with at least 4GB of RAM and Docker Compose v1.27.0 or later
Clone the repo and run the following commands to start the data pipeline:
git clone https://github.com/josephmachado/beginner_de_project.git
cd beginner_de_project
make up
sleep 30 # wait for Airflow to start
make ci # run checks and tests
Go to http:localhost:8080 to see the Airflow UI. Username and password are both airflow
.
This data engineering project, includes the following:
Airflow
: To schedule and orchestrate DAGs.Postgres
: To store Airflow's details (which you can see via Airflow UI) and also has a schema to represent upstream databases.DuckDB
: To act as our warehouseQuarto with Plotly
: To convert code inmarkdown
format to html files that can be embedded in your app or servered as is.Apache Spark
: To process our data, specifically to run a classification algorithm.minio
: To provide an S3 compatible open source storage system.
For simplicity services 1-5 of the above are installed and run in one container defined here.
The user_analytics_dag
DAG in the Airflow UI will look like the below image:
On completion, you can see the dashboard html rendered at./dags/scripts/dashboard/dashboard.html.
Read this post, for information on setting up CI/CD, IAC(terraform), "make" commands and automated testing.