Data Engineering project Amazon Reviews

Data Engineering project for ZoomCamp`24: JSONL -> PostgreSQL + Metabase + Mage.AI

ETL for Amazon Reviews'23` dataset.

‼️New branch with export to BigQuery (in progress, export works only via scripts, no Mage yet, stay tuned!)

Cloud environment: GitHub CodeSpaces, free and enough to test (120 core-hours of Codespaces compute for free monthly, 15GB of Codespaces storage for free monthly).

To reproduce and review this project it would be enough (hopefully) less than hour, and ~8GB total for described datasets. You don't need to use anything extra, like VS Code or trial accounts - just webrowser + GitHub account is totally enough.

This level of cloud resources allowed me to process and analyze-visualize datasets with at least 6 Mln reviews without issues.

🎯 Goals

This is my Data Engineering project started during DE ZoomCamp'24. And the main goal is straight-forward: Extract - Transform - Load data, then visualize some insights.

I chose to analyze Amazon Reviews dataset. Full dataset is huge, and it is available to download in parts - by product categories. There are some smaller and larger sub-datasets, some of them we would explore here (Digital_Music, Health_and_Personal_Care, All_Beauty, Software). Each subset includes 2 files: products (meta), ratings and user comments (reviews).

Some time ago I became very interested in how much we can trust all those ratings on Amazon, including 'bestseller' statuses. But without proper data it was hard to investigate. Now, years later, I have more skills in data analytics and this huge dataset, thanks to McAuley Lab. I chose to use only part of the whole information:

Products: categories and average ratings
Reviews: user's ratings, review dates, was that verified purchase or not (will analyze reviews texts later, awesome source for many insights I think)

Thanks to ZoomCamp for the reason to learn many new tools and get back to my 'reviews & trust' questions!

🕵️ Questions that I chose to investigate during this project:

What are the trends in reviews ratings for verified/not purchases? Are they more negative or more positive than average?
As not verified purchase reviews can be manipulative, what is their ratio in total number?
As automated review submission is technically possible with more advanced tech last years, what are the not verified purchase rating trends over time?
Are there any significant differences in trends for product categories?

Let's explore together! These categories (datasets) I played with so far.

ℹ️ Assets

dataset urls are defined in .csv files stored in /data - dataset_urls0.csv, dataset_urls1.csv, etc
my scripts (bash, python, sql)
pre-configured Metabase dashboard
pre-configured Mage AI pipelines

🧰 Tech stack

Docker and docker-compose. All dockerized apps use default bridge network (172.17.0.x).
PostgreSQL as a data base/warehouse - open source and free
Metabase as an analitics platform - open source and free self-hosted
Mage AI as an orchestration tool - open source and free self-hosted

💡 Combination of CodeSpaces + PostgreSQL + Metabase + Mage AI would be a good choice for those who prefer/experiment with a simple and open source approach, or those who hesitate to deal with The BIG 3 cloud providers (AWS, Azure, GCP) with their serious payments. Looks like a good fit for early stage startups, developers, and those who learn Data Engineering as me (you?). ❕Find one more open source and free opportunity at the end of this instructiom.

🚀 Instructions to deploy

🙈 Oh, so many steps?! Please take a deep breath to calm down, they are all simple enough, that's why so many. One step at a time. I did it successfully many times to test it, you can do it too! Let's go!

NB: and if you 'lost the track', you can always start from scratch - create new CodeSpace and go step by step. Good luck!

Setup environment
Download dataset and start PostgreSQL server
Ingest data using CLI (bash & python)
Visualize data
Orchestration data ingestion (Mage AI)
Instructions to stop the apps

🛠️ Setup environment

Fork this repo on GitHub.
Create GitHub CodeSpace from the repo.
Start CodeSpace, wait for requirements.txt installation to finish (it starts automatically, just watch & wait).
Copy dev.env to .env - run cp dev.env .env in your CodeSpace terminal. You don't need to change anything to test the process. It contains all key settings (PostgreSQL etc).

cp dev.env .env

⤵️ Download dataset and start PostgreSQL server

Run bash download.sh to download 1st dataset (smallest) - from dataset_urls0.csv. As a result you will have 2 files in your data directory: meta_Digital_Music.jsonl.gz and meta_Digital_Music.jsonl.gz. Cool! ✅

bash download.sh

Start PostgreSQL docker: bash start_postgres.sh. It will automatically download postgres container and start it.

bash start_postgres.sh

Wait until you see the message database system is ready to accept connections - PostgreSQL is ready to serve you.

Let it run in this terminal so you can see logs and easily stop it later (with just Ctrl-C). CodeSpace will pop-up the notification that Your application running on port 5432 is available. - just ignore it (close that pop-up).

Start 2nd terminal (click + button on the panel in the right corner above terminal) and switch to it.
You can check database connection by running pgcli -h 172.17.0.2 -p 5432 -u postg -d amzn_reviews, password postg. All dockerized apps of this setup are suppose to run in default bridge network, 172.17.0.x, with 172.17.0.2 for PostgreSQL. If this step fails you're in trouble 😅 But, when you run it in a CodeSpace and follow the instruction it should work fine, I tested multiple times. Type quit to return to terminal CLI. Ok, PostgreSQL is running. ✅

pgcli -h 172.17.0.2 -p 5432 -u postg -d amzn_reviews

⌨️ Ingest data using CLI (bash & python)

Dataset files have been downloaded, PostgreSQL is running - time to ingest your data!

Run bash process.sh to ingest dataset files into PostgreSQL database.

bash process.sh

It executes python script process_jsonl.py with default parameters from your .env settings. By default (without parameter) it processes the 1st dataset files. As the result of successful loading you will see some progress messages and finally Finished ingesting data/... into the PostgreSQL database! Total time ... second(s) +++ PROCESSING finished: OK!.

Congratulations, the first approach to load data accomplished and you have records in 2 tables of your database: meta with products, reviews with ratings. ✅

Now you can ingest the next dataset with the same approach:

download 2nd dataset: run bash download.sh dataset_urls1.csv (Health_and_Personal_Care files)

bash download.sh dataset_urls1.csv

process 2nd dataset: run bash process.sh dataset_urls1.csv

bash process.sh dataset_urls1.csv

download 3rd dataset: run bash download.sh dataset_urls2.csv (All_Beauty files)

bash download.sh dataset_urls2.csv

If you want to try data workflow orchestration have some patience and follow to Visualize data step. If you don't want (or have no time) to play with Mage.AI, just process similarly dataset_urls2.csv for All_Beauty, and download then process dataset_urls3.csv for Software.

📊 Visualize data

Finally, probably the most interesting part, let's see our data using Metabase - open sourse and free self-hosted Business Intelligence, Dashboards and Data Visualization tool.

Start it in Docker - run

bash start_metabase.sh

It will automatically download Metabase container and start it.

CodeSpace will pop-up the notification that Your application running on port 3000 is available. - click Open in Browser.

New page would probably open white. Please wait a couple of seconds to let it start, then refresh the page. Now you will see login screen. Just login with john@mailinator.com, pass: Jj123456 (no worries, it's self-hosted, all safe). Pre-configured dashboard with reports is already there for you! ✅🎉

💡 In case you accidentally close that pop-up or Metabase page and you need it later (after ingesting new datasets), you can always open that page from Ports tab:

Explore the dashboard on the main screen. There are 2 tabs: Products and Reviews.
Products: you can see 2 reports:

pie chart with number of Products by Main category
Products number distribution by Average rating

Reviews: you can see 4 reports:

pie chart with Reviews number by Verified/not purchase
Reviews rating number distribution by Verified/not purchase over time (by months)
pie chart with Reviews number by Main category
Reviews rating number distribution by Main category over time (by months)

You can see some screenshots below.

The more datasets you load, the more categories you can see. That's why I offer you to download and process at least 2 datasets. You can process 3rd dataset by following familiar steps in Ingest data using CLI (bash & python) or go ahead and discover Mage AI.

🧙 Orchestrate data ingestion (Mage AI)

I wouldn't say it was so simple and easy as Matt showed us in videos, but with some time and effort I managed to find the way to convert the logic of my process_jsonl.py script into Mage pipelines and bricks. Why is it worth my/your time? Because larger datasets (like Kindle_Store) will probably demand a serious cloud storage, database and compute than free CodeSpace playground. And I (you?) need to learn how to deal with them with a more scalable system, providing long job execution, partitioning, monitoring and logs. So let's see 2 pipelines I managed to setup with Mage.

💡 In case something goes wrong or complicated, you can still ingest more datasets with step 11 Ingest data using CLI (bash & python) and then reload Metabase dashboard page.

Run unzip mage.zip to extract pre-configured Mage AI workflow.

unzip mage.zip

Run docker-compose build to automatically download and prepare Docker container with Mage AI.

docker-compose build

Run docker-compose up to start Docker container with Mage AI.

docker-compose up

CodeSpace will pop-up the notification that Your application running on port 6789 is available. - click Open in Browser.

That new page would probably open white. Please wait a couple of seconds to let Mage AI start, then refresh the page. Now you will see Mage dashboard. Your orchestration center is ready to serve you! ✅

Go to Pipelines - move mouse to the left edge of the window, it will open menu slider, click on Pipelines.

You will see 2 pipelines I configured:

Click load_run_dataset. It will open current pipeline triggers. Then click Run@once button.

In the Run pipeline now dialogue you can define a variable, in this case it will be the file name of dataset.

Enter dataset_urls2.csv (that you downloaded but not processed yet) and click Run now button. Ingestion process started. You can see the progress on Pipeline runs page. It appears with a little delay, no worries.

After ingesting new dataset you can switch to Metabase page, update it and see new data in the dashboard reports. Congratulations! ✅🎉

If you get to this point, please ⭐️star⭐️ my repo! 🙌

🎁 One more open source opportunity

Bonus for those who survived reading instructions!

In addition to using PostgreSQL in docker (local on your computer or as described in GitHub CodeSpace) you can use/experiment with Supabase - an open source Firebase alternative.

It provides a quite generous free plan (not a trial, no credit card required): 500 MB database space, 2 Core shared CPU, 1GB RAM, 5 GB bandwidth, 1 GB file storage. Free projects are paused after 1 week of inactivity. Limit of 2 active projects.

I tested it, it works with my scripts, and with Metabase as well. You just need to:

register free account
get access credentials (host, port, user, password, database)
put new values to .env file
create in Supabase 2 tables - meta and reviews, then define structure (according to my python script)
and finally use my process.sh (or Mage AI pipelines) to ingest data there
Metabase works with Supabase as with any PostgreSQL database - create dashboard for your new database.

Let me know how it worked for you!

⏹️ Instructions to stop the apps

Simple way - stop all together by stopping your CodeSpace. Remember, this will leave all downloaded data in your CodeSpace - you can start it later and continue playing with tools and data. You can also delete CodeSpace with all the data.
Stop all active Docker containers - run this command in terminal

docker stop $(docker ps -a -q)

Stop Mage AI
- switch to 2nd terminal and press Ctrl-C
- then run docker-compose down in terminal
Stop PostgreSQL - switch to 1st terminal and press Ctrl-C
You can delete downloaded dataset .jsonl.gz files from /data folder.

Roadmap

Refactor python code
Export to .parquet
Export to BigQuery
[/] BigQuery: update Mage orchestration, visualization and instructions
[/] Load more data to analyze Kindle Store dataset
Visualize using Apache Superset (as an alternative to Metabase)
Visualize using Looker Studio
Orchestrate with Prefect (as an alternative to Mage)
Orchestrate with Apache Airflow (as an alternative to Mage)

Stay tuned!

Support

🙏 Thank you for your attention and time!

If you experience any issue while following this instruction (or something left unclear), please add it to Issues, I'll be glad to help/fix. And your feedback, questions & suggestions are welcome as well!
Feel free to fork and submit pull requests.

If you find this project helpful, please ⭐️star⭐️ my repo https://github.com/dmytrovoytko/data-engineering-amazon-reviews to help other people discover it 🙏

Made with ❤️ in Ukraine 🇺🇦 Dmytro Voytko

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.devcontainer		.devcontainer
data		data
mage-setup		mage-setup
metabase.db		metabase.db
screenshots		screenshots
.gitignore		.gitignore
README.md		README.md
dev.env		dev.env
docker-compose.yml		docker-compose.yml
download.sh		download.sh
mage.zip		mage.zip
process.sh		process.sh
process_jsonl.py		process_jsonl.py
requirements.txt		requirements.txt
samples.sh		samples.sh
start_metabase.sh		start_metabase.sh
start_pgcli.sh		start_pgcli.sh
start_postgres.sh		start_postgres.sh
stop-all-docker.sh		stop-all-docker.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering project Amazon Reviews

🎯 Goals

🕵️ Questions that I chose to investigate during this project:

ℹ️ Assets

🧰 Tech stack

🚀 Instructions to deploy

🛠️ Setup environment

⤵️ Download dataset and start PostgreSQL server

⌨️ Ingest data using CLI (bash & python)

📊 Visualize data

🧙 Orchestrate data ingestion (Mage AI)

🎁 One more open source opportunity

⏹️ Instructions to stop the apps

Roadmap

Support

About

Releases

Packages

Languages

dmytrovoytko/data-engineering-amazon-reviews

Folders and files

Latest commit

History

Repository files navigation

Data Engineering project Amazon Reviews

🎯 Goals

🕵️ Questions that I chose to investigate during this project:

ℹ️ Assets

🧰 Tech stack

🚀 Instructions to deploy

🛠️ Setup environment

⤵️ Download dataset and start PostgreSQL server

⌨️ Ingest data using CLI (bash & python)

📊 Visualize data

🧙 Orchestrate data ingestion (Mage AI)

🎁 One more open source opportunity

⏹️ Instructions to stop the apps

Roadmap

Support

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages