Effects of COVID-19 in Europe (An Azure Data Engineering project)

This project looks at datasets to study the overall effect of COVID-19 in Europe in 2020.

Goal of the project

The goal is to visualize the death count, admission, and vaccination count per country in Europe. The project also attempts to find the link between confirmed cases and COVID-19 vaccines.

Dataset

Data is ingested from 2 sources:

ECDC website to extract: Confirmed cases, Mortality, Hospitalization/ICU cases, Testing numbers
Eurostat website to extract Population by age

We started by reading the data from the website itself. However, due to the new permission barrier, the course instructor created a GitHub repo for the datasets: https://github.com/cloudboxacademy/covid19.

Solution Architecture

Environment Set-up

AZURE Subscription
Data Factory
Blob Storage
Data Lake Storage GEN 2
AZURE SQL Database
AZURE Databricks
HD Insight cluster

Ingestion

Ingesting the Population data

The population data is ingested as a .tsv.gz file and stored in a blob storage.

Solution flow for Population data

Steps:

Create a Linked Service To Azure Blob Storage
Create a Source Data Set
Create a Linked Service To Azure Data Lake storage (GEN2)
Create a Sink Data set
Create a Pipeline:
Execute Copy activity when the file becomes available
Check metadata counts before loading the data using the IF Condition
Load Data into our destination
ScheduleTrigger

Ingesting ECDC Data

The confirmed cases, Mortality, Hospitalization/ICU cases, and Testing numbers data are read from the GitHub repo.

The files ingested are as follows:

Case & Deaths Data.csv
Hospital Admission Data.csv
testing.csv
country_response.csv

Solution flow for the ECDC Data

Steps:

Create a Linked Service using an HTTP connector
Create a Source Data Set
Create a Linked Service To Azure Data Lake storage (GEN2)
Create a Sink Data set
Create a Pipeline With Parameters & Variables
Lookup to get all the parameters from the JSON file, then pass it to ForEach ECDC DATA as shown below
Schedule Trigger

The JSON file:

Transformation

ADF dataflows were used to transform the Cases and Deaths data as well as the hospital admissions data, Databricks was used to transform the Population data.

The Data Flows transformations used are:

Select transformation
Lookup transformation
Filter Transformation
Join transformation
Sort Transformation
Conditional split transformation
Derived columns transformation
Sink transformation

Cases and Deaths transformation

Steps :

Cases And Deaths Source (Azure Data Lake Storage Gen2 )
Filter Europe-Only Data
Select only the required columns
PivotCounts using indicator Columns(confirmed cases, deaths) and get the sum of daily cases count
Lookup Country to get country_code_2_digit,country_code_3_digit columns
Select Only the required columns for the Sink
Create a Sink dataset (Azure Data Lake Storage Gen2)
Used Schedule Trigger

Data flow pipeline for Cases and Deaths

Hospital Admissions transformation

Steps

Hospital Admissions Source (Azure Data Lake Storage Gen2 )
Select only the required columns
Lookup Country to get country_code_2_digit,country_code_3_digit columns
Select only the required columns
Condition Split Weekly, Daily Split condition

indicator=='Weekly new hospital admissions per 100k' || indicator=='Weekly new ICU admissions per 100k'
indicator== "Daily hospital occupancy" || indicator=="Daily ICU occupancy"

For Weekly Path

Join with Date to get ecdc_Year_week, week_start_date, week_End_date
Pivot Counts using indicator Columns(confirmed cases, deaths) and get the sum of daily cases count
Sort data using reported_year_week ASC and Country DESC
Select only the required columns for the sink
Create a sink dataset (Azure Data Lake Storage Gen2)
Schedule Trigger

For Daily Path

Pivot Counts using indicator Columns(confirmed cases, deaths) and get the sum of daily cases count
Sort Data using reported_year_week ASC and Country DESC
Select only the required columns for the sink
Create a sink dataset (Azure Data Lake Storage Gen2)
Used Schedule Trigger

Data flow pipeline for Hospital Admissions

Population data transformation

Visualization

The link to the dashboard can be found here: https://app.powerbi.com/view?r=eyJrIjoiYjBhYWU0NTItMmVhOS00MGM5LTk1ZGEtMTQxZTdmZDUxMWUwIiwidCI6ImUwYjlhZTFlLWViMjYtNDZhOC1hZGYyLWQ3ZGJjZjIzNDBhOSJ9

Learning outcomes

Applied knowledge about ingesting files from HTTPS and locally
Understood the difference between Data Warehouse and Data Lake
Understood how to use transformations such as SELECT, LOOKUP, FILTER, JOIN, SORT, CONDITIONAL SPLIT TRANSFORMATION, DERIVED COLUMNS, SINK TRANSFORMATION
Learnt how to write scripts in Python to mount storage and perform transformations using Dataframes.
Learnt how to create tables in SQL as well as populate them in ADF

Resources

https://www.udemy.com/course/learn-azure-data-factory-from-scratch/learn/lecture/23983360#announcements
https://github.com/hbuddana/Azure_Data_Factory_COVID-19_Reporting?tab=readme-ov-file#azure-data-factory-covid-19-reporting

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
configs		configs
lookup		lookup
processed		processed
raw		raw
scripts		scripts
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Effects of COVID-19 in Europe (An Azure Data Engineering project)

Goal of the project

Dataset

Solution Architecture

Environment Set-up

Ingestion

Ingesting the Population data

Solution flow for Population data

Ingesting ECDC Data

Solution flow for the ECDC Data

The JSON file:

Transformation

Cases and Deaths transformation

Data flow pipeline for Cases and Deaths

Hospital Admissions transformation

Data flow pipeline for Hospital Admissions

Population data transformation

Visualization

Learning outcomes

Resources

About

Releases

Packages

Languages

Yusreen/Effects-of-Covid-19-in-Europe

Folders and files

Latest commit

History

Repository files navigation

Effects of COVID-19 in Europe (An Azure Data Engineering project)

Goal of the project

Dataset

Solution Architecture

Environment Set-up

Ingestion

Ingesting the Population data

Solution flow for Population data

Ingesting ECDC Data

Solution flow for the ECDC Data

The JSON file:

Transformation

Cases and Deaths transformation

Data flow pipeline for Cases and Deaths

Hospital Admissions transformation

Data flow pipeline for Hospital Admissions

Population data transformation

Visualization

Learning outcomes

Resources

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages