Skip to content

This project looks at datasets to identify the effect of COVID-19 in Europe in 2020.

Notifications You must be signed in to change notification settings

Yusreen/Effects-of-Covid-19-in-Europe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Effects of COVID-19 in Europe (An Azure Data Engineering project)

This project looks at datasets to study the overall effect of COVID-19 in Europe in 2020.

Goal of the project

The goal is to visualize the death count, admission, and vaccination count per country in Europe. The project also attempts to find the link between confirmed cases and COVID-19 vaccines.

image

Dataset

Data is ingested from 2 sources:

  1. ECDC website to extract: Confirmed cases, Mortality, Hospitalization/ICU cases, Testing numbers
  2. Eurostat website to extract Population by age

We started by reading the data from the website itself. However, due to the new permission barrier, the course instructor created a GitHub repo for the datasets: https://github.com/cloudboxacademy/covid19.

Solution Architecture

image

Environment Set-up

AZURE Subscription
Data Factory
Blob Storage
Data Lake Storage GEN 2
AZURE SQL Database
AZURE Databricks
HD Insight cluster

Ingestion

Ingesting the Population data

The population data is ingested as a .tsv.gz file and stored in a blob storage.

image

Solution flow for Population data

image

Steps:

Create a Linked Service To Azure Blob Storage
Create a Source Data Set
Create a Linked Service To Azure Data Lake storage (GEN2)
Create a Sink Data set
Create a Pipeline:
Execute Copy activity when the file becomes available
Check metadata counts before loading the data using the IF Condition
Load Data into our destination
ScheduleTrigger

Ingesting ECDC Data

The confirmed cases, Mortality, Hospitalization/ICU cases, and Testing numbers data are read from the GitHub repo.

The files ingested are as follows:

Case & Deaths Data.csv
Hospital Admission Data.csv
testing.csv
country_response.csv

image

Solution flow for the ECDC Data

image

Steps:

Create a Linked Service using an HTTP connector
Create a Source Data Set
Create a Linked Service To Azure Data Lake storage (GEN2)
Create a Sink Data set
Create a Pipeline With Parameters & Variables
Lookup to get all the parameters from the JSON file, then pass it to ForEach ECDC DATA as shown below
Schedule Trigger

The JSON file:

image

Transformation

ADF dataflows were used to transform the Cases and Deaths data as well as the hospital admissions data, Databricks was used to transform the Population data.

The Data Flows transformations used are:

  • Select transformation
  • Lookup transformation
  • Filter Transformation
  • Join transformation
  • Sort Transformation
  • Conditional split transformation
  • Derived columns transformation
  • Sink transformation

Cases and Deaths transformation

image

Steps :

Cases And Deaths Source (Azure Data Lake Storage Gen2 )
Filter Europe-Only Data
Select only the required columns
PivotCounts using indicator Columns(confirmed cases, deaths) and get the sum of daily cases count
Lookup Country to get country_code_2_digit,country_code_3_digit columns
Select Only the required columns for the Sink
Create a Sink dataset (Azure Data Lake Storage Gen2)
Used Schedule Trigger

Data flow pipeline for Cases and Deaths

image

Hospital Admissions transformation

image

Steps

  1. Hospital Admissions Source (Azure Data Lake Storage Gen2 )
  2. Select only the required columns
  3. Lookup Country to get country_code_2_digit,country_code_3_digit columns
  4. Select only the required columns
  5. Condition Split Weekly, Daily Split condition
  • indicator=='Weekly new hospital admissions per 100k' || indicator=='Weekly new ICU admissions per 100k'
  • indicator== "Daily hospital occupancy" || indicator=="Daily ICU occupancy"
  1. For Weekly Path
  • Join with Date to get ecdc_Year_week, week_start_date, week_End_date
  • Pivot Counts using indicator Columns(confirmed cases, deaths) and get the sum of daily cases count
  • Sort data using reported_year_week ASC and Country DESC
  • Select only the required columns for the sink
  • Create a sink dataset (Azure Data Lake Storage Gen2)
  • Schedule Trigger
  1. For Daily Path
  • Pivot Counts using indicator Columns(confirmed cases, deaths) and get the sum of daily cases count
  • Sort Data using reported_year_week ASC and Country DESC
  • Select only the required columns for the sink
  • Create a sink dataset (Azure Data Lake Storage Gen2)
  • Used Schedule Trigger

Data flow pipeline for Hospital Admissions

image

Population data transformation

image

Visualization

The link to the dashboard can be found here: https://app.powerbi.com/view?r=eyJrIjoiYjBhYWU0NTItMmVhOS00MGM5LTk1ZGEtMTQxZTdmZDUxMWUwIiwidCI6ImUwYjlhZTFlLWViMjYtNDZhOC1hZGYyLWQ3ZGJjZjIzNDBhOSJ9

Learning outcomes

  • Applied knowledge about ingesting files from HTTPS and locally
  • Understood the difference between Data Warehouse and Data Lake
  • Understood how to use transformations such as SELECT, LOOKUP, FILTER, JOIN, SORT, CONDITIONAL SPLIT TRANSFORMATION, DERIVED COLUMNS, SINK TRANSFORMATION
  • Learnt how to write scripts in Python to mount storage and perform transformations using Dataframes.
  • Learnt how to create tables in SQL as well as populate them in ADF

Resources

https://www.udemy.com/course/learn-azure-data-factory-from-scratch/learn/lecture/23983360#announcements
https://github.com/hbuddana/Azure_Data_Factory_COVID-19_Reporting?tab=readme-ov-file#azure-data-factory-covid-19-reporting

About

This project looks at datasets to identify the effect of COVID-19 in Europe in 2020.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published