Example of the whole process of ingestion, storage, transformation and visualization.
This is an Azure ELT repository for JSON files published by Vitor Franklin de Lacerda Nimschofsky. The idea of the project is to assemble an ELT from a raw data ingested initially via api https://www.openbrewerydb.org/documentation, so that in the end it can be visualized for analysis in PowerBI.
SUMMARY:
● Through a request made on the List Breweries endpoint - https://api.openbrewerydb.org/v1/breweries, I got all the data from all the countries totaling 8206 records in JSON format.
● I stored the records in Azure Blob Storage in a Land container.
● It took the use of Azure Key Vault to create some secrets and hide token and connections in our notebooks, always aiming for compliance and security.
I consumed these JSON files in Azure Databricks to do the Data Lakehouse structuring and also do the necessary transformations. I used the Delta storage of databricks to structure the medallion architecture, benefiting from the optimization and speed of DBSF. Still in Azure Databricks I created a bronze table with the initial dataframe without transformations, in the silver layer I partitioned the initial table by countries, where I created Delta tables in the silver layer for each country separately. Finally in the gold layer I create a final delta table, to be consumed in Power BI, where some insights will be shown regarding the quantity of "Stores" by type of "Store" and by location.
ARCHITECTURE:
RESOURCE CREATION:
Resource Group -
Azure Data Factory -
Azure Key Vault -
Azure Databricks -
Azure Storage -
Cluster Databricks -
All the architecture was done in Azure cloud, I followed with the following strategy:
● Azure Data Factory: I used to do the ingestion of the data coming from the public API, all the data wake is done with the Azure Data Factory tool.
The pipeline JSON is in the source-code folder, in this JSON we can see in detail all the pipeline configuration.
As the endpoint of List Brewery - https://api.openbrewerydb.org/v1/breweries returns paged values, it was necessary to use the Until activity to request all the Breweries data, as long as there were records (data) in the subsequent pages, the request would continue until the request result was empty. Records from all the countries below were requested in this way:
Austria
England
France
Isle of Man
Ireland
Poland
Portugal
Scotland
SouthKorea
United States
In addition to the request with the copy data activity, which makes the request in the source and copies the data to our blob (container land) in the sink, we used initial set variables for the first page and temporary set variables to update subsequent pages, finalizing the loop (ingestion and storage), the pipeline runs the three notebooks, bronze, silver and gold in this sequence.
● Blob Storage: Here the storage of all JSON files was done. The land container is the destination of the entire Pipeline data stream.
I created a variable unix_timestamp, which dynamically generates the current date and time in Unix format, this variable will serve to name the JSON file with a sequence of unique numbers. We also use the date of the day that the pipeline was executed as a subfolder.
The file path for the copy data from the ADF to the blob is the one below:
Azure Key Vault: When creating the mount point it was necessary to use a storage connection token, this token was stored in our secret blob-key.
● Azure Databricks: Three notebooks were, created:
Brewery_Bronze reads the blob files and creates a unique dataframe that is written to a delta table called bronze_brewery.
Brewery_Silver reads the bronze_brewery table in the delta and splits this dataframe into dataframes partitioned by country.
Brewery_Gold reads the bronze_brewery table and creates a table with the relevant gold columns, this table will be consumed in power BI.
The code for the three notebooks is shared in the source code folder.
All code was done in Python and Pyspark.
Below is a screenshot of the tables created in delta.
● Visualization of the data through a Dashboard in PowerBI:
To visualize and take some insights from this data scope, I set up a dashboard that shows the aggregation of the amount of Breweries according to their location and their type, below is the image of the analysis page.
Filtered for all scenarios:
Filtered for South Korea:
Filtered by Brewery Type equal to large:
Project Costs: Below is a picture that shows the costs for this ELT made entirely in Azure.