In this project, I am working on an ETL Movies Analysis bu utilizing Python, VS Code, RegEx and SQL Databases . Raw data exists in multiple places and forms. In order to perform any kind of data analysis, this data needs to be cleaned and structured. Data pipeline process ETL – Extract, Transform, and Load is a core concept in data engineering, ensuring that data is consistent, maintains its integrity, and nonetheless strives for automatization of data wrangling. Without a consistent and robust data structure, it’s nearly impossible to perform any meaningful analysis.
Extract - Transform - Load
The Amazing Prime, a video streaming company, decided to sponsor a Hackathon, where participants trying to predict which low budget movies being released will become popular. Participants of a Hackathon need a clean data in order to perform analyses for their algorithms. In order to provide organized and clean dataset, this project focuses on ETL Extract, Transform and Load process and includes the following:
1-Extracting data from two different sources.
A source from Wikipedia in Json
format for all movies released since 1990
A csv file from Kaggle website for rating data.
2-Transforming data with Python, Pandas and Python RegEx module.
3-Loading data using PostgreSQL and pgAdmin.
My goal is to create an automated pipeline that extracts, transform and loads data. In order to reach my goal, I have breakdown the analysis in four major steps as follow:
1- Function Test :
- WikiMovies.json and MetaData.csv captured from website.
- MetaData is transformed into Pandas data frames.
- JSON file :
- Loading file
- Transforming into data frame.
2- Function for cleaning WikiMovies data:
This function combines data of alternative languages into a new column alt_titles then help me to reduce number of columns in the dataset.
The functions contains following methods :
Python list comprehensions.
apply() and map()
methods in combination with lambda functions.
Regex
Regular expressions
3-Function for cleaning MetaData:
This function helps me to clean the data by applying following methods:
- Reestablish data-types by
pd.to_numeric, astype()
and comparison operators for Boolean types. - Find missing/unknown values and fill with correspond data and drop extra columns.
- Lastly merging data frames using
pd_merge
method.
4- Function -> Loading
- In this final step, the function connects to the database by Sqlalchemy library and
to_sql method.