Movies Extract - Transform - Load

Overview of Project

In this project, I am working on an ETL Movies Analysis bu utilizing Python, VS Code, RegEx and SQL Databases . Raw data exists in multiple places and forms. In order to perform any kind of data analysis, this data needs to be cleaned and structured. Data pipeline process ETL – Extract, Transform, and Load is a core concept in data engineering, ensuring that data is consistent, maintains its integrity, and nonetheless strives for automatization of data wrangling. Without a consistent and robust data structure, it’s nearly impossible to perform any meaningful analysis.

Extract - Transform - Load

Purpose of Movies ETL :

The Amazing Prime, a video streaming company, decided to sponsor a Hackathon, where participants trying to predict which low budget movies being released will become popular. Participants of a Hackathon need a clean data in order to perform analyses for their algorithms. In order to provide organized and clean dataset, this project focuses on ETL Extract, Transform and Load process and includes the following:

1-Extracting data from two different sources. A source from Wikipedia in Json format for all movies released since 1990 A csv file from Kaggle website for rating data.

2-Transforming data with Python, Pandas and Python RegEx module.

3-Loading data using PostgreSQL and pgAdmin.

Analysis

My goal is to create an automated pipeline that extracts, transform and loads data. In order to reach my goal, I have breakdown the analysis in four major steps as follow:

1- Function Test :

WikiMovies.json and MetaData.csv captured from website.
MetaData is transformed into Pandas data frames.
JSON file :
- Loading file
- Transforming into data frame.

2- Function for cleaning WikiMovies data:

This function combines data of alternative languages into a new column alt_titles then help me to reduce number of columns in the dataset.

The functions contains following methods : Python list comprehensions. apply() and map() methods in combination with lambda functions. Regex Regular expressions

3-Function for cleaning MetaData:

This function helps me to clean the data by applying following methods:

Reestablish data-types by pd.to_numeric, astype() and comparison operators for Boolean types.
Find missing/unknown values and fill with correspond data and drop extra columns.
Lastly merging data frames using pd_merge method.

4- Function -> Loading

In this final step, the function connects to the database by Sqlalchemy library and to_sql method.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Resources		Resources
.DS_Store		.DS_Store
.gitignore		.gitignore
ETL_clean_kaggle_data.ipynb		ETL_clean_kaggle_data.ipynb
ETL_clean_wiki_movies.ipynb		ETL_clean_wiki_movies.ipynb
ETL_create_database.ipynb		ETL_create_database.ipynb
ETL_function_test.ipynb		ETL_function_test.ipynb
Movielens_Extract.ipynb		Movielens_Extract.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Movies Extract - Transform - Load

Overview of Project

Purpose of Movies ETL :

Analysis

About

Releases

Packages

Languages

aktugchelekche/Movies-ETL

Folders and files

Latest commit

History

Repository files navigation

Movies Extract - Transform - Load

Overview of Project

Purpose of Movies ETL :

Analysis

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages