Using Spark Scala to do data cleaning, pre-processing, and analytics on a million movies dataset.
Solving analytical questions on the semi-structured MovieLens dataset containing a million records using Spark and Scala. This features the use of Spark RDD, Spark SQL and Spark Dataframes executed on Spark-Shell (REPL) using Scala API. We aim to draw useful insights about users and movies by leveraging different forms of Spark APIs.
- Windows Subsystem ( WSL2 / Ubuntu 20.04)
- Hadoop 2.7
- Spark 3.2.0
- Scala 2.12.15
-
Simply clone the repository
git clone https://github.com/fermat01/Data-analytics-using-spark-scala.git
-
In the repo, navigate to Spark RDD, Spark SQL or Spark Dataframe locations as needed.
-
Run the execute script to view results
sh execute.sh
-
The
execute.sh
will pass the scala code through spark-shell and then display the findings in the terminal from the results folder.
- What are the top 10 most viewed movies?
- What are the distinct list of genres available?
- How many movies for each genre?
- How many movies are starting with numbers or letters (Example: Starting with 1/2/3../A/B/C..Z)?
- List the latest released movies
- Create tables for movies.dat, users.dat and ratings.dat: Saving Tables from Spark SQL
- Find the list of the oldest released movies.
- How many movies are released each year?
- How many number of movies are there for each rating?
- How many users have rated each movie?
- What is the total rating for each movie?
- What is the average rating for each movie?
- Prepare Movies data: Extracting the Year and Genre from the Text
- Prepare Users data: Loading a double delimited csv file
- Prepare Ratings data: Programmatically specifying a schema for the dataframe
- Import Data from URL: Scala
- Save table without defining DDL in Hive
- Broadcast Variable example
- Accumulator example
- Databricks Community Edition
Note: The results were collected and repartitioned into the same text file: This is not a recommended practice since performance is highly impacted but it is done here for the sake of readability.
This repository is licensed under Apache License 2.0 - see License for more details