Data-analytics-using-spark-scala

Using Spark Scala to do data cleaning, pre-processing, and analytics on a million movies dataset.

Overview

Solving analytical questions on the semi-structured MovieLens dataset containing a million records using Spark and Scala. This features the use of Spark RDD, Spark SQL and Spark Dataframes executed on Spark-Shell (REPL) using Scala API. We aim to draw useful insights about users and movies by leveraging different forms of Spark APIs.

Major Components

Environment

Windows Subsystem ( WSL2 / Ubuntu 20.04)
Hadoop 2.7
Spark 3.2.0
Scala 2.12.15

Installation steps

Simply clone the repository

git clone https://github.com/fermat01/Data-analytics-using-spark-scala.git

In the repo, navigate to Spark RDD, Spark SQL or Spark Dataframe locations as needed.
Run the execute script to view results
```
sh execute.sh
```
The execute.sh will pass the scala code through spark-shell and then display the findings in the terminal from the results folder.

Analytical Queries

Spark RDD

Spark SQL

Spark DataFrames

Miscellaneous

Note: The results were collected and repartitioned into the same text file: This is not a recommended practice since performance is highly impacted but it is done here for the sake of readability.

License

This repository is licensed under Apache License 2.0 - see License for more details

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Miscellaneous		Miscellaneous
Movielens		Movielens
Spark_DataFrames		Spark_DataFrames
Spark_RDD		Spark_RDD
Spark_SQL		Spark_SQL
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-analytics-using-spark-scala

Overview

Table of Contents

Major Components

Environment

Installation steps

Analytical Queries

Spark RDD

Spark SQL

Spark DataFrames

Miscellaneous

License

About

Releases

Packages

Languages

License

fermat01/Data-analytics-using-spark-scala

Folders and files

Latest commit

History

Repository files navigation

Data-analytics-using-spark-scala

Overview

Table of Contents

Major Components

Environment

Installation steps

Analytical Queries

Spark RDD

Spark SQL

Spark DataFrames

Miscellaneous

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages