Vagrant-PySpark is a Vagrant box that can be provisioned with any Spark version, ready to run Spark jobs (included PySpark) and unit testing for PySpark.
It is intended to be used only for development and testing with small data sets.
To start and provision the vagrant box you must set a file (ansible/variables.yml) with required variables:
- Scala version
- Spark version
- Hadoop version
Versions must match with the one provided here:
- For Scala: https://www.scala-lang.org/download/
- For Spark and Hadoop: http://spark.apache.org/downloads.html
Variable file should contain following variables:
scala:
version: 2.11.8
spark:
version: 2.1.0
hadoop:
version: 2.7
You can find examples for Spark 1.6.3 and 2.1.0 in this repo:
You can create a symbolic link to use them:
ln -s vars/vars_spark_2.1.0.yml ansible/variables.yml
If you use other versions, PRs are welcome with your version setup.
Set up the Vagrant box and clone your projects inside to run your jobs and tests.
You can fork this repo and extend the Vagrant file to sync your projects folder in the Vagrant box. It will allow you to have all your changes immediately available to run in the Vagrant box.
config.vm.synced_folder "/Project/path/in/host/machine", "/Destination/in/vagrant/box"
You can copy this project inside your Spark project and have all together.
You can find good explanation and examples here