This repository contains a collection of Apache Spark scripts used to get familiar with the basics of batch processing of big data and a collection of Apache Flink scripts used to get familiar with the basics of stream processing of big data.
The Apache Spark scripts cover a range of topics such as:
- manipulating RDDs via:
- functional programming principles like pattern matching
- regex
- functions like:
map
flatMap
reduceByKey
flatten
filter
- manipulating DataFrames via:
- Spark SQL
- custom aggregation functions using
Window
The Apache Flink scripts cover a range of topics such as:
- basic manipulation of DataStreams via functions like:
map
filter
flatMap
- working with stateful streams via
keyBy
- dealing with infinite streams via:
- different kinds of window assigners like
TumblingEventTimeWindows
orSlidingEventTimeWindows
- keyed and non-keyed windows
- new
ProcessWindowFunction
- different kinds of window assigners like
Purpose | Name |
---|---|
Programming language | Scala |
Cluster computing framework | Apache Spark, Apache Flink |
It is assumed that both a Java JDK and an IDE such as IntelliJ are installed and that the users operating system is Windows.
- Install the Scala support plugin for your IDE.
- Import the corresponding sub folder of this repository as a Maven project and resolve all dependencies.
These Big Data scripts are published under the MIT licence, which can be found in the LICENSE file. For this repository, the terms laid out there shall not apply to any individual that is currently enrolled at a higher education institution as a student. Those individuals shall not interact with any other part of this repository besides this README in any way by, for example cloning it or looking at its source code or have someone else interact with this repository in any way.
The Apache Spark logo was taken from Wikipedia and the Apache Flink logo from .