for Principles of Distributed Systems Class, Technical University of Crete
This project implements a Map-Reduce framework on Kubernetes to compute word frequencies from input files stored in Minio. It utilizes various components such as UI service, authentication service, Cassandra for metadata storage, and Kubernetes-managed workers for executing Map-Reduce tasks.
To deploy the entire system on Kubernetes:
make deploy
To remove all deployed components from Kubernetes:
make clean
-
Port Forwarding for UI Service:
Forward the UI service port to access it locally:
kubectl port-forward service/ui-service 8080:8080 -n dena
-
Admin Operations:
-
Login as admin:
python3 client.py login --username admin --password admin
-
Logout:
python3 client.py logout
-
Create a new user as admin:
python3 client.py admin create-user user2
-
-
Job Submission:
Submit a job to process a specific filename (ensure the file is in "map-reduce-input-files" bucket in Minio):
python3 client.py jobs submit filename
-
Job Status:
Check the status of a submitted job using its job_id:
python3 client.py jobs status job_id
This directory includes useful scripts for development and testing:
-
display_content.py: Displays contents of a file stored in Minio.
-
generate_file.py: Generates input files with a specified number of words and stores them in Minio.
-
test_cassandra.py: Tests connectivity and functionality of Cassandra.
-
test_minio: Checks the contents of Minio buckets.
-
UI Service: Flask API handling user commands.
-
Auth Service: FastAPI managing user login and token assignment.
-
Cassandra: Distributed data storage for job metadata and temporary data.
-
Manager: Flask API coordinating Map-Reduce execution, managing workers, and metadata.
-
Workers: Kubernetes jobs executing Map-Reduce tasks.
-
Minio: Persistent storage for input files, output files, and Map-Reduce chunks.
-
Job Initialization:
- UI service forwards a job to
/initialize_job
endpoint of Manager service, storing job metadata in Cassandra with status initialized.
- UI service forwards a job to
-
Split Phase (Worker Job - Split):
- Worker job retrieves the input file from Minio, splits it into chunks, and stores them in a Minio bucket (
chunk_bucket
). It notifies the Manager service upon completion.
- Worker job retrieves the input file from Minio, splits it into chunks, and stores them in a Minio bucket (
-
Map Phase (Worker Job - Map):
- Manager creates mapper jobs based on the number of chunks.
- Mapper jobs retrieve chunks from Minio, perform mapping, and store data in Cassandra (
map_table
). They notify the Manager upon completion.
-
Shuffle-Sort Phase (Worker Job - Shuffle-Sort):
- Manager aggregates and sorts mapped data into
shuffle_sort_table
in Cassandra. - Worker job retrieves data, performs shuffling and sorting, and stores results back in Cassandra. It notifies the Manager upon completion.
- Manager aggregates and sorts mapped data into
-
Reduce Phase (Worker Job - Reduce):
- Manager creates reducer jobs based on configured reducers.
- Reducer jobs retrieve data from Cassandra, perform reduction operations, and store results in
reduce_table
. They notify the Manager upon completion.
-
Combine Phase (Worker Job - Combine):
- Manager initiates a combine job to aggregate reduced results.
- Worker job retrieves data from Cassandra, performs combining, and stores the output JSON file in the output Minio bucket. It notifies the Manager upon completion.
-
Job Completion:
- Manager updates job status to completed upon receiving notification from the combine job.