The number of clusters are not pre-defined and the found during execution. The processing and clustering is done in distributive environment for efficiency using Spark
The link for the dataset is below: https://drive.google.com/drive/folders/10Ysiq2I1TQy_319uSWvzLIu_xzxSXQM5?usp=sharing
Different datasets can be used for experimentation.