This a solution notebook to an assignment question given in a Data Mining graduate course. Each code block is accompanied by relevant analysis wherever required.
Dataset link: https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq
Broadly, the following steps have been performed in this solution notebook:
- Minimal preprocessing on the dataset
- Explained wide usage of Agglomerative clustering over Divisive Clustering
- Visualization of given class labels using TSNE
- Ran agglomerative clustering using the following linkages {single, complete, group average, minimum variance}.
- Compared the clustering performance both visually and empirically on the dataset.
- Reported the best results on various cluster validity indices.
These above assumptions and the flow of work is according to the questions asked in assignment.