For the latest updates as well as for a standalone application capable of reproducing this work please refer to the PhD-additions branch of this repository.
This repository contains the code used to produce the results of the manuscript: An empirical comparison between stochastic and deterministic centroid initialisation for K-Means variations.
To reproduce the results of the paper run nag_init_methods_generate.m which will produce the folder NAG_init_res. This folder will be populated with several files in the format: [modelName]_model_[setNo].mat (for each clustering algorithm a subfolder will be created). Each of these files contains a struct array (rows: initialisation method, columns: number of data sets - default: 40). For stochastic methods multiple values are available based on the repetition of the clustering solution (default: 50).
Struct fields:
- centers: initial clustering centroids.
- idx: cluster index of each data point.
- centroids: final clustering centroids.
- iterations: number of iterations until convergence (not for Hartigan-Wong).
- Silh2 & Silh: silhouette index of the clustering solution (Silh2 was used in the manuscript).
- wcd: within-cluster-distance.
- bcd: between-cluster-distance.
- clPur: purity index of the clustering solution.
nag_init_methods_timing.m runs the execution time analysis. It requires the folder NAG_init_res generated by nag_init_methods_generate.m.
The results of nag_init_methods_generate.m and nag_init_methods_timing.m where then used to generate the figures and tables of the manuscript. For the results of a previous version of the manuscript (arXiv: https://arxiv.org/abs/1908.09946) code for plotting is provided, refer to commit 9e9a00
The results were produced using MATLAB R2018b with the following toolboxes:
- Statistics and Machine Learning Toolbox
- Parallel Computing Toolbox (if this toolbox is not available execution is still possible).
- The NAG Toolbox for MATLAB to run Hartigan and Wong's K-Means.
Clustering basic benchmark:
Gap models:
Weighted Gap models (YanYe):
Brodinova dataset generator:
MATLAB code was based on the R implementation of the algorithm; package: wrsk
Real datasets
Hartigan-Wong K-Means:
MATLAB's and Python's default K-Means clustering is Lloyd's K-Means (initialized with the K-Means++ method) while R uses Hartigan-Wong' K-Means. Here we use NAG Toolbox for MATLAB Hartigan and Wong's K-Means implementation thus in order to use this algorithm the toolbox is required.
Random:
K-Means++:
Original algorithm is described in the study of Arthur, D., & Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding, p 1027–1035. In SODA'07: proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA..
MATLAB implementation was based on the instructions of the MSDN Magazine Blog: Test Run - K-Means++ Data Clustering
Maximin:
Stochastic version is based on the study of Gonzalez, T. F. (1985). Clustering to minimize the maximum intercluster distance. Theoretical computer science, 38, 293-306..
Deterministic version is based on the study of: Katsavounidis, I., Kuo, C. C. J., & Zhang, Z. (1994). A new initialization technique for generalized Lloyd iteration. IEEE Signal processing letters, 1(10), 144-146..
ROBIN:
Original algorithm (deterministic) is described in the study of Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M. J. (2009). Robust partitional clustering by outlier and density insensitive seeding. Pattern Recognition Letters, 30(11), 994-1002..
MATLAB code was originally based on the R implementation of the algorithm (stochastic version); package: wrsk
Density K-Means++:
MATLAB code was based on the R implementation of the algorithm; code: dkmpp_0.1.0
Kaufman:
MATLAB implementation was based on the pseudocode of Pena, J. M., Lozano, J. A., & Larranaga, P. (1999). An empirical comparison of four initialization methods for the k-means algorithm. Pattern recognition letters, 20(10), 1027-1040.