-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CENTROIDS parameter does nothing #54
Comments
I'm sure you already know that once a user specifies (or gives as input) CENTROIDS then the initializers are skipped. |
I first noticed the behavior using real data. I was experimenting with using Kmeans_arma first to get a centroids matrix, and then plugging it in to Kmeans_rcpp. That also produces an error, because although Kmeans_arma returns a matrix of centroids, it is for some reason not compatible with the requirements of the CENTROIDS parameter in Kmeans_rcpp. The returned matrix has attr(,"class") [1] "k-means clustering", which might be the issue. Regardless, I can manually reconstruct a compatible centroids matrix. The behavior with regards to defaulting to kmeans++ in the presence of a CENTROIDS input is not changed. Here is an example using the soybean dataset:
iteration: 1 --> total WCSS: 7716 --> squared norm: 5.84451 ===================== end of initialization 1 =====================
iteration: 1 --> total WCSS: 7716 --> squared norm: 5.84451 ===================== end of initialization 1 ===================== |
Thank you for making me aware of this issue. This was actually an omission from my side. I use the R_NilValue as default parameter in the .cpp files for CENTROIDS and although I used it correctly in the Kmeans_arma and MiniBatchKmeans, I mistakenly passed as input to the Kmeans_rcpp function the R_NilValue rather than the CENTROIDS, that means whenever we used the CENTROIDS the function picked the NULL value and ran the default value of the initializer. This commit fixes the issue and the following code snippet serves as verification, library(ClusterR)
data(soybean)
X = soybean[, -ncol(soybean)]
y = soybean[, ncol(soybean)]
clusters = length(unique(y))
# table(y)
dat = center_scale(X)
# computation of centroids
km = KMeans_rcpp(dat, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', verbose = TRUE, seed = 1)
str(km)
# the output centroids
centroids = km$centroids
str(centroids)
# we use the computed centroids as input to reproduce the output
km_centr = KMeans_rcpp(dat, clusters = clusters, num_init = 5, max_iters = 100, CENTROIDS = centroids, verbose = TRUE)
str(km_centr)
# we receive identical clusters
identical(x = km$clusters, y = km_centr$clusters)
# TRUE
# Dimensions of the centroids
dim(km_centr$centroids)
# Dimensions of the data
dim(dat)
# The centroids have in the first dimension the number of clusters and in the second the number of columns of the data
# for instance we can sample the rows of the data to create sample centroids
set.seed(seed = 1)
samp_rows = sample(x = 1:nrow(dat), size = clusters, replace = FALSE)
# samp_rows
# the sample (rows) centroids
sample_CENTROIDS = dat[samp_rows, , drop = FALSE]
dim(sample_CENTROIDS)
# we make sure that we don't receive the same results as previously
km_centr_sample = KMeans_rcpp(dat, clusters = clusters, CENTROIDS = sample_CENTROIDS, verbose = TRUE)
str(km_centr_sample)
# we receive different clusters compared to the initial run with initializer = 'kmeans++'
identical(x = km$clusters, y = km_centr_sample$clusters)
# FALSE
You can install the latest version using remotes::install_github('mlampros/ClusterR', upgrade = 'always', dependencies = TRUE, repos = 'https://cloud.r-project.org/')
I'll submit the updated ClusterR package tomorrow morning to CRAN. Feel free to close the issue if the code now works as expected. |
Thank you for the fix and for your time @mlampros. |
The KMeans_rcpp function does not behave correctly when provided the CENTROIDS parameter. It ignores the centroids matrix provided and simply defaults to the "Kmeans++" initializer. I read the C++ code and there there should be a flag that is set to true when a centroid matrix is detected, instructing the program to not go through with any of the other initializers, however for some reason R matrices are not detected correctly. Am I doing something wrong, or is this behavior expected?
Example code and output:
library(ClusterR)
data <- matrix(c(
1.0, 1.0,
1.5, 2.0,
3.0, 4.0,
5.0, 7.0,
3.5, 5.0,
4.5, 5.0,
3.5, 4.5
), nrow = 7, ncol = 2, byrow = TRUE)
initial_centroids1 <- matrix(c(0.1, 5, 5, 0.1), nrow = 2, ncol = 2, byrow = TRUE)
initial_centroids2 <- matrix(c(2.5, 3, 3, 4), nrow = 2, ncol = 2, byrow = TRUE)
km_ic1 <- KMeans_rcpp(data, clusters = 2, max_iters = 100, CENTROIDS = initial_centroids1, verbose = TRUE, fuzzy = FALSE, tol = 1e-05)
km_ic2 <- KMeans_rcpp(data, clusters = 2, max_iters = 100, CENTROIDS = initial_centroids2, verbose = TRUE, fuzzy = FALSE, tol = 1e-05)
km_kmpp <- KMeans_rcpp(data, clusters = 2, max_iters = 100, initializer = "kmeans++", verbose = TRUE, fuzzy = FALSE, tol = 1e-05)
km_oi <- KMeans_rcpp(data, clusters = 2, max_iters = 100, initializer = "optimal_init", verbose = TRUE, fuzzy = FALSE, tol = 1e-05)
OUTPUT:
iteration: 1 --> total WCSS: 26.5 --> squared norm: 1.90485
iteration: 2 --> total WCSS: 11.2257 --> squared norm: 1.07748
iteration: 3 --> total WCSS: 8.525 --> squared norm: 0
===================== end of initialization 1 =====================
iteration: 1 --> total WCSS: 26.5 --> squared norm: 1.90485
iteration: 2 --> total WCSS: 11.2257 --> squared norm: 1.07748
iteration: 3 --> total WCSS: 8.525 --> squared norm: 0
===================== end of initialization 1 =====================
iteration: 1 --> total WCSS: 26.5 --> squared norm: 1.90485
iteration: 2 --> total WCSS: 11.2257 --> squared norm: 1.07748
iteration: 3 --> total WCSS: 8.525 --> squared norm: 0
===================== end of initialization 1 =====================
iteration: 1 --> total WCSS: 10 --> squared norm: 0.694622
iteration: 2 --> total WCSS: 8.525 --> squared norm: 0
===================== end of initialization 1 =====================
The text was updated successfully, but these errors were encountered: