Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CENTROIDS parameter does nothing #54

Closed
Nikola4213 opened this issue Jun 15, 2024 · 4 comments
Closed

CENTROIDS parameter does nothing #54

Nikola4213 opened this issue Jun 15, 2024 · 4 comments
Labels

Comments

@Nikola4213
Copy link

The KMeans_rcpp function does not behave correctly when provided the CENTROIDS parameter. It ignores the centroids matrix provided and simply defaults to the "Kmeans++" initializer. I read the C++ code and there there should be a flag that is set to true when a centroid matrix is detected, instructing the program to not go through with any of the other initializers, however for some reason R matrices are not detected correctly. Am I doing something wrong, or is this behavior expected?

Example code and output:

library(ClusterR)

data <- matrix(c(
1.0, 1.0,
1.5, 2.0,
3.0, 4.0,
5.0, 7.0,
3.5, 5.0,
4.5, 5.0,
3.5, 4.5
), nrow = 7, ncol = 2, byrow = TRUE)

initial_centroids1 <- matrix(c(0.1, 5, 5, 0.1), nrow = 2, ncol = 2, byrow = TRUE)
initial_centroids2 <- matrix(c(2.5, 3, 3, 4), nrow = 2, ncol = 2, byrow = TRUE)

km_ic1 <- KMeans_rcpp(data, clusters = 2, max_iters = 100, CENTROIDS = initial_centroids1, verbose = TRUE, fuzzy = FALSE, tol = 1e-05)
km_ic2 <- KMeans_rcpp(data, clusters = 2, max_iters = 100, CENTROIDS = initial_centroids2, verbose = TRUE, fuzzy = FALSE, tol = 1e-05)
km_kmpp <- KMeans_rcpp(data, clusters = 2, max_iters = 100, initializer = "kmeans++", verbose = TRUE, fuzzy = FALSE, tol = 1e-05)
km_oi <- KMeans_rcpp(data, clusters = 2, max_iters = 100, initializer = "optimal_init", verbose = TRUE, fuzzy = FALSE, tol = 1e-05)

OUTPUT:

km_ic1 <- KMeans_rcpp(data, clusters = 2, max_iters = 100, CENTROIDS = initial_centroids1, verbose = TRUE, fuzzy = FALSE, tol = 1e-05)

iteration: 1 --> total WCSS: 26.5 --> squared norm: 1.90485
iteration: 2 --> total WCSS: 11.2257 --> squared norm: 1.07748
iteration: 3 --> total WCSS: 8.525 --> squared norm: 0

===================== end of initialization 1 =====================

km_ic2 <- KMeans_rcpp(data, clusters = 2, max_iters = 100, CENTROIDS = initial_centroids2, verbose = TRUE, fuzzy = FALSE, tol = 1e-05)

iteration: 1 --> total WCSS: 26.5 --> squared norm: 1.90485
iteration: 2 --> total WCSS: 11.2257 --> squared norm: 1.07748
iteration: 3 --> total WCSS: 8.525 --> squared norm: 0

===================== end of initialization 1 =====================

km_kmpp <- KMeans_rcpp(data, clusters = 2, max_iters = 100, initializer = "kmeans++", verbose = TRUE, fuzzy = FALSE, tol = 1e-05)

iteration: 1 --> total WCSS: 26.5 --> squared norm: 1.90485
iteration: 2 --> total WCSS: 11.2257 --> squared norm: 1.07748
iteration: 3 --> total WCSS: 8.525 --> squared norm: 0

===================== end of initialization 1 =====================

km_oi <- KMeans_rcpp(data, clusters = 2, max_iters = 100, initializer = "optimal_init", verbose = TRUE, fuzzy = FALSE, tol = 1e-05)

iteration: 1 --> total WCSS: 10 --> squared norm: 0.694622
iteration: 2 --> total WCSS: 8.525 --> squared norm: 0

===================== end of initialization 1 =====================

@mlampros
Copy link
Owner

@Nikola4213

I'm sure you already know that once a user specifies (or gives as input) CENTROIDS then the initializers are skipped.
The fact that you come to the same squared norm values in your first 3 (out of 4) cases might be related to your toy dataset which consists of 7 rows only and to your specified centroid values.
Did you try to use one of the real datasets that are included in the ClusterR package?

@Nikola4213
Copy link
Author

Nikola4213 commented Jun 16, 2024

I first noticed the behavior using real data. I was experimenting with using Kmeans_arma first to get a centroids matrix, and then plugging it in to Kmeans_rcpp. That also produces an error, because although Kmeans_arma returns a matrix of centroids, it is for some reason not compatible with the requirements of the CENTROIDS parameter in Kmeans_rcpp. The returned matrix has attr(,"class") [1] "k-means clustering", which might be the issue. Regardless, I can manually reconstruct a compatible centroids matrix. The behavior with regards to defaulting to kmeans++ in the presence of a CENTROIDS input is not changed. Here is an example using the soybean dataset:

library(ClusterR)

data_CR <- soybean[,1:35]

centroids_arma <- KMeans_arma(data = data_CR, clusters = 3, n_iter = 100, seed_mode = "random_spread", seed = 1)
km_rcpp_with_centroids <- KMeans_rcpp(data = data_CR, clusters = 3, num_init = 1, max_iters = 100, CENTROIDS = centroids_arma, verbose = TRUE)
Error in KMeans_rcpp(data = data_CR, clusters = 3, num_init = 1, max_iters = 100, :
CENTROIDS should be a matrix with number of rows equal to the number of clusters and number of columns equal to the number of columns of the data
centroids_matrix <- matrix(centroids_arma, nrow = 3, ncol = 35)
km_rcpp_with_centroids <- KMeans_rcpp(data= data_CR, clusters = 3, num_init = 1, max_iters = 100, CENTROIDS = centroids_matrix, verbose = TRUE)

iteration: 1 --> total WCSS: 7716 --> squared norm: 5.84451
iteration: 2 --> total WCSS: 4131.14 --> squared norm: 1.65376
iteration: 3 --> total WCSS: 3856.45 --> squared norm: 1.02322
iteration: 4 --> total WCSS: 3787.8 --> squared norm: 0.259271
iteration: 5 --> total WCSS: 3782.46 --> squared norm: 0.140202
iteration: 6 --> total WCSS: 3781.19 --> squared norm: 0

===================== end of initialization 1 =====================

km_rcpp_with_centroids <- KMeans_rcpp(data= data_CR, clusters = 3, num_init = 1, max_iters = 100, initializer = "kmeans++", verbose = TRUE)

iteration: 1 --> total WCSS: 7716 --> squared norm: 5.84451
iteration: 2 --> total WCSS: 4131.14 --> squared norm: 1.65376
iteration: 3 --> total WCSS: 3856.45 --> squared norm: 1.02322
iteration: 4 --> total WCSS: 3787.8 --> squared norm: 0.259271
iteration: 5 --> total WCSS: 3782.46 --> squared norm: 0.140202
iteration: 6 --> total WCSS: 3781.19 --> squared norm: 0

===================== end of initialization 1 =====================

mlampros added a commit that referenced this issue Jun 16, 2024
@mlampros
Copy link
Owner

mlampros commented Jun 16, 2024

@Nikola4213

Thank you for making me aware of this issue.

This was actually an omission from my side. I use the R_NilValue as default parameter in the .cpp files for CENTROIDS and although I used it correctly in the Kmeans_arma and MiniBatchKmeans, I mistakenly passed as input to the Kmeans_rcpp function the R_NilValue rather than the CENTROIDS, that means whenever we used the CENTROIDS the function picked the NULL value and ran the default value of the initializer.

This commit fixes the issue and the following code snippet serves as verification,

library(ClusterR)

data(soybean)
X = soybean[, -ncol(soybean)]
y = soybean[, ncol(soybean)]

clusters = length(unique(y))
# table(y)

dat = center_scale(X)

# computation of centroids
km = KMeans_rcpp(dat, clusters = clusters, num_init = 5, max_iters = 100, initializer = 'kmeans++', verbose = TRUE, seed = 1)
str(km)

# the output centroids
centroids = km$centroids
str(centroids)

# we use the computed centroids as input to reproduce the output
km_centr = KMeans_rcpp(dat, clusters = clusters, num_init = 5, max_iters = 100, CENTROIDS = centroids, verbose = TRUE)
str(km_centr)

# we receive identical clusters
identical(x = km$clusters, y = km_centr$clusters)
# TRUE

# Dimensions of the centroids
dim(km_centr$centroids)

# Dimensions of the data
dim(dat)

# The centroids have in the first dimension the number of clusters and in the second the number of columns of the data
# for instance we can sample the rows of the data to create sample centroids
set.seed(seed = 1)
samp_rows = sample(x = 1:nrow(dat), size = clusters, replace = FALSE)
# samp_rows

# the sample (rows) centroids
sample_CENTROIDS = dat[samp_rows, , drop = FALSE]
dim(sample_CENTROIDS)

# we make sure that we don't receive the same results as previously
km_centr_sample = KMeans_rcpp(dat, clusters = clusters, CENTROIDS = sample_CENTROIDS, verbose = TRUE)
str(km_centr_sample)

# we receive different clusters compared to the initial run with initializer = 'kmeans++'
identical(x = km$clusters, y = km_centr_sample$clusters)
# FALSE

You can install the latest version using

remotes::install_github('mlampros/ClusterR', upgrade = 'always', dependencies = TRUE, repos = 'https://cloud.r-project.org/')
 

I'll submit the updated ClusterR package tomorrow morning to CRAN.

Feel free to close the issue if the code now works as expected.

@Nikola4213
Copy link
Author

Thank you for the fix and for your time @mlampros.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants