Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When a single node goes down, removing the pool causes the minio-cluster to become unavailable. #2357

Closed
Clara12062 opened this issue Nov 18, 2024 · 2 comments

Comments

@Clara12062
Copy link

Clara12062 commented Nov 18, 2024

Expected Behavior

In a multi-node scenario, if a single node fails, removing it from the pool can still function normally, and the removal does not affect the Minio service. And after the node is restored, it has no impact on the current service.

Current Behavior

Currently, there are three nodes deployed, with two pools: pool-1 (1 server, 2 volumes), and pool-3 (3 servers, 6 volumes). Shut down the pod in the pool-3-0 node. Pool-1 is decommissioned and then removed. Start up the pod in the pool-3-0 node.
The Minio service is unavailable and cannot be restored.

Possible Solution

Steps to Reproduce (for bugs)

  1. The initial deployment consisted of pool-1 (1 server, 2 volumes).
  2. Subsequent horizontal scaling was performed by adding pool-3 (3 servers, 6 volumes).
  3. Then, the decommissioning process was initiated by running the command: mc admin decom start ALIAS http://{pool-1-address}.
  4. The node hosting pool-3-0 was shut down (it is not the same node as pool-1-0) to simulate a failure scenario. At this point, the decommissioning process is still ongoing.
  5. Then, pool-1 was removed from the Minio cluster.
    In theory, the sidecar of the pod in pool-3 should restart. However, the service is still unavailable, and the other pods have not been restarted. Here is the log for pool-3-1:
API: SYSTEM()
Time: 06:59:15 UTC 11/18/2024
DeploymentID: a47be0e4-6f4d-4859-a8e9-a33f7bd0ac75
Error: Post "http://minio-pool-1-0.minio-hl.minio.svc.cluster.local:9000/minio/peer/v24/getlocaldiskids": context deadline exceeded (*rest.NetworkError)
       peerAddress="minio-pool-1-0.minio-hl.minio.svc.cluster.local:9000"
       2: internal/logger/logger.go:258:logger.LogIf()
       1: cmd/notification.go:129:cmd.(*NotificationGroup).Go.func1()

API: SYSTEM()
Time: 06:59:15 UTC 11/18/2024
DeploymentID: a47be0e4-6f4d-4859-a8e9-a33f7bd0ac75
Error: Server not initialized, please try again (*errors.errorString)
       peerAddress="minio-pool-3-0.minio-hl.minio.svc.cluster.local:9000"
       2: internal/logger/logger.go:258:logger.LogIf()
       1: cmd/notification.go:129:cmd.(*NotificationGroup).Go.func1()

API: StorageInfo()
Time: 06:59:15 UTC 11/18/2024
DeploymentID: a47be0e4-6f4d-4859-a8e9-a33f7bd0ac75
RequestID: 1808FDD64C8EB4E2
RemoteHost: 10.244.4.167
Host: minio-pool-3-1.minio-hl.minio.svc.cluster.local:9000
UserAgent: MinIO (linux; amd64) madmin-go/2.0.0
Error: Post "http://minio-pool-1-0.minio-hl.minio.svc.cluster.local:9000/minio/peer/v24/getlocaldiskids": context deadline exceeded (*rest.NetworkError)
       4: internal/logger/logger.go:258:logger.LogIf()
       3: cmd/admin-handlers.go:1091:cmd.getAggregatedBackgroundHealState()
       2: cmd/admin-handlers.go:347:cmd.adminAPIHandlers.StorageInfoHandler()
       1: net/http/server.go:2136:http.HandlerFunc.ServeHTTP()

API: StorageInfo()
Time: 06:59:15 UTC 11/18/2024
DeploymentID: a47be0e4-6f4d-4859-a8e9-a33f7bd0ac75
RequestID: 1808FDD64C8EB4E2
RemoteHost: 10.244.4.167
Host: minio-pool-3-1.minio-hl.minio.svc.cluster.local:9000
UserAgent: MinIO (linux; amd64) madmin-go/2.0.0
Error: Server not initialized, please try again (*errors.errorString)
       4: internal/logger/logger.go:258:logger.LogIf()
       3: cmd/admin-handlers.go:1091:cmd.getAggregatedBackgroundHealState()
       2: cmd/admin-handlers.go:347:cmd.adminAPIHandlers.StorageInfoHandler()
       1: net/http/server.go:2136:http.HandlerFunc.ServeHTTP()

The config.env file for pods in pool-3 is as follows:

bash-4.4$ cat /tmp/minio/config.env 
export MINIO_ARGS="http://minio-pool-3-{0...2}.minio-hl.minio.svc.cluster.local/export/data"
export MINIO_PROMETHEUS_JOB_ID="minio-job"
export MINIO_ROOT_PASSWORD="test@123"
export MINIO_ROOT_USER="admin"
export MINIO_SERVER_URL="http://minio.minio.svc.cluster.local:9000"
export MINIO_UPDATE="on"
export MINIO_UPDATE_MINISIGN_PUBKEY="RWTx5Zr1tiHQLwG9keckT0c45M3AGeHD6IvimQHpyRywVWGbP1aVSGav"

Context

Regression

Your Environment

  • Version used (minio-operator): v6.0.4
  • Environment name and version (e.g. kubernetes v1.17.2): 1.25.17
  • Server type and version:
  • Operating System and version (uname -a):
  • Link to your deployment file:
@ramondeklein
Copy link
Contributor

It doesn't look like pool-3-1 has restarted, because it still attempts to reach pool 1. Because this pool cannot be reached, the entire cluster is unavailable. After removing pool-1 from the tenant, the sidecar should indeed restart all MinIO nodes. Do you have logs from the operator to see what's going on? AFAIK, MinIO operator will only restart when more than half of the nodes are available.

PS: You may force a node restart using mc admin service restart <alias> to force the nodes to restart. That should stop the nodes from contacting pool-1.

@Clara12062
Copy link
Author

Yes, I found the cause. We made changes to the code, which caused the service not to restart. Thank you very much for your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants