When a single node goes down, removing the pool causes the minio-cluster to become unavailable. #2357

Clara12062 · 2024-11-18T07:30:05Z

Expected Behavior

In a multi-node scenario, if a single node fails, removing it from the pool can still function normally, and the removal does not affect the Minio service. And after the node is restored, it has no impact on the current service.

Current Behavior

Currently, there are three nodes deployed, with two pools: pool-1 (1 server, 2 volumes), and pool-3 (3 servers, 6 volumes). Shut down the pod in the pool-3-0 node. Pool-1 is decommissioned and then removed. Start up the pod in the pool-3-0 node.
The Minio service is unavailable and cannot be restored.

Possible Solution

Steps to Reproduce (for bugs)

The initial deployment consisted of pool-1 (1 server, 2 volumes).
Subsequent horizontal scaling was performed by adding pool-3 (3 servers, 6 volumes).
Then, the decommissioning process was initiated by running the command: mc admin decom start ALIAS http://{pool-1-address}.
The node hosting pool-3-0 was shut down (it is not the same node as pool-1-0) to simulate a failure scenario. At this point, the decommissioning process is still ongoing.
Then, pool-1 was removed from the Minio cluster.
In theory, the sidecar of the pod in pool-3 should restart. However, the service is still unavailable, and the other pods have not been restarted. Here is the log for pool-3-1:

API: SYSTEM()
Time: 06:59:15 UTC 11/18/2024
DeploymentID: a47be0e4-6f4d-4859-a8e9-a33f7bd0ac75
Error: Post "http://minio-pool-1-0.minio-hl.minio.svc.cluster.local:9000/minio/peer/v24/getlocaldiskids": context deadline exceeded (*rest.NetworkError)
       peerAddress="minio-pool-1-0.minio-hl.minio.svc.cluster.local:9000"
       2: internal/logger/logger.go:258:logger.LogIf()
       1: cmd/notification.go:129:cmd.(*NotificationGroup).Go.func1()

API: SYSTEM()
Time: 06:59:15 UTC 11/18/2024
DeploymentID: a47be0e4-6f4d-4859-a8e9-a33f7bd0ac75
Error: Server not initialized, please try again (*errors.errorString)
       peerAddress="minio-pool-3-0.minio-hl.minio.svc.cluster.local:9000"
       2: internal/logger/logger.go:258:logger.LogIf()
       1: cmd/notification.go:129:cmd.(*NotificationGroup).Go.func1()

API: StorageInfo()
Time: 06:59:15 UTC 11/18/2024
DeploymentID: a47be0e4-6f4d-4859-a8e9-a33f7bd0ac75
RequestID: 1808FDD64C8EB4E2
RemoteHost: 10.244.4.167
Host: minio-pool-3-1.minio-hl.minio.svc.cluster.local:9000
UserAgent: MinIO (linux; amd64) madmin-go/2.0.0
Error: Post "http://minio-pool-1-0.minio-hl.minio.svc.cluster.local:9000/minio/peer/v24/getlocaldiskids": context deadline exceeded (*rest.NetworkError)
       4: internal/logger/logger.go:258:logger.LogIf()
       3: cmd/admin-handlers.go:1091:cmd.getAggregatedBackgroundHealState()
       2: cmd/admin-handlers.go:347:cmd.adminAPIHandlers.StorageInfoHandler()
       1: net/http/server.go:2136:http.HandlerFunc.ServeHTTP()

API: StorageInfo()
Time: 06:59:15 UTC 11/18/2024
DeploymentID: a47be0e4-6f4d-4859-a8e9-a33f7bd0ac75
RequestID: 1808FDD64C8EB4E2
RemoteHost: 10.244.4.167
Host: minio-pool-3-1.minio-hl.minio.svc.cluster.local:9000
UserAgent: MinIO (linux; amd64) madmin-go/2.0.0
Error: Server not initialized, please try again (*errors.errorString)
       4: internal/logger/logger.go:258:logger.LogIf()
       3: cmd/admin-handlers.go:1091:cmd.getAggregatedBackgroundHealState()
       2: cmd/admin-handlers.go:347:cmd.adminAPIHandlers.StorageInfoHandler()
       1: net/http/server.go:2136:http.HandlerFunc.ServeHTTP()

The config.env file for pods in pool-3 is as follows:

bash-4.4$ cat /tmp/minio/config.env 
export MINIO_ARGS="http://minio-pool-3-{0...2}.minio-hl.minio.svc.cluster.local/export/data"
export MINIO_PROMETHEUS_JOB_ID="minio-job"
export MINIO_ROOT_PASSWORD="test@123"
export MINIO_ROOT_USER="admin"
export MINIO_SERVER_URL="http://minio.minio.svc.cluster.local:9000"
export MINIO_UPDATE="on"
export MINIO_UPDATE_MINISIGN_PUBKEY="RWTx5Zr1tiHQLwG9keckT0c45M3AGeHD6IvimQHpyRywVWGbP1aVSGav"

Context

Regression

Your Environment

Version used (minio-operator): v6.0.4
Environment name and version (e.g. kubernetes v1.17.2): 1.25.17
Server type and version:
Operating System and version (uname -a):
Link to your deployment file:

The text was updated successfully, but these errors were encountered:

ramondeklein · 2024-11-18T09:08:12Z

It doesn't look like pool-3-1 has restarted, because it still attempts to reach pool 1. Because this pool cannot be reached, the entire cluster is unavailable. After removing pool-1 from the tenant, the sidecar should indeed restart all MinIO nodes. Do you have logs from the operator to see what's going on? AFAIK, MinIO operator will only restart when more than half of the nodes are available.

PS: You may force a node restart using mc admin service restart <alias> to force the nodes to restart. That should stop the nodes from contacting pool-1.

Clara12062 · 2024-11-18T09:29:54Z

Yes, I found the cause. We made changes to the code, which caused the service not to restart. Thank you very much for your reply!

Clara12062 added community triage labels Nov 18, 2024

Clara12062 closed this as completed Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When a single node goes down, removing the pool causes the minio-cluster to become unavailable. #2357

When a single node goes down, removing the pool causes the minio-cluster to become unavailable. #2357

Clara12062 commented Nov 18, 2024 •

edited

Loading

ramondeklein commented Nov 18, 2024

Clara12062 commented Nov 18, 2024

When a single node goes down, removing the pool causes the minio-cluster to become unavailable. #2357

When a single node goes down, removing the pool causes the minio-cluster to become unavailable. #2357

Comments

Clara12062 commented Nov 18, 2024 • edited Loading

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Regression

Your Environment

ramondeklein commented Nov 18, 2024

Clara12062 commented Nov 18, 2024

Clara12062 commented Nov 18, 2024 •

edited

Loading