You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a multi-node scenario, if a single node fails, removing it from the pool can still function normally, and the removal does not affect the Minio service. And after the node is restored, it has no impact on the current service.
Current Behavior
Currently, there are three nodes deployed, with two pools: pool-1 (1 server, 2 volumes), and pool-3 (3 servers, 6 volumes). Shut down the pod in the pool-3-0 node. Pool-1 is decommissioned and then removed. Start up the pod in the pool-3-0 node.
The Minio service is unavailable and cannot be restored.
Possible Solution
Steps to Reproduce (for bugs)
The initial deployment consisted of pool-1 (1 server, 2 volumes).
Subsequent horizontal scaling was performed by adding pool-3 (3 servers, 6 volumes).
Then, the decommissioning process was initiated by running the command: mc admin decom start ALIAS http://{pool-1-address}.
The node hosting pool-3-0 was shut down (it is not the same node as pool-1-0) to simulate a failure scenario. At this point, the decommissioning process is still ongoing.
Then, pool-1 was removed from the Minio cluster.
In theory, the sidecar of the pod in pool-3 should restart. However, the service is still unavailable, and the other pods have not been restarted. Here is the log for pool-3-1:
API: SYSTEM()
Time: 06:59:15 UTC 11/18/2024
DeploymentID: a47be0e4-6f4d-4859-a8e9-a33f7bd0ac75
Error: Post "http://minio-pool-1-0.minio-hl.minio.svc.cluster.local:9000/minio/peer/v24/getlocaldiskids": context deadline exceeded (*rest.NetworkError)
peerAddress="minio-pool-1-0.minio-hl.minio.svc.cluster.local:9000"
2: internal/logger/logger.go:258:logger.LogIf()
1: cmd/notification.go:129:cmd.(*NotificationGroup).Go.func1()
API: SYSTEM()
Time: 06:59:15 UTC 11/18/2024
DeploymentID: a47be0e4-6f4d-4859-a8e9-a33f7bd0ac75
Error: Server not initialized, please try again (*errors.errorString)
peerAddress="minio-pool-3-0.minio-hl.minio.svc.cluster.local:9000"
2: internal/logger/logger.go:258:logger.LogIf()
1: cmd/notification.go:129:cmd.(*NotificationGroup).Go.func1()
API: StorageInfo()
Time: 06:59:15 UTC 11/18/2024
DeploymentID: a47be0e4-6f4d-4859-a8e9-a33f7bd0ac75
RequestID: 1808FDD64C8EB4E2
RemoteHost: 10.244.4.167
Host: minio-pool-3-1.minio-hl.minio.svc.cluster.local:9000
UserAgent: MinIO (linux; amd64) madmin-go/2.0.0
Error: Post "http://minio-pool-1-0.minio-hl.minio.svc.cluster.local:9000/minio/peer/v24/getlocaldiskids": context deadline exceeded (*rest.NetworkError)
4: internal/logger/logger.go:258:logger.LogIf()
3: cmd/admin-handlers.go:1091:cmd.getAggregatedBackgroundHealState()
2: cmd/admin-handlers.go:347:cmd.adminAPIHandlers.StorageInfoHandler()
1: net/http/server.go:2136:http.HandlerFunc.ServeHTTP()
API: StorageInfo()
Time: 06:59:15 UTC 11/18/2024
DeploymentID: a47be0e4-6f4d-4859-a8e9-a33f7bd0ac75
RequestID: 1808FDD64C8EB4E2
RemoteHost: 10.244.4.167
Host: minio-pool-3-1.minio-hl.minio.svc.cluster.local:9000
UserAgent: MinIO (linux; amd64) madmin-go/2.0.0
Error: Server not initialized, please try again (*errors.errorString)
4: internal/logger/logger.go:258:logger.LogIf()
3: cmd/admin-handlers.go:1091:cmd.getAggregatedBackgroundHealState()
2: cmd/admin-handlers.go:347:cmd.adminAPIHandlers.StorageInfoHandler()
1: net/http/server.go:2136:http.HandlerFunc.ServeHTTP()
The config.env file for pods in pool-3 is as follows:
It doesn't look like pool-3-1 has restarted, because it still attempts to reach pool 1. Because this pool cannot be reached, the entire cluster is unavailable. After removing pool-1 from the tenant, the sidecar should indeed restart all MinIO nodes. Do you have logs from the operator to see what's going on? AFAIK, MinIO operator will only restart when more than half of the nodes are available.
PS: You may force a node restart using mc admin service restart <alias> to force the nodes to restart. That should stop the nodes from contacting pool-1.
Expected Behavior
In a multi-node scenario, if a single node fails, removing it from the pool can still function normally, and the removal does not affect the Minio service. And after the node is restored, it has no impact on the current service.
Current Behavior
Currently, there are three nodes deployed, with two pools: pool-1 (1 server, 2 volumes), and pool-3 (3 servers, 6 volumes). Shut down the pod in the pool-3-0 node. Pool-1 is decommissioned and then removed. Start up the pod in the pool-3-0 node.
The Minio service is unavailable and cannot be restored.
Possible Solution
Steps to Reproduce (for bugs)
pool-1 (1 server, 2 volumes)
.pool-3 (3 servers, 6 volumes)
.mc admin decom start ALIAS http://{pool-1-address}
.pool-3-0
was shut down (it is not the same node aspool-1-0
) to simulate a failure scenario. At this point, the decommissioning process is still ongoing.In theory, the sidecar of the pod in pool-3 should restart. However, the service is still unavailable, and the other pods have not been restarted. Here is the log for pool-3-1:
The config.env file for pods in pool-3 is as follows:
Context
Regression
Your Environment
minio-operator
): v6.0.4uname -a
):The text was updated successfully, but these errors were encountered: