Troubleshoot a Down OSD

Identify down OSDs and manually bring them back up.

Troubleshoot the Ceph health detail reporting down OSDs. Ensuring that OSDs are operational and data is balanced across them will help remove the likelihood of hotspots being created.

Prerequisites

This procedure requires admin privileges.

Procedure

Identify the down OSDs.

ncn-m/s(001/2/3)# ceph osd tree down

Example output:

ID  CLASS  WEIGHT    TYPE NAME          STATUS  REWEIGHT  PRI-AFF
-1         62.87558  root default
-7         20.95853      host ncn-s002
 1    ssd   3.49309          osd.1        down   1.00000  1.00000
 3    ssd   3.49309          osd.3        down   1.00000  1.00000
 7    ssd   3.49309          osd.7        down   1.00000  1.00000
10    ssd   3.49309          osd.10       down   1.00000  1.00000
13    ssd   3.49309          osd.13       down   1.00000  1.00000
16    ssd   3.49309          osd.16       down   1.00000  1.00000

Restart the down OSDs.
- Option 1:
  1. Restart the OSD utilizing ceph orch
```
ncn-m/s00(1/2/3)# ceph orch daemon restart osd.<number>
```
- Option 2:
  1. Check the logs for the OSD that is down.
    
    Use the OSD number for the down OSD returned in the command above.
```
ncn-m/s(001/2/3)# ceph osd find OSD_ID
```
  2. Manually restart the OSD.
    
    This step must be done on the node with the reported down OSD.
```
ceph orch daemon restart osd.<number>
```
Troubleshooting: If the service is not restarted with ceph orch, restart it using Manage Ceph Services.

Verify the OSDs are running again.

# ceph osd tree down

Example output:

ID  CLASS  WEIGHT    TYPE NAME          STATUS  REWEIGHT  PRI-AFF
-1         62.87558  root default
-7         20.95853      host ncn-s002
 1    ssd   3.49309          osd.1          up   1.00000  1.00000
 3    ssd   3.49309          osd.3          up   1.00000  1.00000
 7    ssd   3.49309          osd.7          up   1.00000  1.00000
10    ssd   3.49309          osd.10         up   1.00000  1.00000
13    ssd   3.49309          osd.13         up   1.00000  1.00000
16    ssd   3.49309          osd.16         up   1.00000  1.00000

If the OSD dies again, check dmesg for drive failures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshoot_a_Down_OSD.md

Troubleshoot_a_Down_OSD.md

Troubleshoot a Down OSD

Prerequisites

Procedure

Files

Troubleshoot_a_Down_OSD.md

Latest commit

History

Troubleshoot_a_Down_OSD.md

File metadata and controls

Troubleshoot a Down OSD

Prerequisites

Procedure