Ceph Impact

Back to overview

Downtime

Oct 16 at 09:38pm CEST

Affected services

PVE-01

PVE-02

PVE-03

Resolved
Oct 16 at 10:45pm CEST

Impact solved

Created
Oct 16 at 09:38pm CEST

Incident: OSD failure and critical failure of Ceph Manager pool due to insufficient replication

Incident date/time: October 16, 2024, 9:38 PM CEST

Affected service: Ceph Cluster - Manager Pool (replication factor 2)

Incident description:
On October 16, 2024, at 9:38 PM CEST, an OSD (Object Storage Daemon) in the Ceph cluster failed. Due to the replication factor of 2 in the manager pool, the failure caused a critical failure in the pool as there were not enough replicas to keep the pool stable. This resulted in a complete degradation of the manager pool and impacted the availability of the Ceph cluster.

Cause:
The replication factor of the manager pool was incorrectly set to 2. As a result, the loss of an OSD could not be compensated, resulting in the manager pool becoming inoperable.

Immediate actions:

Increase replication factor: The manager pool replication factor was increased to 3 to restore redundancy.

Restore OSD: The failed OSD was investigated and attempts were made to rejoin the cluster. The exact cause of the failure is still being analyzed.

Rebalance data: The Ceph cluster automatically started the rebalance and recovery process to replicate the missing data to other available OSDs.

Cluster monitoring: The cluster was continuously monitored to ensure recovery progress and stability.

Service impact:
The manager pool was unavailable for 1 hour and 36 minutes, resulting in a critical risk to operations. There was no data loss during this time, but the availability of the manager pool was severely impacted. The VMs are all back online and the impact has been resolved.

Next steps:

Investigate OSD failure: The failed OSD will be further analyzed to rule out hardware issues and take action if necessary.
Check replication factor: All pools in the Ceph cluster will be checked for their replication settings to ensure the recommended replication factor is used.
Improve monitoring system: Alarms will be set up to detect misconfigurations early in the future.

Downtime: 1 hour and 36 minutes