KVM-Service aktuell gestört – wir arbeiten an der Lösung
Resolved
Nov 24 at 09:00am CET
Root Cause Analysis – Cluster Outage on 23 November 2025
Overview
On Sunday, 23 November 2025, our Ceph cluster at the DE – Maincubes FRA1 site experienced a disruption that led to temporary unavailability of the productive RBD storage. The root cause was a combined failure of two NVMe OSDs from the same manufacturing batch during an active rebalance process, resulting in several Placement Groups (PGs) being irreparably damaged. As a consequence, a full restore from backup was required.Timeline
• 23 November 2025, approx. 04:00 Our monitoring system reported the failure of a single OSD in the cluster. Due to the existing redundancy, this was initially classified as non-critical, as the cluster is designed to tolerate the loss of an individual OSD without service impact.
• 23 November 2025, approx. 09:00 A technician arrived at the data center and replaced the failed OSD. The cluster automatically initiated a rebalance process afterward.
• 23 November 2025, approx. 13:50 Monitoring triggered another alert due to the failure of a second OSD. Under the additional load caused by the ongoing rebalance, this second OSD failed as well. Post-incident analysis revealed that both failed OSDs originated from the same production batch.
• Later on 23 November 2025 As a result of the second OSD failure, 18 Placement Groups (PGs) were permanently lost. These PGs contained critical metadata relevant for the RBD storage, which led to a significant degradation of the cluster and, ultimately, to the unavailability of the RBD storage.
• 23 November 2025, late afternoon/evening Extensive efforts were made to recover the affected OSDs and PGs. After thorough internal analysis and consultation with external experts, the affected OSDs had to be marked as “lost”. A direct recovery of the corrupted PGs from the cluster was no longer possible.
• 23 November 2025, approx. 21:00 – 24 November 2025, approx. 09:00 To restore the environment, we reverted to our incremental backups. The restore was based on the backup taken on 23 November 2025 at 03:00. The restore process of the entire cluster (nearly 500 VMs) ran overnight and was completed on 24 November 2025 at approximately 09:00. At that point, all customer systems were back online.
Technical Root Cause
The incident can essentially be attributed to the following factors:Failure of two OSDs from the same batch Both failed NVMe OSDs came from the same manufacturing batch, suggesting a batch-specific quality or reliability issue.
Increased load due to rebalance The second OSD failed during an active rebalance, which imposed additional I/O load on the drives involved. This increased load likely contributed significantly to the second drive’s failure.
Loss of critical Placement Groups Due to the combined failure of two OSDs across the relevant failure domains, 18 PGs were permanently lost, including PGs holding essential metadata for the RBD pool. This led to an inconsistent and unusable storage state for the affected pool.
Impact
• Temporary unavailability of the RBD storage in the cluster at the DE – Maincubes FRA1 site.
• Service impact on nearly 500 virtual machines hosted on this cluster.
• Required full restore from backup based on the backup state from 03:00 on
23 November 2025.
Preventive and Corrective Measures
To significantly reduce the risk of similar incidents in the future, we have implemented the following measures:Increased replication level
o The replica size of the affected cluster has been increased to 3, providing additional fault tolerance in the event of simultaneous OSD failures.Expansion and distribution of storage capacity
o Six additional NVMe OSDs were added to the cluster to improve data distribution and reduce the impact of load peaks (e.g., during rebalancing).Enhanced hardware quality control for OSDs
o Proactive removal of another OSD from the same batch as the failed drives to prevent potential follow-up issues.
o Introduction of a standardized validation process for new batches, including S.M.A.R.T. checks, burn-in tests, and benchmark/stress tests before drives are put into production.
- Internal processes and SOPs
o Creation of an internal Standard Operating Procedure (SOP) covering regular:
▪ S.M.A.R.T. analysis
▪ Benchmark and stress testing of all OSDs
o Clear definition of procedures for handling OSD failures during active rebalance operations.
- Monitoring improvements
o Tightening of proactive monitoring policies, in particular:
▪ Closer tracking of latency, I/O errors, and reallocations of individual OSDs
▪ Additional alert thresholds for rebalance load and cluster degradation
- Customer Communication and Compensation We sincerely regret the inconvenience caused by this incident. All affected customers will be informed separately about the compensation applicable to their individual case. We highly appreciate our customers’ patience and understanding during the disruption and the subsequent restoration process.
Affected services
Updated
Nov 24 at 05:51am CET
Many of our services have already been successfully restored.
We are continuing to work hard to ensure that all KVM products are running fully and stably again.
Please note: Some functions will only be available to a limited extent until services are fully restored.
Affected services
Updated
Nov 23 at 11:24pm CET
We are currently still experiencing disruptions to some of our root server and web hosting services. The cause is a problem on our storage platform, which we have now clearly identified. Our engineering team is working hard to gradually restore all systems to normal operation and keep downtime to a minimum.
Affected services
Created
Nov 23 at 02:00pm CET
Wir haben eine Störung im Server-Service – Wir arbeiten mit Hochdruck an der Wiederherstellung aller betroffenen Dienste.
Affected services