Incidents | Informaten | Status Incidents reported on status page for Informaten | Status https://status.informaten.com/ https://d1lppblt9t2x15.cloudfront.net/logos/5a8900893a9dded3350b9a640b0fadeb.png Incidents | Informaten | Status https://status.informaten.com/ en KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 08:00:00 -0000 https://status.informaten.com/incident/772646#8186869c632c09dd4fe8c03fd1f075925956b16c15795d103e773fbbad0cb91e Root Cause Analysis – Cluster Outage on 23 November 2025 1. Overview On Sunday, 23 November 2025, our Ceph cluster at the DE – Maincubes FRA1 site experienced a disruption that led to temporary unavailability of the productive RBD storage. The root cause was a combined failure of two NVMe OSDs from the same manufacturing batch during an active rebalance process, resulting in several Placement Groups (PGs) being irreparably damaged. As a consequence, a full restore from backup was required. 2. Timeline • 23 November 2025, approx. 04:00 Our monitoring system reported the failure of a single OSD in the cluster. Due to the existing redundancy, this was initially classified as non-critical, as the cluster is designed to tolerate the loss of an individual OSD without service impact. • 23 November 2025, approx. 09:00 A technician arrived at the data center and replaced the failed OSD. The cluster automatically initiated a rebalance process afterward. • 23 November 2025, approx. 13:50 Monitoring triggered another alert due to the failure of a second OSD. Under the additional load caused by the ongoing rebalance, this second OSD failed as well. Post-incident analysis revealed that both failed OSDs originated from the same production batch. • Later on 23 November 2025 As a result of the second OSD failure, 18 Placement Groups (PGs) were permanently lost. These PGs contained critical metadata relevant for the RBD storage, which led to a significant degradation of the cluster and, ultimately, to the unavailability of the RBD storage. • 23 November 2025, late afternoon/evening Extensive efforts were made to recover the affected OSDs and PGs. After thorough internal analysis and consultation with external experts, the affected OSDs had to be marked as “lost”. A direct recovery of the corrupted PGs from the cluster was no longer possible. • 23 November 2025, approx. 21:00 – 24 November 2025, approx. 09:00 To restore the environment, we reverted to our incremental backups. The restore was based on the backup taken on 23 November 2025 at 03:00. The restore process of the entire cluster (nearly 500 VMs) ran overnight and was completed on 24 November 2025 at approximately 09:00. At that point, all customer systems were back online. 3. Technical Root Cause The incident can essentially be attributed to the following factors: 1. Failure of two OSDs from the same batch Both failed NVMe OSDs came from the same manufacturing batch, suggesting a batch-specific quality or reliability issue. 2. Increased load due to rebalance The second OSD failed during an active rebalance, which imposed additional I/O load on the drives involved. This increased load likely contributed significantly to the second drive’s failure. 3. Loss of critical Placement Groups Due to the combined failure of two OSDs across the relevant failure domains, 18 PGs were permanently lost, including PGs holding essential metadata for the RBD pool. This led to an inconsistent and unusable storage state for the affected pool. 4. Impact • Temporary unavailability of the RBD storage in the cluster at the DE – Maincubes FRA1 site. • Service impact on nearly 500 virtual machines hosted on this cluster. • Required full restore from backup based on the backup state from 03:00 on 23 November 2025. 5. Preventive and Corrective Measures To significantly reduce the risk of similar incidents in the future, we have implemented the following measures: 1. Increased replication level o The replica size of the affected cluster has been increased to 3, providing additional fault tolerance in the event of simultaneous OSD failures. 2. Expansion and distribution of storage capacity o Six additional NVMe OSDs were added to the cluster to improve data distribution and reduce the impact of load peaks (e.g., during rebalancing). 3. Enhanced hardware quality control for OSDs o Proactive removal of another OSD from the same batch as the failed drives to prevent potential follow-up issues. o Introduction of a standardized validation process for new batches, including S.M.A.R.T. checks, burn-in tests, and benchmark/stress tests before drives are put into production. 4. Internal processes and SOPs o Creation of an internal Standard Operating Procedure (SOP) covering regular: ▪ S.M.A.R.T. analysis ▪ Benchmark and stress testing of all OSDs o Clear definition of procedures for handling OSD failures during active rebalance operations. 5. Monitoring improvements o Tightening of proactive monitoring policies, in particular: ▪ Closer tracking of latency, I/O errors, and reallocations of individual OSDs ▪ Additional alert thresholds for rebalance load and cluster degradation 6. Customer Communication and Compensation We sincerely regret the inconvenience caused by this incident. All affected customers will be informed separately about the compensation applicable to their individual case. We highly appreciate our customers’ patience and understanding during the disruption and the subsequent restoration process. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 08:00:00 -0000 https://status.informaten.com/incident/772646#8186869c632c09dd4fe8c03fd1f075925956b16c15795d103e773fbbad0cb91e Root Cause Analysis – Cluster Outage on 23 November 2025 1. Overview On Sunday, 23 November 2025, our Ceph cluster at the DE – Maincubes FRA1 site experienced a disruption that led to temporary unavailability of the productive RBD storage. The root cause was a combined failure of two NVMe OSDs from the same manufacturing batch during an active rebalance process, resulting in several Placement Groups (PGs) being irreparably damaged. As a consequence, a full restore from backup was required. 2. Timeline • 23 November 2025, approx. 04:00 Our monitoring system reported the failure of a single OSD in the cluster. Due to the existing redundancy, this was initially classified as non-critical, as the cluster is designed to tolerate the loss of an individual OSD without service impact. • 23 November 2025, approx. 09:00 A technician arrived at the data center and replaced the failed OSD. The cluster automatically initiated a rebalance process afterward. • 23 November 2025, approx. 13:50 Monitoring triggered another alert due to the failure of a second OSD. Under the additional load caused by the ongoing rebalance, this second OSD failed as well. Post-incident analysis revealed that both failed OSDs originated from the same production batch. • Later on 23 November 2025 As a result of the second OSD failure, 18 Placement Groups (PGs) were permanently lost. These PGs contained critical metadata relevant for the RBD storage, which led to a significant degradation of the cluster and, ultimately, to the unavailability of the RBD storage. • 23 November 2025, late afternoon/evening Extensive efforts were made to recover the affected OSDs and PGs. After thorough internal analysis and consultation with external experts, the affected OSDs had to be marked as “lost”. A direct recovery of the corrupted PGs from the cluster was no longer possible. • 23 November 2025, approx. 21:00 – 24 November 2025, approx. 09:00 To restore the environment, we reverted to our incremental backups. The restore was based on the backup taken on 23 November 2025 at 03:00. The restore process of the entire cluster (nearly 500 VMs) ran overnight and was completed on 24 November 2025 at approximately 09:00. At that point, all customer systems were back online. 3. Technical Root Cause The incident can essentially be attributed to the following factors: 1. Failure of two OSDs from the same batch Both failed NVMe OSDs came from the same manufacturing batch, suggesting a batch-specific quality or reliability issue. 2. Increased load due to rebalance The second OSD failed during an active rebalance, which imposed additional I/O load on the drives involved. This increased load likely contributed significantly to the second drive’s failure. 3. Loss of critical Placement Groups Due to the combined failure of two OSDs across the relevant failure domains, 18 PGs were permanently lost, including PGs holding essential metadata for the RBD pool. This led to an inconsistent and unusable storage state for the affected pool. 4. Impact • Temporary unavailability of the RBD storage in the cluster at the DE – Maincubes FRA1 site. • Service impact on nearly 500 virtual machines hosted on this cluster. • Required full restore from backup based on the backup state from 03:00 on 23 November 2025. 5. Preventive and Corrective Measures To significantly reduce the risk of similar incidents in the future, we have implemented the following measures: 1. Increased replication level o The replica size of the affected cluster has been increased to 3, providing additional fault tolerance in the event of simultaneous OSD failures. 2. Expansion and distribution of storage capacity o Six additional NVMe OSDs were added to the cluster to improve data distribution and reduce the impact of load peaks (e.g., during rebalancing). 3. Enhanced hardware quality control for OSDs o Proactive removal of another OSD from the same batch as the failed drives to prevent potential follow-up issues. o Introduction of a standardized validation process for new batches, including S.M.A.R.T. checks, burn-in tests, and benchmark/stress tests before drives are put into production. 4. Internal processes and SOPs o Creation of an internal Standard Operating Procedure (SOP) covering regular: ▪ S.M.A.R.T. analysis ▪ Benchmark and stress testing of all OSDs o Clear definition of procedures for handling OSD failures during active rebalance operations. 5. Monitoring improvements o Tightening of proactive monitoring policies, in particular: ▪ Closer tracking of latency, I/O errors, and reallocations of individual OSDs ▪ Additional alert thresholds for rebalance load and cluster degradation 6. Customer Communication and Compensation We sincerely regret the inconvenience caused by this incident. All affected customers will be informed separately about the compensation applicable to their individual case. We highly appreciate our customers’ patience and understanding during the disruption and the subsequent restoration process. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 08:00:00 -0000 https://status.informaten.com/incident/772646#8186869c632c09dd4fe8c03fd1f075925956b16c15795d103e773fbbad0cb91e Root Cause Analysis – Cluster Outage on 23 November 2025 1. Overview On Sunday, 23 November 2025, our Ceph cluster at the DE – Maincubes FRA1 site experienced a disruption that led to temporary unavailability of the productive RBD storage. The root cause was a combined failure of two NVMe OSDs from the same manufacturing batch during an active rebalance process, resulting in several Placement Groups (PGs) being irreparably damaged. As a consequence, a full restore from backup was required. 2. Timeline • 23 November 2025, approx. 04:00 Our monitoring system reported the failure of a single OSD in the cluster. Due to the existing redundancy, this was initially classified as non-critical, as the cluster is designed to tolerate the loss of an individual OSD without service impact. • 23 November 2025, approx. 09:00 A technician arrived at the data center and replaced the failed OSD. The cluster automatically initiated a rebalance process afterward. • 23 November 2025, approx. 13:50 Monitoring triggered another alert due to the failure of a second OSD. Under the additional load caused by the ongoing rebalance, this second OSD failed as well. Post-incident analysis revealed that both failed OSDs originated from the same production batch. • Later on 23 November 2025 As a result of the second OSD failure, 18 Placement Groups (PGs) were permanently lost. These PGs contained critical metadata relevant for the RBD storage, which led to a significant degradation of the cluster and, ultimately, to the unavailability of the RBD storage. • 23 November 2025, late afternoon/evening Extensive efforts were made to recover the affected OSDs and PGs. After thorough internal analysis and consultation with external experts, the affected OSDs had to be marked as “lost”. A direct recovery of the corrupted PGs from the cluster was no longer possible. • 23 November 2025, approx. 21:00 – 24 November 2025, approx. 09:00 To restore the environment, we reverted to our incremental backups. The restore was based on the backup taken on 23 November 2025 at 03:00. The restore process of the entire cluster (nearly 500 VMs) ran overnight and was completed on 24 November 2025 at approximately 09:00. At that point, all customer systems were back online. 3. Technical Root Cause The incident can essentially be attributed to the following factors: 1. Failure of two OSDs from the same batch Both failed NVMe OSDs came from the same manufacturing batch, suggesting a batch-specific quality or reliability issue. 2. Increased load due to rebalance The second OSD failed during an active rebalance, which imposed additional I/O load on the drives involved. This increased load likely contributed significantly to the second drive’s failure. 3. Loss of critical Placement Groups Due to the combined failure of two OSDs across the relevant failure domains, 18 PGs were permanently lost, including PGs holding essential metadata for the RBD pool. This led to an inconsistent and unusable storage state for the affected pool. 4. Impact • Temporary unavailability of the RBD storage in the cluster at the DE – Maincubes FRA1 site. • Service impact on nearly 500 virtual machines hosted on this cluster. • Required full restore from backup based on the backup state from 03:00 on 23 November 2025. 5. Preventive and Corrective Measures To significantly reduce the risk of similar incidents in the future, we have implemented the following measures: 1. Increased replication level o The replica size of the affected cluster has been increased to 3, providing additional fault tolerance in the event of simultaneous OSD failures. 2. Expansion and distribution of storage capacity o Six additional NVMe OSDs were added to the cluster to improve data distribution and reduce the impact of load peaks (e.g., during rebalancing). 3. Enhanced hardware quality control for OSDs o Proactive removal of another OSD from the same batch as the failed drives to prevent potential follow-up issues. o Introduction of a standardized validation process for new batches, including S.M.A.R.T. checks, burn-in tests, and benchmark/stress tests before drives are put into production. 4. Internal processes and SOPs o Creation of an internal Standard Operating Procedure (SOP) covering regular: ▪ S.M.A.R.T. analysis ▪ Benchmark and stress testing of all OSDs o Clear definition of procedures for handling OSD failures during active rebalance operations. 5. Monitoring improvements o Tightening of proactive monitoring policies, in particular: ▪ Closer tracking of latency, I/O errors, and reallocations of individual OSDs ▪ Additional alert thresholds for rebalance load and cluster degradation 6. Customer Communication and Compensation We sincerely regret the inconvenience caused by this incident. All affected customers will be informed separately about the compensation applicable to their individual case. We highly appreciate our customers’ patience and understanding during the disruption and the subsequent restoration process. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 08:00:00 -0000 https://status.informaten.com/incident/772646#8186869c632c09dd4fe8c03fd1f075925956b16c15795d103e773fbbad0cb91e Root Cause Analysis – Cluster Outage on 23 November 2025 1. Overview On Sunday, 23 November 2025, our Ceph cluster at the DE – Maincubes FRA1 site experienced a disruption that led to temporary unavailability of the productive RBD storage. The root cause was a combined failure of two NVMe OSDs from the same manufacturing batch during an active rebalance process, resulting in several Placement Groups (PGs) being irreparably damaged. As a consequence, a full restore from backup was required. 2. Timeline • 23 November 2025, approx. 04:00 Our monitoring system reported the failure of a single OSD in the cluster. Due to the existing redundancy, this was initially classified as non-critical, as the cluster is designed to tolerate the loss of an individual OSD without service impact. • 23 November 2025, approx. 09:00 A technician arrived at the data center and replaced the failed OSD. The cluster automatically initiated a rebalance process afterward. • 23 November 2025, approx. 13:50 Monitoring triggered another alert due to the failure of a second OSD. Under the additional load caused by the ongoing rebalance, this second OSD failed as well. Post-incident analysis revealed that both failed OSDs originated from the same production batch. • Later on 23 November 2025 As a result of the second OSD failure, 18 Placement Groups (PGs) were permanently lost. These PGs contained critical metadata relevant for the RBD storage, which led to a significant degradation of the cluster and, ultimately, to the unavailability of the RBD storage. • 23 November 2025, late afternoon/evening Extensive efforts were made to recover the affected OSDs and PGs. After thorough internal analysis and consultation with external experts, the affected OSDs had to be marked as “lost”. A direct recovery of the corrupted PGs from the cluster was no longer possible. • 23 November 2025, approx. 21:00 – 24 November 2025, approx. 09:00 To restore the environment, we reverted to our incremental backups. The restore was based on the backup taken on 23 November 2025 at 03:00. The restore process of the entire cluster (nearly 500 VMs) ran overnight and was completed on 24 November 2025 at approximately 09:00. At that point, all customer systems were back online. 3. Technical Root Cause The incident can essentially be attributed to the following factors: 1. Failure of two OSDs from the same batch Both failed NVMe OSDs came from the same manufacturing batch, suggesting a batch-specific quality or reliability issue. 2. Increased load due to rebalance The second OSD failed during an active rebalance, which imposed additional I/O load on the drives involved. This increased load likely contributed significantly to the second drive’s failure. 3. Loss of critical Placement Groups Due to the combined failure of two OSDs across the relevant failure domains, 18 PGs were permanently lost, including PGs holding essential metadata for the RBD pool. This led to an inconsistent and unusable storage state for the affected pool. 4. Impact • Temporary unavailability of the RBD storage in the cluster at the DE – Maincubes FRA1 site. • Service impact on nearly 500 virtual machines hosted on this cluster. • Required full restore from backup based on the backup state from 03:00 on 23 November 2025. 5. Preventive and Corrective Measures To significantly reduce the risk of similar incidents in the future, we have implemented the following measures: 1. Increased replication level o The replica size of the affected cluster has been increased to 3, providing additional fault tolerance in the event of simultaneous OSD failures. 2. Expansion and distribution of storage capacity o Six additional NVMe OSDs were added to the cluster to improve data distribution and reduce the impact of load peaks (e.g., during rebalancing). 3. Enhanced hardware quality control for OSDs o Proactive removal of another OSD from the same batch as the failed drives to prevent potential follow-up issues. o Introduction of a standardized validation process for new batches, including S.M.A.R.T. checks, burn-in tests, and benchmark/stress tests before drives are put into production. 4. Internal processes and SOPs o Creation of an internal Standard Operating Procedure (SOP) covering regular: ▪ S.M.A.R.T. analysis ▪ Benchmark and stress testing of all OSDs o Clear definition of procedures for handling OSD failures during active rebalance operations. 5. Monitoring improvements o Tightening of proactive monitoring policies, in particular: ▪ Closer tracking of latency, I/O errors, and reallocations of individual OSDs ▪ Additional alert thresholds for rebalance load and cluster degradation 6. Customer Communication and Compensation We sincerely regret the inconvenience caused by this incident. All affected customers will be informed separately about the compensation applicable to their individual case. We highly appreciate our customers’ patience and understanding during the disruption and the subsequent restoration process. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 08:00:00 -0000 https://status.informaten.com/incident/772646#8186869c632c09dd4fe8c03fd1f075925956b16c15795d103e773fbbad0cb91e Root Cause Analysis – Cluster Outage on 23 November 2025 1. Overview On Sunday, 23 November 2025, our Ceph cluster at the DE – Maincubes FRA1 site experienced a disruption that led to temporary unavailability of the productive RBD storage. The root cause was a combined failure of two NVMe OSDs from the same manufacturing batch during an active rebalance process, resulting in several Placement Groups (PGs) being irreparably damaged. As a consequence, a full restore from backup was required. 2. Timeline • 23 November 2025, approx. 04:00 Our monitoring system reported the failure of a single OSD in the cluster. Due to the existing redundancy, this was initially classified as non-critical, as the cluster is designed to tolerate the loss of an individual OSD without service impact. • 23 November 2025, approx. 09:00 A technician arrived at the data center and replaced the failed OSD. The cluster automatically initiated a rebalance process afterward. • 23 November 2025, approx. 13:50 Monitoring triggered another alert due to the failure of a second OSD. Under the additional load caused by the ongoing rebalance, this second OSD failed as well. Post-incident analysis revealed that both failed OSDs originated from the same production batch. • Later on 23 November 2025 As a result of the second OSD failure, 18 Placement Groups (PGs) were permanently lost. These PGs contained critical metadata relevant for the RBD storage, which led to a significant degradation of the cluster and, ultimately, to the unavailability of the RBD storage. • 23 November 2025, late afternoon/evening Extensive efforts were made to recover the affected OSDs and PGs. After thorough internal analysis and consultation with external experts, the affected OSDs had to be marked as “lost”. A direct recovery of the corrupted PGs from the cluster was no longer possible. • 23 November 2025, approx. 21:00 – 24 November 2025, approx. 09:00 To restore the environment, we reverted to our incremental backups. The restore was based on the backup taken on 23 November 2025 at 03:00. The restore process of the entire cluster (nearly 500 VMs) ran overnight and was completed on 24 November 2025 at approximately 09:00. At that point, all customer systems were back online. 3. Technical Root Cause The incident can essentially be attributed to the following factors: 1. Failure of two OSDs from the same batch Both failed NVMe OSDs came from the same manufacturing batch, suggesting a batch-specific quality or reliability issue. 2. Increased load due to rebalance The second OSD failed during an active rebalance, which imposed additional I/O load on the drives involved. This increased load likely contributed significantly to the second drive’s failure. 3. Loss of critical Placement Groups Due to the combined failure of two OSDs across the relevant failure domains, 18 PGs were permanently lost, including PGs holding essential metadata for the RBD pool. This led to an inconsistent and unusable storage state for the affected pool. 4. Impact • Temporary unavailability of the RBD storage in the cluster at the DE – Maincubes FRA1 site. • Service impact on nearly 500 virtual machines hosted on this cluster. • Required full restore from backup based on the backup state from 03:00 on 23 November 2025. 5. Preventive and Corrective Measures To significantly reduce the risk of similar incidents in the future, we have implemented the following measures: 1. Increased replication level o The replica size of the affected cluster has been increased to 3, providing additional fault tolerance in the event of simultaneous OSD failures. 2. Expansion and distribution of storage capacity o Six additional NVMe OSDs were added to the cluster to improve data distribution and reduce the impact of load peaks (e.g., during rebalancing). 3. Enhanced hardware quality control for OSDs o Proactive removal of another OSD from the same batch as the failed drives to prevent potential follow-up issues. o Introduction of a standardized validation process for new batches, including S.M.A.R.T. checks, burn-in tests, and benchmark/stress tests before drives are put into production. 4. Internal processes and SOPs o Creation of an internal Standard Operating Procedure (SOP) covering regular: ▪ S.M.A.R.T. analysis ▪ Benchmark and stress testing of all OSDs o Clear definition of procedures for handling OSD failures during active rebalance operations. 5. Monitoring improvements o Tightening of proactive monitoring policies, in particular: ▪ Closer tracking of latency, I/O errors, and reallocations of individual OSDs ▪ Additional alert thresholds for rebalance load and cluster degradation 6. Customer Communication and Compensation We sincerely regret the inconvenience caused by this incident. All affected customers will be informed separately about the compensation applicable to their individual case. We highly appreciate our customers’ patience and understanding during the disruption and the subsequent restoration process. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 08:00:00 -0000 https://status.informaten.com/incident/772646#8186869c632c09dd4fe8c03fd1f075925956b16c15795d103e773fbbad0cb91e Root Cause Analysis – Cluster Outage on 23 November 2025 1. Overview On Sunday, 23 November 2025, our Ceph cluster at the DE – Maincubes FRA1 site experienced a disruption that led to temporary unavailability of the productive RBD storage. The root cause was a combined failure of two NVMe OSDs from the same manufacturing batch during an active rebalance process, resulting in several Placement Groups (PGs) being irreparably damaged. As a consequence, a full restore from backup was required. 2. Timeline • 23 November 2025, approx. 04:00 Our monitoring system reported the failure of a single OSD in the cluster. Due to the existing redundancy, this was initially classified as non-critical, as the cluster is designed to tolerate the loss of an individual OSD without service impact. • 23 November 2025, approx. 09:00 A technician arrived at the data center and replaced the failed OSD. The cluster automatically initiated a rebalance process afterward. • 23 November 2025, approx. 13:50 Monitoring triggered another alert due to the failure of a second OSD. Under the additional load caused by the ongoing rebalance, this second OSD failed as well. Post-incident analysis revealed that both failed OSDs originated from the same production batch. • Later on 23 November 2025 As a result of the second OSD failure, 18 Placement Groups (PGs) were permanently lost. These PGs contained critical metadata relevant for the RBD storage, which led to a significant degradation of the cluster and, ultimately, to the unavailability of the RBD storage. • 23 November 2025, late afternoon/evening Extensive efforts were made to recover the affected OSDs and PGs. After thorough internal analysis and consultation with external experts, the affected OSDs had to be marked as “lost”. A direct recovery of the corrupted PGs from the cluster was no longer possible. • 23 November 2025, approx. 21:00 – 24 November 2025, approx. 09:00 To restore the environment, we reverted to our incremental backups. The restore was based on the backup taken on 23 November 2025 at 03:00. The restore process of the entire cluster (nearly 500 VMs) ran overnight and was completed on 24 November 2025 at approximately 09:00. At that point, all customer systems were back online. 3. Technical Root Cause The incident can essentially be attributed to the following factors: 1. Failure of two OSDs from the same batch Both failed NVMe OSDs came from the same manufacturing batch, suggesting a batch-specific quality or reliability issue. 2. Increased load due to rebalance The second OSD failed during an active rebalance, which imposed additional I/O load on the drives involved. This increased load likely contributed significantly to the second drive’s failure. 3. Loss of critical Placement Groups Due to the combined failure of two OSDs across the relevant failure domains, 18 PGs were permanently lost, including PGs holding essential metadata for the RBD pool. This led to an inconsistent and unusable storage state for the affected pool. 4. Impact • Temporary unavailability of the RBD storage in the cluster at the DE – Maincubes FRA1 site. • Service impact on nearly 500 virtual machines hosted on this cluster. • Required full restore from backup based on the backup state from 03:00 on 23 November 2025. 5. Preventive and Corrective Measures To significantly reduce the risk of similar incidents in the future, we have implemented the following measures: 1. Increased replication level o The replica size of the affected cluster has been increased to 3, providing additional fault tolerance in the event of simultaneous OSD failures. 2. Expansion and distribution of storage capacity o Six additional NVMe OSDs were added to the cluster to improve data distribution and reduce the impact of load peaks (e.g., during rebalancing). 3. Enhanced hardware quality control for OSDs o Proactive removal of another OSD from the same batch as the failed drives to prevent potential follow-up issues. o Introduction of a standardized validation process for new batches, including S.M.A.R.T. checks, burn-in tests, and benchmark/stress tests before drives are put into production. 4. Internal processes and SOPs o Creation of an internal Standard Operating Procedure (SOP) covering regular: ▪ S.M.A.R.T. analysis ▪ Benchmark and stress testing of all OSDs o Clear definition of procedures for handling OSD failures during active rebalance operations. 5. Monitoring improvements o Tightening of proactive monitoring policies, in particular: ▪ Closer tracking of latency, I/O errors, and reallocations of individual OSDs ▪ Additional alert thresholds for rebalance load and cluster degradation 6. Customer Communication and Compensation We sincerely regret the inconvenience caused by this incident. All affected customers will be informed separately about the compensation applicable to their individual case. We highly appreciate our customers’ patience and understanding during the disruption and the subsequent restoration process. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 04:51:00 -0000 https://status.informaten.com/incident/772646#11791732fb1c691c06404cd6003f56c6ccac0ad5eb0ec2b6c2c876189e81c4d4 Many of our services have already been successfully restored. We are continuing to work hard to ensure that all KVM products are running fully and stably again. Please note: Some functions will only be available to a limited extent until services are fully restored. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 04:51:00 -0000 https://status.informaten.com/incident/772646#11791732fb1c691c06404cd6003f56c6ccac0ad5eb0ec2b6c2c876189e81c4d4 Many of our services have already been successfully restored. We are continuing to work hard to ensure that all KVM products are running fully and stably again. Please note: Some functions will only be available to a limited extent until services are fully restored. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 04:51:00 -0000 https://status.informaten.com/incident/772646#11791732fb1c691c06404cd6003f56c6ccac0ad5eb0ec2b6c2c876189e81c4d4 Many of our services have already been successfully restored. We are continuing to work hard to ensure that all KVM products are running fully and stably again. Please note: Some functions will only be available to a limited extent until services are fully restored. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 04:51:00 -0000 https://status.informaten.com/incident/772646#11791732fb1c691c06404cd6003f56c6ccac0ad5eb0ec2b6c2c876189e81c4d4 Many of our services have already been successfully restored. We are continuing to work hard to ensure that all KVM products are running fully and stably again. Please note: Some functions will only be available to a limited extent until services are fully restored. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 04:51:00 -0000 https://status.informaten.com/incident/772646#11791732fb1c691c06404cd6003f56c6ccac0ad5eb0ec2b6c2c876189e81c4d4 Many of our services have already been successfully restored. We are continuing to work hard to ensure that all KVM products are running fully and stably again. Please note: Some functions will only be available to a limited extent until services are fully restored. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 04:51:00 -0000 https://status.informaten.com/incident/772646#11791732fb1c691c06404cd6003f56c6ccac0ad5eb0ec2b6c2c876189e81c4d4 Many of our services have already been successfully restored. We are continuing to work hard to ensure that all KVM products are running fully and stably again. Please note: Some functions will only be available to a limited extent until services are fully restored. Webspaces recovered https://status.informaten.com/ Sun, 23 Nov 2025 22:41:00 +0000 https://status.informaten.com/#f29e40f4761dc4869a86e1b276e7b140baf234c8f4e2d9db2b6f33015799eaf5 Webspaces recovered KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 22:24:00 -0000 https://status.informaten.com/incident/772646#3c3155ef3ed0dda83a4da57348293d71e098080a0884923202be2d186658b7c4 We are currently still experiencing disruptions to some of our root server and web hosting services. The cause is a problem on our storage platform, which we have now clearly identified. Our engineering team is working hard to gradually restore all systems to normal operation and keep downtime to a minimum. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 22:24:00 -0000 https://status.informaten.com/incident/772646#3c3155ef3ed0dda83a4da57348293d71e098080a0884923202be2d186658b7c4 We are currently still experiencing disruptions to some of our root server and web hosting services. The cause is a problem on our storage platform, which we have now clearly identified. Our engineering team is working hard to gradually restore all systems to normal operation and keep downtime to a minimum. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 22:24:00 -0000 https://status.informaten.com/incident/772646#3c3155ef3ed0dda83a4da57348293d71e098080a0884923202be2d186658b7c4 We are currently still experiencing disruptions to some of our root server and web hosting services. The cause is a problem on our storage platform, which we have now clearly identified. Our engineering team is working hard to gradually restore all systems to normal operation and keep downtime to a minimum. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 22:24:00 -0000 https://status.informaten.com/incident/772646#3c3155ef3ed0dda83a4da57348293d71e098080a0884923202be2d186658b7c4 We are currently still experiencing disruptions to some of our root server and web hosting services. The cause is a problem on our storage platform, which we have now clearly identified. Our engineering team is working hard to gradually restore all systems to normal operation and keep downtime to a minimum. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 22:24:00 -0000 https://status.informaten.com/incident/772646#3c3155ef3ed0dda83a4da57348293d71e098080a0884923202be2d186658b7c4 We are currently still experiencing disruptions to some of our root server and web hosting services. The cause is a problem on our storage platform, which we have now clearly identified. Our engineering team is working hard to gradually restore all systems to normal operation and keep downtime to a minimum. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 22:24:00 -0000 https://status.informaten.com/incident/772646#3c3155ef3ed0dda83a4da57348293d71e098080a0884923202be2d186658b7c4 We are currently still experiencing disruptions to some of our root server and web hosting services. The cause is a problem on our storage platform, which we have now clearly identified. Our engineering team is working hard to gradually restore all systems to normal operation and keep downtime to a minimum. PVE-2 recovered https://status.informaten.com/ Sun, 23 Nov 2025 22:08:01 +0000 https://status.informaten.com/#d30d6cf1d3b57766f75a741568e3b29793977e0bb9c701023ca125dffdc76337 PVE-2 recovered PVE-2 went down https://status.informaten.com/ Sun, 23 Nov 2025 22:06:35 +0000 https://status.informaten.com/#d30d6cf1d3b57766f75a741568e3b29793977e0bb9c701023ca125dffdc76337 PVE-2 went down Landingpage & CP recovered https://status.informaten.com/ Sun, 23 Nov 2025 21:59:18 +0000 https://status.informaten.com/#23b368ab792fbf2f4c9e37b1d27b4d7f1c2494106fd124335ef796a6927fdd43 Landingpage & CP recovered INF18 recovered https://status.informaten.com/ Sun, 23 Nov 2025 21:58:02 +0000 https://status.informaten.com/#d36f8dfaf7f96a1f22889df038862f62c4f166be32df8b689f3945f9fe629345 INF18 recovered INF18 went down https://status.informaten.com/ Sun, 23 Nov 2025 21:56:53 +0000 https://status.informaten.com/#d36f8dfaf7f96a1f22889df038862f62c4f166be32df8b689f3945f9fe629345 INF18 went down Landingpage & CP went down https://status.informaten.com/ Sun, 23 Nov 2025 21:51:12 +0000 https://status.informaten.com/#23b368ab792fbf2f4c9e37b1d27b4d7f1c2494106fd124335ef796a6927fdd43 Landingpage & CP went down PVE-3 recovered https://status.informaten.com/ Sun, 23 Nov 2025 21:44:03 +0000 https://status.informaten.com/#6bad9d261e99894dd6a6f65522db918ec668ff2f6c1546dd26b7e6040771ed9a PVE-3 recovered PVE-3 went down https://status.informaten.com/ Sun, 23 Nov 2025 21:43:56 +0000 https://status.informaten.com/#6bad9d261e99894dd6a6f65522db918ec668ff2f6c1546dd26b7e6040771ed9a PVE-3 went down INF12 recovered https://status.informaten.com/ Sun, 23 Nov 2025 21:40:01 +0000 https://status.informaten.com/#50c33fdc55c8ce8630f7ee7b5ca3a56b7316760d4f5680b15672b0588c9c7295 INF12 recovered INF12 went down https://status.informaten.com/ Sun, 23 Nov 2025 21:33:42 +0000 https://status.informaten.com/#50c33fdc55c8ce8630f7ee7b5ca3a56b7316760d4f5680b15672b0588c9c7295 INF12 went down Landingpage & CP recovered https://status.informaten.com/ Sun, 23 Nov 2025 17:59:18 +0000 https://status.informaten.com/#b578c2b0934c10eaab5a5b8b7599ddaa3bdcf6a0867a52f4ec992899e52b6680 Landingpage & CP recovered Webspaces went down https://status.informaten.com/ Sun, 23 Nov 2025 14:12:59 +0000 https://status.informaten.com/#f29e40f4761dc4869a86e1b276e7b140baf234c8f4e2d9db2b6f33015799eaf5 Webspaces went down Landingpage & CP went down https://status.informaten.com/ Sun, 23 Nov 2025 13:24:12 +0000 https://status.informaten.com/#b578c2b0934c10eaab5a5b8b7599ddaa3bdcf6a0867a52f4ec992899e52b6680 Landingpage & CP went down KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 13:00:00 -0000 https://status.informaten.com/incident/772646#734eb8ffa87bdd7497b99c2aa58b9044f2d709c27dbd5291208b5a8ca4f51874 Wir haben eine Störung im Server-Service – Wir arbeiten mit Hochdruck an der Wiederherstellung aller betroffenen Dienste. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 13:00:00 -0000 https://status.informaten.com/incident/772646#734eb8ffa87bdd7497b99c2aa58b9044f2d709c27dbd5291208b5a8ca4f51874 Wir haben eine Störung im Server-Service – Wir arbeiten mit Hochdruck an der Wiederherstellung aller betroffenen Dienste. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 13:00:00 -0000 https://status.informaten.com/incident/772646#734eb8ffa87bdd7497b99c2aa58b9044f2d709c27dbd5291208b5a8ca4f51874 Wir haben eine Störung im Server-Service – Wir arbeiten mit Hochdruck an der Wiederherstellung aller betroffenen Dienste. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 13:00:00 -0000 https://status.informaten.com/incident/772646#734eb8ffa87bdd7497b99c2aa58b9044f2d709c27dbd5291208b5a8ca4f51874 Wir haben eine Störung im Server-Service – Wir arbeiten mit Hochdruck an der Wiederherstellung aller betroffenen Dienste. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 13:00:00 -0000 https://status.informaten.com/incident/772646#734eb8ffa87bdd7497b99c2aa58b9044f2d709c27dbd5291208b5a8ca4f51874 Wir haben eine Störung im Server-Service – Wir arbeiten mit Hochdruck an der Wiederherstellung aller betroffenen Dienste. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 13:00:00 -0000 https://status.informaten.com/incident/772646#734eb8ffa87bdd7497b99c2aa58b9044f2d709c27dbd5291208b5a8ca4f51874 Wir haben eine Störung im Server-Service – Wir arbeiten mit Hochdruck an der Wiederherstellung aller betroffenen Dienste. Geplante Wartung des CPs https://status.informaten.com/incident/719634 Sun, 07 Sep 2025 16:00:43 -0000 https://status.informaten.com/incident/719634#4133bab81205f1fbd079fa59c91814a97f076154231d257cda79a9dfb3a4be98 Maintenance completed Geplante Wartung des CPs https://status.informaten.com/incident/719634 Sat, 06 Sep 2025 22:00:43 -0000 https://status.informaten.com/incident/719634#2b83c1dbf59f175f604d23a1255f3e89b063af2481079544ddb6d06834e2a2cc Wir führen derzeit Wartungsarbeiten am Customer Panel unserer Webseite durch. Während dieser Zeit ist der Zugang zum Kundenbereich möglicherweise eingeschränkt oder nicht verfügbar. Wir bitten um dein Verständnis und arbeiten daran, den Service so schnell wie möglich wieder bereitzustellen. Unsere Dienste wie Root Server, Webspace und Domains sowie alle Managed Services und Colocation sind nicht davon betroffen. EN: We are currently performing maintenance work on the customer panel of our website. During this time, access to the customer area may be restricted or unavailable. We apologize for any inconvenience and are working to restore service as quickly as possible. Our services such as root servers, web space, and domains, as well as all managed services and colocation, are not affected.