Incidents | Informaten | Status Incidents reported on status page for Informaten | Status https://status.informaten.com/ https://d1lppblt9t2x15.cloudfront.net/logos/5a8900893a9dded3350b9a640b0fadeb.png Incidents | Informaten | Status https://status.informaten.com/ en Webspaces recovered https://status.informaten.com/ Sat, 31 Jan 2026 02:46:42 +0000 https://status.informaten.com/#536f16cf20c021d47b054f1687aa7bc43f0374a135664d9115e905b277fa6339 Webspaces recovered Webspaces went down https://status.informaten.com/ Sat, 31 Jan 2026 02:42:22 +0000 https://status.informaten.com/#536f16cf20c021d47b054f1687aa7bc43f0374a135664d9115e905b277fa6339 Webspaces went down Landingpage & CP recovered https://status.informaten.com/ Tue, 06 Jan 2026 08:10:56 +0000 https://status.informaten.com/#298d23e91aef0ff9c08235d5e4e1aea474a04a8c5f1cd603626adacd54e1e4ec Landingpage & CP recovered Landingpage & CP went down https://status.informaten.com/ Tue, 06 Jan 2026 08:05:46 +0000 https://status.informaten.com/#298d23e91aef0ff9c08235d5e4e1aea474a04a8c5f1cd603626adacd54e1e4ec Landingpage & CP went down KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 08:00:00 -0000 https://status.informaten.com/incident/772646#8186869c632c09dd4fe8c03fd1f075925956b16c15795d103e773fbbad0cb91e Root Cause Analysis – Cluster Outage on 23 November 2025 1. Overview On Sunday, 23 November 2025, our Ceph cluster at the DE – Maincubes FRA1 site experienced a disruption that led to temporary unavailability of the productive RBD storage. The root cause was a combined failure of two NVMe OSDs from the same manufacturing batch during an active rebalance process, resulting in several Placement Groups (PGs) being irreparably damaged. As a consequence, a full restore from backup was required. 2. Timeline • 23 November 2025, approx. 04:00 Our monitoring system reported the failure of a single OSD in the cluster. Due to the existing redundancy, this was initially classified as non-critical, as the cluster is designed to tolerate the loss of an individual OSD without service impact. • 23 November 2025, approx. 09:00 A technician arrived at the data center and replaced the failed OSD. The cluster automatically initiated a rebalance process afterward. • 23 November 2025, approx. 13:50 Monitoring triggered another alert due to the failure of a second OSD. Under the additional load caused by the ongoing rebalance, this second OSD failed as well. Post-incident analysis revealed that both failed OSDs originated from the same production batch. • Later on 23 November 2025 As a result of the second OSD failure, 18 Placement Groups (PGs) were permanently lost. These PGs contained critical metadata relevant for the RBD storage, which led to a significant degradation of the cluster and, ultimately, to the unavailability of the RBD storage. • 23 November 2025, late afternoon/evening Extensive efforts were made to recover the affected OSDs and PGs. After thorough internal analysis and consultation with external experts, the affected OSDs had to be marked as “lost”. A direct recovery of the corrupted PGs from the cluster was no longer possible. • 23 November 2025, approx. 21:00 – 24 November 2025, approx. 09:00 To restore the environment, we reverted to our incremental backups. The restore was based on the backup taken on 23 November 2025 at 03:00. The restore process of the entire cluster (nearly 500 VMs) ran overnight and was completed on 24 November 2025 at approximately 09:00. At that point, all customer systems were back online. 3. Technical Root Cause The incident can essentially be attributed to the following factors: 1. Failure of two OSDs from the same batch Both failed NVMe OSDs came from the same manufacturing batch, suggesting a batch-specific quality or reliability issue. 2. Increased load due to rebalance The second OSD failed during an active rebalance, which imposed additional I/O load on the drives involved. This increased load likely contributed significantly to the second drive’s failure. 3. Loss of critical Placement Groups Due to the combined failure of two OSDs across the relevant failure domains, 18 PGs were permanently lost, including PGs holding essential metadata for the RBD pool. This led to an inconsistent and unusable storage state for the affected pool. 4. Impact • Temporary unavailability of the RBD storage in the cluster at the DE – Maincubes FRA1 site. • Service impact on nearly 500 virtual machines hosted on this cluster. • Required full restore from backup based on the backup state from 03:00 on 23 November 2025. 5. Preventive and Corrective Measures To significantly reduce the risk of similar incidents in the future, we have implemented the following measures: 1. Increased replication level o The replica size of the affected cluster has been increased to 3, providing additional fault tolerance in the event of simultaneous OSD failures. 2. Expansion and distribution of storage capacity o Six additional NVMe OSDs were added to the cluster to improve data distribution and reduce the impact of load peaks (e.g., during rebalancing). 3. Enhanced hardware quality control for OSDs o Proactive removal of another OSD from the same batch as the failed drives to prevent potential follow-up issues. o Introduction of a standardized validation process for new batches, including S.M.A.R.T. checks, burn-in tests, and benchmark/stress tests before drives are put into production. 4. Internal processes and SOPs o Creation of an internal Standard Operating Procedure (SOP) covering regular: ▪ S.M.A.R.T. analysis ▪ Benchmark and stress testing of all OSDs o Clear definition of procedures for handling OSD failures during active rebalance operations. 5. Monitoring improvements o Tightening of proactive monitoring policies, in particular: ▪ Closer tracking of latency, I/O errors, and reallocations of individual OSDs ▪ Additional alert thresholds for rebalance load and cluster degradation 6. Customer Communication and Compensation We sincerely regret the inconvenience caused by this incident. All affected customers will be informed separately about the compensation applicable to their individual case. We highly appreciate our customers’ patience and understanding during the disruption and the subsequent restoration process. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 08:00:00 -0000 https://status.informaten.com/incident/772646#8186869c632c09dd4fe8c03fd1f075925956b16c15795d103e773fbbad0cb91e Root Cause Analysis – Cluster Outage on 23 November 2025 1. Overview On Sunday, 23 November 2025, our Ceph cluster at the DE – Maincubes FRA1 site experienced a disruption that led to temporary unavailability of the productive RBD storage. The root cause was a combined failure of two NVMe OSDs from the same manufacturing batch during an active rebalance process, resulting in several Placement Groups (PGs) being irreparably damaged. As a consequence, a full restore from backup was required. 2. Timeline • 23 November 2025, approx. 04:00 Our monitoring system reported the failure of a single OSD in the cluster. Due to the existing redundancy, this was initially classified as non-critical, as the cluster is designed to tolerate the loss of an individual OSD without service impact. • 23 November 2025, approx. 09:00 A technician arrived at the data center and replaced the failed OSD. The cluster automatically initiated a rebalance process afterward. • 23 November 2025, approx. 13:50 Monitoring triggered another alert due to the failure of a second OSD. Under the additional load caused by the ongoing rebalance, this second OSD failed as well. Post-incident analysis revealed that both failed OSDs originated from the same production batch. • Later on 23 November 2025 As a result of the second OSD failure, 18 Placement Groups (PGs) were permanently lost. These PGs contained critical metadata relevant for the RBD storage, which led to a significant degradation of the cluster and, ultimately, to the unavailability of the RBD storage. • 23 November 2025, late afternoon/evening Extensive efforts were made to recover the affected OSDs and PGs. After thorough internal analysis and consultation with external experts, the affected OSDs had to be marked as “lost”. A direct recovery of the corrupted PGs from the cluster was no longer possible. • 23 November 2025, approx. 21:00 – 24 November 2025, approx. 09:00 To restore the environment, we reverted to our incremental backups. The restore was based on the backup taken on 23 November 2025 at 03:00. The restore process of the entire cluster (nearly 500 VMs) ran overnight and was completed on 24 November 2025 at approximately 09:00. At that point, all customer systems were back online. 3. Technical Root Cause The incident can essentially be attributed to the following factors: 1. Failure of two OSDs from the same batch Both failed NVMe OSDs came from the same manufacturing batch, suggesting a batch-specific quality or reliability issue. 2. Increased load due to rebalance The second OSD failed during an active rebalance, which imposed additional I/O load on the drives involved. This increased load likely contributed significantly to the second drive’s failure. 3. Loss of critical Placement Groups Due to the combined failure of two OSDs across the relevant failure domains, 18 PGs were permanently lost, including PGs holding essential metadata for the RBD pool. This led to an inconsistent and unusable storage state for the affected pool. 4. Impact • Temporary unavailability of the RBD storage in the cluster at the DE – Maincubes FRA1 site. • Service impact on nearly 500 virtual machines hosted on this cluster. • Required full restore from backup based on the backup state from 03:00 on 23 November 2025. 5. Preventive and Corrective Measures To significantly reduce the risk of similar incidents in the future, we have implemented the following measures: 1. Increased replication level o The replica size of the affected cluster has been increased to 3, providing additional fault tolerance in the event of simultaneous OSD failures. 2. Expansion and distribution of storage capacity o Six additional NVMe OSDs were added to the cluster to improve data distribution and reduce the impact of load peaks (e.g., during rebalancing). 3. Enhanced hardware quality control for OSDs o Proactive removal of another OSD from the same batch as the failed drives to prevent potential follow-up issues. o Introduction of a standardized validation process for new batches, including S.M.A.R.T. checks, burn-in tests, and benchmark/stress tests before drives are put into production. 4. Internal processes and SOPs o Creation of an internal Standard Operating Procedure (SOP) covering regular: ▪ S.M.A.R.T. analysis ▪ Benchmark and stress testing of all OSDs o Clear definition of procedures for handling OSD failures during active rebalance operations. 5. Monitoring improvements o Tightening of proactive monitoring policies, in particular: ▪ Closer tracking of latency, I/O errors, and reallocations of individual OSDs ▪ Additional alert thresholds for rebalance load and cluster degradation 6. Customer Communication and Compensation We sincerely regret the inconvenience caused by this incident. All affected customers will be informed separately about the compensation applicable to their individual case. We highly appreciate our customers’ patience and understanding during the disruption and the subsequent restoration process. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 08:00:00 -0000 https://status.informaten.com/incident/772646#8186869c632c09dd4fe8c03fd1f075925956b16c15795d103e773fbbad0cb91e Root Cause Analysis – Cluster Outage on 23 November 2025 1. Overview On Sunday, 23 November 2025, our Ceph cluster at the DE – Maincubes FRA1 site experienced a disruption that led to temporary unavailability of the productive RBD storage. The root cause was a combined failure of two NVMe OSDs from the same manufacturing batch during an active rebalance process, resulting in several Placement Groups (PGs) being irreparably damaged. As a consequence, a full restore from backup was required. 2. Timeline • 23 November 2025, approx. 04:00 Our monitoring system reported the failure of a single OSD in the cluster. Due to the existing redundancy, this was initially classified as non-critical, as the cluster is designed to tolerate the loss of an individual OSD without service impact. • 23 November 2025, approx. 09:00 A technician arrived at the data center and replaced the failed OSD. The cluster automatically initiated a rebalance process afterward. • 23 November 2025, approx. 13:50 Monitoring triggered another alert due to the failure of a second OSD. Under the additional load caused by the ongoing rebalance, this second OSD failed as well. Post-incident analysis revealed that both failed OSDs originated from the same production batch. • Later on 23 November 2025 As a result of the second OSD failure, 18 Placement Groups (PGs) were permanently lost. These PGs contained critical metadata relevant for the RBD storage, which led to a significant degradation of the cluster and, ultimately, to the unavailability of the RBD storage. • 23 November 2025, late afternoon/evening Extensive efforts were made to recover the affected OSDs and PGs. After thorough internal analysis and consultation with external experts, the affected OSDs had to be marked as “lost”. A direct recovery of the corrupted PGs from the cluster was no longer possible. • 23 November 2025, approx. 21:00 – 24 November 2025, approx. 09:00 To restore the environment, we reverted to our incremental backups. The restore was based on the backup taken on 23 November 2025 at 03:00. The restore process of the entire cluster (nearly 500 VMs) ran overnight and was completed on 24 November 2025 at approximately 09:00. At that point, all customer systems were back online. 3. Technical Root Cause The incident can essentially be attributed to the following factors: 1. Failure of two OSDs from the same batch Both failed NVMe OSDs came from the same manufacturing batch, suggesting a batch-specific quality or reliability issue. 2. Increased load due to rebalance The second OSD failed during an active rebalance, which imposed additional I/O load on the drives involved. This increased load likely contributed significantly to the second drive’s failure. 3. Loss of critical Placement Groups Due to the combined failure of two OSDs across the relevant failure domains, 18 PGs were permanently lost, including PGs holding essential metadata for the RBD pool. This led to an inconsistent and unusable storage state for the affected pool. 4. Impact • Temporary unavailability of the RBD storage in the cluster at the DE – Maincubes FRA1 site. • Service impact on nearly 500 virtual machines hosted on this cluster. • Required full restore from backup based on the backup state from 03:00 on 23 November 2025. 5. Preventive and Corrective Measures To significantly reduce the risk of similar incidents in the future, we have implemented the following measures: 1. Increased replication level o The replica size of the affected cluster has been increased to 3, providing additional fault tolerance in the event of simultaneous OSD failures. 2. Expansion and distribution of storage capacity o Six additional NVMe OSDs were added to the cluster to improve data distribution and reduce the impact of load peaks (e.g., during rebalancing). 3. Enhanced hardware quality control for OSDs o Proactive removal of another OSD from the same batch as the failed drives to prevent potential follow-up issues. o Introduction of a standardized validation process for new batches, including S.M.A.R.T. checks, burn-in tests, and benchmark/stress tests before drives are put into production. 4. Internal processes and SOPs o Creation of an internal Standard Operating Procedure (SOP) covering regular: ▪ S.M.A.R.T. analysis ▪ Benchmark and stress testing of all OSDs o Clear definition of procedures for handling OSD failures during active rebalance operations. 5. Monitoring improvements o Tightening of proactive monitoring policies, in particular: ▪ Closer tracking of latency, I/O errors, and reallocations of individual OSDs ▪ Additional alert thresholds for rebalance load and cluster degradation 6. Customer Communication and Compensation We sincerely regret the inconvenience caused by this incident. All affected customers will be informed separately about the compensation applicable to their individual case. We highly appreciate our customers’ patience and understanding during the disruption and the subsequent restoration process. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 08:00:00 -0000 https://status.informaten.com/incident/772646#8186869c632c09dd4fe8c03fd1f075925956b16c15795d103e773fbbad0cb91e Root Cause Analysis – Cluster Outage on 23 November 2025 1. Overview On Sunday, 23 November 2025, our Ceph cluster at the DE – Maincubes FRA1 site experienced a disruption that led to temporary unavailability of the productive RBD storage. The root cause was a combined failure of two NVMe OSDs from the same manufacturing batch during an active rebalance process, resulting in several Placement Groups (PGs) being irreparably damaged. As a consequence, a full restore from backup was required. 2. Timeline • 23 November 2025, approx. 04:00 Our monitoring system reported the failure of a single OSD in the cluster. Due to the existing redundancy, this was initially classified as non-critical, as the cluster is designed to tolerate the loss of an individual OSD without service impact. • 23 November 2025, approx. 09:00 A technician arrived at the data center and replaced the failed OSD. The cluster automatically initiated a rebalance process afterward. • 23 November 2025, approx. 13:50 Monitoring triggered another alert due to the failure of a second OSD. Under the additional load caused by the ongoing rebalance, this second OSD failed as well. Post-incident analysis revealed that both failed OSDs originated from the same production batch. • Later on 23 November 2025 As a result of the second OSD failure, 18 Placement Groups (PGs) were permanently lost. These PGs contained critical metadata relevant for the RBD storage, which led to a significant degradation of the cluster and, ultimately, to the unavailability of the RBD storage. • 23 November 2025, late afternoon/evening Extensive efforts were made to recover the affected OSDs and PGs. After thorough internal analysis and consultation with external experts, the affected OSDs had to be marked as “lost”. A direct recovery of the corrupted PGs from the cluster was no longer possible. • 23 November 2025, approx. 21:00 – 24 November 2025, approx. 09:00 To restore the environment, we reverted to our incremental backups. The restore was based on the backup taken on 23 November 2025 at 03:00. The restore process of the entire cluster (nearly 500 VMs) ran overnight and was completed on 24 November 2025 at approximately 09:00. At that point, all customer systems were back online. 3. Technical Root Cause The incident can essentially be attributed to the following factors: 1. Failure of two OSDs from the same batch Both failed NVMe OSDs came from the same manufacturing batch, suggesting a batch-specific quality or reliability issue. 2. Increased load due to rebalance The second OSD failed during an active rebalance, which imposed additional I/O load on the drives involved. This increased load likely contributed significantly to the second drive’s failure. 3. Loss of critical Placement Groups Due to the combined failure of two OSDs across the relevant failure domains, 18 PGs were permanently lost, including PGs holding essential metadata for the RBD pool. This led to an inconsistent and unusable storage state for the affected pool. 4. Impact • Temporary unavailability of the RBD storage in the cluster at the DE – Maincubes FRA1 site. • Service impact on nearly 500 virtual machines hosted on this cluster. • Required full restore from backup based on the backup state from 03:00 on 23 November 2025. 5. Preventive and Corrective Measures To significantly reduce the risk of similar incidents in the future, we have implemented the following measures: 1. Increased replication level o The replica size of the affected cluster has been increased to 3, providing additional fault tolerance in the event of simultaneous OSD failures. 2. Expansion and distribution of storage capacity o Six additional NVMe OSDs were added to the cluster to improve data distribution and reduce the impact of load peaks (e.g., during rebalancing). 3. Enhanced hardware quality control for OSDs o Proactive removal of another OSD from the same batch as the failed drives to prevent potential follow-up issues. o Introduction of a standardized validation process for new batches, including S.M.A.R.T. checks, burn-in tests, and benchmark/stress tests before drives are put into production. 4. Internal processes and SOPs o Creation of an internal Standard Operating Procedure (SOP) covering regular: ▪ S.M.A.R.T. analysis ▪ Benchmark and stress testing of all OSDs o Clear definition of procedures for handling OSD failures during active rebalance operations. 5. Monitoring improvements o Tightening of proactive monitoring policies, in particular: ▪ Closer tracking of latency, I/O errors, and reallocations of individual OSDs ▪ Additional alert thresholds for rebalance load and cluster degradation 6. Customer Communication and Compensation We sincerely regret the inconvenience caused by this incident. All affected customers will be informed separately about the compensation applicable to their individual case. We highly appreciate our customers’ patience and understanding during the disruption and the subsequent restoration process. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 08:00:00 -0000 https://status.informaten.com/incident/772646#8186869c632c09dd4fe8c03fd1f075925956b16c15795d103e773fbbad0cb91e Root Cause Analysis – Cluster Outage on 23 November 2025 1. Overview On Sunday, 23 November 2025, our Ceph cluster at the DE – Maincubes FRA1 site experienced a disruption that led to temporary unavailability of the productive RBD storage. The root cause was a combined failure of two NVMe OSDs from the same manufacturing batch during an active rebalance process, resulting in several Placement Groups (PGs) being irreparably damaged. As a consequence, a full restore from backup was required. 2. Timeline • 23 November 2025, approx. 04:00 Our monitoring system reported the failure of a single OSD in the cluster. Due to the existing redundancy, this was initially classified as non-critical, as the cluster is designed to tolerate the loss of an individual OSD without service impact. • 23 November 2025, approx. 09:00 A technician arrived at the data center and replaced the failed OSD. The cluster automatically initiated a rebalance process afterward. • 23 November 2025, approx. 13:50 Monitoring triggered another alert due to the failure of a second OSD. Under the additional load caused by the ongoing rebalance, this second OSD failed as well. Post-incident analysis revealed that both failed OSDs originated from the same production batch. • Later on 23 November 2025 As a result of the second OSD failure, 18 Placement Groups (PGs) were permanently lost. These PGs contained critical metadata relevant for the RBD storage, which led to a significant degradation of the cluster and, ultimately, to the unavailability of the RBD storage. • 23 November 2025, late afternoon/evening Extensive efforts were made to recover the affected OSDs and PGs. After thorough internal analysis and consultation with external experts, the affected OSDs had to be marked as “lost”. A direct recovery of the corrupted PGs from the cluster was no longer possible. • 23 November 2025, approx. 21:00 – 24 November 2025, approx. 09:00 To restore the environment, we reverted to our incremental backups. The restore was based on the backup taken on 23 November 2025 at 03:00. The restore process of the entire cluster (nearly 500 VMs) ran overnight and was completed on 24 November 2025 at approximately 09:00. At that point, all customer systems were back online. 3. Technical Root Cause The incident can essentially be attributed to the following factors: 1. Failure of two OSDs from the same batch Both failed NVMe OSDs came from the same manufacturing batch, suggesting a batch-specific quality or reliability issue. 2. Increased load due to rebalance The second OSD failed during an active rebalance, which imposed additional I/O load on the drives involved. This increased load likely contributed significantly to the second drive’s failure. 3. Loss of critical Placement Groups Due to the combined failure of two OSDs across the relevant failure domains, 18 PGs were permanently lost, including PGs holding essential metadata for the RBD pool. This led to an inconsistent and unusable storage state for the affected pool. 4. Impact • Temporary unavailability of the RBD storage in the cluster at the DE – Maincubes FRA1 site. • Service impact on nearly 500 virtual machines hosted on this cluster. • Required full restore from backup based on the backup state from 03:00 on 23 November 2025. 5. Preventive and Corrective Measures To significantly reduce the risk of similar incidents in the future, we have implemented the following measures: 1. Increased replication level o The replica size of the affected cluster has been increased to 3, providing additional fault tolerance in the event of simultaneous OSD failures. 2. Expansion and distribution of storage capacity o Six additional NVMe OSDs were added to the cluster to improve data distribution and reduce the impact of load peaks (e.g., during rebalancing). 3. Enhanced hardware quality control for OSDs o Proactive removal of another OSD from the same batch as the failed drives to prevent potential follow-up issues. o Introduction of a standardized validation process for new batches, including S.M.A.R.T. checks, burn-in tests, and benchmark/stress tests before drives are put into production. 4. Internal processes and SOPs o Creation of an internal Standard Operating Procedure (SOP) covering regular: ▪ S.M.A.R.T. analysis ▪ Benchmark and stress testing of all OSDs o Clear definition of procedures for handling OSD failures during active rebalance operations. 5. Monitoring improvements o Tightening of proactive monitoring policies, in particular: ▪ Closer tracking of latency, I/O errors, and reallocations of individual OSDs ▪ Additional alert thresholds for rebalance load and cluster degradation 6. Customer Communication and Compensation We sincerely regret the inconvenience caused by this incident. All affected customers will be informed separately about the compensation applicable to their individual case. We highly appreciate our customers’ patience and understanding during the disruption and the subsequent restoration process. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 08:00:00 -0000 https://status.informaten.com/incident/772646#8186869c632c09dd4fe8c03fd1f075925956b16c15795d103e773fbbad0cb91e Root Cause Analysis – Cluster Outage on 23 November 2025 1. Overview On Sunday, 23 November 2025, our Ceph cluster at the DE – Maincubes FRA1 site experienced a disruption that led to temporary unavailability of the productive RBD storage. The root cause was a combined failure of two NVMe OSDs from the same manufacturing batch during an active rebalance process, resulting in several Placement Groups (PGs) being irreparably damaged. As a consequence, a full restore from backup was required. 2. Timeline • 23 November 2025, approx. 04:00 Our monitoring system reported the failure of a single OSD in the cluster. Due to the existing redundancy, this was initially classified as non-critical, as the cluster is designed to tolerate the loss of an individual OSD without service impact. • 23 November 2025, approx. 09:00 A technician arrived at the data center and replaced the failed OSD. The cluster automatically initiated a rebalance process afterward. • 23 November 2025, approx. 13:50 Monitoring triggered another alert due to the failure of a second OSD. Under the additional load caused by the ongoing rebalance, this second OSD failed as well. Post-incident analysis revealed that both failed OSDs originated from the same production batch. • Later on 23 November 2025 As a result of the second OSD failure, 18 Placement Groups (PGs) were permanently lost. These PGs contained critical metadata relevant for the RBD storage, which led to a significant degradation of the cluster and, ultimately, to the unavailability of the RBD storage. • 23 November 2025, late afternoon/evening Extensive efforts were made to recover the affected OSDs and PGs. After thorough internal analysis and consultation with external experts, the affected OSDs had to be marked as “lost”. A direct recovery of the corrupted PGs from the cluster was no longer possible. • 23 November 2025, approx. 21:00 – 24 November 2025, approx. 09:00 To restore the environment, we reverted to our incremental backups. The restore was based on the backup taken on 23 November 2025 at 03:00. The restore process of the entire cluster (nearly 500 VMs) ran overnight and was completed on 24 November 2025 at approximately 09:00. At that point, all customer systems were back online. 3. Technical Root Cause The incident can essentially be attributed to the following factors: 1. Failure of two OSDs from the same batch Both failed NVMe OSDs came from the same manufacturing batch, suggesting a batch-specific quality or reliability issue. 2. Increased load due to rebalance The second OSD failed during an active rebalance, which imposed additional I/O load on the drives involved. This increased load likely contributed significantly to the second drive’s failure. 3. Loss of critical Placement Groups Due to the combined failure of two OSDs across the relevant failure domains, 18 PGs were permanently lost, including PGs holding essential metadata for the RBD pool. This led to an inconsistent and unusable storage state for the affected pool. 4. Impact • Temporary unavailability of the RBD storage in the cluster at the DE – Maincubes FRA1 site. • Service impact on nearly 500 virtual machines hosted on this cluster. • Required full restore from backup based on the backup state from 03:00 on 23 November 2025. 5. Preventive and Corrective Measures To significantly reduce the risk of similar incidents in the future, we have implemented the following measures: 1. Increased replication level o The replica size of the affected cluster has been increased to 3, providing additional fault tolerance in the event of simultaneous OSD failures. 2. Expansion and distribution of storage capacity o Six additional NVMe OSDs were added to the cluster to improve data distribution and reduce the impact of load peaks (e.g., during rebalancing). 3. Enhanced hardware quality control for OSDs o Proactive removal of another OSD from the same batch as the failed drives to prevent potential follow-up issues. o Introduction of a standardized validation process for new batches, including S.M.A.R.T. checks, burn-in tests, and benchmark/stress tests before drives are put into production. 4. Internal processes and SOPs o Creation of an internal Standard Operating Procedure (SOP) covering regular: ▪ S.M.A.R.T. analysis ▪ Benchmark and stress testing of all OSDs o Clear definition of procedures for handling OSD failures during active rebalance operations. 5. Monitoring improvements o Tightening of proactive monitoring policies, in particular: ▪ Closer tracking of latency, I/O errors, and reallocations of individual OSDs ▪ Additional alert thresholds for rebalance load and cluster degradation 6. Customer Communication and Compensation We sincerely regret the inconvenience caused by this incident. All affected customers will be informed separately about the compensation applicable to their individual case. We highly appreciate our customers’ patience and understanding during the disruption and the subsequent restoration process. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 04:51:00 -0000 https://status.informaten.com/incident/772646#11791732fb1c691c06404cd6003f56c6ccac0ad5eb0ec2b6c2c876189e81c4d4 Many of our services have already been successfully restored. We are continuing to work hard to ensure that all KVM products are running fully and stably again. Please note: Some functions will only be available to a limited extent until services are fully restored. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 04:51:00 -0000 https://status.informaten.com/incident/772646#11791732fb1c691c06404cd6003f56c6ccac0ad5eb0ec2b6c2c876189e81c4d4 Many of our services have already been successfully restored. We are continuing to work hard to ensure that all KVM products are running fully and stably again. Please note: Some functions will only be available to a limited extent until services are fully restored. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 04:51:00 -0000 https://status.informaten.com/incident/772646#11791732fb1c691c06404cd6003f56c6ccac0ad5eb0ec2b6c2c876189e81c4d4 Many of our services have already been successfully restored. We are continuing to work hard to ensure that all KVM products are running fully and stably again. Please note: Some functions will only be available to a limited extent until services are fully restored. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 04:51:00 -0000 https://status.informaten.com/incident/772646#11791732fb1c691c06404cd6003f56c6ccac0ad5eb0ec2b6c2c876189e81c4d4 Many of our services have already been successfully restored. We are continuing to work hard to ensure that all KVM products are running fully and stably again. Please note: Some functions will only be available to a limited extent until services are fully restored. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 04:51:00 -0000 https://status.informaten.com/incident/772646#11791732fb1c691c06404cd6003f56c6ccac0ad5eb0ec2b6c2c876189e81c4d4 Many of our services have already been successfully restored. We are continuing to work hard to ensure that all KVM products are running fully and stably again. Please note: Some functions will only be available to a limited extent until services are fully restored. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Mon, 24 Nov 2025 04:51:00 -0000 https://status.informaten.com/incident/772646#11791732fb1c691c06404cd6003f56c6ccac0ad5eb0ec2b6c2c876189e81c4d4 Many of our services have already been successfully restored. We are continuing to work hard to ensure that all KVM products are running fully and stably again. Please note: Some functions will only be available to a limited extent until services are fully restored. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 22:24:00 -0000 https://status.informaten.com/incident/772646#3c3155ef3ed0dda83a4da57348293d71e098080a0884923202be2d186658b7c4 We are currently still experiencing disruptions to some of our root server and web hosting services. The cause is a problem on our storage platform, which we have now clearly identified. Our engineering team is working hard to gradually restore all systems to normal operation and keep downtime to a minimum. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 22:24:00 -0000 https://status.informaten.com/incident/772646#3c3155ef3ed0dda83a4da57348293d71e098080a0884923202be2d186658b7c4 We are currently still experiencing disruptions to some of our root server and web hosting services. The cause is a problem on our storage platform, which we have now clearly identified. Our engineering team is working hard to gradually restore all systems to normal operation and keep downtime to a minimum. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 22:24:00 -0000 https://status.informaten.com/incident/772646#3c3155ef3ed0dda83a4da57348293d71e098080a0884923202be2d186658b7c4 We are currently still experiencing disruptions to some of our root server and web hosting services. The cause is a problem on our storage platform, which we have now clearly identified. Our engineering team is working hard to gradually restore all systems to normal operation and keep downtime to a minimum. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 22:24:00 -0000 https://status.informaten.com/incident/772646#3c3155ef3ed0dda83a4da57348293d71e098080a0884923202be2d186658b7c4 We are currently still experiencing disruptions to some of our root server and web hosting services. The cause is a problem on our storage platform, which we have now clearly identified. Our engineering team is working hard to gradually restore all systems to normal operation and keep downtime to a minimum. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 22:24:00 -0000 https://status.informaten.com/incident/772646#3c3155ef3ed0dda83a4da57348293d71e098080a0884923202be2d186658b7c4 We are currently still experiencing disruptions to some of our root server and web hosting services. The cause is a problem on our storage platform, which we have now clearly identified. Our engineering team is working hard to gradually restore all systems to normal operation and keep downtime to a minimum. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 22:24:00 -0000 https://status.informaten.com/incident/772646#3c3155ef3ed0dda83a4da57348293d71e098080a0884923202be2d186658b7c4 We are currently still experiencing disruptions to some of our root server and web hosting services. The cause is a problem on our storage platform, which we have now clearly identified. Our engineering team is working hard to gradually restore all systems to normal operation and keep downtime to a minimum. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 13:00:00 -0000 https://status.informaten.com/incident/772646#734eb8ffa87bdd7497b99c2aa58b9044f2d709c27dbd5291208b5a8ca4f51874 Wir haben eine Störung im Server-Service – Wir arbeiten mit Hochdruck an der Wiederherstellung aller betroffenen Dienste. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 13:00:00 -0000 https://status.informaten.com/incident/772646#734eb8ffa87bdd7497b99c2aa58b9044f2d709c27dbd5291208b5a8ca4f51874 Wir haben eine Störung im Server-Service – Wir arbeiten mit Hochdruck an der Wiederherstellung aller betroffenen Dienste. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 13:00:00 -0000 https://status.informaten.com/incident/772646#734eb8ffa87bdd7497b99c2aa58b9044f2d709c27dbd5291208b5a8ca4f51874 Wir haben eine Störung im Server-Service – Wir arbeiten mit Hochdruck an der Wiederherstellung aller betroffenen Dienste. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 13:00:00 -0000 https://status.informaten.com/incident/772646#734eb8ffa87bdd7497b99c2aa58b9044f2d709c27dbd5291208b5a8ca4f51874 Wir haben eine Störung im Server-Service – Wir arbeiten mit Hochdruck an der Wiederherstellung aller betroffenen Dienste. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 13:00:00 -0000 https://status.informaten.com/incident/772646#734eb8ffa87bdd7497b99c2aa58b9044f2d709c27dbd5291208b5a8ca4f51874 Wir haben eine Störung im Server-Service – Wir arbeiten mit Hochdruck an der Wiederherstellung aller betroffenen Dienste. KVM-Service aktuell gestört – wir arbeiten an der Lösung https://status.informaten.com/incident/772646 Sun, 23 Nov 2025 13:00:00 -0000 https://status.informaten.com/incident/772646#734eb8ffa87bdd7497b99c2aa58b9044f2d709c27dbd5291208b5a8ca4f51874 Wir haben eine Störung im Server-Service – Wir arbeiten mit Hochdruck an der Wiederherstellung aller betroffenen Dienste. Geplante Wartung des CPs https://status.informaten.com/incident/719634 Sun, 07 Sep 2025 16:00:43 -0000 https://status.informaten.com/incident/719634#4133bab81205f1fbd079fa59c91814a97f076154231d257cda79a9dfb3a4be98 Maintenance completed Geplante Wartung des CPs https://status.informaten.com/incident/719634 Sat, 06 Sep 2025 22:00:43 -0000 https://status.informaten.com/incident/719634#2b83c1dbf59f175f604d23a1255f3e89b063af2481079544ddb6d06834e2a2cc Wir führen derzeit Wartungsarbeiten am Customer Panel unserer Webseite durch. Während dieser Zeit ist der Zugang zum Kundenbereich möglicherweise eingeschränkt oder nicht verfügbar. Wir bitten um dein Verständnis und arbeiten daran, den Service so schnell wie möglich wieder bereitzustellen. Unsere Dienste wie Root Server, Webspace und Domains sowie alle Managed Services und Colocation sind nicht davon betroffen. EN: We are currently performing maintenance work on the customer panel of our website. During this time, access to the customer area may be restricted or unavailable. We apologize for any inconvenience and are working to restore service as quickly as possible. Our services such as root servers, web space, and domains, as well as all managed services and colocation, are not affected.