2025-10-09

2025-10-09

We lost a first hypervisor due to high memory pressure and at the same time there was an incident in one of our data centers, where the water coolers were unavailable.

Then, this put pressure on the Paris region, which caused another hypervisor to enter the same state. We lost critical services in the meantime due to infrastructure that was already under heavy load.

Status were communicated through : https://www.clevercloudstatus.com/incidents/1020

Timeline

TimeDescription
2025-10-09 07:10Monitoring of time series systems fails to work properly, distributed systems oncall operator is paged
2025-10-09 07:40Time series systems fails to receive ingress traffic
2025-10-09 07:59ALL time series systems load balancers are stuck due to too much connections
2025-10-09 08:01Distributed systems operator raises the alert to Level 2
2025-10-09 08:03On of the hypervisor is detected as down
2025-10-09 08:12Orchestration operations are limited to limit a potential domino effect and reduce deployment pressure on infrastructure
2025-10-09 08:15Restart of time series systems ingresses / an hypervisor experiencing high load (2300), memory pressure identified
2025-10-09 08:17Passing the faulty hypervisor to CLI for management
2025-10-09 08:18Metrics are getting ingested by time series systems again (1/5 of nominal traffic)
2025-10-09 08:23Enable time series systems load balancer datapoints sampling (blackhole) on 50% of traffic
2025-10-09 08:35Root cause identified: RAM capacity issue causing load spike on hypervisors
2025-10-09 08:37Faulty hypervisor is rebooted successfully (uptime: 2025-10-09 06:37:06 UTC)
2025-10-09 08:40Domino effect detected: an hypervisor is showing signs of stress
2025-10-09 08:43Multiple databases impacted (Down), the full Database Team is escalated
2025-10-09 08:46Time series systems is at 75% of nominal ingress traffic
2025-10-09 09:00Time series systems is fully operational & back to nominal traffic
2025-10-09 09:09Addon provider Mongo is back online
2025-10-09 09:10On a faulty hypervisor , all databases are up again
2025-10-09 09:14On the second faulty hypervisor, all databases are up again
2025-10-09 09:14DBaaS team is now fixing random databases that wen’t down on the par8 datacenter (manly small encrypted plans). Reboot failing due to a local Out Of Memory (oom)
2025-10-09 09:18On of four time series systems loadbalancer is not rebooting, due to missing base image
2025-10-09 09:25Issue to reboot some databases (OOM before start). We restarted them in rescue and deactivated the logs to enable the reboot.
2025-10-09 09:46All four time series systems loadbalancer are up again
2025-10-09 09:55The rebooted databases without logs were correctly migrated, logs are now collected again
2025-10-09 09:56An automated system reported an expired certificate on time series systems customer endpoint for ingress, half the traffic fails to reach time series systems
2025-10-09 10:30two of the four time series systems load balancers are exposing expired certificates
2025-10-09 10:39Time series systems expired certificates issue is fixed, back to nominal
2025-10-09 10:43All databases impacted by the incident are up and running
2025-10-09 10:46mass restart of supernova/supernova-http on all hypervisors via Ansible - RabbitMQ connection flapping detected
2025-10-09 10:52Some databases were going down again, they are now fixed and migrated to resolve the issue
2025-10-09 11:10proxy-manager / hayxorp / rabbitfrs restart across the infrastructure - verification that all services reconnected to RabbitMQ successfully

Analysis

An mechanical incident in the Paris region created an unexpected memory pressure on some hypervisors, leading to a domino effect. The procurement of additional hardware to restore capacity buffers is currently in final administrative approval stages.

Actions

  • Complete the deployment of oomd which will help to better handle memory pressure

  • Add a monitoring disaster trigger to detect when a hypervisor stays under pressure for too long

  • Fix small MySQL databases that were on error after the reboot during the incident

  • Add TLS check on each warp10 load balancers for c2-warp10-clevercloud-customers.services.clever-cloud.com

  • New hardware installation will reduce pressure on infrastructure

  • Configure time series systems ingress thresholds to deny traffic before instances become stuck

Conclusion

During the incident, our actions demonstrated a good reactivity, but also some potential enhancements. The infrastructure congestion has already been resolved on two stages:

  • immediate capacity addition (done)
  • accelerate the two new AZ project with a shipment for january

We also learnt on the behavior of machine health in case of cooling issue on a given availability zone (AZ), which never occurred before.

Databases management is being enhanced with two strategic actions:

  • volume capability that enable arbitrary restart on any other AZ
  • automatic failover for leader/follower setups

This two items are part of the roadmap with a first step for Q1 FY26 which is the ability to manipulate clusters with API.

Last updated on

Did this documentation help you ?