2025-10-09
We lost a first hypervisor due to high memory pressure and at the same time there was an incident in one of our data centers, where the water coolers were unavailable.
Then, this put pressure on the Paris region, which caused another hypervisor to enter the same state. We lost critical services in the meantime due to infrastructure that was already under heavy load.
Status were communicated through : https://www.clevercloudstatus.com/incidents/1020
Timeline
| Time | Description |
|---|---|
| 2025-10-09 07:10 | Monitoring of time series systems fails to work properly, distributed systems oncall operator is paged |
| 2025-10-09 07:40 | Time series systems fails to receive ingress traffic |
| 2025-10-09 07:59 | ALL time series systems load balancers are stuck due to too much connections |
| 2025-10-09 08:01 | Distributed systems operator raises the alert to Level 2 |
| 2025-10-09 08:03 | On of the hypervisor is detected as down |
| 2025-10-09 08:12 | Orchestration operations are limited to limit a potential domino effect and reduce deployment pressure on infrastructure |
| 2025-10-09 08:15 | Restart of time series systems ingresses / an hypervisor experiencing high load (2300), memory pressure identified |
| 2025-10-09 08:17 | Passing the faulty hypervisor to CLI for management |
| 2025-10-09 08:18 | Metrics are getting ingested by time series systems again (1/5 of nominal traffic) |
| 2025-10-09 08:23 | Enable time series systems load balancer datapoints sampling (blackhole) on 50% of traffic |
| 2025-10-09 08:35 | Root cause identified: RAM capacity issue causing load spike on hypervisors |
| 2025-10-09 08:37 | Faulty hypervisor is rebooted successfully (uptime: 2025-10-09 06:37:06 UTC) |
| 2025-10-09 08:40 | Domino effect detected: an hypervisor is showing signs of stress |
| 2025-10-09 08:43 | Multiple databases impacted (Down), the full Database Team is escalated |
| 2025-10-09 08:46 | Time series systems is at 75% of nominal ingress traffic |
| 2025-10-09 09:00 | Time series systems is fully operational & back to nominal traffic |
| 2025-10-09 09:09 | Addon provider Mongo is back online |
| 2025-10-09 09:10 | On a faulty hypervisor , all databases are up again |
| 2025-10-09 09:14 | On the second faulty hypervisor, all databases are up again |
| 2025-10-09 09:14 | DBaaS team is now fixing random databases that wen’t down on the par8 datacenter (manly small encrypted plans). Reboot failing due to a local Out Of Memory (oom) |
| 2025-10-09 09:18 | On of four time series systems loadbalancer is not rebooting, due to missing base image |
| 2025-10-09 09:25 | Issue to reboot some databases (OOM before start). We restarted them in rescue and deactivated the logs to enable the reboot. |
| 2025-10-09 09:46 | All four time series systems loadbalancer are up again |
| 2025-10-09 09:55 | The rebooted databases without logs were correctly migrated, logs are now collected again |
| 2025-10-09 09:56 | An automated system reported an expired certificate on time series systems customer endpoint for ingress, half the traffic fails to reach time series systems |
| 2025-10-09 10:30 | two of the four time series systems load balancers are exposing expired certificates |
| 2025-10-09 10:39 | Time series systems expired certificates issue is fixed, back to nominal |
| 2025-10-09 10:43 | All databases impacted by the incident are up and running |
| 2025-10-09 10:46 | mass restart of supernova/supernova-http on all hypervisors via Ansible - RabbitMQ connection flapping detected |
| 2025-10-09 10:52 | Some databases were going down again, they are now fixed and migrated to resolve the issue |
| 2025-10-09 11:10 | proxy-manager / hayxorp / rabbitfrs restart across the infrastructure - verification that all services reconnected to RabbitMQ successfully |
Analysis
An mechanical incident in the Paris region created an unexpected memory pressure on some hypervisors, leading to a domino effect. The procurement of additional hardware to restore capacity buffers is currently in final administrative approval stages.
Actions
Complete the deployment of oomd which will help to better handle memory pressure
Add a monitoring disaster trigger to detect when a hypervisor stays under pressure for too long
Fix small MySQL databases that were on error after the reboot during the incident
Add TLS check on each warp10 load balancers for c2-warp10-clevercloud-customers.services.clever-cloud.com
New hardware installation will reduce pressure on infrastructure
Configure time series systems ingress thresholds to deny traffic before instances become stuck
Conclusion
During the incident, our actions demonstrated a good reactivity, but also some potential enhancements. The infrastructure congestion has already been resolved on two stages:
- immediate capacity addition (done)
- accelerate the two new AZ project with a shipment for january
We also learnt on the behavior of machine health in case of cooling issue on a given availability zone (AZ), which never occurred before.
Databases management is being enhanced with two strategic actions:
- volume capability that enable arbitrary restart on any other AZ
- automatic failover for leader/follower setups
This two items are part of the roadmap with a first step for Q1 FY26 which is the ability to manipulate clusters with API.
Did this documentation help you ?