2025-10-09

We lost a first hypervisor due to high memory pressure and at the same time there was an incident in one of our data centers, where the water coolers were unavailable.

Then, this put pressure on the Paris region, which caused another hypervisor to enter the same state. We lost critical services in the meantime due to infrastructure that was already under heavy load.

Status were communicated through : https://www.clevercloudstatus.com/incidents/1020

Timeline

Time	Description
2025-10-09 07:10	Monitoring of time series systems fails to work properly, distributed systems oncall operator is paged
2025-10-09 07:40	Time series systems fails to receive ingress traffic
2025-10-09 07:59	ALL time series systems load balancers are stuck due to too much connections
2025-10-09 08:01	Distributed systems operator raises the alert to Level 2
2025-10-09 08:03	On of the hypervisor is detected as down
2025-10-09 08:12	Orchestration operations are limited to limit a potential domino effect and reduce deployment pressure on infrastructure
2025-10-09 08:15	Restart of time series systems ingresses / an hypervisor experiencing high load (2300), memory pressure identified
2025-10-09 08:17	Passing the faulty hypervisor to CLI for management
2025-10-09 08:18	Metrics are getting ingested by time series systems again (1/5 of nominal traffic)
2025-10-09 08:23	Enable time series systems load balancer datapoints sampling (blackhole) on 50% of traffic
2025-10-09 08:35	Root cause identified: RAM capacity issue causing load spike on hypervisors
2025-10-09 08:37	Faulty hypervisor is rebooted successfully (uptime: 2025-10-09 06:37:06 UTC)
2025-10-09 08:40	Domino effect detected: an hypervisor is showing signs of stress
2025-10-09 08:43	Multiple databases impacted (Down), the full Database Team is escalated
2025-10-09 08:46	Time series systems is at 75% of nominal ingress traffic
2025-10-09 09:00	Time series systems is fully operational & back to nominal traffic
2025-10-09 09:09	Addon provider Mongo is back online
2025-10-09 09:10	On a faulty hypervisor , all databases are up again
2025-10-09 09:14	On the second faulty hypervisor, all databases are up again
2025-10-09 09:14	DBaaS team is now fixing random databases that wen’t down on the par8 datacenter (manly small encrypted plans). Reboot failing due to a local Out Of Memory (oom)
2025-10-09 09:18	On of four time series systems loadbalancer is not rebooting, due to missing base image
2025-10-09 09:25	Issue to reboot some databases (OOM before start). We restarted them in rescue and deactivated the logs to enable the reboot.
2025-10-09 09:46	All four time series systems loadbalancer are up again
2025-10-09 09:55	The rebooted databases without logs were correctly migrated, logs are now collected again
2025-10-09 09:56	An automated system reported an expired certificate on time series systems customer endpoint for ingress, half the traffic fails to reach time series systems
2025-10-09 10:30	two of the four time series systems load balancers are exposing expired certificates
2025-10-09 10:39	Time series systems expired certificates issue is fixed, back to nominal
2025-10-09 10:43	All databases impacted by the incident are up and running
2025-10-09 10:46	mass restart of supernova/supernova-http on all hypervisors via Ansible - RabbitMQ connection flapping detected
2025-10-09 10:52	Some databases were going down again, they are now fixed and migrated to resolve the issue
2025-10-09 11:10	proxy-manager / hayxorp / rabbitfrs restart across the infrastructure - verification that all services reconnected to RabbitMQ successfully

Analysis

An mechanical incident in the Paris region created an unexpected memory pressure on some hypervisors, leading to a domino effect. The procurement of additional hardware to restore capacity buffers is currently in final administrative approval stages.

Actions

Complete the deployment of oomd which will help to better handle memory pressure
Add a monitoring disaster trigger to detect when a hypervisor stays under pressure for too long
Fix small MySQL databases that were on error after the reboot during the incident
Add TLS check on each warp10 load balancers for c2-warp10-clevercloud-customers.services.clever-cloud.com
New hardware installation will reduce pressure on infrastructure
Configure time series systems ingress thresholds to deny traffic before instances become stuck

Conclusion

During the incident, our actions demonstrated a good reactivity, but also some potential enhancements. The infrastructure congestion has already been resolved on two stages:

immediate capacity addition (done)
accelerate the two new AZ project with a shipment for january

We also learnt on the behavior of machine health in case of cooling issue on a given availability zone (AZ), which never occurred before.

Databases management is being enhanced with two strategic actions:

volume capability that enable arbitrary restart on any other AZ
automatic failover for leader/follower setups

This two items are part of the roadmap with a first step for Q1 FY26 which is the ability to manipulate clusters with API.

Last updated on December 17, 2025

Did this documentation help you ?

2025-03-03