2026-03-19

A memory pressure spike on ~15 Paris hypervisors, amplified by a placement algorithm threshold effect, caused cascading failures impacting hosted services, the deployment pipeline, and observability from 15:50 to 18:10 UTC.

The overloaded servers triggered OS memory reclaim (OOM), which consumed CPU and destabilized HAProxy load balancers — causing connectivity loss for hosted services and paralyzing internal automation (deployments, monitoring). Full recovery at 21:41 UTC.

Status updates were communicated through: https://www.clevercloudstatus.com/

Timeline

Time	Description
2026-03-19 15:20	First degradation signals on internal services (VM registration, deployment service database access). No client impact yet.
2026-03-19 15:50	Degradations and outages visible for some clients. Services hosted on the most impacted hypervisors become unreachable. Deployment pipeline stops.
2026-03-19 16:03	Incident response team mobilized. Mitigation actions begin: stabilizing hypervisors under pressure, progressive restart of internal clusters (messaging, observability), partial deployment resumption.
2026-03-19 17:30	Full restoration of the deployment pipeline and observability. Continued recovery of impacted client services.
2026-03-19 18:09	All impacted client services restored. Residual alert cleanup begins.
2026-03-19 21:41	Service fully operational. All residual alerts cleared.

Analysis

A simultaneous memory consumption spike was observed on approximately fifteen host servers (hypervisors) in the Paris region. The primary suspect identified is the TCP load balancing process (HAProxy), whose instrumentation at the time did not provide sufficient granularity to immediately confirm the cause.

An amplifying effect from the placement algorithm was identified: servers were already running at elevated load due to a threshold effect in the orchestration and resource placement algorithm. During rapid workload growth, the algorithm did not distribute load linearly beyond a certain threshold, causing some servers to be more heavily loaded than they should have been.

The memory saturation triggered the OS memory reclaim mechanism, which itself is highly CPU-intensive. This sudden CPU load spike disrupted the load balancers (HAProxy) on those servers — cascading into connectivity interruptions for hosted services and temporarily paralyzing internal automation services (deployments, monitoring).

Actions

Placement scoring function modified to smooth load ramp-up and eliminate the threshold effect
Dedicated monitoring added to detect any recurrence of this behavior
Temporary fixes deployed immediately during the incident to mitigate effects
Partial algorithm redesign to guarantee smooth and optimal resource allocation under high load
Enhanced observability for non-VM processes running on hypervisors (especially HAProxy instrumentation)
Memory pressure resilience improvements prioritized and delivered within the following week
Two new datacenters planned for the Paris region in Q2 2026 to increase capacity and reduce sensitivity to load spikes

Conclusion

The incident was caused by a combination of a memory spike on HAProxy processes and a placement algorithm threshold effect that left insufficient headroom on impacted hypervisors. The response team was mobilized quickly and worked continuously until full recovery. Corrective measures have been applied to the placement algorithm, and observability improvements have been deployed to accelerate future diagnosis. Additional Paris datacenter capacity is planned for Q2 2026.

Last updated on March 27, 2026

Did this documentation help you ?

2025-10-09