A minor update resulted in a cascade of errors: how it went wrong, what we’ve learnt

2024 08 14 clever cloud banniere blog incident 2 aout 2024 en 1
On Friday, August 2nd, 2024 Clever Cloud’s platform became very unstable, leading to downtime of varying duration and scope, for customers using services on the EU-FR-1 (PAR) region, and remote zones depending on the EU-FR-1 control plane (OVHcloud, Scaleway, and Oracle). Privates and on-premise zones weren’t impacted.

Minor update, bigger consequences

The global incident started after we initiated a minor rolling maintenance upgrade of Apache Pulsar (3.3.1). It enabled a new balancer algorithm for bundles placement optimization which caused troubles. We quickly detected it, kept the 3.3.1 release and rolled back the defective configuration. 

The initial problem, although fixed, resulted in a metadata conflict. We had to stop all messaging brokers to start only one, preventing metadata conflict between them, and succeeded in getting back to a nominal situation. But after that our infrastructure experienced I/O and memory pressure, causing some hypervisors to crash.

The unavailability of the Apache Pulsar messaging layer has led to the buffering of some telemetry agents on the virtual machines (VMs) running on hypervisors (HVs). They started to buffer in memory while messaging service endpoints weren’t available.

Reach the limits of our smaller VMs 

For small VMs it reached a memory limit which triggered a lot of pressure on the kernel. When it happens, the kernel tries to flush memory and caches, remove processes from memory to load them instruction per instruction from disk. This is generating a lot of I/O disk pressure on the underlying hypervisor. A few minutes after HVs load increased, we started to see kernel panics on some of them.

Small VMs are spreaded globally among our Availability Zones, so we ended up having an overloaded server infrastructure. As we needed the scheduler to manage the situation, the control plane was performing suboptimally and the infrastructure struggled to provide available resources.

So we had to shut down all non critical services and our SRE team redeployed one by one the overloaded small VMs. Then we started to regain control over all hypervisors and were able to take our infrastructure back to its optimal state.

What’s next?

For non minor releases, we use a simulation process that emulates a full environment where we inject changes. This validates upgrades and changes by observing the infrastructure behavior before going to production. This process wasn’t used for minor maintenance upgrades or (apparently) small changes in the configuration profile.

This incident taught us a lot and we identified some progress areas regarding our images, tools configuration and integration, or small VMs setup. We’ve taken some immediate actions and will work on more deep topics in the coming weeks. 

We sincerely apologize for any inconvenience this incident may have caused. Your trust is paramount to us, and we are committed to providing you with the highest level of service reliability. 

You can learn more about the schedule and technical analysis of the incident in our public postmortem. If you have any questions or concerns, please don’t hesitate to reach out to our support team.

Blog

À lire également

UP Program: Clever Cloud announces its fifth startup selection

With this new batch, Clever Cloud welcomes four startups to the UP Program: Sentibee, Pictaderm, Legaia and Cockpit Agriculture.
Company

Sōzu 2.0 — turning a reverse proxy into a programmable edge

Sōzu is the reverse proxy that sits in front of every application running on Clever Cloud. After eighteen months of work — first the HTTP/2 multiplexer, built on our existing kawa pivot, then almost every other layer of the proxy, and finally a long run in production on the cleverapps.io load balancers — Sōzu 2.0 is out.
Engineering

K3s vs K8s: What Are the Differences and Which One Should You Choose in 2026?

Kubernetes has become the standard for container orchestration. But depending on your infrastructure constraints (limited resources, edge computing, IoT, or large-scale enterprise clusters), the distribution you choose can radically change the operational experience. K3s and K8s (upstream Kubernetes) address different needs, even though both share the same CNCF-certified foundation.
Engineering Features