Back to all summaries

Shedding old code with ecdysis: graceful restarts for Rust services at Cloudflare

Manuel Olguín Muñoz
Rust Open Source Infrastructure Engineering Edge Developers Developer Platform Application Services Rust

AI-Generated Summary: This is an automated summary created using AI. For the full details and context, please read the original post.

Cloudflare's ecdysis: A Solution for Zero-Downtime Upgrades in Rust Services

Cloudflare has open-sourced ecdysis, a Rust library that enables zero-downtime upgrades for network services handling millions of requests per second. This solution has been production-tested for five years, saving millions of requests with every restart across Cloudflare's global network. Ecdysis addresses the challenge of upgrading services without disrupting live connections, a critical requirement for services like traffic routing, TLS lifecycle management, and firewall rules enforcement.

The Problem with Naive Restart Approaches

The traditional approach to restarting a service involves stopping the old process and starting a new one. However, this creates a window of time where connections are refused and requests are dropped. For services handling thousands of requests per second, even a brief restart can result in hundreds of dropped connections. This approach also kills already-established connections, causing clients to be abruptly disconnected.

How ecdysis Solves the Problem

Ecdysis implements a more sophisticated approach to restarting services, ensuring that no live connections are dropped and no new connections are refused. By using a combination of techniques, including load balancing and connection queuing, ecdysis enables zero-downtime upgrades for services that require continuous operation. This solution has been production-tested at Cloudflare and has proven to be reliable and efficient.

Practical Implications for Developers

Developers can use ecdysis to implement zero-downtime upgrades for their own Rust services, ensuring that their applications remain available and responsive to users. By adopting ecdysis, developers can avoid the limitations of traditional restart approaches and provide a better experience for their users.

Want to read the full article?

Read Full Post on Cloudflare Blog