Back to all summaries

When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug

Esteban Carisimo, Antonio Vicente
Congestion Control Debugging QUIC QUICHE Networking HTTP3 Rust

AI-Generated Summary: This is an automated summary created using AI. For the full details and context, please read the original post.

Linux Kernel Optimization Becomes a QUIC Bug

A recent investigation by Cloudflare's team revealed a bug in the CUBIC congestion controller, which is the default congestion controller in Linux and governs how most TCP and QUIC connections probe for available bandwidth. The bug causes the congestion window (cwnd) to permanently pin at its minimum and never recover from a congestion collapse event. This issue was discovered when a Linux kernel change aimed at bringing CUBIC into line with the app-limited exclusion described in RFC 9438 was ported to Cloudflare's open-source implementation of QUIC, quiche.

Symptoms and Investigation

The bug was first reported by an unexpected failure in Cloudflare's ingress proxy integration test pipeline. The test simulated a scenario of heavy loss in the early part of the connection, which is an uncommon regime for congestion controllers. The investigation revealed that the CUBIC congestion controller was unable to recover from the congestion collapse event, resulting in a 61% failure rate in the test.

Key Findings

  • The bug causes the congestion window (cwnd) to permanently pin at its minimum and never recover from a congestion collapse event.
  • The issue was discovered when a Linux kernel change aimed at bringing CUBIC into line with the app-limited exclusion described in RFC 9438 was ported to quiche.
  • The bug was first reported by an unexpected failure in Cloudflare's ingress proxy integration test pipeline.
  • The test simulated a scenario of heavy loss in the early part of the connection, which is an uncommon regime for congestion controllers.

Practical Implications

This bug has significant implications for developers who use CUBIC congestion control in their QUIC implementations. It highlights the importance of thoroughly testing and validating congestion controllers in uncommon regimes, such as congestion collapse events. Additionally, it emphasizes the need for careful consideration of Linux kernel changes and their potential impact on QUIC implementations.

Want to read the full article?

Read Full Post on Cloudflare Blog