Coding

When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug

A latent Linux kernel power-saving quirk—collapsing CPU idle states too aggressively—has triggered catastrophic QUIC packet loss on Cloudflare’s edge, forcing a custom kernel patch that trades microjoules for microseconds. The fix exposes how energy governors, tuned for bare-metal efficiency, clash with latency-sensitive transport stacks when milliseconds decide user churn.

A Linux kernel optimization aimed at reducing power consumption has been found to cause catastrophic QUIC packet loss on Cloudflare's edge network. The issue arises from the kernel's aggressive collapsing of CPU idle states, which triggers a bug in the CUBIC congestion control algorithm used by QUIC.

Overview

CUBIC is a loss-based congestion control algorithm that governs how TCP and QUIC connections probe for available bandwidth, back off when they detect loss, and recover afterward. The algorithm uses a congestion window (cwnd) to limit the number of bytes that can be in flight at any moment. A larger cwnd allows the sender to push more data per round trip, while a smaller cwnd throttles it.

The Bug

The bug occurs when the connection exits slow-start and switches to congestion avoidance. In this state, the CUBIC algorithm enters a rapid oscillation between recovery and congestion avoidance, causing the cwnd to remain pinned at its minimum value. This oscillation is triggered by the kernel's idle period optimization, which shifts the epoch forward by the idle duration rather than resetting it.

The fix involves measuring the idle duration from when bytes_in_flight actually transitioned to zero, rather than the last packet sent. This change ensures that the recovery boundary stops chasing the send time, allowing the cwnd to grow along the expected CUBIC curve.

Tradeoffs

The fix highlights the tradeoffs between energy efficiency and latency sensitivity in transport protocols. The Linux kernel optimization aimed to reduce power consumption by collapsing CPU idle states, but this optimization clashed with the latency-sensitive requirements of QUIC. The fix trades microjoules for microseconds, prioritizing latency over energy efficiency.

The investigation into the bug required weeks of instrumenting qlogs and analyzing visualizations, but the solution required changing just three lines of code. The fix has been contributed to Cloudflare's open-source implementation of QUIC and HTTP/3, and the company continues to experiment with and tune its model-based BBRv3 implementation.

Similar Articles

More articles like this

Coding 1 min

Visual Studio Code 1.120

Visual Studio Code’s 1.120 update slashes debugging friction with native Data Breakpoints, letting engineers pause execution when specific object properties change—not just memory addresses. The release also bakes in GitHub Copilot-powered inline code completions for Python, JavaScript, and TypeScript, cutting keystrokes by up to 40% in early benchmarks, while a revamped terminal shell integration finally bridges the gap between local and remote workflows.

Coding 2 min

My graduation cap runs Rust

A DIY robotics project showcases the potential of Rust for real-time, low-latency systems, leveraging the language's memory safety guarantees and concurrency features to control a graduation cap's LED display and motorized movement. The project's use of the Tokio runtime and async-std library highlights Rust's growing adoption in the embedded systems and robotics communities. By pushing the language's capabilities in these domains, developers may unlock new applications for Rust in the IoT and automation spaces.

Coding 1 min

Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

A 26M-parameter model, Needle, distills the complexity of Gemini tool calling into a lightweight, attention-based architecture, leveraging simple attention networks and gating to achieve efficient function calling on consumer devices. By abandoning massive models and reasoning-heavy designs, Needle runs at 6000 tokens per second on prefill and 1200 tokens per second on decode, making it a promising solution for agentic experiences on budget phones and wearables.

Coding 1 min

SQL: Incorrect by Construction

"SQL's fundamental design flaw, rooted in its reliance on string concatenation, has been quietly undermining data integrity for decades, with a recent study revealing that a staggering 70% of SQL queries contain implicit string conversions, compromising the accuracy of results and exposing databases to catastrophic errors."

Coding 1 min

Reimagining the mouse pointer for the AI era

A radical redesign of the traditional cursor is underway, as researchers propose replacing the static pointer with a dynamic, AI-driven "attention pointer" that adapts to the user's gaze and task at hand. This innovation leverages computer vision and machine learning to create a more intuitive and context-aware interaction paradigm. By decoupling the pointer from the screen, users may experience improved productivity and reduced cognitive load.

Coding 1 min

Show HN: Gigacatalyst – Extend your SaaS with an embedded AI builder

A new class of embedded AI builders is emerging, allowing SaaS companies to empower non-technical users to craft custom workflows and features through conversational interfaces, thereby bypassing traditional engineering bottlenecks and long product roadmaps. This trend is exemplified by Gigacatalyst, a platform that leverages AI to connect with a SaaS's APIs, learn its data model, and enable users to build custom features without requiring engineering expertise.