Coding

Kubernetes v1.36: More Drivers, New Features, and the Next Era of DRA

Kubernetes v1.36 cements Dynamic Resource Allocation (DRA) as the default control plane for heterogeneous hardware, graduating core APIs to stable while extending support to CPUs, memory, and PodGroup-based ResourceClaims. With driver ecosystems now spanning GPUs, FPGAs, and SmartNICs, the release eliminates bespoke resource schedulers, letting operators define fallback policies and failure domains in declarative manifests—critical for scaling AI workloads across mixed-accelerator clusters.

Kubernetes v1.36 cements Dynamic Resource Allocation (DRA) as the default control plane for heterogeneous hardware, graduating core APIs to stable while extending support to CPUs, memory, and PodGroup-based ResourceClaims.

Overview

Dynamic Resource Allocation (DRA) has fundamentally changed how platform administrators handle hardware accelerators and specialized resources in Kubernetes. In the v1.36 release, DRA continues to mature, bringing a wave of feature graduations, critical usability improvements, and new capabilities that extend the flexibility of DRA to native resources like memory and CPU, and support for ResourceClaims in PodGroups.

What's New

The community has been hard at work stabilizing core DRA concepts. In Kubernetes 1.36, several highly anticipated features have graduated to Beta and Stable. These include:

  1. Prioritized list (stable): allowing users to define fallback preferences when requesting devices.
  2. Extended resource support (beta): enabling users to request resources via traditional extended resources on a Pod.
  3. Partitionable devices (beta): providing native DRA support for dynamically carving physical hardware into smaller, logical instances.
  4. Device taints (beta): empowering cluster administrators to manage hardware more effectively by applying taints directly to specific DRA devices.
  5. Device binding conditions (beta): improving scheduling reliability by delaying committing a Pod to a Node until its required external resources are fully prepared.
  6. Resource health status (beta): exposing device health information directly in the Pod status.

New features in v1.36 include:

  1. ResourceClaim support for workloads: enabling Kubernetes to seamlessly manage shared resources across massive sets of Pods.
  2. Node allocatable resources: introducing the first iteration of using the DRA APIs to manage node allocatable infrastructure resources.
  3. DRA resource availability visibility: allowing users to query the availability of devices in DRA resource pools.
  4. List types for attributes: changing ResourceClaim constraint evaluation to work better with scalar and list values.
  5. Deterministic device selection: updating the Kubernetes scheduler to evaluate devices using lexicographical ordering based on resource pool and ResourceSlice names.
  6. Discoverable device metadata in containers: defining a standard protocol for how DRA drivers expose device attributes to containers.

What's Next

The roadmap focuses on maturing existing features toward beta and stable releases while hardening DRA’s performance, scalability, and reliability. A key priority will be deep integration with workload aware and topology aware scheduling. Users can get involved by joining the WG Device Management Slack channel and meetings, and collaborating on development, sharing feedback, or building their first DRA driver.

In summary, Kubernetes v1.36 enhances Dynamic Resource Allocation, providing a more robust and hardware-agnostic infrastructure. With its new features and graduations, DRA is becoming the standard for resource allocation, allowing cluster operators to migrate clusters to DRA and letting application developers adopt the ResourceClaim API on their own schedule.

Similar Articles

More articles like this

Coding 1 min

Open Source Resistance: keep OSS alive on company time

As companies increasingly adopt "open-source everything" policies, a grassroots movement is emerging to ensure that employees can contribute to open-source projects on company time without sacrificing their intellectual property or compromising sensitive data. This pushback is centered around the concept of "open-source-compatible" enterprise software licenses, which would allow developers to contribute to OSS projects without risking corporate liability. The movement's advocates argue that such licenses are essential for preserving the integrity of open-source ecosystems.

Coding 2 min

The limits of Rust, or why you should probably not follow Amazon and Cloudflare

Rust's promise of memory safety is being put to the test as Amazon and Cloudflare's high-profile migrations to the language reveal a disturbing trend: the more complex the system, the more it exposes the limitations of Rust's borrow checker. Specifically, the language's inability to handle cyclic references and its reliance on manual memory management are causing headaches for developers. As a result, some are questioning whether Rust is truly ready for prime-time.

Coding 1 min

The AI Backlash Could Get Ugly

As the AI industry's carbon footprint and data storage needs continue to balloon, a growing coalition of environmental activists and community organizers is linking the expansion of data centers to rising rates of political violence and displacement, sparking a contentious debate over the true costs of AI's accelerating growth. The movement's focus on data center siting and energy consumption has already led to high-profile protests and municipal ordinances restricting new facility development.

Coding 1 min

Software Developers Say AI Is Rotting Their Brains

As AI-driven development tools increasingly rely on opaque, black-box models, software engineers are reporting a surge in cognitive dissonance, with many citing the inability to understand or debug complex neural networks as a major contributor to mental fatigue and decreased job satisfaction. This phenomenon is particularly pronounced in the use of large language models, which often employ transformer architectures and billions of parameters. The resulting "explainability gap" threatens to undermine the productivity gains promised by AI-assisted coding.

Coding 2 min

My graduation cap runs Rust

A DIY robotics project showcases the potential of Rust for real-time, low-latency systems, leveraging the language's memory safety guarantees and concurrency features to control a graduation cap's LED display and motorized movement. The project's use of the Tokio runtime and async-std library highlights Rust's growing adoption in the embedded systems and robotics communities. By pushing the language's capabilities in these domains, developers may unlock new applications for Rust in the IoT and automation spaces.

Coding 1 min

When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug

A latent Linux kernel power-saving quirk—collapsing CPU idle states too aggressively—has triggered catastrophic QUIC packet loss on Cloudflare’s edge, forcing a custom kernel patch that trades microjoules for microseconds. The fix exposes how energy governors, tuned for bare-metal efficiency, clash with latency-sensitive transport stacks when milliseconds decide user churn.