Ondavox | PyTorch 2.12 Release Blog

Kenji A (AI-assisted) May 13, 2026 2 min read EN

Based on reporting from Source.

PyTorch 2.12 is now available, bringing a 100x speedup for batched eigenvalue decomposition on CUDA, a new device-agnostic graph capture API, and support for exporting models using Microscaling (MX) quantization formats. The release includes 2,926 commits from 457 contributors since version 2.11.

Performance improvements

The headline performance change is an overhaul of the backend selection for linalg.eigh on CUDA. The legacy MAGMA backend has been deprecated in favor of cuSolver, and the dispatch heuristics now use syevj_batched unconditionally. For batched symmetric/Hermitian eigenvalue problems, this yields up to 100x speedups over the previous release. Workloads that previously took minutes now run in seconds by processing many small or medium matrices as a single GPU operation.

The Adagrad optimizer now supports fused=True, performing the entire optimizer step in a single CUDA kernel. Adagrad joins Adam, AdamW, and SGD in offering a fused variant.

New APIs and export capabilities

torch.accelerator.Graph is a new device-agnostic API for graph capture and replay, providing a unified abstraction over backend-specific implementations such as torch.xpu.XPUGraph. Each backend can register its own implementation through a lightweight GraphImplInterface. Alongside this, c10::Stream and torch.Stream now expose an is_capturing() method, replacing device-specific alternatives.

torch.export.save and torch.export.load now correctly serialize and deserialize tensors with the float8_e8m0fnu dtype used as the shared block-scale exponent in MX formats (MXFP4, MXFP6, MXFP8). This unblocks the full export-to-deployment workflow for models using Microscaling quantization, which is relevant for teams deploying large language models to cost-constrained or edge environments.

Control-flow regions using torch.cond can now be captured and replayed as part of CUDA Graphs. By leveraging CUDA 12.4's conditional IF nodes, branches are evaluated entirely on the GPU within a single graph capture. This currently works with the eager and cudagraphs backends; Inductor support is planned for a future release.

Distributed and profiling updates

Custom operators can now accept ProcessGroup objects directly as arguments. All c10d functional collective ops have been updated to accept both ProcessGroup objects and string names.

The PyTorch Profiler Events API now exposes flow IDs, flow types, activity types, unfinished events, and Python function events. NCCL collective traces can be correlated across ranks using a new seq_num field.

FlightRecorder's trace analyzer now supports ncclx and gloo backends alongside existing nccl and xccl backends, and recognizes torchcomms operations.

Platform-specific updates

CUDA: torch.cuda.graph now accepts an enable_annotations kwarg that injects annotation metadata into individual kernels. CUDA Green Contexts support specifying a workqueue limit.
ROCm: AMD GPUs (ROCm >= 7.02) now support expandable memory segments. rocSHMEM support enables symmetric memory collective operations. hipSPARSELt is enabled by default on ROCm >= 7.12, bringing semi-structured (2:4) sparsity support. FlexAttention on AMD GPUs uses two-stage pipelining, delivering 5-26% speedups on MI350X.
Apple MPS: Apple Silicon binary wheels now ship with ahead-of-time-compiled Metal-4 shaders, eliminating runtime shader compilation overhead on first run.

Deprecations and breaking changes

TorchScript is now deprecated. torch.export should replace the jit trace and script APIs; Executorch should replace the embedded runtime.

The CUDA 12.8 binary wheel is deprecated and will no longer be published as part of the standard release matrix. The default wheel remains CUDA 13.0. CUDA 13.2 has been added as an experimental build. Users on older architectures (Pascal, Volta) should use the CUDA 12.6 wheel. Users on newer GPUs (Blackwell) should use CUDA 13.0+ wheels, which require an NVIDIA driver upgrade to 580.65.06 (Linux) or 580.88 (Windows).

Planned breaking changes for torchcomms in PyTorch 2.13+ include eager initialization of ProcessGroup, changes to P2P operations, and making torchcomms a required package for PyTorch Distributed.

Bottom line

PyTorch 2.12 is a significant performance and platform compatibility release. The 100x speedup for batched eigendecomposition directly addresses a longstanding gap with CuPy. The new device-agnostic graph API and MX quantization export support make the framework more practical for production deployment across diverse hardware.

PyTorch 2.12 Release Blog

Performance improvements

New APIs and export capabilities

Distributed and profiling updates

Platform-specific updates

Deprecations and breaking changes

Bottom line

Sources 1

More articles like this

Building a safe, effective sandbox to enable Codex on Windows

Hermes Unlocks Self-Improving AI Agents, Powered by NVIDIA RTX PCs and DGX Spark

Two Legal Research Providers Launch MCP Integrations with Claude: Thomson Reuters and Free Law Project Connect Their Data to AI

OpenAI Hit With Overdose Suit Centered on ChatGPT Medical Advice

Anthropic Goes All-In on Legal, Releasing More Than 20 Connectors and 12 Practice-Area Plugins for Claude

Efficient Edge AI on Arm CPUs and NPUs: Understanding ExecuTorch through Practical Labs