PyTorch 2.12 is now available, bringing a 100x speedup for batched eigenvalue decomposition on CUDA, a new device-agnostic graph capture API, and support for exporting models using Microscaling (MX) quantization formats. The release includes 2,926 commits from 457 contributors since version 2.11.
Performance improvements
The headline performance change is an overhaul of the backend selection for linalg.eigh on CUDA. The legacy MAGMA backend has been deprecated in favor of cuSolver, and the dispatch heuristics now use syevj_batched unconditionally. For batched symmetric/Hermitian eigenvalue problems, this yields up to 100x speedups over the previous release. Workloads that previously took minutes now run in seconds by processing many small or medium matrices as a single GPU operation.
The Adagrad optimizer now supports fused=True, performing the entire optimizer step in a single CUDA kernel. Adagrad joins Adam, AdamW, and SGD in offering a fused variant.
New APIs and export capabilities
torch.accelerator.Graph is a new device-agnostic API for graph capture and replay, providing a unified abstraction over backend-specific implementations such as torch.xpu.XPUGraph. Each backend can register its own implementation through a lightweight GraphImplInterface. Alongside this, c10::Stream and torch.Stream now expose an is_capturing() method, replacing device-specific alternatives.
torch.export.save and torch.export.load now correctly serialize and deserialize tensors with the float8_e8m0fnu dtype used as the shared block-scale exponent in MX formats (MXFP4, MXFP6, MXFP8). This unblocks the full export-to-deployment workflow for models using Microscaling quantization, which is relevant for teams deploying large language models to cost-constrained or edge environments.
Control-flow regions using torch.cond can now be captured and replayed as part of CUDA Graphs. By leveraging CUDA 12.4's conditional IF nodes, branches are evaluated entirely on the GPU within a single graph capture. This currently works with the eager and cudagraphs backends; Inductor support is planned for a future release.
Distributed and profiling updates
Custom operators can now accept ProcessGroup objects directly as arguments. All c10d functional collective ops have been updated to accept both ProcessGroup objects and string names.
The PyTorch Profiler Events API now exposes flow IDs, flow types, activity types, unfinished events, and Python function events. NCCL collective traces can be correlated across ranks using a new seq_num field.
FlightRecorder's trace analyzer now supports ncclx and gloo backends alongside existing nccl and xccl backends, and recognizes torchcomms operations.
Platform-specific updates
- CUDA:
torch.cuda.graphnow accepts anenable_annotationskwarg that injects annotation metadata into individual kernels. CUDA Green Contexts support specifying a workqueue limit. - ROCm: AMD GPUs (ROCm >= 7.02) now support expandable memory segments. rocSHMEM support enables symmetric memory collective operations. hipSPARSELt is enabled by default on ROCm >= 7.12, bringing semi-structured (2:4) sparsity support. FlexAttention on AMD GPUs uses two-stage pipelining, delivering 5-26% speedups on MI350X.
- Apple MPS: Apple Silicon binary wheels now ship with ahead-of-time-compiled Metal-4 shaders, eliminating runtime shader compilation overhead on first run.
Deprecations and breaking changes
TorchScript is now deprecated. torch.export should replace the jit trace and script APIs; Executorch should replace the embedded runtime.
The CUDA 12.8 binary wheel is deprecated and will no longer be published as part of the standard release matrix. The default wheel remains CUDA 13.0. CUDA 13.2 has been added as an experimental build. Users on older architectures (Pascal, Volta) should use the CUDA 12.6 wheel. Users on newer GPUs (Blackwell) should use CUDA 13.0+ wheels, which require an NVIDIA driver upgrade to 580.65.06 (Linux) or 580.88 (Windows).
Planned breaking changes for torchcomms in PyTorch 2.13+ include eager initialization of ProcessGroup, changes to P2P operations, and making torchcomms a required package for PyTorch Distributed.
Bottom line
PyTorch 2.12 is a significant performance and platform compatibility release. The 100x speedup for batched eigendecomposition directly addresses a longstanding gap with CuPy. The new device-agnostic graph API and MX quantization export support make the framework more practical for production deployment across diverse hardware.