ExecuTorch, a PyTorch extension for running AI models locally on edge devices, now has a set of hands-on Jupyter labs created by Arm. The labs walk through deploying models on Arm CPUs and Ethos-U NPUs, covering both the practical steps and the underlying reasoning.
Overview
ExecuTorch extends the PyTorch ecosystem to deliver local AI inference on constrained edge devices. It takes a PyTorch model, exports it into a lightweight .pte file containing both the model weights and a static computation graph, and runs it through a runtime built specifically for edge inference. This removes the need for Python at runtime and avoids dynamic execution overhead that is unnecessary for inference.
Arm has created a collection of Jupyter labs that complement the official ExecuTorch documentation. The labs explain both the how and the why of each step, covering CPU and NPU inference across Cortex-A and Cortex-M + Ethos-U platforms. They also showcase the use of Model Explorer adapters, developed by Arm, to gain visibility into model deployment with ExecuTorch.
What ExecuTorch Does
ExecuTorch takes a PyTorch model, exports it into a minimal .pte artefact containing both the model weights and a static computation graph. This removes the need for Python at runtime and avoids dynamic execution overhead that is unnecessary for inference. The export step is followed by lowering, where the model graph is transformed into a backend-compatible form. This is where hardware-aware optimization begins.
The resulting artefact is lightweight, portable, predictable in execution, and suitable for deployment on constrained systems.
CPU Inference: Raspberry Pi 5
Even on devices like the Raspberry Pi 5, which can run PyTorch models without needing ExecuTorch, performance improvements can be found through using ExecuTorch. Performance depends heavily on how the model is executed. ExecuTorch achieves performance by delegating parts of the model to optimized backends. On Arm CPUs, this is typically done using the XNNPACK backend. When enabled, supported operators—such as convolutions and matrix multiplications—are delegated to highly optimized implementations. On Arm platforms, these implementations leverage KleidiAI microkernels, which make efficient use of architectural features such as Neon.
In the labs, Arm compares inference of an OPT-125M transformer model on a Raspberry Pi 5. The results show a significant latency reduction when using ExecuTorch with XNNPACK. It's important to note that backend delegation doesn't occur by default. Running ExecuTorch without XNNPACK will often result in higher latency compared to PyTorch (which has its own KleidiAI optimizations), though you still benefit from a reduced runtime footprint and improved portability.
NPU Inference: Ethos-U and TOSA
To go further, the labs target hardware acceleration using Arm Ethos-U NPUs, typically paired with Cortex-A or Cortex-M CPUs. Execution becomes heterogeneous. Rather than running the entire model on one processor, ExecuTorch partitions the graph: supported subgraphs are delegated to the NPU, and unsupported operators fall back to the CPU.
Ethos-U operates on quantized integer models (typically INT8), so models must be quantized before delegation. The first step is to create a quantizer specific to the backend using EthosUQuantizer and a compile_spec matching your specific target Ethos-U. For example, the Ethos-U targeted in the labs is an Ethos-U85 with 256 multiply-accumulate (MAC) units.
The next step involves lowering the model into TOSA (Tensor Operator Set Architecture), an intermediate representation designed to bridge high-level frameworks and hardware backends. TOSA provides a stable, hardware-agnostic operator set. Instead of requiring each hardware vendor to support every framework-specific operator, models are lowered into TOSA, and hardware backends implement this smaller, standardized set.
This step uses the to_edge_transform_and_lower API, specifying use of the EthosUPartitioner. For Ethos-U this triggers the backend path that serializes to TOSA and runs Vela to produce an optimized command stream for execution on the NPU. Finally, .to_executorch(...) packages the result into a .pte file.
Visualizing Model Deployment
To make the partitioning visible, the labs utilize Google's Model Explorer, along with adapters developed by Arm. These tools allow you to inspect the ExecuTorch graph (.pte) and visualize how it is partitioned across backends, and examine the TOSA representation (.tosa).
For example, the labs compare two .pte files targeting the same Ethos-U configuration, but generated from slightly different models. A regular MobileNetV2 model contains only supported operators, allowing the entire compute region to be delegated as a single, continuous Ethos-U subgraph. A MobileNetV2 model with an additional LRN layer inserted is decomposed into lower-level operations during lowering. Not all of these operations can be delegated, and the graph is partitioned into multiple segments. Supported regions are delegated to the NPU, while the unsupported portion runs on the CPU.
This level of visibility helps explain performance behavior and can guide optimization decisions.
Practical Next Steps
The Jupyter labs are designed so you can run and modify the code on your own hardware. The collection includes contributions from Professor Marcelo Rovai (UNIFEI University, and a member of the Edge AI Foundation Academia-Industry Partnership) and academic reviewers at IIIT Bangalore.
Building models is only half the story—getting them running efficiently at the edge is what matters. ExecuTorch makes that possible, and these labs show you how to get started quickly while understanding the underlying concepts.