AI

EMO: Pretraining mixture of experts for emergent modularity

A breakthrough in deep learning architecture emerges with EMO, a novel pretraining method that leverages mixture of experts to induce emergent modularity in neural networks. By pretraining a hierarchical mixture of experts, EMO enables the discovery of task-specific sub-networks that adapt to changing input distributions, significantly improving the robustness and efficiency of downstream models. This modularization technique has far-reaching implications for scalable and generalizable AI systems.

EMO is a novel pretraining method that leverages mixture of experts to induce emergent modularity in neural networks. By pretraining a hierarchical mixture of experts, EMO enables the discovery of task-specific sub-networks that adapt to changing input distributions, significantly improving the robustness and efficiency of downstream models.

Overview

EMO is a 1B-active, 14B-total-parameter MoE trained on 1 trillion tokens. It supports selective expert use, allowing users to select a small subset of experts for a given task while retaining near full-model performance. When all experts are used together, EMO remains a strong general-purpose model.

What it does

In an MoE, a small network called the router decides which experts each token activates. EMO's key observation is that tokens from the same document usually come from the same domain. The router learns to restrict tokens in a document to choose their active experts from a shared expert pool, encouraging groups of experts to form domain specialization.

Tradeoffs

The document pool size controls how restrictive the modularity constraint is. A smaller pool forces tokens in the same document to share a tighter set of experts, encouraging stronger modularity; a larger pool gives the model more flexibility but weakens the constraint. EMO's performance is comparable to a standard MoE model, and it remains robust under selective expert use. When only 12.5% of the experts are used, EMO loses only about 3% absolute performance across all benchmarks.

EMO's expert subsets specialize in semantically meaningful domains, such as Health, Medical & Wellness, News Reporting, US Politics & Elections, and Film & Music. This is in contrast to standard MoE training, which produces clusters of surface-level or syntactic features. The EMO-trained model, a matched standard-MoE baseline, and the training code are being released to help the community study emergent modularity in MoEs.

In practice, EMO can be used to improve the memory-accuracy tradeoff in large sparse models. The model's modular structure allows for flexible deployment, and the expert subsets can be composed to create new models. However, there are still many questions to be answered, such as how to better select and compose expert subsets, how to update modules without disrupting the full model, and how to use modular structure for better interpretability and control.

In conclusion, EMO is a significant step towards making large sparse models more modular, and its release should help the community to build towards modular language models that are easier to deploy, adapt, inspect, and compose.

Similar Articles

More articles like this

AI 1 min

Hermes Unlocks Self-Improving AI Agents, Powered by NVIDIA RTX PCs and DGX Spark

"Self-improving AI agents are gaining traction, thanks to Hermes Agent, a new open-source framework that has amassed 140,000 GitHub stars in under three months. Powered by NVIDIA's RTX PCs and DGX Spark, Hermes enables agents to learn from experience and adapt to new tasks, potentially revolutionizing workflows and productivity. This rapid adoption marks a significant milestone in the evolution of agentic AI."

AI 3 min

Two Legal Research Providers Launch MCP Integrations with Claude: Thomson Reuters and Free Law Project Connect Their Data to AI

Two Legal Research Providers Launch MCP Integrations with Claude: Thomson Reuters and Free Law Project Connect Their Data to AI LawSites

AI 2 min

OpenAI Hit With Overdose Suit Centered on ChatGPT Medical Advice

OpenAI Hit With Overdose Suit Centered on ChatGPT Medical Advice Bloomberg Law News

AI 2 min

Anthropic Goes All-In on Legal, Releasing More Than 20 Connectors and 12 Practice-Area Plugins for Claude

Anthropic Goes All-In on Legal, Releasing More Than 20 Connectors and 12 Practice-Area Plugins for Claude LawSites

AI 2 min

Efficient Edge AI on Arm CPUs and NPUs: Understanding ExecuTorch through Practical Labs

Arm's Edge AI Initiative Gains Momentum with ExecuTorch, a PyTorch Extension for Local Inference on Constrained Devices. This new framework leverages Arm CPUs and NPUs to accelerate AI workloads, promising significant performance boosts on edge devices. Practical Labs, developed by Arm, provide a hands-on introduction to ExecuTorch's capabilities and potential applications in IoT and industrial automation.

AI 1 min

Universal AI is “a pathway to AI fluency that’s accessible and approachable to anyone, anywhere”

MIT’s new AI literacy push—backed by a free, adaptive course and real-time LLM tutors—slashes the barrier to entry for non-technical learners, embedding generative models as both subject and instructor. By offloading scaffolding to AI agents, the program turns passive video lectures into interactive, Socratic dialogues that scale from K-12 classrooms to corporate upskilling, potentially minting millions of “AI-fluent” users within a year.