The landscape of Large Language Models (LLMs) is continuously evolving, with ever-increasing model sizes leading to astronomical compute and memory demands. This presents a significant "scaling problem". Traditionally, researchers have pursued two distinct paths to address this: parameter efficiency (shrinking model size) or adaptive computation (making models think only when necessary). However, recent advancements, particularly the Mixture of Experts (MoE) and the newly introduced Mixture of Recursions (MoR), are fundamentally rethinking how we build and deploy these powerful AI systems, often achieving both simultaneously
. Understanding their unique approaches and business implications is crucial for organisations looking to leverage cutting-edge AI.
Mixture of Experts (MoE): The Specialist Approach
What it is: Mixture of Experts is a machine learning technique where a model is divided into multiple "expert" sub-networks, each specialising in a subset of the input data
. Instead of using the entire neural network for every input, a "gate network" or "router" dynamically determines which specific expert(s) are best suited to process a given input token. This concept of "conditional computation" enforces sparsity, meaning only parts of the model are active for a given input
.
How it works: In transformer models, MoE layers typically replace dense Feed-Forward Network (FFN) layers
. These layers contain multiple experts (which are often FFNs themselves). For example, in Mistral's Mixtral 8x7B, each layer has 8 experts, but for every token, the router selects only two of these experts for processing. The outputs of these selected experts are then combined and passed to the next layer. While the total parameter count (sparse parameter count) of Mixtral is around 47 billion (not 56 billion, as many parameters are shared across non-expert layers like self-attention), the active parameters used per token are much lower, around 12.9 billion
.
Key Advantages:
• Faster Pretraining: MoE models can be pretrained significantly faster than dense models with the same number of parameters, as not all parameters are activated for every computation
. This means a larger model or dataset can be used with the same compute budget
.
• Faster Inference (for equivalent capacity): Although MoEs might have many parameters, only a subset is used during inference, leading to faster processing compared to a dense model of equivalent capacity
. For instance, Mixtral outperforms Llama 2 70B with less than 20% of its active parameters during inference
.
• Increased Model Capacity: MoEs allow for a dramatic increase in the total number of parameters (model capacity) without a proportional increase in computational burden
.
Challenges and Business Implications:
• High VRAM Requirements: A significant drawback is that all parameters of all experts need to be loaded into memory (VRAM) for inference, even if only a few are active
. This translates to high memory demands, which can be a barrier for smaller organisations or local deployments
.
• Fine-tuning Difficulties: MoEs have historically struggled with fine-tuning, being more prone to overfitting than dense models
. However, recent work shows that instruction-tuning can significantly benefit MoE models, potentially more so than dense models
. This suggests that for applications requiring specific task adaptation, strategic instruction tuning is key.
• Load Balancing: Ensuring all experts are equally utilised is challenging, as routers can converge to favour a few popular experts
. Techniques like auxiliary loss, noisy top-k gating, and expert capacity are used to mitigate this. This adds engineering complexity, especially in distributed systems
.
Mixture of Recursions (MoR): The Iterative Refinement Approach
What it is: Mixture of Recursions (MoR) is an advanced transformer architecture that Google DeepMind has developed, aiming to improve upon transformer limitations like uniform compute and parameter bloat
. MoR is essentially a "recursive transformer but with a twist". Instead of stacking unique layers like a standard transformer, it reuses the same block of layers (a "recursive block") multiple times. The "twist" is that it incorporates adaptive token-level computation
.
How it works: MoR's core innovation is to allow the model to dynamically decide how many times each token needs to pass through this shared block, based on its complexity
. A "token router," a neural network, makes this decision
.
• Parameter Efficiency: By reusing the same block of parameters repeatedly, MoR drastically reduces the number of weights that need to be trained and stored compared to a standard transformer
. This dramatically increases the model's effective depth without increasing parameter count
.
• Adaptive Compute: Simple tokens might go through just one or two recursions (early exit), while complex tokens might undergo multiple recursions, receiving more compute
. This avoids uniform compute allocation, which is a limitation of traditional transformers. The router learns to assign recursion depth based on the semantic importance or complexity of tokens, e.g., content-rich words get more processing, function words exit earlier
.
• Routing Mechanisms:
◦ Expert Choice Routing: At each recursion step, the router decides which tokens continue to the next recursion and which exit
. This ensures a static compute budget and good load balancing but has a potential causality leak risk, mitigated by an auxiliary loss
.
◦ Token Choice Routing: Each token decides its total recursion depth once at the beginning
. This avoids causality issues but may lead to load imbalance, addressed by a balancing loss
.
• KV Cache Optimisation: MoR uses strategies like recursion-wise KV caching (only storing key-value pairs for active tokens at a given recursion depth) and recursive KV sharing (reusing initial KV pairs across all subsequent recursions) to reduce memory footprint and speed up prefill operations
.
Key Advantages:
• Smaller Models & Cheaper Compute: MoR can achieve similar or better performance with significantly fewer parameters (e.g., 118M parameter MoR outperforms 315M transformer)
. It enables more efficient training, with observed reductions in training time and peak memory usage
.
• Faster Inference Throughput: MoR demonstrates up to 2.18 times faster inference throughput compared to vanilla transformers and existing recursive baselines
. This is due to its parameter sharing (enabling continuous depthwise batching) and the early exiting mechanism for simpler tokens
.
• Adaptive Intelligence and Latent Reasoning: The dynamic routing allows the model to "think harder" when needed, integrating deeper internal processing directly into the decoding of each token
. This enables test-time scaling, where increasing allowed recursion steps during inference can improve quality
.
Challenges and Business Implications:
• Small Model Performance: MoR might not work as well with very small models
.
• Fixed Routing Capacity: Adjusting how many tokens get routed deeper can be less flexible after training, especially with expert choice routing
.
• KV Sharing Accuracy Trade-off: While recursive KV sharing offers maximum memory savings and prefill speedup, reusing initial KVs can sometimes lead to a slight dip in performance quality, particularly with expert choice routing
.
• Complexity: While simpler than MoE in some aspects (single weight, simple router), the recursion and routing mechanisms introduce their own complexities
.
MoE vs. MoR: A Comparative Business View
When considering MoE and MoR for business applications, several factors come into play:
• Core Architectural Philosophy:
◦ MoE: Embraces width scaling by routing tokens to different, specialised experts within the model
. It's like having a team of different specialists, where the router picks the best ones for each sub-task.
◦ MoR: Favours depth scaling by recursively applying a shared block of layers, with tokens going through adaptive depths of processing
. It's akin to having a highly efficient, multi-talented core team that can iterate on a problem until a solution is found, dedicating more time to harder problems.
• Parameter and Memory Footprint:
◦ MoE: While inference FLOPs are lower (due to sparsity), the total model size in memory (VRAM) can be very large because all expert parameters must be loaded
. This implies higher infrastructure costs for deployment, especially for very large models.
◦ MoR: Offers substantial parameter reduction due to weight sharing, leading to smaller models and significantly less memory usage (up to 25% less)
. This translates directly to lower operational costs and broader accessibility for deployment, potentially enabling cutting-edge AI on more modest hardware
.
• Computational Efficiency:
◦ MoE: Achieves faster inference by activating only a subset of experts
.
◦ MoR: Delivers even faster inference throughput (up to 2.18x) through adaptive compute (early exiting for simple tokens) and continuous depthwise batching, which keeps GPUs highly utilised
. This means quicker response times for users, which is critical for real-time applications like chatbots or interactive AI services.
• Training and Engineering Complexity:
◦ MoE: Involves challenges like sharding and load balancing across multiple experts, which can require significant engineering effort
. Load balancing auxiliary losses and noisy routing are crucial for stable training
.
◦ MoR: Is comparatively simpler to manage from an engineering perspective, as it primarily involves a single shared weight set and a router
. Its parameter efficiency also aids techniques like FSDP (Fully Sharded Data Parallelism) by reducing communication overhead during distributed training
.
• Performance vs. Specialisation:
◦ MoE: Experts can specialise in different concepts or groups of tokens, though broad language specialisation might be discouraged by load balancing
. It tends to perform well on knowledge-heavy tasks but might struggle with reasoning-heavy tasks compared to dense models at fixed pretrain perplexity
.
◦ MoR: Achieves better performance (lower perplexity, higher few-shot accuracy) with significantly fewer parameters for the same training compute
. Its ability to perform adaptive, latent reasoning means it can dynamically allocate more "thought" to complex parts of a problem, potentially leading to more nuanced and accurate outputs
.
• Future and Adaptability:
◦ MoE: Continues to be a focus for distillation (to smaller dense models), quantization (reducing memory footprint), and expert merging to further optimise serving
.
◦ MoR: Its core recursive block is modality agnostic, making it highly promising for multimodal AI applications (vision, audio, video) where long contexts are common and its memory/throughput gains could be even more impactful
. This opens doors for new product offerings beyond just text generation.
Conclusion
Both Mixture of Experts and Mixture of Recursions represent significant strides in creating more efficient and capable LLMs, pushing past the limitations of traditional transformer architectures
. While MoE excels at scaling model capacity by selectively activating specialised parts, it still carries a substantial memory footprint due to loading all experts. MoR, on the other hand, offers a compelling vision for truly adaptive and resource-efficient AI by enabling dynamic, token-level computation through shared parameters and smart caching
.
For businesses, the choice between MoE and MoR (or a potential hybrid in the future) will depend on specific needs. If the priority is leveraging massive model capacity and benefiting from faster inference on high-end infrastructure, MoE remains a strong contender, especially with advancements in instruction-tuning. However, for those seeking to democratise AI, reduce operational costs, and explore novel multimodal applications with truly intelligent resource allocation, Mixture of Recursions presents a highly exciting and promising path forward. This paradigm shift towards "adaptive intelligence" is poised to make cutting-edge AI more practical and accessible across a wider array of industries and use cases.