Mastering OpenMythos: Implementing Recurrent-Depth Transformers with MLA and MoE

A Coding Tutorial on OpenMythos on Recurrent-Depth Transformers with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing

OpenMythos is a theoretical reconstruction of the Claude Mythos architecture that utilizes iterative computation to achieve deeper reasoning without increasing parameter size. In benchmarks, the Multi-Head Latent Attention (MLA) mechanism demonstrates a significant reduction in KV-cache memory footprint compared to Grouped-Query Attention (GQA). By varying loop depth at inference time, the model can improve accuracy on complex tasks without additional training.

Why This Matters

Traditional Transformer scaling relies on increasing parameter counts, which leads to linear growth in memory and compute requirements. OpenMythos shifts this paradigm toward compute-adaptive reasoning, where depth is achieved through recurrent updates rather than static layers. By decoupling inference depth from the training depth, engineers can trade additional inference-time compute for higher accuracy on complex tasks while maintaining architectural stability.

Furthermore, the integration of Multi-Head Latent Attention (MLA) addresses the KV-cache bottleneck inherent in recurrent architectures. By optimizing the memory footprint and utilizing Mixture-of-Experts (MoE) for routing, OpenMythos provides a blueprint for deploying high-reasoning models on hardware with limited VRAM. This tutorial validates that these models remain stable even under extreme learning rates, proving the robustness of the recurrent injection mechanism.

Key Insights

Depth Extrapolation: Increasing loop iterations at inference time (e.g., from T=3 to T=16) improves accuracy on structured parity tasks without retraining (OpenMythos, 2026).
MLA Memory Efficiency: Multi-Head Latent Attention significantly reduces the KV-cache footprint compared to GQA, optimized for recurrent inference loops.
Spectral Stability: The recurrent update remains stable if the spectral radius of matrix A is within (0, 1), even during high-learning-rate stress tests.
Adaptive Computation Time (ACT): Halting probabilities enable the model to dynamically allocate compute cycles across sequence positions.
MoE Expert Utilization: Top-k routing with router bias ensures distributed expert utilization, preventing token collapse in Mixture-of-Experts layers.

Working Examples

Initialization and KV-cache footprint analysis for OpenMythos using MLA attention.

import torch, torch.nn as nn
from open_mythos.main import OpenMythos, MythosConfig

cfg = MythosConfig(
    vocab_size=256, dim=128, n_heads=4,
    max_seq_len=128, max_loop_iters=8,
    n_experts=4, n_experts_per_tok=2,
    attn_type="mla", kv_lora_rank=32
)
model = OpenMythos(cfg).to("cuda")

# Memory footprint comparison
x = torch.randint(0, 256, (1, 64), device="cuda")
cache = {}
with torch.no_grad():
    logits = model(x, n_loops=4, kv_cache=cache)

def get_cache_size(kv): 
    return sum(t.element_size() * t.numel() for entry in kv.values() for t in entry.values()) / 1024

print(f"MLA Cache size: {get_cache_size(cache):.2f} KB")

Training on a cumulative parity task and performing depth extrapolation during inference.

T_TRAIN = 3
opt = torch.optim.AdamW(model.parameters(), lr=3e-4)

for step in range(600):
    x, y = make_batch(64) # Cumulative parity task
    logits = model(x, n_loops=T_TRAIN)
    loss = torch.nn.functional.cross_entropy(logits.view(-1, 256), y.view(-1))
    opt.zero_grad()
    loss.backward()
    opt.step()

# Inference with Depth Extrapolation
with torch.no_grad():
    logits_extrapolated = model(x, n_loops=16)
    acc = (logits_extrapolated.argmax(-1) == y).float().mean()

Practical Applications

Use Case: Logical reasoning tasks like Cumulative Parity. The model uses recurrent loops to maintain state across long sequences, improving accuracy as inference compute increases. Pitfall: Over-looping on simple tokens can waste compute if ACT halting thresholds are not properly tuned.
Use Case: Memory-constrained LLM inference. MLA attention allows for high-throughput generation with minimal KV-cache growth. Pitfall: Improperly configured LoRA ranks in the MLA projection can cause representation bottlenecks in the latent space.

References:

https://www.marktechpost.com/2026/04/23/a-coding-tutorial-on-openmythos-on-recurrent-depth-transformers-with-depth-extrapolation-adaptive-computation-and-mixture-of-experts-routing/

On This Page

A Coding Tutorial on OpenMythos on Recurrent-Depth Transformers with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Mastering Mixture of Experts: Scaling Large Language Models via Sparse Architectures

Implementing Microsoft’s OpenMementos: Trace Analysis and Context Compression for LLMs

Unified Access to 50+ Chinese LLMs via OpenAI-Compatible API