Skip to main content

On This Page

OpenMythos: A 770M Parameter Recurrent-Depth Transformer Matching 1.3B Models

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Meet OpenMythos: An Open-Source PyTorch Reconstruction of Claude Mythos Where 770M Parameters Match a 1.3B Transformer

Kye Gomez has released OpenMythos, an open-source PyTorch hypothesis of the Claude Mythos architecture. The project proposes that Claude Mythos is a Recurrent-Depth Transformer (RDT) that reuses a fixed set of weights iteratively within a single forward pass. This architecture allows a 770M parameter model to achieve performance parity with standard 1.3B parameter transformers.

Why This Matters

Conventional transformer architectures like LLaMA or GPT scale capability by increasing the number of unique layers, which directly inflates parameter count and memory requirements. This creates a rigid bottleneck where reasoning depth is hard-coded into the model’s physical structure at training time, limiting efficiency and inference flexibility.

OpenMythos shifts the paradigm by decoupling reasoning depth from parameter count through iterative weight application. By refining internal representations across multiple loop steps rather than passing through thousands of unique parameters, models can achieve higher reasoning capabilities with significantly lower storage footprints, fundamentally changing the scaling laws for edge-compatible AI.

Key Insights

  • Recurrent-Depth Transformers (RDTs) utilize a fixed set of weights applied iteratively across T loop steps, making reasoning depth a function of inference compute rather than parameter count (OpenMythos, 2026).
  • The Recurrent Block integrates Mixture-of-Experts (MoE) from DeepSeekMoE, using a pool of fine-grained experts where the router selects distinct subsets at each loop depth to ensure computational variety.
  • Multi-Latent Attention (MLA) from DeepSeek-V2 is used to cache compressed low-rank KV latents, resulting in a 10–20× reduction in KV memory overhead at production scale.
  • Stability is maintained via Linear Time-Invariant (LTI) injection constraints from the Parcae architecture (Prairie et al., 2026), enforcing a spectral radius of A < 1 to prevent residual explosion.
  • Continuous latent space reasoning allows models to generalize to deeper reasoning chains than seen in training, as demonstrated in Saunshi et al. (2025).
  • Adaptive Computation Time (ACT) uses a learned scalar per position to dynamically halt looping, allowing simpler tokens to exit early while complex logic receives more compute cycles.
  • Depth-Wise LoRA adapters introduce small rank-r matrices at each iteration to provide per-step behavioral differentiation without the parameter cost of unique layers.

Practical Applications

  • Inference-Time Scaling: Systems can extend reasoning depth for complex logic tasks by running more loops (e.g., T=16) without needing to retrain or increase model size.
  • Pitfall: Residual Explosion: Without LTI constraints or proper weight initialization, hidden states can grow unboundedly across deep loops, leading to numerical instability.
  • KV Cache Management: Using Multi-Latent Attention allows high-throughput serving of long-context requests on hardware with limited VRAM by compressing key/value tensors.
  • Pitfall: Overthinking: Beyond optimal depth, excessive recurrence can degrade predictions as the hidden state drifts into noise, requiring ACT halting to preserve accuracy.

References:

Continue reading

Next article

Building Moonbug: A Lunar-Centric Productivity Ecosystem and Event Catalogue

Related Content