Skip to main content

On This Page

Zyphra ZAYA1-8B: A 760M Parameter MoE Model Outperforming Claude 4.5 on Math

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Zyphra Releases ZAYA1-8B: A Reasoning MoE Trained on AMD Hardware That Punches Far Above Its Weight Class

Zyphra AI has launched ZAYA1-8B, a Mixture of Experts model featuring only 760 million active parameters. The model was trained end-to-end on a cluster of 1,024 AMD Instinct MI300x nodes. It achieves an 89.6 score on HMMT’25, surpassing the mathematical reasoning performance of Claude 4.5 Sonnet.

Why This Matters

Standard dense models activate every parameter for every token, leading to high inference costs and latency as model size scales. ZAYA1-8B utilizes a Mixture of Experts (MoE) architecture to decouple representational capacity from compute cost, addressing the inefficiency of massive parameter activation. By optimizing for intelligence density, it demonstrates that specialized architectures can match frontier performance while drastically reducing memory bandwidth requirements and inference FLOPs.

Key Insights

  • Compressed Convolutional Attention (CCA) achieves 8× KV-cache compression compared to standard attention mechanisms (Zyphra, 2026).
  • The ZAYA1 MLP-based router utilizes PID-controller bias balancing to prevent expert load imbalance during training (Zyphra, 2026).
  • Markovian RSA test-time compute combines recursive self-aggregation with fixed-duration reasoning chunks to keep context windows bounded (Zyphra, 2026).
  • Training was performed on 1,024 AMD Instinct MI300x nodes using the AMD Pensando Pollara interconnect (IBM/Zyphra, 2026).
  • A five-stage post-training pipeline utilizes an RLVE-Gym phase with dynamically adjusted puzzle difficulty to train reasoning circuits (Zyphra, 2026).

Practical Applications

  • Use case: On-device deployment for local LLM applications requiring high intelligence density and low memory bandwidth.
  • Pitfall: Applying the Markovian RSA harness to models like Qwen3-4B without reasoning-specific co-design results in diminished performance uplift.
  • Use case: Serverless inference for mathematical and coding tasks via Zyphra Cloud using the Apache 2.0 licensed weights.
  • Pitfall: Neglecting active load balancing in MoE routers leads to unstable training and underutilization of the expert network.

References:

Continue reading

Next article

5 Railway.io Config Mistakes That Silently Break Deployments (And How to Fix Them)

Related Content