Skip to main content

On This Page

LightSeek Foundation Releases TokenSpeed: An Open-Source Inference Engine for Agentic AI

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

LightSeek Foundation Releases TokenSpeed, an Open-Source LLM Inference Engine Targeting TensorRT-LLM-Level Performance for Agentic Workloads

The LightSeek Foundation has released TokenSpeed, an open-source inference engine under the MIT license specifically optimized for agentic workloads. On NVIDIA Blackwell hardware, it achieves 11% higher throughput than TensorRT-LLM at 100 tokens per second per user. This release aims to solve the unique bottlenecks of long-context, multi-turn coding agents.

Why This Matters

Agentic workloads differ from standard chatbots due to contexts exceeding 50K tokens and multi-turn interactions that strain per-GPU tokens per minute (TPM). While ideal models assume linear scaling, real-world agentic systems face bottlenecks where KV cache management and communication logic overhead significantly degrade per-user responsiveness. TokenSpeed addresses this by shifting correctness constraints to the type system and automating parallelism, reducing the cognitive load on developers while maximizing hardware utilization.

Key Insights

  • TokenSpeed uses a local SPMD (Single Program, Multiple Data) approach with a static compiler to automate collective operations during model construction (LightSeek Foundation, 2026).
  • The scheduler implements a C++ finite-state machine to enforce KV cache state transfer safety at compile time rather than runtime.
  • TokenSpeed’s MLA (Multi-head Latent Attention) kernel folds the query-sequence axis into the head axis to maximize BMM1 Tensor Core utilization on NVIDIA Blackwell.
  • Benchmarks on NVIDIA B200 show a 9% reduction in minimum latency for batch size 1 compared to the state-of-the-art TensorRT-LLM.
  • The system integrates PyTorch-native SMG to reduce handoff costs between CPU orchestration and GPU execution planes.
  • The TokenSpeed MLA kernel has already been adopted by the vLLM project due to its superior performance in speculative decoding workloads.

Practical Applications

  • Coding agents (e.g., Cursor, Codex): Use TokenSpeed to maintain a 70 TPS floor while serving 50K+ token contexts. Pitfall: Manual communication logic implementation can lead to significant scaling errors.
  • Speculative decoding workloads: Leverage the TokenSpeed MLA kernel to nearly halve decode latency on Blackwell GPUs for batch sizes 4, 8, and 16. Pitfall: Runtime KV cache management errors often lead to memory corruption in long-running agentic turns.

References:

Continue reading

Next article

Meta AI Open-Sources NeuralBench: A Standardized Benchmark for EEG Foundation Models

Related Content