LightSeek Foundation Releases TokenSpeed: An Open-Source Inference Engine for Agentic AI

LightSeek Foundation Releases TokenSpeed, an Open-Source LLM Inference Engine Targeting TensorRT-LLM-Level Performance for Agentic Workloads

The LightSeek Foundation has released TokenSpeed, an open-source inference engine under the MIT license specifically optimized for agentic workloads. On NVIDIA Blackwell hardware, it achieves 11% higher throughput than TensorRT-LLM at 100 tokens per second per user. This release aims to solve the unique bottlenecks of long-context, multi-turn coding agents.

Why This Matters

Agentic workloads differ from standard chatbots due to contexts exceeding 50K tokens and multi-turn interactions that strain per-GPU tokens per minute (TPM). While ideal models assume linear scaling, real-world agentic systems face bottlenecks where KV cache management and communication logic overhead significantly degrade per-user responsiveness. TokenSpeed addresses this by shifting correctness constraints to the type system and automating parallelism, reducing the cognitive load on developers while maximizing hardware utilization.

Key Insights

TokenSpeed uses a local SPMD (Single Program, Multiple Data) approach with a static compiler to automate collective operations during model construction (LightSeek Foundation, 2026).
The scheduler implements a C++ finite-state machine to enforce KV cache state transfer safety at compile time rather than runtime.
TokenSpeed’s MLA (Multi-head Latent Attention) kernel folds the query-sequence axis into the head axis to maximize BMM1 Tensor Core utilization on NVIDIA Blackwell.
Benchmarks on NVIDIA B200 show a 9% reduction in minimum latency for batch size 1 compared to the state-of-the-art TensorRT-LLM.
The system integrates PyTorch-native SMG to reduce handoff costs between CPU orchestration and GPU execution planes.
The TokenSpeed MLA kernel has already been adopted by the vLLM project due to its superior performance in speculative decoding workloads.

Practical Applications

Coding agents (e.g., Cursor, Codex): Use TokenSpeed to maintain a 70 TPS floor while serving 50K+ token contexts. Pitfall: Manual communication logic implementation can lead to significant scaling errors.
Speculative decoding workloads: Leverage the TokenSpeed MLA kernel to nearly halve decode latency on Blackwell GPUs for batch sizes 4, 8, and 16. Pitfall: Runtime KV cache management errors often lead to memory corruption in long-running agentic turns.

References:

https://www.marktechpost.com/2026/05/07/lightseek-foundation-releases-tokenspeed-an-open-source-llm-inference-engine-targeting-tensorrt-llm-level-performance-for-agentic-workloads/

On This Page

LightSeek Foundation Releases TokenSpeed, an Open-Source LLM Inference Engine Targeting TensorRT-LLM-Level Performance for Agentic Workloads

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

NVIDIA Releases AITune: Automated Backend Optimization for PyTorch Inference

Fastino Labs Releases GLiGuard: 300M Parameter Model for 16x Faster LLM Safety Moderation

Meta AI Open Sources GCM: Solving Silent GPU Failures in Large-Scale AI Training