LightSeek Foundation Releases TokenSpeed: An Open-Source Inference Engine for Agentic AI
These articles are AI-generated summaries. Please check the original sources for full details.
LightSeek Foundation Releases TokenSpeed, an Open-Source LLM Inference Engine Targeting TensorRT-LLM-Level Performance for Agentic Workloads
The LightSeek Foundation has released TokenSpeed, an open-source inference engine under the MIT license specifically optimized for agentic workloads. On NVIDIA Blackwell hardware, it achieves 11% higher throughput than TensorRT-LLM at 100 tokens per second per user. This release aims to solve the unique bottlenecks of long-context, multi-turn coding agents.
Why This Matters
Agentic workloads differ from standard chatbots due to contexts exceeding 50K tokens and multi-turn interactions that strain per-GPU tokens per minute (TPM). While ideal models assume linear scaling, real-world agentic systems face bottlenecks where KV cache management and communication logic overhead significantly degrade per-user responsiveness. TokenSpeed addresses this by shifting correctness constraints to the type system and automating parallelism, reducing the cognitive load on developers while maximizing hardware utilization.
Key Insights
- TokenSpeed uses a local SPMD (Single Program, Multiple Data) approach with a static compiler to automate collective operations during model construction (LightSeek Foundation, 2026).
- The scheduler implements a C++ finite-state machine to enforce KV cache state transfer safety at compile time rather than runtime.
- TokenSpeed’s MLA (Multi-head Latent Attention) kernel folds the query-sequence axis into the head axis to maximize BMM1 Tensor Core utilization on NVIDIA Blackwell.
- Benchmarks on NVIDIA B200 show a 9% reduction in minimum latency for batch size 1 compared to the state-of-the-art TensorRT-LLM.
- The system integrates PyTorch-native SMG to reduce handoff costs between CPU orchestration and GPU execution planes.
- The TokenSpeed MLA kernel has already been adopted by the vLLM project due to its superior performance in speculative decoding workloads.
Practical Applications
- Coding agents (e.g., Cursor, Codex): Use TokenSpeed to maintain a 70 TPS floor while serving 50K+ token contexts. Pitfall: Manual communication logic implementation can lead to significant scaling errors.
- Speculative decoding workloads: Leverage the TokenSpeed MLA kernel to nearly halve decode latency on Blackwell GPUs for batch sizes 4, 8, and 16. Pitfall: Runtime KV cache management errors often lead to memory corruption in long-running agentic turns.
References:
Continue reading
Next article
Meta AI Open-Sources NeuralBench: A Standardized Benchmark for EEG Foundation Models
Related Content
Fastino Labs Releases GLiGuard: 300M Parameter Model for 16x Faster LLM Safety Moderation
Fastino Labs open-sourced GLiGuard, a 300M parameter safety model that matches the accuracy of models 90x its size while delivering 16.6x lower latency.
NVIDIA Releases AITune: Automated Backend Optimization for PyTorch Inference
NVIDIA releases AITune, an Apache 2.0 toolkit that automatically benchmarks and selects the fastest inference backends like TensorRT and Torch Inductor for PyTorch.
NVIDIA Releases cuda-oxide: A Native Rust-to-PTX Compiler for SIMT GPU Kernels
NVIDIA AI researchers released cuda-oxide, an experimental Rust-to-CUDA compiler backend that compiles SIMT GPU kernels directly to PTX, achieving 868 TFLOPS on B200 GPUs.