Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into First Class Targets for on Device LLMs
These articles are AI-generated summaries. Please check the original sources for full details.
Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into First Class Targets for on Device LLMs
Google and MediaTek’s new LiteRT NeuroPilot Accelerator streamlines on-device AI, allowing generative models to run directly on phones, laptops, and IoT devices without constant data center reliance. It integrates the LiteRT runtime with MediaTek’s NeuroPilot NPU stack, offering a unified API for deploying LLMs and embedding models.
Why This Matters
Historically, on-device ML relied heavily on CPUs and GPUs, while NPUs required fragmented, vendor-specific tools and complex debugging. This fragmentation resulted in a combinatorial explosion of binaries and significant development overhead, increasing costs and time-to-market for deploying models on diverse hardware.
Key Insights
- AOT Compilation Recommendation: On-device compilation of models like Gemma-3-270M can take over a minute, making Ahead-of-Time (AOT) compilation the practical choice for production LLM deployments.
- Unified API: LiteRT NeuroPilot provides a single
Accelerator.NPUabstraction, simplifying code and reducing conditional logic for targeting different hardware backends (CPU, GPU, NPU). - Zero-Copy Buffers: LiteRT integrates with Android’s
AHardwareBufferand GPU buffers, enabling zero-copy tensor transfers for performance-critical tasks like real-time video processing.
Working Example
// Load model compiled for NPU
auto model = Model::CreateFromFile("model.tflite");
auto options = Options::Create();
options->SetHardwareAccelerators(kLiteRtHwAcceleratorNpu);
// Create compiled model
auto compiled = CompiledModel::Create(*env, *model, *options);
// Allocate buffers and run
auto input_buffers = compiled->CreateInputBuffers();
auto output_buffers = compiled->CreateOutputBuffers();
input_buffers[0].Write<float>(input_span);
compiled->Run(input_buffers, output_buffers);
output_buffers[0].Read(output_span);
Practical Applications
- Vivo X300 Pro: Achieves over 1600 tokens per second in prefill and 28 tokens per second in decode with Gemma-3n E2B on the Dimensity 9500 NPU.
- Pitfall: Relying on on-device compilation for larger LLMs can introduce significant latency, negatively impacting user experience and making AOT compilation essential for production deployments.
References:
Continue reading
Next article
How to Streamline Zero Trust Using the Shared Signals Framework
Related Content
Google AI Releases MTP Drafters for Gemma 4: Accelerating Inference by 3x
Google AI releases MTP drafters for Gemma 4, using speculative decoding to deliver up to 3x faster inference without quality loss.
OpenAI Releases MRC Protocol: Scaling AI Supercomputing to 131,000 GPUs
OpenAI's new MRC protocol enables 131,000 GPU clusters with 33% fewer optics and microsecond failure recovery for frontier AI model training.
Adaptive Parallel Reasoning: Scaling Inference with Dynamic Control
Adaptive Parallel Reasoning (APR) allows LLMs to dynamically spawn concurrent threads, reducing latency compared to linear sequential reasoning which can take hours.