Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into First Class Targets for on Device LLMs
These articles are AI-generated summaries. Please check the original sources for full details.
Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into First Class Targets for on Device LLMs
Google and MediaTek’s new LiteRT NeuroPilot Accelerator streamlines on-device AI, allowing generative models to run directly on phones, laptops, and IoT devices without constant data center reliance. It integrates the LiteRT runtime with MediaTek’s NeuroPilot NPU stack, offering a unified API for deploying LLMs and embedding models.
Why This Matters
Historically, on-device ML relied heavily on CPUs and GPUs, while NPUs required fragmented, vendor-specific tools and complex debugging. This fragmentation resulted in a combinatorial explosion of binaries and significant development overhead, increasing costs and time-to-market for deploying models on diverse hardware.
Key Insights
- AOT Compilation Recommendation: On-device compilation of models like Gemma-3-270M can take over a minute, making Ahead-of-Time (AOT) compilation the practical choice for production LLM deployments.
- Unified API: LiteRT NeuroPilot provides a single
Accelerator.NPUabstraction, simplifying code and reducing conditional logic for targeting different hardware backends (CPU, GPU, NPU). - Zero-Copy Buffers: LiteRT integrates with Android’s
AHardwareBufferand GPU buffers, enabling zero-copy tensor transfers for performance-critical tasks like real-time video processing.
Working Example
// Load model compiled for NPU
auto model = Model::CreateFromFile("model.tflite");
auto options = Options::Create();
options->SetHardwareAccelerators(kLiteRtHwAcceleratorNpu);
// Create compiled model
auto compiled = CompiledModel::Create(*env, *model, *options);
// Allocate buffers and run
auto input_buffers = compiled->CreateInputBuffers();
auto output_buffers = compiled->CreateOutputBuffers();
input_buffers[0].Write<float>(input_span);
compiled->Run(input_buffers, output_buffers);
output_buffers[0].Read(output_span);
Practical Applications
- Vivo X300 Pro: Achieves over 1600 tokens per second in prefill and 28 tokens per second in decode with Gemma-3n E2B on the Dimensity 9500 NPU.
- Pitfall: Relying on on-device compilation for larger LLMs can introduce significant latency, negatively impacting user experience and making AOT compilation essential for production deployments.
References:
Continue reading
Next article
How to Streamline Zero Trust Using the Shared Signals Framework
Related Content
Operationalizing AI: Infrastructure, Observability, and Scheduling in Production
CoreWeave CTO Peter Salanki discusses the infrastructure requirements for running complex AI workloads in production at HumanX.
Technofeudalism and the Cognitive Enclosure of AI Engineering
An analysis of how cloud capital is transforming cognitive capacity into a rented commodity through the lens of Technofeudalism.
Google's Deep-Thinking Ratio: Boosting LLM Accuracy While Slashing Inference Costs by 50%
Google researchers introduce the Deep-Thinking Ratio (DTR), a metric that improves LLM accuracy while cutting inference costs by 49% on AIME 2025 benchmarks.