Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into First Class Targets for on Device LLMs

Google and MediaTek’s new LiteRT NeuroPilot Accelerator streamlines on-device AI, allowing generative models to run directly on phones, laptops, and IoT devices without constant data center reliance. It integrates the LiteRT runtime with MediaTek’s NeuroPilot NPU stack, offering a unified API for deploying LLMs and embedding models.

Why This Matters

Historically, on-device ML relied heavily on CPUs and GPUs, while NPUs required fragmented, vendor-specific tools and complex debugging. This fragmentation resulted in a combinatorial explosion of binaries and significant development overhead, increasing costs and time-to-market for deploying models on diverse hardware.

Key Insights

AOT Compilation Recommendation: On-device compilation of models like Gemma-3-270M can take over a minute, making Ahead-of-Time (AOT) compilation the practical choice for production LLM deployments.
Unified API: LiteRT NeuroPilot provides a single Accelerator.NPU abstraction, simplifying code and reducing conditional logic for targeting different hardware backends (CPU, GPU, NPU).
Zero-Copy Buffers: LiteRT integrates with Android’s AHardwareBuffer and GPU buffers, enabling zero-copy tensor transfers for performance-critical tasks like real-time video processing.

Working Example

// Load model compiled for NPU
auto model = Model::CreateFromFile("model.tflite");
auto options = Options::Create();
options->SetHardwareAccelerators(kLiteRtHwAcceleratorNpu);
// Create compiled model
auto compiled = CompiledModel::Create(*env, *model, *options);
// Allocate buffers and run
auto input_buffers = compiled->CreateInputBuffers();
auto output_buffers = compiled->CreateOutputBuffers();
input_buffers[0].Write<float>(input_span);
compiled->Run(input_buffers, output_buffers);
output_buffers[0].Read(output_span);

Practical Applications

Vivo X300 Pro: Achieves over 1600 tokens per second in prefill and 28 tokens per second in decode with Gemma-3n E2B on the Dimensity 9500 NPU.
Pitfall: Relying on on-device compilation for larger LLMs can introduce significant latency, negatively impacting user experience and making AOT compilation essential for production deployments.

References:

https://www.marktechpost.com/2025/12/09/google-litert-neuropilot-stack-turns-mediatek-dimensity-npus-into-first-class-targets-for-on-device-llms/

On This Page

Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into First Class Targets for on Device LLMs