Skip to main content

On This Page

Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into First Class Targets for on Device LLMs

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Google LiteRT NeuroPilot Stack Turns MediaTek Dimensity NPUs into First Class Targets for on Device LLMs

Google and MediaTek’s new LiteRT NeuroPilot Accelerator streamlines on-device AI, allowing generative models to run directly on phones, laptops, and IoT devices without constant data center reliance. It integrates the LiteRT runtime with MediaTek’s NeuroPilot NPU stack, offering a unified API for deploying LLMs and embedding models.

Why This Matters

Historically, on-device ML relied heavily on CPUs and GPUs, while NPUs required fragmented, vendor-specific tools and complex debugging. This fragmentation resulted in a combinatorial explosion of binaries and significant development overhead, increasing costs and time-to-market for deploying models on diverse hardware.

Key Insights

  • AOT Compilation Recommendation: On-device compilation of models like Gemma-3-270M can take over a minute, making Ahead-of-Time (AOT) compilation the practical choice for production LLM deployments.
  • Unified API: LiteRT NeuroPilot provides a single Accelerator.NPU abstraction, simplifying code and reducing conditional logic for targeting different hardware backends (CPU, GPU, NPU).
  • Zero-Copy Buffers: LiteRT integrates with Android’s AHardwareBuffer and GPU buffers, enabling zero-copy tensor transfers for performance-critical tasks like real-time video processing.

Working Example

// Load model compiled for NPU
auto model = Model::CreateFromFile("model.tflite");
auto options = Options::Create();
options->SetHardwareAccelerators(kLiteRtHwAcceleratorNpu);
// Create compiled model
auto compiled = CompiledModel::Create(*env, *model, *options);
// Allocate buffers and run
auto input_buffers = compiled->CreateInputBuffers();
auto output_buffers = compiled->CreateOutputBuffers();
input_buffers[0].Write<float>(input_span);
compiled->Run(input_buffers, output_buffers);
output_buffers[0].Read(output_span);

Practical Applications

  • Vivo X300 Pro: Achieves over 1600 tokens per second in prefill and 28 tokens per second in decode with Gemma-3n E2B on the Dimensity 9500 NPU.
  • Pitfall: Relying on on-device compilation for larger LLMs can introduce significant latency, negatively impacting user experience and making AOT compilation essential for production deployments.

References:

Continue reading

Next article

How to Streamline Zero Trust Using the Shared Signals Framework

Related Content