Skip to main content

On This Page

New llama.cpp Server Feature: Dynamic Model Management

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

New in llama.cpp: Model Management

The llama.cpp server now supports “router mode,” allowing dynamic loading, unloading, and switching between multiple models without requiring server restarts. This addresses a frequent user request for Ollama-style model management within the llama.cpp ecosystem.

This feature improves resource utilization and uptime, as traditional model switching necessitates a full server restart, disrupting active requests and potentially causing downtime. In production environments, even brief interruptions can translate to significant financial losses or degraded user experience.

Key Insights

  • Ollama-style management: Inspired by the popular Ollama framework, offering a familiar workflow.
  • Multi-process architecture: Each model runs in its own process, isolating failures and improving stability.
  • LRU eviction: Least Recently Used models are automatically unloaded when the maximum number of loaded models (--models-max) is reached, freeing up VRAM.

Working Example

# Start the server in router mode (no model specified)
llama-server
# List available models
curl http://localhost:8080/models
# Manually load a model
curl -X POST http://localhost:8080/models/load \
-H "Content-Type: application/json" \
-d '{"model": "my-model.gguf"}'

Practical Applications

  • A/B Testing: Run different model versions concurrently to compare performance on real-world data.
  • Multi-tenant deployments: Serve multiple users or applications with different model requirements on a single server.

References:

Continue reading

Next article

OpenAI Introduces GPT-5.2: A Long Context Workhorse For Agents, Coding And Knowledge Work

Related Content