Skip to main content

On This Page

Building a Fully Offline AI-Assisted Linux Development Workstation

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

My fully offline AI-assisted Linux development machine

Engineer Deepu K Sasidharan has deployed a fully offline AI coding environment on an ASUS ROG Flow Z13 tablet-workstation. The system leverages 128GB of unified memory to dedicate 64GB specifically to the GPU for local LLM inference.

Why This Matters

Technical reality often clashes with cloud-dependent AI workflows due to privacy concerns and the ‘techno-oligarchy’ of remote APIs. By utilizing local ROCm/HIP acceleration on integrated Radeon graphics, developers can eliminate token costs and data leakage while maintaining a high-performance development loop. This setup demonstrates that modern consumer hardware with sufficient unified memory can effectively host 27B to 31B parameter models, providing a viable alternative to hosted frontier models for repository-wide coding tasks.

Key Insights

  • Arch Linux enables immediate access to the latest kernel, Mesa, and ROCm-adjacent bits required for bleeding-edge hardware like the AMD Ryzen AI Max+ 395.
  • Niri, a scrolling Wayland compositor, replaces traditional tiling grids with a fluid horizontal column workflow optimized for ultrawide displays.
  • Qwen3.6 27B models at Q8_0 quantization achieve 7.18 generation tokens/s on integrated Radeon 8060S GPUs using ROCm acceleration.
  • The DankMaterialShell (DMS) consolidates desktop plumbing—including clipboard management and system monitoring—into a single extensible shell interface.
  • Building llama.cpp with HIP support and Ninja allows for significant performance gains over standard wrappers like Ollama on AMD hardware.

Working Examples

OpenCode provider configuration for local llama.cpp server

{"$schema": "https://opencode.ai/config.json","provider": {"llama.cpp": {"npm": "@ai-sdk/openai-compatible","name": "llama.cpp ROCm (local)","options": {"baseURL": "http://127.0.0.1:18080/v1"},"models": {"qwen3-6-27b-q8-0": {"name": "Qwen3.6 27B Q8_0 (local ROCm)","limit": {"context": 262144,"output": 16384}}}}}}

Automated llama.cpp build script with ROCm/HIP support

cmake -S /mnt/work/Workspace/llms/llama.cpp -B /mnt/work/Workspace/llms/llama.cpp/build-hip -G Ninja -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release && cmake --build /mnt/work/Workspace/llms/llama.cpp/build-hip --config Release -j "$(nproc)" --target llama-server llama-bench

Local LLM server execution command with GPU offloading

ROCBLAS_USE_HIPBLASLT=1 llama-server --model "$model" --alias "$alias_name" --host 127.0.0.1 --port 18080 --ctx-size "$ctx" --n-gpu-layers 999 --flash-attn on --no-mmap --cache-type-k f16 --cache-type-v f16 --batch-size 4096 --ubatch-size 512 --reasoning "$reasoning"

Practical Applications

  • Use Case: Running Qwen3.6 27B for code review tasks to identify logic errors missed by hosted models. Pitfall: High context windows (256k) reduce generation speed to ~64 tokens/s and require significant VRAM allocation.
  • Use Case: Offline development during travel or in low-connectivity environments using OpenCode as a local agent. Pitfall: Bleeding-edge hardware like the Flow Z13 requires manual firmware fixes for Thunderbolt rescans and Wi-Fi quirks.
  • Use Case: Secure modification of private repositories where data sovereignty is mandated. Pitfall: Reasoning modes in local models can load up to 70% of available GPU memory, potentially starving the host OS during heavy multitasking.

References:

Continue reading

Next article

Implementing OAuth 2.0 Device Flow for Input-Constrained Environments

Related Content