Building a Fully Offline AI-Assisted Linux Development Workstation

My fully offline AI-assisted Linux development machine

Engineer Deepu K Sasidharan has deployed a fully offline AI coding environment on an ASUS ROG Flow Z13 tablet-workstation. The system leverages 128GB of unified memory to dedicate 64GB specifically to the GPU for local LLM inference.

Why This Matters

Technical reality often clashes with cloud-dependent AI workflows due to privacy concerns and the ‘techno-oligarchy’ of remote APIs. By utilizing local ROCm/HIP acceleration on integrated Radeon graphics, developers can eliminate token costs and data leakage while maintaining a high-performance development loop. This setup demonstrates that modern consumer hardware with sufficient unified memory can effectively host 27B to 31B parameter models, providing a viable alternative to hosted frontier models for repository-wide coding tasks.

Key Insights

Arch Linux enables immediate access to the latest kernel, Mesa, and ROCm-adjacent bits required for bleeding-edge hardware like the AMD Ryzen AI Max+ 395.
Niri, a scrolling Wayland compositor, replaces traditional tiling grids with a fluid horizontal column workflow optimized for ultrawide displays.
Qwen3.6 27B models at Q8_0 quantization achieve 7.18 generation tokens/s on integrated Radeon 8060S GPUs using ROCm acceleration.
The DankMaterialShell (DMS) consolidates desktop plumbing—including clipboard management and system monitoring—into a single extensible shell interface.
Building llama.cpp with HIP support and Ninja allows for significant performance gains over standard wrappers like Ollama on AMD hardware.

Working Examples

OpenCode provider configuration for local llama.cpp server

{"$schema": "https://opencode.ai/config.json","provider": {"llama.cpp": {"npm": "@ai-sdk/openai-compatible","name": "llama.cpp ROCm (local)","options": {"baseURL": "http://127.0.0.1:18080/v1"},"models": {"qwen3-6-27b-q8-0": {"name": "Qwen3.6 27B Q8_0 (local ROCm)","limit": {"context": 262144,"output": 16384}}}}}}

Automated llama.cpp build script with ROCm/HIP support

cmake -S /mnt/work/Workspace/llms/llama.cpp -B /mnt/work/Workspace/llms/llama.cpp/build-hip -G Ninja -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release && cmake --build /mnt/work/Workspace/llms/llama.cpp/build-hip --config Release -j "$(nproc)" --target llama-server llama-bench

Local LLM server execution command with GPU offloading

ROCBLAS_USE_HIPBLASLT=1 llama-server --model "$model" --alias "$alias_name" --host 127.0.0.1 --port 18080 --ctx-size "$ctx" --n-gpu-layers 999 --flash-attn on --no-mmap --cache-type-k f16 --cache-type-v f16 --batch-size 4096 --ubatch-size 512 --reasoning "$reasoning"

Practical Applications

Use Case: Running Qwen3.6 27B for code review tasks to identify logic errors missed by hosted models. Pitfall: High context windows (256k) reduce generation speed to ~64 tokens/s and require significant VRAM allocation.
Use Case: Offline development during travel or in low-connectivity environments using OpenCode as a local agent. Pitfall: Bleeding-edge hardware like the Flow Z13 requires manual firmware fixes for Thunderbolt rescans and Wi-Fi quirks.
Use Case: Secure modification of private repositories where data sovereignty is mandated. Pitfall: Reasoning modes in local models can load up to 70% of available GPU memory, potentially starving the host OS during heavy multitasking.

References:

On This Page

My fully offline AI-assisted Linux development machine

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

An Implementation of Fully Traced and Evaluated Local LLM Pipeline Using Opik

Optimizing AI Coding Workflows with Local Quality Pipelines

Troubleshooting High CPU and Memory Usage on Linux