Deep Dive: Proxmox Cluster Synchronization via Corosync and pmxcfs Internals
These articles are AI-generated summaries. Please check the original sources for full details.
Deep Dive: How Proxmox Actually Keeps Your Cluster in Sync (Corosync & pmxcfs Internals)
Proxmox VE manages cluster configuration through the Proxmox Cluster File System (pmxcfs), which presents an in-memory SQLite database as a FUSE-mounted filesystem. This architecture relies on the Totem Single-Ring Protocol to ensure every node receives messages in the exact same order. The tight integration between network messaging and physical disk I/O forms the backbone of Proxmox cluster consistency.
Why This Matters
While ideal distributed models often abstract away hardware specifics, Proxmox clustering reveals a rigid dependency on local storage performance for global network stability. Because pmxcfs requires a synchronous fsync() to the physical disk on every node before a transaction is committed, storage latency is not just a performance bottleneck but a primary stability risk.
A single node with high disk latency can stall the Corosync token circulation across the entire ring. This delay triggers a domino effect where the cluster service might declare a node dead, leading to unnecessary fencing and potential service interruptions in what appeared to be a healthy environment.
Key Insights
- The Totem Single-Ring Protocol (Corosync totemsrp.c) prevents write conflicts by allowing only the node currently holding the token to multicast messages.
- Virtual Synchrony is maintained through the ARU (All Received Up to) sequence number, which acts as a cluster-wide receipt for message delivery.
- pmxcfs functions as an in-memory SQLite database that is mirrored across nodes and presented as a filesystem via FUSE mounting.
- Every configuration change requires an immediate fsync() to the backing SQLite file on every node, blocking until the OS confirms physical persistence.
- The pveperf benchmark tool reveals performance disparities where SSDs achieve over 3,000 fsync/s while USB sticks often drop below 50 fsync/s.
Practical Applications
- System Disk Selection: Administrators should prioritize high-end NVMe or SATA SSDs for the Proxmox OS drive to maintain high fsync rates. Pitfall: Using SD cards or USB sticks for boot media leads to token circulation delays and cluster instability.
- Pre-Cluster Benchmarking: Utilize the pveperf utility to verify fsync performance on new hardware before joining it to a production cluster. Pitfall: Ignoring system disk I/O while focusing exclusively on VM storage performance can cause unexpected node fencing.
- Cluster Topology Planning: Ensure Corosync network paths have minimal jitter to prevent network-induced token timeouts. Pitfall: High network latency combined with slow system disks creates a cumulative delay that triggers node-death declarations.
References:
Continue reading
Next article
Enforcing Design Consistency in AI Agents with TypeUI CLI
Related Content
Automating Xray Node Deployment with 3xui-fast-install
Deploy a security-hardened Xray node featuring VLESS, Hysteria2, and Caddy in under one minute via an automated bash script.
Streamlining Docker Swarm and Compose Deployments via GitHub Actions
Deploy Docker Compose and Swarm services to remote hosts using the docker-remote-deployment-action with zero custom CI scripts.
Streamlining GitHub Repository Creation with GitHub CLI
Eliminate manual browser steps by using the GitHub CLI to create and link remote repositories directly from the terminal.