Deep Dive: Proxmox Cluster Synchronization via Corosync and pmxcfs Internals
These articles are AI-generated summaries. Please check the original sources for full details.
Deep Dive: How Proxmox Actually Keeps Your Cluster in Sync (Corosync & pmxcfs Internals)
Proxmox VE manages cluster configuration through the Proxmox Cluster File System (pmxcfs), which presents an in-memory SQLite database as a FUSE-mounted filesystem. This architecture relies on the Totem Single-Ring Protocol to ensure every node receives messages in the exact same order. The tight integration between network messaging and physical disk I/O forms the backbone of Proxmox cluster consistency.
Why This Matters
While ideal distributed models often abstract away hardware specifics, Proxmox clustering reveals a rigid dependency on local storage performance for global network stability. Because pmxcfs requires a synchronous fsync() to the physical disk on every node before a transaction is committed, storage latency is not just a performance bottleneck but a primary stability risk.
A single node with high disk latency can stall the Corosync token circulation across the entire ring. This delay triggers a domino effect where the cluster service might declare a node dead, leading to unnecessary fencing and potential service interruptions in what appeared to be a healthy environment.
Key Insights
- The Totem Single-Ring Protocol (Corosync totemsrp.c) prevents write conflicts by allowing only the node currently holding the token to multicast messages.
- Virtual Synchrony is maintained through the ARU (All Received Up to) sequence number, which acts as a cluster-wide receipt for message delivery.
- pmxcfs functions as an in-memory SQLite database that is mirrored across nodes and presented as a filesystem via FUSE mounting.
- Every configuration change requires an immediate fsync() to the backing SQLite file on every node, blocking until the OS confirms physical persistence.
- The pveperf benchmark tool reveals performance disparities where SSDs achieve over 3,000 fsync/s while USB sticks often drop below 50 fsync/s.
Practical Applications
- System Disk Selection: Administrators should prioritize high-end NVMe or SATA SSDs for the Proxmox OS drive to maintain high fsync rates. Pitfall: Using SD cards or USB sticks for boot media leads to token circulation delays and cluster instability.
- Pre-Cluster Benchmarking: Utilize the pveperf utility to verify fsync performance on new hardware before joining it to a production cluster. Pitfall: Ignoring system disk I/O while focusing exclusively on VM storage performance can cause unexpected node fencing.
- Cluster Topology Planning: Ensure Corosync network paths have minimal jitter to prevent network-induced token timeouts. Pitfall: High network latency combined with slow system disks creates a cumulative delay that triggers node-death declarations.
References:
Continue reading
Next article
Enforcing Design Consistency in AI Agents with TypeUI CLI
Related Content
Automating Policy-Gated Releases: Building SwiftDeploy for Observable DevOps
SwiftDeploy evolves into a policy-gated system using OPA to block releases if disk space is under 10GB or error rates exceed 1%.
Building Policy-Driven DevOps: Integrating OPA and Prometheus into SwiftDeploy
Frank develops SwiftDeploy, a gated CLI tool using OPA to block canary promotions when P99 latency exceeds 500ms or disk space drops below 10GB.
Cloud Provisioning Latency Benchmarks: GCP Latency Spikes 75% in May 2026
GCP europe-north1 VM provisioning latency surged by 75% to 3m 07s while AWS maintained a sub-35s p50 lead in the latest weekly benchmarks.