Skip to main content

On This Page

Deep Dive: Proxmox Cluster Synchronization via Corosync and pmxcfs Internals

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Deep Dive: How Proxmox Actually Keeps Your Cluster in Sync (Corosync & pmxcfs Internals)

Proxmox VE manages cluster configuration through the Proxmox Cluster File System (pmxcfs), which presents an in-memory SQLite database as a FUSE-mounted filesystem. This architecture relies on the Totem Single-Ring Protocol to ensure every node receives messages in the exact same order. The tight integration between network messaging and physical disk I/O forms the backbone of Proxmox cluster consistency.

Why This Matters

While ideal distributed models often abstract away hardware specifics, Proxmox clustering reveals a rigid dependency on local storage performance for global network stability. Because pmxcfs requires a synchronous fsync() to the physical disk on every node before a transaction is committed, storage latency is not just a performance bottleneck but a primary stability risk.

A single node with high disk latency can stall the Corosync token circulation across the entire ring. This delay triggers a domino effect where the cluster service might declare a node dead, leading to unnecessary fencing and potential service interruptions in what appeared to be a healthy environment.

Key Insights

  • The Totem Single-Ring Protocol (Corosync totemsrp.c) prevents write conflicts by allowing only the node currently holding the token to multicast messages.
  • Virtual Synchrony is maintained through the ARU (All Received Up to) sequence number, which acts as a cluster-wide receipt for message delivery.
  • pmxcfs functions as an in-memory SQLite database that is mirrored across nodes and presented as a filesystem via FUSE mounting.
  • Every configuration change requires an immediate fsync() to the backing SQLite file on every node, blocking until the OS confirms physical persistence.
  • The pveperf benchmark tool reveals performance disparities where SSDs achieve over 3,000 fsync/s while USB sticks often drop below 50 fsync/s.

Practical Applications

  • System Disk Selection: Administrators should prioritize high-end NVMe or SATA SSDs for the Proxmox OS drive to maintain high fsync rates. Pitfall: Using SD cards or USB sticks for boot media leads to token circulation delays and cluster instability.
  • Pre-Cluster Benchmarking: Utilize the pveperf utility to verify fsync performance on new hardware before joining it to a production cluster. Pitfall: Ignoring system disk I/O while focusing exclusively on VM storage performance can cause unexpected node fencing.
  • Cluster Topology Planning: Ensure Corosync network paths have minimal jitter to prevent network-induced token timeouts. Pitfall: High network latency combined with slow system disks creates a cumulative delay that triggers node-death declarations.

References:

Continue reading

Next article

Enforcing Design Consistency in AI Agents with TypeUI CLI

Related Content