Skip to main content
data systems from the ground up

Writing to a File and Why Appending Is the Only Sane Default

5 min read Chapter 2 of 36

Writing to a File and Why Appending Is the Only Sane Default

The Black Box

Application code calls INSERT INTO package_events (package_id, status, timestamp) VALUES (...). The database returns success. The developer assumes the data is on disk. What happened between the INSERT and the success response involves at least three layers of buffering, a write strategy that determines crash safety, and I/O characteristics that vary by an order of magnitude depending on whether the write was sequential or random.

The Mechanism

A write to a file on Linux passes through four layers before reaching persistent storage:

  1. Application buffer. The write() system call copies data from the application’s memory into the kernel’s page cache. This returns immediately. The data is in RAM, not on disk.

  2. Kernel page cache. The OS holds the written data in memory and marks the page as dirty. The pdflush daemon (or its modern equivalent, writeback threads) will flush dirty pages to disk at some future time, typically within 30 seconds.

  3. Disk controller cache. The drive itself has a volatile write cache (typically 256MB on enterprise SSDs). Data written to this cache can be lost on power failure unless the drive has capacitor-backed cache or the OS issues a cache flush command.

  4. Persistent media. The data is on the actual flash cells (SSD) or magnetic platters (HDD).

fsync() forces the data through all four layers to persistent media. Without fsync, a successful write() means the data is in the page cache. A power failure loses it.

# Concept: observing the difference between buffered writes and fsync
# Write 100MB without fsync, then with fsync

# Buffered write (data in page cache only)
dd if=/dev/zero of=/tmp/testfile bs=1M count=100
# Typical: 2.1 GB/s (writing to RAM)

# With fsync (data on persistent media)
dd if=/dev/zero of=/tmp/testfile bs=1M count=100 conv=fsync
# Typical: 450 MB/s on NVMe SSD, 120 MB/s on SATA SSD

The 4x-17x difference between buffered writes and fsynced writes is the cost of durability. Every database makes a choice about when and how often to call fsync. PostgreSQL calls it at every commit by default (synchronous_commit = on). Kafka calls it based on flush.messages and flush.ms configuration. The tradeoff is always latency vs durability.

The Observable Consequence

Sequential vs Random Write Performance

Appending writes sequentially. Updating in place writes randomly. The performance difference is measurable:

OperationNVMe SSDSATA SSDHDD
Sequential write (1MB blocks)3,200 MB/s520 MB/s180 MB/s
Random write (4KB blocks)800 MB/s180 MB/s2 MB/s
Ratio (sequential/random)4x2.9x90x

On NVMe SSDs, the gap is smaller than on spinning disks, but it still exists. The flash translation layer inside the SSD must map random 4KB writes to physical NAND pages, triggering write amplification and garbage collection. Sequential writes align naturally with the SSD’s internal erase block boundaries.

On HDDs, the gap is catastrophic. Random writes require a physical seek of the read/write head, costing 4-10ms per operation. Sequential writes avoid seeks entirely.

Crash Safety

An append-only write either completes or it does not. If the process crashes mid-append, the partial record sits at the end of the file. Everything before it is intact. Recovery is straightforward: read the file, discard the last incomplete record.

An in-place update overwrites existing data. If the process crashes mid-update, the record is half old and half new. The file is corrupted. Recovery requires either a backup or a separate recovery log, which is itself an append-only file. This is not a hypothetical failure mode. It is the reason PostgreSQL, SQLite, and every other durable database uses a write-ahead log.

The Code

The logistics platform’s event writer demonstrates the two critical choices: append-only writes and controlled fsync.

// Concept: append-only write with explicit fsync control
// FileChannel gives control over when data reaches persistent media
void recordEvent(FileChannel channel, String packageId, String status, Instant ts)
        throws IOException {
    String record = packageId + "\t" + status + "\t" + ts + "\n";
    ByteBuffer buf = ByteBuffer.wrap(record.getBytes(StandardCharsets.UTF_8));
    channel.write(buf);           // Data is in page cache
    channel.force(true);          // fsync: data is on persistent media
}

Calling force(true) after every write guarantees durability but limits throughput to the fsync rate of the underlying device. On an NVMe SSD, this caps single-threaded write throughput at roughly 50,000-80,000 fsyncs per second. For the logistics platform processing 2,000 events per minute, this is more than sufficient. For Kafka ingesting 500,000 messages per second, it is not, which is why Kafka batches writes and fsyncs periodically rather than per-message.

The Decision Rule

Use append-only writes when durability matters and you need crash recovery without external tooling. Use in-place updates only when you have a separate recovery mechanism (a WAL) and the random write overhead is offset by eliminating the need to compact the append-only log. In practice, this means: the WAL is append-only, and the data files it protects may use in-place updates. The log is the safety net. The log is always appended.