Skip to main content
unbound mongodb at scale

Write Concern Selection for the Telemetry Platform

4 min read Chapter 59 of 72

Write Concern Selection for the Telemetry Platform

The Symptom

The telemetry platform uses w: "majority" for all writes. Ingestion throughput is 14,000 writes/second, but the hardware can sustain 22,000 writes/second with w: 1. During traffic spikes, the 14,000 ops/sec ceiling causes write queue buildup and connection pool exhaustion (CH4).

The Cause

Not all writes require the same durability. The telemetry platform writes four types of data:

  1. Sensor readings: High volume (50,000/s target). Losing a few readings during a primary failover is acceptable because sensors resend periodically.
  2. Bucket summaries: Computed from readings. Can be recomputed from raw readings if lost. Medium criticality.
  3. Anomaly alerts: Low volume but critical. An alert that a sensor exceeded a safety threshold must not be lost.
  4. Configuration changes: Rare but critical. Changing a sensor’s calibration or threshold must be durable.

Using w: "majority" for all four types wastes throughput on low-criticality data.

The Benchmark

Write typeVolumeCriticalityw:1 throughputw:majority throughputRecommendation
Sensor readings50,000/sLow22,000/s14,000/sw:1
Bucket summaries800/sMedium22,000/s14,000/sw:1
Anomaly alerts5/sHigh22,000/s14,000/sw:majority, j:true
Config changes0.01/sCritical22,000/s14,000/sw:majority, j:true

With mixed write concerns, the effective throughput for the dominant write type (sensor readings) is 22,000/s instead of 14,000/s.

The Fix

@Configuration
public class MongoWriteConfig {

    private final MongoDatabase database;

    // High-volume, low-criticality: w:1
    @Bean("readingsCollection")
    public MongoCollection<Document> readingsCollection() {
        return database.getCollection("readings")
            .withWriteConcern(WriteConcern.W1);
    }

    // Medium criticality: w:1 (can be recomputed)
    @Bean("bucketsCollection")
    public MongoCollection<Document> bucketsCollection() {
        return database.getCollection("buckets_5min")
            .withWriteConcern(WriteConcern.W1);
    }

    // High criticality: w:majority with journal
    @Bean("alertsCollection")
    public MongoCollection<Document> alertsCollection() {
        return database.getCollection("alerts")
            .withWriteConcern(WriteConcern.MAJORITY.withJournal(true));
    }

    // Critical: w:majority with journal and timeout
    @Bean("configCollection")
    public MongoCollection<Document> configCollection() {
        return database.getCollection("sensor_config")
            .withWriteConcern(WriteConcern.MAJORITY
                .withJournal(true)
                .withWTimeout(5000, TimeUnit.MILLISECONDS));
    }
}

wtimeout is critical for w: "majority" writes. Without it, if a secondary goes down, the write blocks indefinitely waiting for majority acknowledgment. With wTimeout: 5000, the write fails after 5 seconds if majority is not achieved. The write is still persisted on the primary; only the acknowledgment is uncertain.

// Handle wtimeout errors
try {
    configCollection.insertOne(configChange);
} catch (MongoWriteConcernException e) {
    // Write reached primary but not majority within timeout
    // The write exists on primary and will replicate eventually
    // Log and retry with verification
    logger.warn("Write concern timeout: {}", e.getMessage());

    // Verify the write reached primary
    Document existing = configCollection
        .withReadPreference(ReadPreference.primary())
        .find(Filters.eq("_id", configChange.get("_id")))
        .first();

    if (existing != null) {
        logger.info("Write exists on primary, replication pending");
    } else {
        throw e;  // Write genuinely failed
    }
}

Bulk writes with write concern:

// FAST: Bulk insert sensor readings with w:1
public void ingestBatch(List<Document> readings) {
    collection.withWriteConcern(WriteConcern.W1)
        .insertMany(readings, new InsertManyOptions().ordered(false));
    // ordered:false allows the driver to batch and parallelize inserts
    // w:1 returns after primary acknowledges without waiting for replication
}

The Proof

Metricw:majority everywhereMixed write concerns
Readings ingestion throughput14,000/s22,000/s
Alert write latency8ms8ms (still w:majority)
Connection pool utilization at 50k/sPool exhausted65%
Data loss on primary failover (readings)00-2 seconds of readings
Data loss on primary failover (alerts)00

The Trade-off

With w: 1 for sensor readings, a primary failover can lose the last 1-2 seconds of unreplicated readings. For the telemetry platform, this means losing approximately 50,000-100,000 readings. These readings are recoverable if sensors buffer and resend on connection failure (which the IoT protocol handles). If sensors do not buffer, the readings are lost permanently.

The wtimeout behavior is subtle. A MongoWriteConcernException does not mean the write failed. It means the write reached the primary but the driver could not confirm majority acknowledgment within the timeout. The write may have replicated to a majority by the time the exception is caught. Retrying the write without checking could create duplicates. Always verify before retrying.

For transactions, the write concern applies to the commit operation. A transaction with w: "majority" ensures that either all operations in the transaction are replicated to a majority, or none are. This is the only way to get atomic, durable multi-document writes.