The Mechanism
The Mechanism
The immediate cause is a human error: a destructive command executed in the wrong terminal. The systemic cause is that every recovery mechanism had failed independently, silently, and prior to the incident.
The rm -rf command:
# The engineer intended to run this on db2 (secondary):
sudo rm -rf /var/opt/gitlab/postgresql/data/
# The engineer ran it on db1 (primary).
# FAILURE POINT: rm -rf provides no confirmation, no prompt,
# no indication of which host the command is running on.
# The default bash prompt may or may not display the hostname.
# In a late-night troubleshooting session with multiple terminals,
# visual identification of which terminal is connected to which
# server is the only safeguard.
The command is catastrophic by design. rm -rf removes files recursively (-r) and forcefully (-f, suppressing confirmation prompts). It is the standard Unix file deletion command. It has no undo. It does not check what it is deleting. It does not warn that the target contains a PostgreSQL data directory. It does not require elevated confirmation for large deletions. These are properties of the tool, not bugs.
The human factors are straightforward. The engineer has been troubleshooting a replication problem for hours. It is late evening. Multiple terminal windows are open, connected to different servers. The terminals look similar. The cognitive load of distinguishing between them is low under normal conditions but increases with fatigue. The engineer executes the correct command on the wrong server.
This is not incompetence. This is a predictable consequence of a workflow that relies on human visual identification of terminal windows as the sole safeguard against destructive operations on production systems.
The backup failures each have independent causes:
pg_dump (silent failure). The cron job that triggers pg_dump ran on schedule. The pg_dump command itself failed because the database had grown beyond the disk space available for the dump. The cron job did not check the exit code of pg_dump. No alert was configured for pg_dump failure. No monitoring checked whether the backup file existed and was the expected size. The failure had been occurring for an unknown number of days before the incident.
# RECONSTRUCTED FROM GITLAB INCIDENT REPORT
# The cron job (simplified)
0 2 * * * /usr/bin/pg_dump gitlabhq_production | gzip > /var/opt/gitlab/backups/db.sql.gz
# FAILURE POINT: No exit code check. No size validation.
# If pg_dump fails (disk full, connection error, timeout),
# the cron job exits silently. No alert. No retry.
# Tomorrow's cron will fail the same way.
Streaming replication (already broken). The replication lag that the engineer was troubleshooting meant the secondary was already unusable. The secondary had fallen far enough behind the primary that its WAL position was no longer recoverable through normal replication. This was the problem being solved when the deletion occurred.
LVM snapshots (never configured). The capability existed in the infrastructure. The configuration was never applied to the database volume. This is not a failure of a backup mechanism. It is the absence of a backup mechanism that appeared on the architecture diagram but was never implemented.
Azure disk snapshots (not running). Similar to the LVM case: the snapshot mechanism existed in the cloud platform but was not configured to run on the database volume with the expected frequency.
S3 uploads (dependent on pg_dump). The S3 upload script uploaded the output of pg_dump. Since pg_dump had been failing, there was nothing to upload. The S3 bucket contained stale data from the last successful pg_dump.
The pattern across all five failures is the same: no verification. No mechanism existed to confirm that each backup was functioning. No alert fired when a backup failed. No periodic restore test confirmed that backups could actually be used for recovery. Each backup existed on paper and in architecture diagrams. None had been tested under realistic conditions.
The recovery took approximately 18 hours. The team restored from the six-hour-old pg_dump, losing approximately six hours of production data. GitLab documented the data loss publicly: 707 users lost data. The team live-streamed the recovery process on YouTube, an unprecedented act of transparency for a major data loss incident.