Building a Disaster Recovery Plan for Your Docker Compose Infrastructure

01The Day My Server Died

Last March, the SSD in my primary server failed without warning. No SMART errors, no degraded performance — it just stopped responding during a routine docker compose pull. One moment I had 22 running services, the next I had a blinking cursor and a drive that wouldn't mount. I had backups. What I didn't have was a tested recovery plan. I knew my data was on Backblaze B2, but I hadn't documented the exact steps to go from "bare metal" to "all services running." It took me 14 hours to fully recover — finding credentials, remembering configuration decisions, figuring out which backup to restore first. After that experience, I built a proper disaster recovery plan. The second time I tested it (deliberately, on a fresh VPS), full recovery took 2 hours and 15 minutes. Most of that was download time. Two concepts frame everything in disaster recovery: RTO (Recovery Time Objective) — how long can you afford to be down? And RPO (Recovery Point Objective) — how much data can you afford to lose? For my self-hosted stack, I target 4-hour RTO and 24-hour RPO. My backups run daily, and I can rebuild from scratch in under 4 hours.

02Three Layers of Backup

I use three backup layers, each protecting against different failure modes: Layer 1: Docker volume snapshots. These are fast, local backups that protect against application errors, accidental deletions, and bad updates. I run them every 6 hours and keep 7 days of snapshots. Recovery from a volume snapshot takes seconds. Layer 2: Off-site encrypted backups with restic to Backblaze B2. These protect against hardware failure, ransomware, and physical disasters. I run them daily and keep 30 daily, 12 monthly, and 2 yearly snapshots. Recovery requires downloading from B2, so it's slower but survives total server loss. Layer 3: Infrastructure as code. My docker-compose.yml files, .env templates, and configuration files are in a private Git repository. This isn't data backup — it's the knowledge of how my infrastructure is built. If I lose everything, I can clone the repo and have a blueprint for rebuilding.

[backup.sh]

1#!/bin/bash
2# backup.sh — Daily backup script
3set -euo pipefail
4
5BACKUP_DIR="/opt/backups"
6TIMESTAMP=$(date +%Y%m%d-%H%M%S)
7RESTIC_REPO="b2:my-bucket:server-backups"
8
9echo "=== Starting backup at $TIMESTAMP ==="
10
11# Layer 1: Volume snapshots
12echo "Creating volume snapshots..."
13for volume in $(docker volume ls -q); do
14  docker run --rm     -v "$volume:/source:ro"     -v "$BACKUP_DIR/volumes:/backup"     alpine tar czf "/backup/${volume}-${TIMESTAMP}.tar.gz" -C /source .
15done
16
17# Cleanup old snapshots (keep 7 days)
18find "$BACKUP_DIR/volumes" -name "*.tar.gz" -mtime +7 -delete
19
20# Layer 2: Off-site with restic
21echo "Running restic backup..."
22restic -r "$RESTIC_REPO" backup   "$BACKUP_DIR/volumes"   /opt/docker   --exclude="*.log"   --tag "daily"
23
24# Prune old backups
25restic -r "$RESTIC_REPO" forget   --keep-daily 30   --keep-monthly 12   --keep-yearly 2   --prune
26
27echo "=== Backup completed ==="

restic vs rclone: Use restic for backup (it handles deduplication, encryption, and retention policies). Use rclone for syncing (mirroring a folder to cloud storage). They solve different problems. I use restic for backups and rclone to sync my Paperless export folder to Google Drive as an additional copy.

03Testing Your Restores

An untested backup is not a backup — it's a hope. I learned this the hard way when my first recovery attempt failed because a restic snapshot was corrupted and I hadn't run restic check in months. Now I test restores monthly. The process is simple: spin up a cheap VPS ($5/month DigitalOcean droplet), run the recovery script, verify services start and data is intact, then destroy the VPS. Total cost per test: about $0.05 and 30 minutes of my time.

[restore-test.sh]

1#!/bin/bash
2# restore-test.sh — Monthly restore verification
3set -euo pipefail
4
5RESTIC_REPO="b2:my-bucket:server-backups"
6RESTORE_DIR="/opt/restore-test"
7
8echo "=== Restore Test Started ==="
9
10# Verify backup integrity
11echo "Checking backup integrity..."
12restic -r "$RESTIC_REPO" check --read-data-subset=10%
13if [ $? -ne 0 ]; then
14  echo "CRITICAL: Backup integrity check failed!"
15  exit 1
16fi
17
18# Restore latest snapshot
19echo "Restoring latest snapshot..."
20mkdir -p "$RESTORE_DIR"
21restic -r "$RESTIC_REPO" restore latest --target "$RESTORE_DIR"
22
23# Verify critical files exist
24echo "Verifying critical files..."
25CRITICAL_FILES=(
26  "opt/docker/docker-compose.yml"
27  "opt/backups/volumes"
28)
29for f in "${CRITICAL_FILES[@]}"; do
30  if [ ! -e "$RESTORE_DIR/$f" ]; then
31    echo "CRITICAL: Missing file: $f"
32    exit 1
33  fi
34done
35
36# Restore volumes and start services
37echo "Restoring Docker volumes..."
38cd "$RESTORE_DIR/opt/docker"
39for archive in "$RESTORE_DIR/opt/backups/volumes/"*.tar.gz; do
40  VOLUME_NAME=$(basename "$archive" | sed 's/-[0-9].*//g')
41  docker volume create "$VOLUME_NAME"
42  docker run --rm     -v "$VOLUME_NAME:/target"     -v "$(dirname "$archive"):/backup:ro"     alpine tar xzf "/backup/$(basename "$archive")" -C /target
43done
44
45echo "Starting services..."
46docker compose up -d
47
48# Wait and check health
49sleep 30
50UNHEALTHY=$(docker compose ps --format json | grep -c '"unhealthy"' || true)
51if [ "$UNHEALTHY" -gt 0 ]; then
52  echo "WARNING: $UNHEALTHY unhealthy services"
53  docker compose ps
54else
55  echo "All services healthy!"
56fi
57
58echo "=== Restore Test Complete ==="

Never skip restore testing because 'backups are running fine.' A backup that can't be restored is worthless. I've seen corrupted databases, missing encryption keys, and outdated restore scripts. The monthly test is the only way to be confident your recovery plan actually works.

04Writing a Recovery Runbook

A runbook is a step-by-step guide for recovering your infrastructure from scratch. Write it for someone who isn't you — because in a crisis, even you won't be thinking clearly. My runbook lives in the same Git repository as my Docker Compose files and covers every step from "I have a new server with a fresh OS" to "all services are running and verified." It includes server provisioning (OS, Docker, firewall), secret recovery (where passwords and keys are stored), backup restoration (step-by-step restic restore commands), service startup (correct order based on dependencies), and verification (how to confirm each service is working). The most critical part is secret recovery. Your .env files contain database passwords, API keys, and encryption keys. These can't be in your Git repository (that's a security risk) and they can't be only on the server (that's a single point of failure). I store mine in a Bitwarden vault (hosted on a different server) and in an encrypted file on my backup destination.

[recovery-template.env]

1# recovery-template.env
2# Fill in these values during recovery
3# Sources: Bitwarden vault "Server Secrets" entry
4
5# Database passwords
6DB_PASSWORD=
7REDIS_PASSWORD=
8
9# Application secrets
10AUTHENTIK_SECRET_KEY=
11N8N_ENCRYPTION_KEY=
12PAPERLESS_SECRET_KEY=
13
14# Backup credentials
15RESTIC_PASSWORD=
16B2_ACCOUNT_ID=
17B2_ACCOUNT_KEY=
18
19# Domain and networking
20DOMAIN=example.com
21TAILSCALE_AUTH_KEY=
22
23# Email (for notifications and password recovery)
24SMTP_HOST=
25SMTP_USER=
26SMTP_PASSWORD=

05Automating Recovery Verification

Manual monthly testing is good, but automated verification is better. I run a weekly smoke test that verifies my backup chain is intact without doing a full restore. The automation runs as a cron job (or, if you've set up n8n, as an n8n workflow). It checks that the latest backup is less than 26 hours old, that restic check reports no errors, that the backup size is within expected range (a sudden 90% drop means something stopped backing up), and sends a summary to Slack. For the full monthly restore test, I use a simple script that provisions a DigitalOcean droplet via their API, runs the restore script, verifies services, and tears down the droplet. The whole process runs unattended and I get a pass/fail notification. This automation has caught two real issues: once when a B2 API key expired and backups silently stopped, and once when a Docker volume was renamed and the backup script was backing up an empty volume. Both times, I fixed the issue before a real disaster happened. The cost is minimal — about $2/month for the weekly integrity checks (a few API calls to B2) and $0.50/month for the monthly full restore test on a temporary droplet. That's cheap insurance for a self-hosted infrastructure that runs my documents, passwords, git repositories, and automation.

06The Complete DR Checklist

Here's my disaster recovery checklist, prioritized by impact. Work through this list and check off each item. Missing any of the top 5 items means you're not prepared for a hardware failure. 1. Automated daily backups running (restic, duplicati, or borg to off-site storage). This is non-negotiable — without off-site backups, a hardware failure means permanent data loss. 2. All secrets stored in a second location (password manager, encrypted file on a different server). If your secrets die with your server, you can't decrypt your backups or configure your services. 3. Docker Compose files and configuration in version control. Infrastructure as code means you can rebuild your stack from a README, not from memory. 4. Backup integrity checks running weekly. A corrupted backup is worse than no backup — it gives false confidence. 5. Full restore tested at least once. Even if you never test again (you should), doing it once reveals the gaps in your plan. 6. Recovery runbook written and accessible offline. Print it, save it to your phone, store it in your password manager. When your server is down, you can't access a runbook that lives on that server. 7. Monitoring and alerting for backup failures. Backups fail silently. If you don't monitor them, you won't know until you need them. 8. Monthly restore test on a fresh server. This is the only way to maintain confidence that your plan works as your infrastructure evolves. Realistic recovery times based on my testing: Simple stack (5 services, 10GB data) — 1-2 hours. Medium stack (15 services, 50GB data) — 2-4 hours. Large stack (30+ services, 200GB+ data) — 4-8 hours. Most of the time is downloading backups from off-site storage. A faster internet connection directly reduces your recovery time. Start with items 1-5 this weekend. They take a few hours to set up and they're the difference between a minor inconvenience and a catastrophe.