Skip to content

Disaster Recovery Runbook

Emergency procedures for infrastructure failures - Step-by-step recovery guides for critical scenarios.


Table of Contents


Overview

Purpose

This runbook provides emergency recovery procedures for catastrophic failures:

  • Lost SSH access
  • Secrets manager down
  • Critical encryption keys lost
  • Server destroyed
  • Complete infrastructure loss

Terraform-Based Recovery

NEW: With Terraform managing infrastructure, many disaster scenarios become significantly easier to recover from:

  • VPS destroyed → terraform apply rebuilds it
  • DNS misconfigured → terraform plan shows drift, terraform apply fixes it
  • Firewall rules lost → Already in code, terraform apply restores
  • SSH keys lost → Terraform state shows which keys exist, add new ones via code
  • Complete Loss → Use the Bootstrap System to recover critical keys and credentials in seconds.

However, Terraform does NOT help with:

  • ❌ Infisical ENCRYPTION_KEY lost (still unrecoverable)
  • ❌ Terraform state file lost (need backups)
  • ❌ Application data lost (need separate backups)

Recovery Time Objectives (RTO)

Updated with Tested Recovery Times (December 2025):

Scenario Manual Recovery Terraform Recovery Data Loss Test Status
Single service failure 10-30 min 5-10 min None (restore from backup) ✅ Tested
Database corruption 30-60 min 10-15 min Up to 24h (last backup) ✅ Tested
VPS destroyed 4-8 hours 40 minutes None (backups on Storage Box) ✅ Tested
Total infrastructure loss 8-24 hours 40 minutes None (Terraform + backups) ✅ Tested
Lost SSH keys 30 min (console) 5 min (Terraform) None ✅ Tested
Infisical down 15 minutes 15 minutes None (use cached secrets) ⚠️ Partial
ENCRYPTION_KEY lost UNRECOVERABLE UNRECOVERABLE ALL SECRETS LOST N/A
Terraform state lost N/A 1 hour (import) None (can rebuild) ⚠️ Partial

Key Findings:

  • Terraform reduces total infrastructure recovery from 24 hours → 40 minutes (96% faster)
  • Most critical path: terraform apply (10 min) + database restore (15 min) + verification (15 min)
  • Data on Storage Box survives VPS loss (photos, databases, configs all safe)
  • ENCRYPTION_KEY is the ONLY unrecoverable failure (must backup to 4 locations)

Critical Files and Backups

The Holy Trinity

These three items enable complete recovery:

  1. SSH Keys (per-device)

  2. Location: ~/.ssh/id_ed25519_<device>

  3. Backup: On device only (never in cloud)
  4. Recovery: Generate new keys, use console access

  5. Infisical ENCRYPTION_KEY

  6. Location: ~/infisical/.env on Hetzner VPS

  7. Backup: MUST have GPG-encrypted copies on:
    • Kimsufi server
    • Local machine (MacBook)
    • Cloud storage (encrypted)
    • Physical USB drive (in safe)
  8. Recovery: Restore from backup
  9. IF LOST: ALL SECRETS PERMANENTLY LOST

  10. Terraform State (CRITICAL)

  11. Location: Storage Box S3 backend s3://terraform-state/prod/terraform.tfstate
  12. Backup: Storage Box snapshots + terraform state pull > backup.tfstate (weekly)
  13. Recovery: Restore from Storage Box or rebuild via imports
  14. IF LOST: Can rebuild via terraform import but tedious (1-2 hours)

Backup Verification Schedule

Weekly:

  • ✅ Verify Infisical ENCRYPTION_KEY backups exist
  • ✅ Test SSH access from all devices
  • ✅ Verify dev() container secrets load correctly
  • NEW: terraform state pull > backup-$(date +%Y%m%d).tfstate (backup state)
  • NEW: terraform plan shows "No changes" (verify state matches reality)

Monthly:

  • ✅ Test Infisical backup restoration (on test environment)
  • ✅ Verify Storage Box snapshots are current
  • ✅ Review and update emergency contact information
  • NEW: Test Terraform disaster recovery (destroy test resource → terraform apply)

Scenario 1: Lost All SSH Keys

Problem

All devices lost/stolen, no SSH access to any server.

Example: MacBook, iPad, and PC all lost simultaneously.

Assessment

Severity: HIGH RTO: 30 minutes - 2 hours (depending on access method) Data Loss: None (SSH keys don't contain data)

Recovery Steps

Option A: Terraform (Fastest - 5 minutes)

  1. Generate New SSH Key (on new device):
# On new MacBook/device
ssh-keygen -t ed25519 -C "kavi@new-macbook" -f ~/.ssh/id_ed25519_new-macbook
  1. Clone Terraform Repository:
git clone git@github.com:kavi/terraform-infra.git
cd terraform-infra
  1. Add New SSH Key to Terraform:
# Edit ssh-keys.tf
vim ssh-keys.tf

# Add new resource
resource "hcloud_ssh_key" "new_macbook" {
  name       = "kavi-new-macbook"
  public_key = file("~/.ssh/id_ed25519_new-macbook.pub")
}

# Add to server ssh_keys list
# In hetzner-vps.tf, add hcloud_ssh_key.new_macbook.id
  1. Apply Changes:
terraform init  # Download providers
terraform plan  # Verify changes
terraform apply # Add new SSH key
# Type: yes
  1. Test SSH:
    ssh kavi@100.80.53.55
    # Should work with new key
    

Option B: Hetzner Cloud Console (If Terraform Not Available)

  1. Access Hetzner Robot:
Go to: https://robot.hetzner.com
Login with Hetzner account credentials
  1. Activate Rescue System:
Select VPS → Rescue → Activate Linux rescue system
Choose: linux64
Note the root password shown
Click: Activate
  1. Reboot into Rescue:
Click: Reset
Wait 2-3 minutes for rescue system to boot
  1. Access via Rescue Console:
In Hetzner console: Open VNC console
Or SSH to VPS IP with rescue root password
  1. Mount Filesystem:
# In rescue system
mount /dev/sda1 /mnt  # Adjust device as needed
  1. Add New SSH Key:
# Generate new key on new device first
# Then copy public key content

# In rescue system
cat >> /mnt/home/kavi/.ssh/authorized_keys << 'EOF'
ssh-ed25519 AAAAC3... kavi@new-macbook
EOF

# Fix permissions
chmod 600 /mnt/home/kavi/.ssh/authorized_keys
chown 1000:1000 /mnt/home/kavi/.ssh/authorized_keys
  1. Reboot Normally:
In Hetzner console: Disable rescue system
Click: Reset
Wait for normal boot
  1. Test SSH:
    # From new device
    ssh kavi@100.80.53.55
    # Should work with new key
    

Prevention

  • NEW: Keep Terraform repository cloned on all devices (MacBook, iPad, PC)
  • NEW: git push Terraform changes immediately (keep repo up to date)
  • ✅ Keep backup SSH key on secondary device
  • ✅ Document Hetzner rescue system procedure (for extreme cases)
  • ✅ Store Hetzner credentials securely (password manager)

With Terraform: Even if you lose ALL devices, you can clone terraform-infra repo on new device → add new SSH key → terraform apply


Scenario 2: Infisical Server Down

Problem

Hetzner VPS offline, can't access Infisical to get secrets.

Example: VPS crashed, hardware failure, network issue.

Assessment

Severity: MEDIUM RTO: 15 minutes - 1 hour Data Loss: None (secrets exist, just inaccessible)

Recovery Steps

Option A: Use Cached Secrets (Immediate)

  1. Check for Recent dev() Exports:
# On local machine or Kimsufi
ls -lat /tmp/env-*.list | head -5

# Use if found (may be stale)
docker run --env-file /tmp/env-12345.list ...
  1. Verify Secrets Are Current:
    # Check file timestamp
    ls -l /tmp/env-12345.list
    # If < 24 hours old, probably safe to use
    

Option B: Restore Hetzner VPS

  1. Diagnose Problem:
# Try SSH
ssh kavi@100.80.53.55
# Connection refused? VPS down

# Try ping
ping 100.80.53.55
# No response? Network issue or VPS offline
  1. Check Hetzner Status:
Go to: https://console.hetzner.cloud
Check VPS status
Check for maintenance notifications
  1. Restart VPS (if running but unresponsive):
In Hetzner console: Click "Power" → "Reboot"
Wait 2-3 minutes
Retry SSH
  1. Restart Infisical:
ssh kavi@100.80.53.55
cd ~/infisical
docker-compose down
docker-compose up -d

# Verify
docker-compose ps
# All services should be "Up"

Prevention

  • ✅ Regular Infisical backups (weekly automated)
  • ✅ Monitor VPS uptime (uptime monitoring service)
  • ✅ Hetzner VPS snapshots (daily)
  • ✅ Document Infisical restart procedure

Scenario 3: ENCRYPTION_KEY Lost

Problem

Infisical ENCRYPTION_KEY lost, all backups lost.

Result: ALL SECRETS PERMANENTLY LOST - No recovery possible.

Assessment

Severity: CATASTROPHIC RTO: N/A (unrecoverable) Data Loss: 100% of all secrets

Why This Is Unrecoverable

Infisical uses symmetric encryption:

Secret → Encrypt with ENCRYPTION_KEY → Store in MongoDB
Retrieve → Decrypt with ENCRYPTION_KEY → Secret

NO ENCRYPTION_KEY = NO DECRYPTION = NO SECRETS

There is no master key, no backdoor, no recovery mechanism.

What You Must Do

Immediate actions:

  1. Accept Total Loss:

  2. All API keys GONE

  3. All passwords GONE
  4. All database credentials GONE

  5. Regenerate Everything:

# Anthropic
Go to: https://console.anthropic.com/settings/keys
Create: New API key

# Google
Go to: https://console.cloud.google.com/apis/credentials
Create: New API key

# Repeat for EVERY service (8-16 hours)
  1. Rebuild Infisical:
# Start fresh (new ENCRYPTION_KEY)
cd ~/infisical
docker-compose down
rm -rf .env mongodb_data/

# Generate new encryption key
openssl rand -base64 32

# Create new .env
docker-compose up -d

Time to complete: 8-16 hours (depending on number of secrets)

Prevention (CRITICAL)

The ONLY defense is multiple backups:

  1. Primary (Hetzner): ~/infisical/.env.backup
  2. Secondary (Kimsufi, GPG encrypted): Encrypted copy on second server
  3. Tertiary (Local, GPG encrypted): ~/Backups/infisical-key.gpg
  4. Quaternary (Cloud, GPG encrypted): Google Drive/Dropbox
  5. Quinary (Physical): USB drive in safe

Test restoration quarterly


Scenario 4: Hetzner VPS Destroyed

Problem

Hetzner VPS completely destroyed, all data lost.

Assessment

Severity: HIGH RTO: 15 minutes (Terraform) vs 2-4 hours (manual) Data Loss: Application data (if no backups) - infrastructure config preserved in Terraform

Recovery Steps

Option A: Terraform Rebuild (Fastest - 15 minutes)

  1. Verify Terraform State is Intact:
cd ~/Coding/terraform-infra
terraform state list
# Should show: hcloud_server.hetzner_vps (will show as destroyed)
  1. Rebuild VPS:
terraform apply
# Terraform will recreate the destroyed server
# - Same SSH keys attached
# - Same firewall rules
# - Same server type (CPX42)
# - NEW IP address (will be different)
  1. Update DNS (if Cloudflare managed by Terraform):
# Terraform automatically updates DNS to new IP
# (Cloudflare A record references hcloud_server.hetzner_vps.ipv4_address)

terraform apply  # DNS updated automatically
  1. Restore Application Data:
# SSH to new server
ssh kavi@<new-ip>

# Restore Infisical from backup
# (see Scenario 3 for ENCRYPTION_KEY restore)

# Pull docker-compose.yml from Git
git clone git@github.com:kavi/infra-config.git
cd infra-config
docker-compose up -d
  1. Verify Services:
    docker-compose ps  # All services "Up"
    curl https://secrets.kua.cl  # DNS propagated
    

Time: 15 minutes (server provisioning) + 30 minutes (service deployment) = 45 minutes total

Option B: Restore from Snapshot (Manual - 1 hour)

  1. Access Hetzner Console → Snapshots
  2. Create Server from Snapshot
  3. Update DNS to new IP (manual Cloudflare console)
  4. Verify Services running

Option C: Manual Rebuild (Slowest - 2-4 hours)

  1. Create new CPX42 VPS (Hetzner console)
  2. Install Docker
  3. Configure firewall
  4. Add SSH keys
  5. Restore Infisical (.env from backup)
  6. Deploy services from Git
  7. Update DNS

Prevention

  • NEW: Terraform configuration (infrastructure as code)
  • NEW: terraform state pull backups (weekly)
  • ✅ Daily Hetzner snapshots (automated) - faster restore
  • ✅ Infisical .env backup (weekly) - CRITICAL
  • ✅ docker-compose.yml in Git - service definitions preserved

Terraform Benefit: VPS destroyed at 3 AM → terraform apply at 9 AM → back online in 15 minutes


Scenario 5: Complete Infrastructure Loss

Problem

Both Production and Development VPS destroyed. All containers lost. Need to rebuild entire infrastructure from scratch.

Example: Hetzner datacenter fire, account compromised and servers deleted, etc.

Assessment

Severity: CATASTROPHIC RTO: 40 minutes (Terraform) vs 8-24 hours (manual) Data Loss: ZERO (backups on Storage Box survive VPS loss)

Critical Fact: Photos, databases, and configs on Storage Box are independent of VPS. When VPS is destroyed, Storage Box data survives.

Recovery

Option A: Terraform-Powered Recovery (40 minutes) - TESTED

Prerequisites (must have backups of these):

  • ✅ Terraform repository (~/Coding/terraform-infra)
  • ✅ Provider credentials (HCLOUD_TOKEN, CLOUDFLARE_API_KEY, CLOUDFLARE_EMAIL)
  • ✅ Infisical ENCRYPTION_KEY (from Storage Box, MacBook, Kimsufi, or paper backup)
  • ✅ Storage Box survives (contains all backups)

Detailed Step-by-Step Recovery:

Phase 1: Environment Setup (5 minutes)

  1. Set Provider Credentials:
# On local MacBook (or any device)
export HCLOUD_TOKEN="HWrZL9wRdSNFSbAdoZSFm4km8xkKKOdmO5ShYdSz9yoOmAZgYKRi5CojJmRBfbQ6"
export CLOUDFLARE_API_KEY="d7f671a87a8337c7605db47ced761703e7199"
export CLOUDFLARE_EMAIL="kdoi@email.com"
  1. Navigate to Terraform Directory:
    cd ~/Coding/terraform-infra
    

Phase 2: Rebuild Infrastructure (10 minutes)

  1. Preview Infrastructure Recreation:
terraform plan
# Should show: 2 VPS to create, 10 DNS records to create, 2 firewalls to create
  1. Recreate ENTIRE Infrastructure:
terraform apply
# Type "yes" when prompted
# Wait ~10 minutes for VPS creation + cloud-init setup

What Terraform Does Automatically:

  • ✅ Creates Production VPS (CPX31) at new IP
  • ✅ Creates Development VPS (CX32) at new IP
  • ✅ Configures firewalls (ports 80, 443, 22, 2283, etc.)
  • ✅ Adds SSH keys to both VPS
  • ✅ Updates ALL DNS records to new IPs (9 A records + wildcard)
  • ✅ Runs cloud-init scripts (installs Docker, mounts Storage Box)

Phase 3: Verify Infrastructure (5 minutes)

  1. Check VPS Created Successfully:
terraform output
# Shows: production_vps_ip, development_vps_ip, dns_subdomains
  1. Test SSH Access:
ssh production "echo Production VPS online"
ssh dev "echo Development VPS online"
  1. Verify Storage Box Mounted (data survived!):
ssh production "df -h | grep storagebox"
# Should show: /mnt/storagebox mounted via rclone

ssh production "ls -lh /mnt/storagebox/immich/upload"
# Should show: 5GB+ of photos (SURVIVED VPS destruction!)

Phase 4: Restore Databases (15 minutes)

  1. Restore Immich Database:
ssh production "cd /opt/immich && docker-compose up -d immich-postgres"

# Wait 30 seconds for postgres to start

ssh production "docker exec -i immich-postgres psql -U immich immich < /mnt/storagebox/backups/daily/immich_db_latest.sql"
  1. Restore n8n Database:
ssh production "cd /opt/n8n && docker-compose up -d n8n-postgres"

ssh production "docker exec -i n8n-postgres psql -U n8n n8n < /mnt/storagebox/backups/daily/n8n_db_latest.sql"
  1. Restore Infisical Database + ENCRYPTION_KEY:
# CRITICAL: Restore ENCRYPTION_KEY first!
scp /mnt/storagebox/backups/infisical_keys/ENCRYPTION_KEY production:/opt/infisical/

# Or restore from MacBook backup:
scp ~/Backups/Infisical/ENCRYPTION_KEY production:/opt/infisical/

ssh production "cd /opt/infisical && docker-compose up -d infisical-postgres"

ssh production "docker exec -i infisical-postgres psql -U infisical infisical < /mnt/storagebox/backups/daily/infisical_db_latest.sql"

Phase 5: Start All Services (5 minutes)

  1. Start Production Services:
# Start all services in correct order
ssh production "cd /opt/immich && docker-compose up -d"
ssh production "cd /opt/kuanary && docker-compose up -d"
ssh production "cd /opt/imgproxy && docker-compose up -d"
ssh production "cd /opt/obsidian && docker-compose up -d"
ssh production "cd /opt/infra-docs && docker-compose up -d"
ssh production "cd /opt/n8n && docker-compose up -d"
ssh production "cd /opt/portainer && docker-compose up -d"
ssh production "cd /opt/infisical && docker-compose up -d"
ssh production "cd /opt/icloud-sync && docker-compose up -d"
  1. Start Development Services:
ssh dev "cd /opt/open-webui && docker-compose up -d"

Phase 6: Verification (5 minutes)

  1. Verify DNS Propagation (TTL=300, wait 5 minutes):
for domain in photos.kua.cl media.kua.cl cdn.kua.cl notes.kua.cl docs.kua.cl n8n.kua.cl secrets.kua.cl dev.kua.cl; do
  echo -n "$domain: "
  dig +short $domain
done
# All should return new VPS IPs
  1. Test All Services:
curl -I https://photos.kua.cl    # Immich
curl -I https://media.kua.cl     # Kuanary
curl -I https://cdn.kua.cl       # imgproxy
curl -I https://notes.kua.cl     # Obsidian
curl -I https://docs.kua.cl      # This documentation site
curl -I https://n8n.kua.cl       # n8n
curl -I https://secrets.kua.cl   # Infisical
curl -I https://dev.kua.cl       # open-webui
  1. Verify Photo Count in Immich:
# Log in to https://photos.kua.cl
# Check asset count: Should show 100,805+ photos
# All metadata, albums, face tags restored from database

Total Recovery Time: 40 minutes

Data Loss: ZERO

  • Photos: Already on Storage Box (5GB+)
  • Databases: Restored from latest daily backup (max 24h old)
  • Obsidian vaults: Already on Storage Box
  • iCloud sync: Already on Storage Box
  • Configurations: In Git + Storage Box backups

Cost: Same as before (€25.98/month for both VPS)

  1. Rebuild EVERYTHING (10 minutes):
terraform apply
# Creates:
# - Hetzner VPS (CPX42)
# - SSH keys (from your ssh-keys.tf)
# - Cloudflare DNS (all kua.cl records)
# - S3 buckets
# - Firewall rules
  1. Deploy Services (5 minutes):
ssh kavi@<new-vps-ip>
git clone git@github.com:kavi/infra-config.git
cd infra-config

# Restore Infisical ENCRYPTION_KEY
# (from GPG backup on USB/Cloud)

docker-compose up -d  # All services running

Total: 30 minutes from zero to fully operational infrastructure

Option B: Manual Recovery (8-24 hours) - No Terraform

Phase 1: Establish Access (2 hours)

  • Get new device, recover cloud accounts, generate SSH keys

Phase 2: Rebuild Infrastructure (4 hours)

  • Create VPS manually (Hetzner console)
  • Configure DNS manually (Cloudflare)
  • Setup firewall manually
  • Install Docker

Phase 3: Reconstruct Secrets (8 hours)

  • If ENCRYPTION_KEY exists: Restore Infisical → all secrets recovered
  • If ENCRYPTION_KEY lost: Regenerate ALL API keys manually

Phase 4: Verify (2 hours)

  • Test services, create new backups, update docs

Total: 8-24 hours

Prevention

This scenario should NEVER happen if backups are maintained:

  1. Terraform Repository (MOST CRITICAL):

  2. ✅ On GitHub (offsite, version controlled)

  3. ✅ Cloned on multiple devices (MacBook, iPad, PC)
  4. ✅ Contains ENTIRE infrastructure definition

  5. Infisical ENCRYPTION_KEY (CRITICAL):

  6. ✅ GPG-encrypted backup on USB drive (in safe)

  7. ✅ GPG-encrypted backup on Cloud (Google Drive/Dropbox)
  8. ✅ GPG-encrypted backup on Kimsufi (secondary server)
  9. ✅ 5 copies minimum

  10. Terraform State (IMPORTANT):

  11. ✅ Remote state on Storage Box S3 (survives VPS loss)

  12. terraform state pull backups (weekly, stored offsite)

  13. Provider Credentials (IMPORTANT):

  14. ✅ Hetzner API token (in password manager)
  15. ✅ Cloudflare API token (in password manager)

With Terraform: Complete infrastructure loss → 30 minutes to rebuild Without Terraform: Complete infrastructure loss → 8-24 hours to rebuild

The Terraform Advantage: Infrastructure is code, stored in Git. As long as GitHub exists, you can rebuild everything.


Scenario 6: Locked Out of All Servers

Problem

Can SSH but locked out of user account.

Recovery

Use Hetzner Rescue System:

  1. Mount filesystem
  2. Reset password or fix sudo
  3. Reboot normally

Scenario 7: Rapid Recovery via Bootstrap

Problem

You have lost your primary device or are setting up a completely new environment and need to regain access to everything (SSH, Infisical, Repositories) quickly.

Recovery Steps

  1. Obtain Bundle: Get your device-bootstrap.age from your secure backup (USB/Cloud).
  2. Install age: brew install age
  3. Execute:
    age -d device-bootstrap.age | bash
    
  4. Verify:
    • ssh bruno (Should work immediately)
    • infisical secrets get ENCRYPTION_KEY (Should work immediately)
    • cd ~/coder-core (Repository should be ready)

Prevention Checklist

Daily

  • [ ] Verify SSH access from primary device
  • [ ] Check services are running

Weekly

  • [ ] Test dev() container startup
  • [ ] Verify ENCRYPTION_KEY backups exist

Monthly

  • [ ] Test Infisical backup restoration
  • [ ] Review emergency procedures

Quarterly

  • [ ] Test complete disaster recovery
  • [ ] Verify all backups decrypt correctly

Summary

Critical Recovery Procedures (With Terraform):

  • ✅ Lost SSH keys → Terraform (5 min) vs Hetzner rescue (30 min)
  • ✅ Infisical down → Cached secrets (15 min)
  • ⚠️ ENCRYPTION_KEY lost → UNRECOVERABLE (must have backups)
  • ✅ VPS destroyed → terraform apply (15 min) vs Snapshot restore (1-2 hours)
  • ✅ Complete loss → Clone repo + apply (30 min) vs Full manual rebuild (8-24 hours)

Terraform Recovery Advantage: | Scenario | Manual Time | Terraform Time | Time Saved | |----------|-------------|----------------|------------| | SSH keys lost | 30 min | 5 min | 25 min (83% faster) | | VPS destroyed | 2-4 hours | 15 min | 1-4 hours (92% faster) | | Complete loss | 8-24 hours | 30 min | 8-24 hours (96% faster) |

Prevention is Everything:

  • Terraform Repository: On GitHub, cloned on all devices (MOST CRITICAL)
  • ENCRYPTION_KEY: 5 copies minimum, test quarterly (CRITICAL)
  • Terraform State: Remote backend + weekly backups (IMPORTANT)
  • VPS Snapshots: Daily automated (for faster restores)
  • Documentation: Keep this runbook updated

What's Next: