Disaster Recovery Runbook¶
Emergency procedures for infrastructure failures - Step-by-step recovery guides for critical scenarios.
Table of Contents¶
- Overview
- Critical Files and Backups
- Scenario 1: Lost All SSH Keys
- Scenario 2: Infisical Server Down
- Scenario 3: ENCRYPTION_KEY Lost
- Scenario 4: Hetzner VPS Destroyed
- Scenario 5: Complete Infrastructure Loss
- Scenario 6: Locked Out of All Servers
- Scenario 7: Rapid Recovery via Bootstrap
- Prevention Checklist
Overview¶
Purpose¶
This runbook provides emergency recovery procedures for catastrophic failures:
- Lost SSH access
- Secrets manager down
- Critical encryption keys lost
- Server destroyed
- Complete infrastructure loss
Terraform-Based Recovery¶
NEW: With Terraform managing infrastructure, many disaster scenarios become significantly easier to recover from:
- VPS destroyed →
terraform applyrebuilds it - DNS misconfigured →
terraform planshows drift,terraform applyfixes it - Firewall rules lost → Already in code,
terraform applyrestores - SSH keys lost → Terraform state shows which keys exist, add new ones via code
- Complete Loss → Use the Bootstrap System to recover critical keys and credentials in seconds.
However, Terraform does NOT help with:
- ❌ Infisical ENCRYPTION_KEY lost (still unrecoverable)
- ❌ Terraform state file lost (need backups)
- ❌ Application data lost (need separate backups)
Recovery Time Objectives (RTO)¶
Updated with Tested Recovery Times (December 2025):
| Scenario | Manual Recovery | Terraform Recovery | Data Loss | Test Status |
|---|---|---|---|---|
| Single service failure | 10-30 min | 5-10 min | None (restore from backup) | ✅ Tested |
| Database corruption | 30-60 min | 10-15 min | Up to 24h (last backup) | ✅ Tested |
| VPS destroyed | 4-8 hours | 40 minutes | None (backups on Storage Box) | ✅ Tested |
| Total infrastructure loss | 8-24 hours | 40 minutes | None (Terraform + backups) | ✅ Tested |
| Lost SSH keys | 30 min (console) | 5 min (Terraform) | None | ✅ Tested |
| Infisical down | 15 minutes | 15 minutes | None (use cached secrets) | ⚠️ Partial |
| ENCRYPTION_KEY lost | UNRECOVERABLE | UNRECOVERABLE | ALL SECRETS LOST | N/A |
| Terraform state lost | N/A | 1 hour (import) | None (can rebuild) | ⚠️ Partial |
Key Findings:
- Terraform reduces total infrastructure recovery from 24 hours → 40 minutes (96% faster)
- Most critical path:
terraform apply(10 min) + database restore (15 min) + verification (15 min) - Data on Storage Box survives VPS loss (photos, databases, configs all safe)
- ENCRYPTION_KEY is the ONLY unrecoverable failure (must backup to 4 locations)
Critical Files and Backups¶
The Holy Trinity¶
These three items enable complete recovery:
-
SSH Keys (per-device)
-
Location:
~/.ssh/id_ed25519_<device> - Backup: On device only (never in cloud)
-
Recovery: Generate new keys, use console access
-
Infisical ENCRYPTION_KEY
-
Location:
~/infisical/.envon Hetzner VPS - Backup: MUST have GPG-encrypted copies on:
- Kimsufi server
- Local machine (MacBook)
- Cloud storage (encrypted)
- Physical USB drive (in safe)
- Recovery: Restore from backup
-
IF LOST: ALL SECRETS PERMANENTLY LOST
-
Terraform State (CRITICAL)
- Location: Storage Box S3 backend
s3://terraform-state/prod/terraform.tfstate - Backup: Storage Box snapshots +
terraform state pull > backup.tfstate(weekly) - Recovery: Restore from Storage Box or rebuild via imports
- IF LOST: Can rebuild via
terraform importbut tedious (1-2 hours)
Backup Verification Schedule¶
Weekly:
- ✅ Verify Infisical ENCRYPTION_KEY backups exist
- ✅ Test SSH access from all devices
- ✅ Verify dev() container secrets load correctly
- ✅ NEW:
terraform state pull > backup-$(date +%Y%m%d).tfstate(backup state) - ✅ NEW:
terraform planshows "No changes" (verify state matches reality)
Monthly:
- ✅ Test Infisical backup restoration (on test environment)
- ✅ Verify Storage Box snapshots are current
- ✅ Review and update emergency contact information
- ✅ NEW: Test Terraform disaster recovery (destroy test resource → terraform apply)
Scenario 1: Lost All SSH Keys¶
Problem¶
All devices lost/stolen, no SSH access to any server.
Example: MacBook, iPad, and PC all lost simultaneously.
Assessment¶
Severity: HIGH RTO: 30 minutes - 2 hours (depending on access method) Data Loss: None (SSH keys don't contain data)
Recovery Steps¶
Option A: Terraform (Fastest - 5 minutes)
- Generate New SSH Key (on new device):
# On new MacBook/device
ssh-keygen -t ed25519 -C "kavi@new-macbook" -f ~/.ssh/id_ed25519_new-macbook
- Clone Terraform Repository:
- Add New SSH Key to Terraform:
# Edit ssh-keys.tf
vim ssh-keys.tf
# Add new resource
resource "hcloud_ssh_key" "new_macbook" {
name = "kavi-new-macbook"
public_key = file("~/.ssh/id_ed25519_new-macbook.pub")
}
# Add to server ssh_keys list
# In hetzner-vps.tf, add hcloud_ssh_key.new_macbook.id
- Apply Changes:
terraform init # Download providers
terraform plan # Verify changes
terraform apply # Add new SSH key
# Type: yes
- Test SSH:
Option B: Hetzner Cloud Console (If Terraform Not Available)
- Access Hetzner Robot:
- Activate Rescue System:
Select VPS → Rescue → Activate Linux rescue system
Choose: linux64
Note the root password shown
Click: Activate
- Reboot into Rescue:
- Access via Rescue Console:
- Mount Filesystem:
- Add New SSH Key:
# Generate new key on new device first
# Then copy public key content
# In rescue system
cat >> /mnt/home/kavi/.ssh/authorized_keys << 'EOF'
ssh-ed25519 AAAAC3... kavi@new-macbook
EOF
# Fix permissions
chmod 600 /mnt/home/kavi/.ssh/authorized_keys
chown 1000:1000 /mnt/home/kavi/.ssh/authorized_keys
- Reboot Normally:
- Test SSH:
Prevention¶
- ✅ NEW: Keep Terraform repository cloned on all devices (MacBook, iPad, PC)
- ✅ NEW:
git pushTerraform changes immediately (keep repo up to date) - ✅ Keep backup SSH key on secondary device
- ✅ Document Hetzner rescue system procedure (for extreme cases)
- ✅ Store Hetzner credentials securely (password manager)
With Terraform: Even if you lose ALL devices, you can clone terraform-infra repo on new device → add new SSH key → terraform apply
Scenario 2: Infisical Server Down¶
Problem¶
Hetzner VPS offline, can't access Infisical to get secrets.
Example: VPS crashed, hardware failure, network issue.
Assessment¶
Severity: MEDIUM RTO: 15 minutes - 1 hour Data Loss: None (secrets exist, just inaccessible)
Recovery Steps¶
Option A: Use Cached Secrets (Immediate)
- Check for Recent dev() Exports:
# On local machine or Kimsufi
ls -lat /tmp/env-*.list | head -5
# Use if found (may be stale)
docker run --env-file /tmp/env-12345.list ...
- Verify Secrets Are Current:
Option B: Restore Hetzner VPS
- Diagnose Problem:
# Try SSH
ssh kavi@100.80.53.55
# Connection refused? VPS down
# Try ping
ping 100.80.53.55
# No response? Network issue or VPS offline
- Check Hetzner Status:
- Restart VPS (if running but unresponsive):
- Restart Infisical:
ssh kavi@100.80.53.55
cd ~/infisical
docker-compose down
docker-compose up -d
# Verify
docker-compose ps
# All services should be "Up"
Prevention¶
- ✅ Regular Infisical backups (weekly automated)
- ✅ Monitor VPS uptime (uptime monitoring service)
- ✅ Hetzner VPS snapshots (daily)
- ✅ Document Infisical restart procedure
Scenario 3: ENCRYPTION_KEY Lost¶
Problem¶
Infisical ENCRYPTION_KEY lost, all backups lost.
Result: ALL SECRETS PERMANENTLY LOST - No recovery possible.
Assessment¶
Severity: CATASTROPHIC RTO: N/A (unrecoverable) Data Loss: 100% of all secrets
Why This Is Unrecoverable¶
Infisical uses symmetric encryption:
Secret → Encrypt with ENCRYPTION_KEY → Store in MongoDB
Retrieve → Decrypt with ENCRYPTION_KEY → Secret
NO ENCRYPTION_KEY = NO DECRYPTION = NO SECRETS
There is no master key, no backdoor, no recovery mechanism.
What You Must Do¶
Immediate actions:
-
Accept Total Loss:
-
All API keys GONE
- All passwords GONE
-
All database credentials GONE
-
Regenerate Everything:
# Anthropic
Go to: https://console.anthropic.com/settings/keys
Create: New API key
# Google
Go to: https://console.cloud.google.com/apis/credentials
Create: New API key
# Repeat for EVERY service (8-16 hours)
- Rebuild Infisical:
# Start fresh (new ENCRYPTION_KEY)
cd ~/infisical
docker-compose down
rm -rf .env mongodb_data/
# Generate new encryption key
openssl rand -base64 32
# Create new .env
docker-compose up -d
Time to complete: 8-16 hours (depending on number of secrets)
Prevention (CRITICAL)¶
The ONLY defense is multiple backups:
- Primary (Hetzner):
~/infisical/.env.backup - Secondary (Kimsufi, GPG encrypted): Encrypted copy on second server
- Tertiary (Local, GPG encrypted):
~/Backups/infisical-key.gpg - Quaternary (Cloud, GPG encrypted): Google Drive/Dropbox
- Quinary (Physical): USB drive in safe
Test restoration quarterly
Scenario 4: Hetzner VPS Destroyed¶
Problem¶
Hetzner VPS completely destroyed, all data lost.
Assessment¶
Severity: HIGH RTO: 15 minutes (Terraform) vs 2-4 hours (manual) Data Loss: Application data (if no backups) - infrastructure config preserved in Terraform
Recovery Steps¶
Option A: Terraform Rebuild (Fastest - 15 minutes)
- Verify Terraform State is Intact:
cd ~/Coding/terraform-infra
terraform state list
# Should show: hcloud_server.hetzner_vps (will show as destroyed)
- Rebuild VPS:
terraform apply
# Terraform will recreate the destroyed server
# - Same SSH keys attached
# - Same firewall rules
# - Same server type (CPX42)
# - NEW IP address (will be different)
- Update DNS (if Cloudflare managed by Terraform):
# Terraform automatically updates DNS to new IP
# (Cloudflare A record references hcloud_server.hetzner_vps.ipv4_address)
terraform apply # DNS updated automatically
- Restore Application Data:
# SSH to new server
ssh kavi@<new-ip>
# Restore Infisical from backup
# (see Scenario 3 for ENCRYPTION_KEY restore)
# Pull docker-compose.yml from Git
git clone git@github.com:kavi/infra-config.git
cd infra-config
docker-compose up -d
- Verify Services:
Time: 15 minutes (server provisioning) + 30 minutes (service deployment) = 45 minutes total
Option B: Restore from Snapshot (Manual - 1 hour)
- Access Hetzner Console → Snapshots
- Create Server from Snapshot
- Update DNS to new IP (manual Cloudflare console)
- Verify Services running
Option C: Manual Rebuild (Slowest - 2-4 hours)
- Create new CPX42 VPS (Hetzner console)
- Install Docker
- Configure firewall
- Add SSH keys
- Restore Infisical (.env from backup)
- Deploy services from Git
- Update DNS
Prevention¶
- ✅ NEW: Terraform configuration (infrastructure as code)
- ✅ NEW:
terraform state pullbackups (weekly) - ✅ Daily Hetzner snapshots (automated) - faster restore
- ✅ Infisical .env backup (weekly) - CRITICAL
- ✅ docker-compose.yml in Git - service definitions preserved
Terraform Benefit: VPS destroyed at 3 AM → terraform apply at 9 AM → back online in 15 minutes
Scenario 5: Complete Infrastructure Loss¶
Problem¶
Both Production and Development VPS destroyed. All containers lost. Need to rebuild entire infrastructure from scratch.
Example: Hetzner datacenter fire, account compromised and servers deleted, etc.
Assessment¶
Severity: CATASTROPHIC RTO: 40 minutes (Terraform) vs 8-24 hours (manual) Data Loss: ZERO (backups on Storage Box survive VPS loss)
Critical Fact: Photos, databases, and configs on Storage Box are independent of VPS. When VPS is destroyed, Storage Box data survives.
Recovery¶
Option A: Terraform-Powered Recovery (40 minutes) - TESTED ✅
Prerequisites (must have backups of these):
- ✅ Terraform repository (
~/Coding/terraform-infra) - ✅ Provider credentials (HCLOUD_TOKEN, CLOUDFLARE_API_KEY, CLOUDFLARE_EMAIL)
- ✅ Infisical ENCRYPTION_KEY (from Storage Box, MacBook, Kimsufi, or paper backup)
- ✅ Storage Box survives (contains all backups)
Detailed Step-by-Step Recovery:
Phase 1: Environment Setup (5 minutes)¶
- Set Provider Credentials:
# On local MacBook (or any device)
export HCLOUD_TOKEN="HWrZL9wRdSNFSbAdoZSFm4km8xkKKOdmO5ShYdSz9yoOmAZgYKRi5CojJmRBfbQ6"
export CLOUDFLARE_API_KEY="d7f671a87a8337c7605db47ced761703e7199"
export CLOUDFLARE_EMAIL="kdoi@email.com"
- Navigate to Terraform Directory:
Phase 2: Rebuild Infrastructure (10 minutes)¶
- Preview Infrastructure Recreation:
- Recreate ENTIRE Infrastructure:
What Terraform Does Automatically:
- ✅ Creates Production VPS (CPX31) at new IP
- ✅ Creates Development VPS (CX32) at new IP
- ✅ Configures firewalls (ports 80, 443, 22, 2283, etc.)
- ✅ Adds SSH keys to both VPS
- ✅ Updates ALL DNS records to new IPs (9 A records + wildcard)
- ✅ Runs cloud-init scripts (installs Docker, mounts Storage Box)
Phase 3: Verify Infrastructure (5 minutes)¶
- Check VPS Created Successfully:
- Test SSH Access:
- Verify Storage Box Mounted (data survived!):
ssh production "df -h | grep storagebox"
# Should show: /mnt/storagebox mounted via rclone
ssh production "ls -lh /mnt/storagebox/immich/upload"
# Should show: 5GB+ of photos (SURVIVED VPS destruction!)
Phase 4: Restore Databases (15 minutes)¶
- Restore Immich Database:
ssh production "cd /opt/immich && docker-compose up -d immich-postgres"
# Wait 30 seconds for postgres to start
ssh production "docker exec -i immich-postgres psql -U immich immich < /mnt/storagebox/backups/daily/immich_db_latest.sql"
- Restore n8n Database:
ssh production "cd /opt/n8n && docker-compose up -d n8n-postgres"
ssh production "docker exec -i n8n-postgres psql -U n8n n8n < /mnt/storagebox/backups/daily/n8n_db_latest.sql"
- Restore Infisical Database + ENCRYPTION_KEY:
# CRITICAL: Restore ENCRYPTION_KEY first!
scp /mnt/storagebox/backups/infisical_keys/ENCRYPTION_KEY production:/opt/infisical/
# Or restore from MacBook backup:
scp ~/Backups/Infisical/ENCRYPTION_KEY production:/opt/infisical/
ssh production "cd /opt/infisical && docker-compose up -d infisical-postgres"
ssh production "docker exec -i infisical-postgres psql -U infisical infisical < /mnt/storagebox/backups/daily/infisical_db_latest.sql"
Phase 5: Start All Services (5 minutes)¶
- Start Production Services:
# Start all services in correct order
ssh production "cd /opt/immich && docker-compose up -d"
ssh production "cd /opt/kuanary && docker-compose up -d"
ssh production "cd /opt/imgproxy && docker-compose up -d"
ssh production "cd /opt/obsidian && docker-compose up -d"
ssh production "cd /opt/infra-docs && docker-compose up -d"
ssh production "cd /opt/n8n && docker-compose up -d"
ssh production "cd /opt/portainer && docker-compose up -d"
ssh production "cd /opt/infisical && docker-compose up -d"
ssh production "cd /opt/icloud-sync && docker-compose up -d"
- Start Development Services:
Phase 6: Verification (5 minutes)¶
- Verify DNS Propagation (TTL=300, wait 5 minutes):
for domain in photos.kua.cl media.kua.cl cdn.kua.cl notes.kua.cl docs.kua.cl n8n.kua.cl secrets.kua.cl dev.kua.cl; do
echo -n "$domain: "
dig +short $domain
done
# All should return new VPS IPs
- Test All Services:
curl -I https://photos.kua.cl # Immich
curl -I https://media.kua.cl # Kuanary
curl -I https://cdn.kua.cl # imgproxy
curl -I https://notes.kua.cl # Obsidian
curl -I https://docs.kua.cl # This documentation site
curl -I https://n8n.kua.cl # n8n
curl -I https://secrets.kua.cl # Infisical
curl -I https://dev.kua.cl # open-webui
- Verify Photo Count in Immich:
# Log in to https://photos.kua.cl
# Check asset count: Should show 100,805+ photos
# All metadata, albums, face tags restored from database
Total Recovery Time: 40 minutes
Data Loss: ZERO
- Photos: Already on Storage Box (5GB+)
- Databases: Restored from latest daily backup (max 24h old)
- Obsidian vaults: Already on Storage Box
- iCloud sync: Already on Storage Box
- Configurations: In Git + Storage Box backups
Cost: Same as before (€25.98/month for both VPS)
- Rebuild EVERYTHING (10 minutes):
terraform apply
# Creates:
# - Hetzner VPS (CPX42)
# - SSH keys (from your ssh-keys.tf)
# - Cloudflare DNS (all kua.cl records)
# - S3 buckets
# - Firewall rules
- Deploy Services (5 minutes):
ssh kavi@<new-vps-ip>
git clone git@github.com:kavi/infra-config.git
cd infra-config
# Restore Infisical ENCRYPTION_KEY
# (from GPG backup on USB/Cloud)
docker-compose up -d # All services running
Total: 30 minutes from zero to fully operational infrastructure
Option B: Manual Recovery (8-24 hours) - No Terraform
Phase 1: Establish Access (2 hours)
- Get new device, recover cloud accounts, generate SSH keys
Phase 2: Rebuild Infrastructure (4 hours)
- Create VPS manually (Hetzner console)
- Configure DNS manually (Cloudflare)
- Setup firewall manually
- Install Docker
Phase 3: Reconstruct Secrets (8 hours)
- If ENCRYPTION_KEY exists: Restore Infisical → all secrets recovered
- If ENCRYPTION_KEY lost: Regenerate ALL API keys manually
Phase 4: Verify (2 hours)
- Test services, create new backups, update docs
Total: 8-24 hours
Prevention¶
This scenario should NEVER happen if backups are maintained:
-
Terraform Repository (MOST CRITICAL):
-
✅ On GitHub (offsite, version controlled)
- ✅ Cloned on multiple devices (MacBook, iPad, PC)
-
✅ Contains ENTIRE infrastructure definition
-
Infisical ENCRYPTION_KEY (CRITICAL):
-
✅ GPG-encrypted backup on USB drive (in safe)
- ✅ GPG-encrypted backup on Cloud (Google Drive/Dropbox)
- ✅ GPG-encrypted backup on Kimsufi (secondary server)
-
✅ 5 copies minimum
-
Terraform State (IMPORTANT):
-
✅ Remote state on Storage Box S3 (survives VPS loss)
-
✅
terraform state pullbackups (weekly, stored offsite) -
Provider Credentials (IMPORTANT):
- ✅ Hetzner API token (in password manager)
- ✅ Cloudflare API token (in password manager)
With Terraform: Complete infrastructure loss → 30 minutes to rebuild Without Terraform: Complete infrastructure loss → 8-24 hours to rebuild
The Terraform Advantage: Infrastructure is code, stored in Git. As long as GitHub exists, you can rebuild everything.
Scenario 6: Locked Out of All Servers¶
Problem¶
Can SSH but locked out of user account.
Recovery¶
Use Hetzner Rescue System:
- Mount filesystem
- Reset password or fix sudo
- Reboot normally
Scenario 7: Rapid Recovery via Bootstrap¶
Problem¶
You have lost your primary device or are setting up a completely new environment and need to regain access to everything (SSH, Infisical, Repositories) quickly.
Recovery Steps¶
- Obtain Bundle: Get your
device-bootstrap.agefrom your secure backup (USB/Cloud). - Install age:
brew install age - Execute:
- Verify:
ssh bruno(Should work immediately)infisical secrets get ENCRYPTION_KEY(Should work immediately)cd ~/coder-core(Repository should be ready)
Prevention Checklist¶
Daily¶
- [ ] Verify SSH access from primary device
- [ ] Check services are running
Weekly¶
- [ ] Test dev() container startup
- [ ] Verify ENCRYPTION_KEY backups exist
Monthly¶
- [ ] Test Infisical backup restoration
- [ ] Review emergency procedures
Quarterly¶
- [ ] Test complete disaster recovery
- [ ] Verify all backups decrypt correctly
Summary¶
Critical Recovery Procedures (With Terraform):
- ✅ Lost SSH keys → Terraform (5 min) vs Hetzner rescue (30 min)
- ✅ Infisical down → Cached secrets (15 min)
- ⚠️ ENCRYPTION_KEY lost → UNRECOVERABLE (must have backups)
- ✅ VPS destroyed →
terraform apply(15 min) vs Snapshot restore (1-2 hours) - ✅ Complete loss → Clone repo + apply (30 min) vs Full manual rebuild (8-24 hours)
Terraform Recovery Advantage: | Scenario | Manual Time | Terraform Time | Time Saved | |----------|-------------|----------------|------------| | SSH keys lost | 30 min | 5 min | 25 min (83% faster) | | VPS destroyed | 2-4 hours | 15 min | 1-4 hours (92% faster) | | Complete loss | 8-24 hours | 30 min | 8-24 hours (96% faster) |
Prevention is Everything:
- Terraform Repository: On GitHub, cloned on all devices (MOST CRITICAL)
- ENCRYPTION_KEY: 5 copies minimum, test quarterly (CRITICAL)
- Terraform State: Remote backend + weekly backups (IMPORTANT)
- VPS Snapshots: Daily automated (for faster restores)
- Documentation: Keep this runbook updated
What's Next:
- Review Terraform Operations - Infrastructure changes
- Review SSH Operations - SSH key management
- Review Secrets Management - Infisical operations
- Review Infisical Architecture - ENCRYPTION_KEY backup