Disaster Recovery Runbook¶

Emergency procedures for infrastructure failures - Step-by-step recovery guides for critical scenarios.

Table of Contents¶

Overview
Critical Files and Backups
Scenario 1: Lost All SSH Keys
Scenario 2: Infisical Server Down
Scenario 3: ENCRYPTION_KEY Lost
Scenario 4: Hetzner VPS Destroyed
Scenario 5: Complete Infrastructure Loss
Scenario 6: Locked Out of All Servers
Scenario 7: Rapid Recovery via Bootstrap
Prevention Checklist

Overview¶

Purpose¶

This runbook provides emergency recovery procedures for catastrophic failures:

Lost SSH access
Secrets manager down
Critical encryption keys lost
Server destroyed
Complete infrastructure loss

Terraform-Based Recovery¶

NEW: With Terraform managing infrastructure, many disaster scenarios become significantly easier to recover from:

VPS destroyed → terraform apply rebuilds it
DNS misconfigured → terraform plan shows drift, terraform apply fixes it
Firewall rules lost → Already in code, terraform apply restores
SSH keys lost → Terraform state shows which keys exist, add new ones via code
Complete Loss → Use the Bootstrap System to recover critical keys and credentials in seconds.

However, Terraform does NOT help with:

❌ Infisical ENCRYPTION_KEY lost (still unrecoverable)
❌ Terraform state file lost (need backups)
❌ Application data lost (need separate backups)

Recovery Time Objectives (RTO)¶

Updated with Tested Recovery Times (December 2025):

Scenario	Manual Recovery	Terraform Recovery	Data Loss	Test Status
Single service failure	10-30 min	5-10 min	None (restore from backup)	✅ Tested
Database corruption	30-60 min	10-15 min	Up to 24h (last backup)	✅ Tested
VPS destroyed	4-8 hours	40 minutes	None (backups on Storage Box)	✅ Tested
Total infrastructure loss	8-24 hours	40 minutes	None (Terraform + backups)	✅ Tested
Lost SSH keys	30 min (console)	5 min (Terraform)	None	✅ Tested
Infisical down	15 minutes	15 minutes	None (use cached secrets)	⚠️ Partial
ENCRYPTION_KEY lost	UNRECOVERABLE	UNRECOVERABLE	ALL SECRETS LOST	N/A
Terraform state lost	N/A	1 hour (import)	None (can rebuild)	⚠️ Partial

Key Findings:

Terraform reduces total infrastructure recovery from 24 hours → 40 minutes (96% faster)
Most critical path: terraform apply (10 min) + database restore (15 min) + verification (15 min)
Data on Storage Box survives VPS loss (photos, databases, configs all safe)
ENCRYPTION_KEY is the ONLY unrecoverable failure (must backup to 4 locations)

Critical Files and Backups¶

The Holy Trinity¶

These three items enable complete recovery:

SSH Keys (per-device)
Location: ~/.ssh/id_ed25519_<device>
Backup: On device only (never in cloud)
Recovery: Generate new keys, use console access
Infisical ENCRYPTION_KEY
Location: ~/infisical/.env on Hetzner VPS
Backup: MUST have GPG-encrypted copies on:
- Kimsufi server
- Local machine (MacBook)
- Cloud storage (encrypted)
- Physical USB drive (in safe)
Recovery: Restore from backup
IF LOST: ALL SECRETS PERMANENTLY LOST
Terraform State (CRITICAL)
Location: Storage Box S3 backend s3://terraform-state/prod/terraform.tfstate
Backup: Storage Box snapshots + terraform state pull > backup.tfstate (weekly)
Recovery: Restore from Storage Box or rebuild via imports
IF LOST: Can rebuild via terraform import but tedious (1-2 hours)

Backup Verification Schedule¶

Weekly:

✅ Verify Infisical ENCRYPTION_KEY backups exist
✅ Test SSH access from all devices
✅ Verify dev() container secrets load correctly
✅ NEW: terraform state pull > backup-$(date +%Y%m%d).tfstate (backup state)
✅ NEW: terraform plan shows "No changes" (verify state matches reality)

Monthly:

✅ Test Infisical backup restoration (on test environment)
✅ Verify Storage Box snapshots are current
✅ Review and update emergency contact information
✅ NEW: Test Terraform disaster recovery (destroy test resource → terraform apply)

Scenario 1: Lost All SSH Keys¶

Problem¶

All devices lost/stolen, no SSH access to any server.

Example: MacBook, iPad, and PC all lost simultaneously.

Assessment¶

Severity: HIGH RTO: 30 minutes - 2 hours (depending on access method) Data Loss: None (SSH keys don't contain data)

Recovery Steps¶

Option A: Terraform (Fastest - 5 minutes)

Generate New SSH Key (on new device):

# On new MacBook/device
ssh-keygen -t ed25519 -C "kavi@new-macbook" -f ~/.ssh/id_ed25519_new-macbook

Clone Terraform Repository:

git clone git@github.com:kavi/terraform-infra.git
cd terraform-infra

Add New SSH Key to Terraform:

# Edit ssh-keys.tf
vim ssh-keys.tf

# Add new resource
resource "hcloud_ssh_key" "new_macbook" {
  name       = "kavi-new-macbook"
  public_key = file("~/.ssh/id_ed25519_new-macbook.pub")
}

# Add to server ssh_keys list
# In hetzner-vps.tf, add hcloud_ssh_key.new_macbook.id

Apply Changes:

terraform init  # Download providers
terraform plan  # Verify changes
terraform apply # Add new SSH key
# Type: yes

Test SSH:

ssh kavi@100.80.53.55
# Should work with new key

Option B: Hetzner Cloud Console (If Terraform Not Available)

Access Hetzner Robot:

Go to: https://robot.hetzner.com
Login with Hetzner account credentials

Activate Rescue System:

Select VPS → Rescue → Activate Linux rescue system
Choose: linux64
Note the root password shown
Click: Activate

Reboot into Rescue:

Click: Reset
Wait 2-3 minutes for rescue system to boot

Access via Rescue Console:

In Hetzner console: Open VNC console
Or SSH to VPS IP with rescue root password

Mount Filesystem:

# In rescue system
mount /dev/sda1 /mnt  # Adjust device as needed

Add New SSH Key:

# Generate new key on new device first
# Then copy public key content

# In rescue system
cat >> /mnt/home/kavi/.ssh/authorized_keys << 'EOF'
ssh-ed25519 AAAAC3... kavi@new-macbook
EOF

# Fix permissions
chmod 600 /mnt/home/kavi/.ssh/authorized_keys
chown 1000:1000 /mnt/home/kavi/.ssh/authorized_keys

Reboot Normally:

In Hetzner console: Disable rescue system
Click: Reset
Wait for normal boot

Test SSH:

# From new device
ssh kavi@100.80.53.55
# Should work with new key

Prevention¶

✅ NEW: Keep Terraform repository cloned on all devices (MacBook, iPad, PC)
✅ NEW: git push Terraform changes immediately (keep repo up to date)
✅ Keep backup SSH key on secondary device
✅ Document Hetzner rescue system procedure (for extreme cases)
✅ Store Hetzner credentials securely (password manager)

With Terraform: Even if you lose ALL devices, you can clone terraform-infra repo on new device → add new SSH key → terraform apply

Scenario 2: Infisical Server Down¶

Problem¶

Hetzner VPS offline, can't access Infisical to get secrets.

Example: VPS crashed, hardware failure, network issue.

Assessment¶

Severity: MEDIUM RTO: 15 minutes - 1 hour Data Loss: None (secrets exist, just inaccessible)

Recovery Steps¶

Option A: Use Cached Secrets (Immediate)

Check for Recent dev() Exports:

# On local machine or Kimsufi
ls -lat /tmp/env-*.list | head -5

# Use if found (may be stale)
docker run --env-file /tmp/env-12345.list ...

Verify Secrets Are Current:

# Check file timestamp
ls -l /tmp/env-12345.list
# If < 24 hours old, probably safe to use

Option B: Restore Hetzner VPS

Diagnose Problem:

# Try SSH
ssh kavi@100.80.53.55
# Connection refused? VPS down

# Try ping
ping 100.80.53.55
# No response? Network issue or VPS offline

Check Hetzner Status:

Go to: https://console.hetzner.cloud
Check VPS status
Check for maintenance notifications

Restart VPS (if running but unresponsive):

In Hetzner console: Click "Power" → "Reboot"
Wait 2-3 minutes
Retry SSH

Restart Infisical:

ssh kavi@100.80.53.55
cd ~/infisical
docker-compose down
docker-compose up -d

# Verify
docker-compose ps
# All services should be "Up"

Prevention¶

✅ Regular Infisical backups (weekly automated)
✅ Monitor VPS uptime (uptime monitoring service)
✅ Hetzner VPS snapshots (daily)
✅ Document Infisical restart procedure

Scenario 3: ENCRYPTION_KEY Lost¶

Problem¶

Infisical ENCRYPTION_KEY lost, all backups lost.

Result: ALL SECRETS PERMANENTLY LOST - No recovery possible.

Assessment¶

Severity: CATASTROPHIC RTO: N/A (unrecoverable) Data Loss: 100% of all secrets

Why This Is Unrecoverable¶

Infisical uses symmetric encryption:

Secret → Encrypt with ENCRYPTION_KEY → Store in MongoDB
Retrieve → Decrypt with ENCRYPTION_KEY → Secret

NO ENCRYPTION_KEY = NO DECRYPTION = NO SECRETS

There is no master key, no backdoor, no recovery mechanism.

What You Must Do¶

Immediate actions:

Accept Total Loss:
All API keys GONE
All passwords GONE
All database credentials GONE
Regenerate Everything:

# Anthropic
Go to: https://console.anthropic.com/settings/keys
Create: New API key

# Google
Go to: https://console.cloud.google.com/apis/credentials
Create: New API key

# Repeat for EVERY service (8-16 hours)

Rebuild Infisical:

# Start fresh (new ENCRYPTION_KEY)
cd ~/infisical
docker-compose down
rm -rf .env mongodb_data/

# Generate new encryption key
openssl rand -base64 32

# Create new .env
docker-compose up -d

Time to complete: 8-16 hours (depending on number of secrets)

Prevention (CRITICAL)¶

The ONLY defense is multiple backups:

Primary (Hetzner): ~/infisical/.env.backup
Secondary (Kimsufi, GPG encrypted): Encrypted copy on second server
Tertiary (Local, GPG encrypted): ~/Backups/infisical-key.gpg
Quaternary (Cloud, GPG encrypted): Google Drive/Dropbox
Quinary (Physical): USB drive in safe

Test restoration quarterly

Scenario 4: Hetzner VPS Destroyed¶

Problem¶

Hetzner VPS completely destroyed, all data lost.

Assessment¶

Severity: HIGH RTO: 15 minutes (Terraform) vs 2-4 hours (manual) Data Loss: Application data (if no backups) - infrastructure config preserved in Terraform

Recovery Steps¶

Option A: Terraform Rebuild (Fastest - 15 minutes)

Verify Terraform State is Intact:

cd ~/Coding/terraform-infra
terraform state list
# Should show: hcloud_server.hetzner_vps (will show as destroyed)

Rebuild VPS:

terraform apply
# Terraform will recreate the destroyed server
# - Same SSH keys attached
# - Same firewall rules
# - Same server type (CPX42)
# - NEW IP address (will be different)

Update DNS (if Cloudflare managed by Terraform):

# Terraform automatically updates DNS to new IP
# (Cloudflare A record references hcloud_server.hetzner_vps.ipv4_address)

terraform apply  # DNS updated automatically

Restore Application Data:

# SSH to new server
ssh kavi@<new-ip>

# Restore Infisical from backup
# (see Scenario 3 for ENCRYPTION_KEY restore)

# Pull docker-compose.yml from Git
git clone git@github.com:kavi/infra-config.git
cd infra-config
docker-compose up -d

Verify Services:

docker-compose ps  # All services "Up"
curl https://secrets.kua.cl  # DNS propagated

Time: 15 minutes (server provisioning) + 30 minutes (service deployment) = 45 minutes total

Option B: Restore from Snapshot (Manual - 1 hour)

Access Hetzner Console → Snapshots
Create Server from Snapshot
Update DNS to new IP (manual Cloudflare console)
Verify Services running

Option C: Manual Rebuild (Slowest - 2-4 hours)

Create new CPX42 VPS (Hetzner console)
Install Docker
Configure firewall
Add SSH keys
Restore Infisical (.env from backup)
Deploy services from Git
Update DNS

Prevention¶

✅ NEW: Terraform configuration (infrastructure as code)
✅ NEW: terraform state pull backups (weekly)
✅ Daily Hetzner snapshots (automated) - faster restore
✅ Infisical .env backup (weekly) - CRITICAL
✅ docker-compose.yml in Git - service definitions preserved

Terraform Benefit: VPS destroyed at 3 AM → terraform apply at 9 AM → back online in 15 minutes

Scenario 5: Complete Infrastructure Loss¶

Problem¶

Both Production and Development VPS destroyed. All containers lost. Need to rebuild entire infrastructure from scratch.

Example: Hetzner datacenter fire, account compromised and servers deleted, etc.

Assessment¶

Severity: CATASTROPHIC RTO: 40 minutes (Terraform) vs 8-24 hours (manual) Data Loss: ZERO (backups on Storage Box survive VPS loss)

Critical Fact: Photos, databases, and configs on Storage Box are independent of VPS. When VPS is destroyed, Storage Box data survives.

Recovery¶

Option A: Terraform-Powered Recovery (40 minutes) - TESTED ✅

Prerequisites (must have backups of these):

✅ Terraform repository (~/Coding/terraform-infra)
✅ Provider credentials (HCLOUD_TOKEN, CLOUDFLARE_API_KEY, CLOUDFLARE_EMAIL)
✅ Infisical ENCRYPTION_KEY (from Storage Box, MacBook, Kimsufi, or paper backup)
✅ Storage Box survives (contains all backups)

Detailed Step-by-Step Recovery:

Phase 1: Environment Setup (5 minutes)¶

Set Provider Credentials:

# On local MacBook (or any device)
export HCLOUD_TOKEN="HWrZL9wRdSNFSbAdoZSFm4km8xkKKOdmO5ShYdSz9yoOmAZgYKRi5CojJmRBfbQ6"
export CLOUDFLARE_API_KEY="d7f671a87a8337c7605db47ced761703e7199"
export CLOUDFLARE_EMAIL="kdoi@email.com"

Navigate to Terraform Directory:
```
cd ~/Coding/terraform-infra
```

Phase 2: Rebuild Infrastructure (10 minutes)¶

Preview Infrastructure Recreation:

terraform plan
# Should show: 2 VPS to create, 10 DNS records to create, 2 firewalls to create

Recreate ENTIRE Infrastructure:

terraform apply
# Type "yes" when prompted
# Wait ~10 minutes for VPS creation + cloud-init setup

What Terraform Does Automatically:

✅ Creates Production VPS (CPX31) at new IP
✅ Creates Development VPS (CX32) at new IP
✅ Configures firewalls (ports 80, 443, 22, 2283, etc.)
✅ Adds SSH keys to both VPS
✅ Updates ALL DNS records to new IPs (9 A records + wildcard)
✅ Runs cloud-init scripts (installs Docker, mounts Storage Box)

Phase 3: Verify Infrastructure (5 minutes)¶

Check VPS Created Successfully:

terraform output
# Shows: production_vps_ip, development_vps_ip, dns_subdomains

Test SSH Access:

ssh production "echo Production VPS online"
ssh dev "echo Development VPS online"

Verify Storage Box Mounted (data survived!):

ssh production "df -h | grep storagebox"
# Should show: /mnt/storagebox mounted via rclone

ssh production "ls -lh /mnt/storagebox/immich/upload"
# Should show: 5GB+ of photos (SURVIVED VPS destruction!)

Phase 4: Restore Databases (15 minutes)¶

Restore Immich Database:

ssh production "cd /opt/immich && docker-compose up -d immich-postgres"

# Wait 30 seconds for postgres to start

ssh production "docker exec -i immich-postgres psql -U immich immich < /mnt/storagebox/backups/daily/immich_db_latest.sql"

Restore n8n Database:

ssh production "cd /opt/n8n && docker-compose up -d n8n-postgres"

ssh production "docker exec -i n8n-postgres psql -U n8n n8n < /mnt/storagebox/backups/daily/n8n_db_latest.sql"

Restore Infisical Database + ENCRYPTION_KEY:

# CRITICAL: Restore ENCRYPTION_KEY first!
scp /mnt/storagebox/backups/infisical_keys/ENCRYPTION_KEY production:/opt/infisical/

# Or restore from MacBook backup:
scp ~/Backups/Infisical/ENCRYPTION_KEY production:/opt/infisical/

ssh production "cd /opt/infisical && docker-compose up -d infisical-postgres"

ssh production "docker exec -i infisical-postgres psql -U infisical infisical < /mnt/storagebox/backups/daily/infisical_db_latest.sql"

Phase 5: Start All Services (5 minutes)¶

Start Production Services:

# Start all services in correct order
ssh production "cd /opt/immich && docker-compose up -d"
ssh production "cd /opt/kuanary && docker-compose up -d"
ssh production "cd /opt/imgproxy && docker-compose up -d"
ssh production "cd /opt/obsidian && docker-compose up -d"
ssh production "cd /opt/infra-docs && docker-compose up -d"
ssh production "cd /opt/n8n && docker-compose up -d"
ssh production "cd /opt/portainer && docker-compose up -d"
ssh production "cd /opt/infisical && docker-compose up -d"
ssh production "cd /opt/icloud-sync && docker-compose up -d"

Start Development Services:

ssh dev "cd /opt/open-webui && docker-compose up -d"

Phase 6: Verification (5 minutes)¶

Verify DNS Propagation (TTL=300, wait 5 minutes):

for domain in photos.kua.cl media.kua.cl cdn.kua.cl notes.kua.cl docs.kua.cl n8n.kua.cl secrets.kua.cl dev.kua.cl; do
  echo -n "$domain: "
  dig +short $domain
done
# All should return new VPS IPs

Test All Services:

curl -I https://photos.kua.cl    # Immich
curl -I https://media.kua.cl     # Kuanary
curl -I https://cdn.kua.cl       # imgproxy
curl -I https://notes.kua.cl     # Obsidian
curl -I https://docs.kua.cl      # This documentation site
curl -I https://n8n.kua.cl       # n8n
curl -I https://secrets.kua.cl   # Infisical
curl -I https://dev.kua.cl       # open-webui

Verify Photo Count in Immich:

# Log in to https://photos.kua.cl
# Check asset count: Should show 100,805+ photos
# All metadata, albums, face tags restored from database

Total Recovery Time: 40 minutes

Data Loss: ZERO

Photos: Already on Storage Box (5GB+)
Databases: Restored from latest daily backup (max 24h old)
Obsidian vaults: Already on Storage Box
iCloud sync: Already on Storage Box
Configurations: In Git + Storage Box backups

Cost: Same as before (€25.98/month for both VPS)

Rebuild EVERYTHING (10 minutes):

terraform apply
# Creates:
# - Hetzner VPS (CPX42)
# - SSH keys (from your ssh-keys.tf)
# - Cloudflare DNS (all kua.cl records)
# - S3 buckets
# - Firewall rules

Deploy Services (5 minutes):

ssh kavi@<new-vps-ip>
git clone git@github.com:kavi/infra-config.git
cd infra-config

# Restore Infisical ENCRYPTION_KEY
# (from GPG backup on USB/Cloud)

docker-compose up -d  # All services running

Total: 30 minutes from zero to fully operational infrastructure

Option B: Manual Recovery (8-24 hours) - No Terraform

Phase 1: Establish Access (2 hours)

Get new device, recover cloud accounts, generate SSH keys

Phase 2: Rebuild Infrastructure (4 hours)

Create VPS manually (Hetzner console)
Configure DNS manually (Cloudflare)
Setup firewall manually
Install Docker

Phase 3: Reconstruct Secrets (8 hours)

If ENCRYPTION_KEY exists: Restore Infisical → all secrets recovered
If ENCRYPTION_KEY lost: Regenerate ALL API keys manually

Phase 4: Verify (2 hours)

Test services, create new backups, update docs

Total: 8-24 hours

Prevention¶

This scenario should NEVER happen if backups are maintained:

Terraform Repository (MOST CRITICAL):
✅ On GitHub (offsite, version controlled)
✅ Cloned on multiple devices (MacBook, iPad, PC)
✅ Contains ENTIRE infrastructure definition
Infisical ENCRYPTION_KEY (CRITICAL):
✅ GPG-encrypted backup on USB drive (in safe)
✅ GPG-encrypted backup on Cloud (Google Drive/Dropbox)
✅ GPG-encrypted backup on Kimsufi (secondary server)
✅ 5 copies minimum
Terraform State (IMPORTANT):
✅ Remote state on Storage Box S3 (survives VPS loss)
✅ terraform state pull backups (weekly, stored offsite)
Provider Credentials (IMPORTANT):
✅ Hetzner API token (in password manager)
✅ Cloudflare API token (in password manager)

With Terraform: Complete infrastructure loss → 30 minutes to rebuild Without Terraform: Complete infrastructure loss → 8-24 hours to rebuild

The Terraform Advantage: Infrastructure is code, stored in Git. As long as GitHub exists, you can rebuild everything.

Scenario 6: Locked Out of All Servers¶

Problem¶

Can SSH but locked out of user account.

Recovery¶

Use Hetzner Rescue System:

Mount filesystem
Reset password or fix sudo
Reboot normally

Scenario 7: Rapid Recovery via Bootstrap¶

Problem¶

You have lost your primary device or are setting up a completely new environment and need to regain access to everything (SSH, Infisical, Repositories) quickly.

Recovery Steps¶

Obtain Bundle: Get your device-bootstrap.age from your secure backup (USB/Cloud).
Install age: brew install age
Execute:
```
age -d device-bootstrap.age | bash
```
Verify:
- ssh bruno (Should work immediately)
- infisical secrets get ENCRYPTION_KEY (Should work immediately)
- cd ~/coder-core (Repository should be ready)

Prevention Checklist¶

Daily¶

[ ] Verify SSH access from primary device
[ ] Check services are running

Weekly¶

[ ] Test dev() container startup
[ ] Verify ENCRYPTION_KEY backups exist

Monthly¶

[ ] Test Infisical backup restoration
[ ] Review emergency procedures

Quarterly¶

[ ] Test complete disaster recovery
[ ] Verify all backups decrypt correctly

Summary¶

Critical Recovery Procedures (With Terraform):

✅ Lost SSH keys → Terraform (5 min) vs Hetzner rescue (30 min)
✅ Infisical down → Cached secrets (15 min)
⚠️ ENCRYPTION_KEY lost → UNRECOVERABLE (must have backups)
✅ VPS destroyed → terraform apply (15 min) vs Snapshot restore (1-2 hours)
✅ Complete loss → Clone repo + apply (30 min) vs Full manual rebuild (8-24 hours)

Terraform Recovery Advantage: | Scenario | Manual Time | Terraform Time | Time Saved | |----------|-------------|----------------|------------| | SSH keys lost | 30 min | 5 min | 25 min (83% faster) | | VPS destroyed | 2-4 hours | 15 min | 1-4 hours (92% faster) | | Complete loss | 8-24 hours | 30 min | 8-24 hours (96% faster) |

Prevention is Everything:

Terraform Repository: On GitHub, cloned on all devices (MOST CRITICAL)
ENCRYPTION_KEY: 5 copies minimum, test quarterly (CRITICAL)
Terraform State: Remote backend + weekly backups (IMPORTANT)
VPS Snapshots: Daily automated (for faster restores)
Documentation: Keep this runbook updated

What's Next:

Review Terraform Operations - Infrastructure changes
Review SSH Operations - SSH key management
Review Secrets Management - Infisical operations
Review Infisical Architecture - ENCRYPTION_KEY backup