Terraform Operations Runbook¶

Day-to-day Terraform operations - Practical procedures for managing infrastructure after migration.

Overview¶

This runbook covers common Terraform operations you'll perform regularly:

Making infrastructure changes
Adding new resources
Updating existing resources
Multi-device workflows
Troubleshooting common issues
Emergency procedures

Audience: For use after completing Terraform migration (Weeks 1-3).

Daily Operations¶

Checking Infrastructure State¶

Before making changes, always check current state:

cd ~/Coding/terraform-infra

# Pull latest changes from Git
git pull

# See what Terraform is managing
terraform state list

# Check if actual infrastructure matches configuration
terraform plan

# Expected: "No changes. Your infrastructure matches the configuration."

If plan shows unexpected changes: - Someone else made manual changes (check with team) - Configuration drift (actual infrastructure differs from state) - Need to investigate before proceeding

Making a Simple Change¶

Example: Update VPS server type from CPX31 to CPX42.

Step 1: Edit configuration

# Edit VPS configuration
vim vps.tf

# Change:
# server_type = "cpx31"
# To:
# server_type = "cpx42"

Step 2: Validate and format

# Validate syntax
terraform validate

# Expected: Success! The configuration is valid.

# Format code
terraform fmt

Step 3: Review changes

# Generate plan
terraform plan

# Expected output:
# ~ hcloud_server.hetzner_vps will be updated in-place
#   ~ server_type = "cpx31" -> "cpx42"
#
# Plan: 0 to add, 1 to change, 0 to destroy.

Critical: Read EVERY line. Ensure: - ✅ Only expected changes shown - ❌ No resources being destroyed unintentionally - ❌ No "-/+ must be replaced" (causes downtime)

Step 4: Apply changes

# Apply (will show plan again and prompt for confirmation)
terraform apply

# Type: yes

# Wait for completion

Step 5: Verify

# Check VPS status
ssh kavi@100.80.53.55

# Verify new server type
cat /proc/cpuinfo | grep "model name" | wc -l
# Should show 8 cores (CPX42)

# Check services still running
docker ps

Step 6: Commit to Git

git add vps.tf
git commit -m "feat(vps): upgrade to CPX42 for increased performance"
git push

Adding Resources¶

Add New DNS Record¶

Task: Add api.kua.cl DNS record for new service.

cd ~/Coding/terraform-infra

# Edit DNS configuration
vim dns.tf

Add to dns.tf:

# API endpoint
resource "cloudflare_record" "api" {
  zone_id = var.cloudflare_zone_id
  name    = "api"
  value   = hcloud_server.hetzner_vps.ipv4_address
  type    = "A"
  proxied = false
  ttl     = 3600
}

Apply:

terraform plan
# Expected: Plan: 1 to add, 0 to change, 0 to destroy

terraform apply
# Type: yes

# Verify DNS
dig +short api.kua.cl
# Should show VPS IP

Commit:

git add dns.tf
git commit -m "feat(dns): add api.kua.cl A record"
git push

Add New SSH Key (New Device)¶

Task: Add SSH key for new PC.

Step 1: Get public key from new device

# On new PC, generate key
ssh-keygen -t ed25519 -C "kavi@pc" -f ~/.ssh/id_ed25519_pc

# Copy public key content
cat ~/.ssh/id_ed25519_pc.pub

Step 2: Add to terraform-infra repository

# On MacBook (or any device with repository)
cd ~/Coding/terraform-infra

# Add public key to keys directory
cat > keys/id_ed25519_pc.pub << 'EOF'
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5... kavi@pc
EOF

Step 3: Add to Terraform configuration

# Edit ssh-keys.tf
vim ssh-keys.tf

Add:

resource "hcloud_ssh_key" "pc" {
  name       = "kavi-pc"
  public_key = file("${path.module}/keys/id_ed25519_pc.pub")
}

Update VPS to use new key:

resource "hcloud_server" "hetzner_vps" {
  # ...
  ssh_keys = [
    hcloud_ssh_key.macbook.id,
    hcloud_ssh_key.ipad.id,
    hcloud_ssh_key.pc.id  # Added
  ]
}

Step 4: Apply

terraform plan
# Expected: Plan: 1 to add, 1 to change, 0 to destroy

terraform apply
# Type: yes

Step 5: Test from new PC

# On new PC
ssh kavi@100.80.53.55

# Should work without password

Step 6: Commit

git add ssh-keys.tf keys/id_ed25519_pc.pub
git commit -m "feat(ssh): add PC SSH key for new device"
git push

Updating Resources¶

Change DNS TTL¶

Task: Reduce TTL before infrastructure change.

# Edit dns.tf
vim dns.tf

# Change TTL:
resource "cloudflare_record" "root" {
  # ...
  ttl = 300  # Was: 3600 (reduced to 5 minutes)
}

# Apply
terraform plan
# Expected: ~ cloudflare_record.root
#   ttl: 3600 -> 300

terraform apply

# Commit
git add dns.tf
git commit -m "fix(dns): reduce TTL to 5 minutes for upcoming migration"
git push

Update Firewall Rules¶

Task: Allow new port for service.

# Edit firewall.tf
vim firewall.tf

# Add new rule:
resource "hcloud_firewall" "hetzner_vps" {
  # ... existing rules ...

  rule {
    direction = "in"
    protocol  = "tcp"
    port      = "8080"
    source_ips = ["0.0.0.0/0", "::/0"]
  }
}

# Apply
terraform plan
terraform apply

# Test
curl -I http://<VPS_IP>:8080

Multi-Device Workflow¶

Working from MacBook¶

# On MacBook
cd ~/Coding/terraform-infra

# Pull latest
git pull

# Make changes
vim vps.tf

# Plan
terraform plan

# Apply
terraform apply

# Commit and push
git add vps.tf
git commit -m "feat(vps): add label for monitoring"
git push

Switching to iPad¶

# On iPad (in a-Shell or Blink + VPS)
cd ~/Coding/terraform-infra

# Pull changes made on MacBook
git pull

# Verify state synchronized
terraform plan

# Expected: No changes (MacBook's apply already updated state)

# Make new changes if needed
vim dns.tf

# Apply
terraform apply

# Push
git add dns.tf
git commit -m "feat(dns): add monitoring.kua.cl"
git push

State Synchronization¶

How it works:

MacBook                    Storage Box S3              iPad
  |                              |                       |
  |------- terraform apply ----->|                       |
  |        (updates state)       |                       |
  |                              |                       |
  |        git push ------------>|<----- git pull -------|
  |                              |                       |
  |                              |<-- terraform plan ----|
  |                              |   (reads state)       |
  |                              |                       |

Key: Remote state on Storage Box keeps all devices synchronized.

Avoiding Conflicts¶

Problem: Two devices run terraform apply simultaneously.

Solution: State locking

# Device 1: terraform apply
# Acquires lock on state file

# Device 2: terraform apply
# Error: state locked by Device 1

# Device 2 waits...

# Device 1: apply completes
# Releases lock

# Device 2: can now apply

If lock stuck:

# Check if other device still running
# If confirmed no other process, force unlock:

terraform force-unlock <LOCK_ID>

# Use lock ID from error message

Common Tasks¶

Query Infrastructure¶

Get VPS IP:

terraform output vps_ip
# 46.224.146.107

List all resources:

terraform state list

# Example output:
# hcloud_ssh_key.macbook
# hcloud_ssh_key.ipad
# hcloud_server.hetzner_vps
# cloudflare_record.root
# cloudflare_record.secrets

Show specific resource:

terraform state show hcloud_server.hetzner_vps

# Shows all attributes:
# name = "hetzner-vps"
# server_type = "cpx42"
# ipv4_address = "46.224.146.107"
# etc.

Refresh State¶

When to use: Sync Terraform state with actual infrastructure.

# Update state without changing infrastructure
terraform refresh

# Or (newer):
terraform apply -refresh-only

# Reviews changes, then updates state

Use case: Someone made manual change via console, want to import into state.

Remove Resource from Terraform¶

Scenario: Want to manage resource manually, remove from Terraform.

# Remove from state (doesn't delete resource)
terraform state rm hcloud_ssh_key.old_key

# Verify removed
terraform state list | grep old_key
# (no output)

# Remove from configuration
vim ssh-keys.tf
# Delete resource block

# Terraform no longer manages this resource

Import Existing Resource¶

Scenario: Resource created manually, want Terraform to manage it.

Step 1: Add to configuration

vim firewall.tf

# Add resource definition:
resource "hcloud_firewall" "existing" {
  name = "existing-firewall"
  # ... (configure to match actual)
}

Step 2: Import

# Get resource ID from Hetzner console
# Example: 123456

terraform import hcloud_firewall.existing 123456

# Verify
terraform plan
# Should show: No changes (if config matches actual)

Troubleshooting¶

"No changes" but infrastructure differs¶

Symptom: terraform plan shows "No changes" but actual infrastructure is different.

Cause: State is out of sync.

Solution:

# Refresh state
terraform apply -refresh-only

# Now plan should show differences
terraform plan

# If still shows no changes:
# 1. Check if you're in correct directory
pwd
# Should be: ~/Coding/terraform-infra

# 2. Check if correct backend
grep "bucket" terraform.tf
# Should be: terraform-state on Storage Box

# 3. Verify state
terraform state pull | jq '.resources[] | .name'

"Resource already exists"¶

Symptom:

Error: Error creating server: server already exists

Cause: Trying to create resource that exists.

Solution: Import instead

# Find existing resource ID
hcloud server list

# Import into Terraform
terraform import hcloud_server.hetzner_vps <SERVER_ID>

# Adjust configuration to match
vim vps.tf

# Verify
terraform plan
# Should show: No changes

State Locked¶

Symptom:

Error: Error acquiring the state lock

Lock Info:
  ID: abc-123-def-456
  Path: terraform-state/terraform.tfstate
  Operation: OperationTypeApply
  Who: kavi@macbook
  Created: 2025-01-15 14:30:00

Cause: Another terraform process running, or previous run crashed.

Solution:

# 1. Check if process still running on other device
# Contact yourself / check other terminals

# 2. If confirmed no other process:
terraform force-unlock abc-123-def-456

# Use exact lock ID from error

# 3. Verify unlocked
terraform plan
# Should work now

Plan Shows Unexpected Destroy¶

Symptom:

Plan: 0 to add, 0 to change, 1 to destroy.

- hcloud_server.hetzner_vps will be destroyed

Cause: Configuration drift or accidental change.

Solution:

# DON'T apply!

# 1. Check git diff
git diff vps.tf

# 2. Check state
terraform state show hcloud_server.hetzner_vps

# 3. Restore previous configuration
git checkout vps.tf

# 4. Plan again
terraform plan
# Should now show: No changes

# If still shows destroy:
# Resource was manually deleted, need to recreate or import

Emergency Procedures¶

VPS Accidentally Destroyed¶

Scenario: Ran terraform destroy by mistake or terraform apply destroyed VPS.

Recovery:

# 1. Check if VPS exists
hcloud server list

# 2. If destroyed, check if in Terraform state
terraform state list | grep hetzner_vps

# 3. If still in state, apply will recreate
terraform apply

# VPS will be recreated with:
# - Same configuration (from vps.tf)
# - cloud-init will restore from backups
# - Services start automatically

# 4. Wait for cloud-init (~15 minutes)
ssh root@<NEW_VPS_IP>
tail -f /var/log/cloud-init-output.log

# 5. Verify services
docker ps

# 6. Update DNS if IP changed
terraform plan  # Should auto-update DNS
terraform apply

Recovery time: 30 minutes (infrastructure + data restore)

State File Corrupted¶

Symptom:

Error: state data in S3 does not have the expected content

Recovery:

# 1. Check if backup exists
ls -la terraform.tfstate.backup

# 2. If exists, restore
terraform state push terraform.tfstate.backup

# 3. Verify
terraform plan

# 4. If no backup, pull from remote
terraform state pull > current-state.json

# 5. Inspect
cat current-state.json

# 6. If corrupted, rebuild state via imports
# (This is tedious but possible)

Lost Access to Storage Box¶

Symptom: Can't access remote state.

Temporary solution:

# 1. Switch to local state temporarily
mv terraform.tf terraform.tf.bak

# 2. Create local backend config
cat > terraform.tf << 'EOF'
terraform {
  required_version = ">= 1.6.0"
  # No backend = local state
}
EOF

# 3. Reinitialize
terraform init

# 4. Manually sync with last known state
# (Check backup or git history)

# 5. When Storage Box access restored:
# Restore original terraform.tf
mv terraform.tf.bak terraform.tf

# Migrate state back
terraform init -migrate-state

Accidentally Committed Secrets¶

Scenario: Committed terraform.tfvars with secrets to Git.

Immediate action:

# 1. Remove from Git history
git filter-branch --force --index-filter \
  'git rm --cached --ignore-unmatch terraform.tfvars' \
  --prune-empty --tag-name-filter cat -- --all

# 2. Force push
git push origin --force --all

# 3. Rotate ALL secrets immediately
# Generate new:
# - HCLOUD_TOKEN (Hetzner console)
# - CLOUDFLARE_API_TOKEN (Cloudflare dashboard)
# - STORAGEBOX credentials (Hetzner robot)

# 4. Update Infisical with new secrets
infisical secrets set HCLOUD_TOKEN="new-token" --env=prod

# 5. Update local terraform.tfvars

# 6. Never commit again
# Verify .gitignore has *.tfvars

Best Practices Checklist¶

Before every terraform apply:

[ ] Ran git pull (latest changes)
[ ] Ran terraform plan (reviewed output line-by-line)
[ ] Verified no unexpected destroys (- or -/+)
[ ] Checked no one else is running Terraform
[ ] Have backups (if destructive operation)

After every terraform apply:

[ ] Verified infrastructure change worked
[ ] Committed configuration to Git
[ ] Pushed to remote repository
[ ] Updated documentation if needed

Daily/Weekly:

[ ] Review Terraform state: terraform state list
[ ] Check for drift: terraform plan should show no changes
[ ] Verify backups exist: ls /mnt/storagebox/backups/
[ ] Test disaster recovery (monthly)

Summary¶

Common operations: - ✅ Make change → validate → plan → apply → verify → commit - ✅ Add resource → define in .tf → plan → apply - ✅ Multi-device → git pull → terraform plan → make changes → git push - ✅ Query infrastructure → terraform output, terraform state show

Emergency procedures: - 🚨 VPS destroyed → terraform apply (30 min recovery) - 🚨 State corrupted → restore from backup - 🚨 State locked → terraform force-unlock - 🚨 Secrets leaked → rotate immediately

Best practices: - 📖 ALWAYS run terraform plan before apply - 📖 READ every line of plan output - 📖 COMMIT configuration to Git after applying - 📖 VERIFY infrastructure after changes

Key files: - vps.tf - VPS configuration - dns.tf - DNS records - ssh-keys.tf - SSH access control - terraform.tfstate - NEVER edit manually

What's next: - Practice making small changes - Test multi-device workflow - Document any additional procedures - Schedule monthly disaster recovery tests

Infrastructure as Code = Infrastructure as Reliable as Code