Terraform Operations Runbook¶
Day-to-day Terraform operations - Practical procedures for managing infrastructure after migration.
Overview¶
This runbook covers common Terraform operations you'll perform regularly:
- Making infrastructure changes
- Adding new resources
- Updating existing resources
- Multi-device workflows
- Troubleshooting common issues
- Emergency procedures
Audience: For use after completing Terraform migration (Weeks 1-3).
Table of Contents¶
- Daily Operations
- Adding Resources
- Updating Resources
- Multi-Device Workflow
- Common Tasks
- Troubleshooting
- Emergency Procedures
Daily Operations¶
Checking Infrastructure State¶
Before making changes, always check current state:
cd ~/Coding/terraform-infra
# Pull latest changes from Git
git pull
# See what Terraform is managing
terraform state list
# Check if actual infrastructure matches configuration
terraform plan
# Expected: "No changes. Your infrastructure matches the configuration."
If plan shows unexpected changes: - Someone else made manual changes (check with team) - Configuration drift (actual infrastructure differs from state) - Need to investigate before proceeding
Making a Simple Change¶
Example: Update VPS server type from CPX31 to CPX42.
Step 1: Edit configuration
Step 2: Validate and format
# Validate syntax
terraform validate
# Expected: Success! The configuration is valid.
# Format code
terraform fmt
Step 3: Review changes
# Generate plan
terraform plan
# Expected output:
# ~ hcloud_server.hetzner_vps will be updated in-place
# ~ server_type = "cpx31" -> "cpx42"
#
# Plan: 0 to add, 1 to change, 0 to destroy.
Critical: Read EVERY line. Ensure: - ✅ Only expected changes shown - ❌ No resources being destroyed unintentionally - ❌ No "-/+ must be replaced" (causes downtime)
Step 4: Apply changes
# Apply (will show plan again and prompt for confirmation)
terraform apply
# Type: yes
# Wait for completion
Step 5: Verify
# Check VPS status
ssh kavi@100.80.53.55
# Verify new server type
cat /proc/cpuinfo | grep "model name" | wc -l
# Should show 8 cores (CPX42)
# Check services still running
docker ps
Step 6: Commit to Git
Adding Resources¶
Add New DNS Record¶
Task: Add api.kua.cl DNS record for new service.
Add to dns.tf:
# API endpoint
resource "cloudflare_record" "api" {
zone_id = var.cloudflare_zone_id
name = "api"
value = hcloud_server.hetzner_vps.ipv4_address
type = "A"
proxied = false
ttl = 3600
}
Apply:
terraform plan
# Expected: Plan: 1 to add, 0 to change, 0 to destroy
terraform apply
# Type: yes
# Verify DNS
dig +short api.kua.cl
# Should show VPS IP
Commit:
Add New SSH Key (New Device)¶
Task: Add SSH key for new PC.
Step 1: Get public key from new device
# On new PC, generate key
ssh-keygen -t ed25519 -C "kavi@pc" -f ~/.ssh/id_ed25519_pc
# Copy public key content
cat ~/.ssh/id_ed25519_pc.pub
Step 2: Add to terraform-infra repository
# On MacBook (or any device with repository)
cd ~/Coding/terraform-infra
# Add public key to keys directory
cat > keys/id_ed25519_pc.pub << 'EOF'
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5... kavi@pc
EOF
Step 3: Add to Terraform configuration
Add:
resource "hcloud_ssh_key" "pc" {
name = "kavi-pc"
public_key = file("${path.module}/keys/id_ed25519_pc.pub")
}
Update VPS to use new key:
resource "hcloud_server" "hetzner_vps" {
# ...
ssh_keys = [
hcloud_ssh_key.macbook.id,
hcloud_ssh_key.ipad.id,
hcloud_ssh_key.pc.id # Added
]
}
Step 4: Apply
Step 5: Test from new PC
Step 6: Commit
git add ssh-keys.tf keys/id_ed25519_pc.pub
git commit -m "feat(ssh): add PC SSH key for new device"
git push
Updating Resources¶
Change DNS TTL¶
Task: Reduce TTL before infrastructure change.
# Edit dns.tf
vim dns.tf
# Change TTL:
resource "cloudflare_record" "root" {
# ...
ttl = 300 # Was: 3600 (reduced to 5 minutes)
}
# Apply
terraform plan
# Expected: ~ cloudflare_record.root
# ttl: 3600 -> 300
terraform apply
# Commit
git add dns.tf
git commit -m "fix(dns): reduce TTL to 5 minutes for upcoming migration"
git push
Update Firewall Rules¶
Task: Allow new port for service.
# Edit firewall.tf
vim firewall.tf
# Add new rule:
resource "hcloud_firewall" "hetzner_vps" {
# ... existing rules ...
rule {
direction = "in"
protocol = "tcp"
port = "8080"
source_ips = ["0.0.0.0/0", "::/0"]
}
}
# Apply
terraform plan
terraform apply
# Test
curl -I http://<VPS_IP>:8080
Multi-Device Workflow¶
Working from MacBook¶
# On MacBook
cd ~/Coding/terraform-infra
# Pull latest
git pull
# Make changes
vim vps.tf
# Plan
terraform plan
# Apply
terraform apply
# Commit and push
git add vps.tf
git commit -m "feat(vps): add label for monitoring"
git push
Switching to iPad¶
# On iPad (in a-Shell or Blink + VPS)
cd ~/Coding/terraform-infra
# Pull changes made on MacBook
git pull
# Verify state synchronized
terraform plan
# Expected: No changes (MacBook's apply already updated state)
# Make new changes if needed
vim dns.tf
# Apply
terraform apply
# Push
git add dns.tf
git commit -m "feat(dns): add monitoring.kua.cl"
git push
State Synchronization¶
How it works:
MacBook Storage Box S3 iPad
| | |
|------- terraform apply ----->| |
| (updates state) | |
| | |
| git push ------------>|<----- git pull -------|
| | |
| |<-- terraform plan ----|
| | (reads state) |
| | |
Key: Remote state on Storage Box keeps all devices synchronized.
Avoiding Conflicts¶
Problem: Two devices run terraform apply simultaneously.
Solution: State locking
# Device 1: terraform apply
# Acquires lock on state file
# Device 2: terraform apply
# Error: state locked by Device 1
# Device 2 waits...
# Device 1: apply completes
# Releases lock
# Device 2: can now apply
If lock stuck:
# Check if other device still running
# If confirmed no other process, force unlock:
terraform force-unlock <LOCK_ID>
# Use lock ID from error message
Common Tasks¶
Query Infrastructure¶
Get VPS IP:
List all resources:
terraform state list
# Example output:
# hcloud_ssh_key.macbook
# hcloud_ssh_key.ipad
# hcloud_server.hetzner_vps
# cloudflare_record.root
# cloudflare_record.secrets
Show specific resource:
terraform state show hcloud_server.hetzner_vps
# Shows all attributes:
# name = "hetzner-vps"
# server_type = "cpx42"
# ipv4_address = "46.224.146.107"
# etc.
Refresh State¶
When to use: Sync Terraform state with actual infrastructure.
# Update state without changing infrastructure
terraform refresh
# Or (newer):
terraform apply -refresh-only
# Reviews changes, then updates state
Use case: Someone made manual change via console, want to import into state.
Remove Resource from Terraform¶
Scenario: Want to manage resource manually, remove from Terraform.
# Remove from state (doesn't delete resource)
terraform state rm hcloud_ssh_key.old_key
# Verify removed
terraform state list | grep old_key
# (no output)
# Remove from configuration
vim ssh-keys.tf
# Delete resource block
# Terraform no longer manages this resource
Import Existing Resource¶
Scenario: Resource created manually, want Terraform to manage it.
Step 1: Add to configuration
vim firewall.tf
# Add resource definition:
resource "hcloud_firewall" "existing" {
name = "existing-firewall"
# ... (configure to match actual)
}
Step 2: Import
# Get resource ID from Hetzner console
# Example: 123456
terraform import hcloud_firewall.existing 123456
# Verify
terraform plan
# Should show: No changes (if config matches actual)
Troubleshooting¶
"No changes" but infrastructure differs¶
Symptom: terraform plan shows "No changes" but actual infrastructure is different.
Cause: State is out of sync.
Solution:
# Refresh state
terraform apply -refresh-only
# Now plan should show differences
terraform plan
# If still shows no changes:
# 1. Check if you're in correct directory
pwd
# Should be: ~/Coding/terraform-infra
# 2. Check if correct backend
grep "bucket" terraform.tf
# Should be: terraform-state on Storage Box
# 3. Verify state
terraform state pull | jq '.resources[] | .name'
"Resource already exists"¶
Symptom:
Cause: Trying to create resource that exists.
Solution: Import instead
# Find existing resource ID
hcloud server list
# Import into Terraform
terraform import hcloud_server.hetzner_vps <SERVER_ID>
# Adjust configuration to match
vim vps.tf
# Verify
terraform plan
# Should show: No changes
State Locked¶
Symptom:
Error: Error acquiring the state lock
Lock Info:
ID: abc-123-def-456
Path: terraform-state/terraform.tfstate
Operation: OperationTypeApply
Who: kavi@macbook
Created: 2025-01-15 14:30:00
Cause: Another terraform process running, or previous run crashed.
Solution:
# 1. Check if process still running on other device
# Contact yourself / check other terminals
# 2. If confirmed no other process:
terraform force-unlock abc-123-def-456
# Use exact lock ID from error
# 3. Verify unlocked
terraform plan
# Should work now
Plan Shows Unexpected Destroy¶
Symptom:
Cause: Configuration drift or accidental change.
Solution:
# DON'T apply!
# 1. Check git diff
git diff vps.tf
# 2. Check state
terraform state show hcloud_server.hetzner_vps
# 3. Restore previous configuration
git checkout vps.tf
# 4. Plan again
terraform plan
# Should now show: No changes
# If still shows destroy:
# Resource was manually deleted, need to recreate or import
Emergency Procedures¶
VPS Accidentally Destroyed¶
Scenario: Ran terraform destroy by mistake or terraform apply destroyed VPS.
Recovery:
# 1. Check if VPS exists
hcloud server list
# 2. If destroyed, check if in Terraform state
terraform state list | grep hetzner_vps
# 3. If still in state, apply will recreate
terraform apply
# VPS will be recreated with:
# - Same configuration (from vps.tf)
# - cloud-init will restore from backups
# - Services start automatically
# 4. Wait for cloud-init (~15 minutes)
ssh root@<NEW_VPS_IP>
tail -f /var/log/cloud-init-output.log
# 5. Verify services
docker ps
# 6. Update DNS if IP changed
terraform plan # Should auto-update DNS
terraform apply
Recovery time: 30 minutes (infrastructure + data restore)
State File Corrupted¶
Symptom:
Recovery:
# 1. Check if backup exists
ls -la terraform.tfstate.backup
# 2. If exists, restore
terraform state push terraform.tfstate.backup
# 3. Verify
terraform plan
# 4. If no backup, pull from remote
terraform state pull > current-state.json
# 5. Inspect
cat current-state.json
# 6. If corrupted, rebuild state via imports
# (This is tedious but possible)
Lost Access to Storage Box¶
Symptom: Can't access remote state.
Temporary solution:
# 1. Switch to local state temporarily
mv terraform.tf terraform.tf.bak
# 2. Create local backend config
cat > terraform.tf << 'EOF'
terraform {
required_version = ">= 1.6.0"
# No backend = local state
}
EOF
# 3. Reinitialize
terraform init
# 4. Manually sync with last known state
# (Check backup or git history)
# 5. When Storage Box access restored:
# Restore original terraform.tf
mv terraform.tf.bak terraform.tf
# Migrate state back
terraform init -migrate-state
Accidentally Committed Secrets¶
Scenario: Committed terraform.tfvars with secrets to Git.
Immediate action:
# 1. Remove from Git history
git filter-branch --force --index-filter \
'git rm --cached --ignore-unmatch terraform.tfvars' \
--prune-empty --tag-name-filter cat -- --all
# 2. Force push
git push origin --force --all
# 3. Rotate ALL secrets immediately
# Generate new:
# - HCLOUD_TOKEN (Hetzner console)
# - CLOUDFLARE_API_TOKEN (Cloudflare dashboard)
# - STORAGEBOX credentials (Hetzner robot)
# 4. Update Infisical with new secrets
infisical secrets set HCLOUD_TOKEN="new-token" --env=prod
# 5. Update local terraform.tfvars
# 6. Never commit again
# Verify .gitignore has *.tfvars
Best Practices Checklist¶
Before every terraform apply:
- [ ] Ran
git pull(latest changes) - [ ] Ran
terraform plan(reviewed output line-by-line) - [ ] Verified no unexpected destroys (
-or-/+) - [ ] Checked no one else is running Terraform
- [ ] Have backups (if destructive operation)
After every terraform apply:
- [ ] Verified infrastructure change worked
- [ ] Committed configuration to Git
- [ ] Pushed to remote repository
- [ ] Updated documentation if needed
Daily/Weekly:
- [ ] Review Terraform state:
terraform state list - [ ] Check for drift:
terraform planshould show no changes - [ ] Verify backups exist:
ls /mnt/storagebox/backups/ - [ ] Test disaster recovery (monthly)
Summary¶
Common operations:
- ✅ Make change → validate → plan → apply → verify → commit
- ✅ Add resource → define in .tf → plan → apply
- ✅ Multi-device → git pull → terraform plan → make changes → git push
- ✅ Query infrastructure → terraform output, terraform state show
Emergency procedures:
- 🚨 VPS destroyed → terraform apply (30 min recovery)
- 🚨 State corrupted → restore from backup
- 🚨 State locked → terraform force-unlock
- 🚨 Secrets leaked → rotate immediately
Best practices:
- 📖 ALWAYS run terraform plan before apply
- 📖 READ every line of plan output
- 📖 COMMIT configuration to Git after applying
- 📖 VERIFY infrastructure after changes
Key files:
- vps.tf - VPS configuration
- dns.tf - DNS records
- ssh-keys.tf - SSH access control
- terraform.tfstate - NEVER edit manually
What's next: - Practice making small changes - Test multi-device workflow - Document any additional procedures - Schedule monthly disaster recovery tests
Infrastructure as Code = Infrastructure as Reliable as Code