Pending Infrastructure Tasks¶

Status as of December 2025 - Post-migration work items and architectural improvements.

🎯 High Priority Tasks¶

1. Push Code to Forgejo ✅ COMPLETE¶

Status: ✅ Complete - All code pushed to Forgejo

Completed: - ✅ Forgejo running on Production VPS (port 3001, SSH port 2222) - ✅ Port 2222 opened in Hetzner Cloud Firewall - ✅ git.kua.cl DNS record created - ✅ Push-to-create enabled for users and organizations - ✅ kavi-infra repository pushed to Forgejo - ✅ terraform-infra repository pushed to Forgejo - ✅ SSH key configured for Git access

Repository URLs: - kavi-infra: ssh://git@116.203.109.220:2222/kuatecno/kavi-infra.git - terraform-infra: ssh://git@116.203.109.220:2222/kuatecno/terraform-infra.git

Benefits Achieved: Self-hosted Git repository, GitOps workflow enabled, all infrastructure code centralized

2. Complete Ansible Automation ✅ COMPLETE¶

Status: ✅ Complete - Full Ansible automation implemented

Completed: - ✅ roles/common - ufw, fail2ban, common packages - ✅ roles/docker - Docker + Docker Compose installation - ✅ roles/storage - Rclone + systemd mount services - ✅ roles/mac-essentials - Essential CLI tools for all Macs - ✅ roles/mac-developer-tools - Infrastructure tools (terraform, ansible, docker, k8s) - ✅ roles/server-hardening - UFW, fail2ban, SSH hardening, auto-updates

Playbooks Created: - playbooks/mac-regular.yml - For regular Macs - playbooks/mac-developer.yml - For developer Macs - playbooks/server-production.yml - For production servers - playbooks/server-development.yml - For development servers

Benefits Achieved: - Mac onboarding: 30 minutes from zero to configured - Server configuration: 10-15 minutes automated - Disaster recovery: Server rebuild in 2 minutes (Terraform + Ansible)

3. Implement Scatter-Backup Strategy 💾¶

Status: Strategy defined, script not yet written

Current State: - ✅ Backup targets identified: Eva (Kimsufi), OneDrive, Google Drive - ✅ rclone installed on Production VPS - ❌ rclone remotes not configured - ❌ backup-scatter.sh script not written - ❌ Cron job not scheduled

Action Required:

Configure rclone remotes:

ssh production

# Configure Eva (Kimsufi) remote
rclone config create eva sftp host=144.217.76.53 user=ubuntu key_file=~/.ssh/id_ed25519_macmini

# Configure OneDrive remote
rclone config create onedrive onedrive

# Configure Google Drive remote
rclone config create gdrive drive

Create backup-scatter.sh script:
Dump PostgreSQL databases (Immich, n8n, Infisical)
Dump SQLite databases (Forgejo)
Encrypt dumps with rclone crypt
Upload to Eva, OneDrive, Google Drive in parallel
Log results to /var/log/backup-scatter.log

Schedule nightly cron job:

# Run at 2 AM daily (before backup verification at 4 AM)
0 2 * * * /opt/scripts/backup-scatter.sh

Why Important: Eliminates single point of failure. Hetzner account suspension no longer means total data loss.

Data Redundancy: 4 backup locations (Storage Box + Eva + OneDrive + Google Drive)

4. Configure Immich Database Replication 🗄️¶

Status: Strategy defined, not yet implemented

Current State: - ✅ Immich database identified as "Tier 1 Critical" (100,805+ photos metadata) - ✅ Development VPS available as replica target - ❌ Primary-Replica replication not configured - ❌ pgBackRest not installed - ❌ WAL archiving not enabled

Action Required:

Configure Production Postgres as Primary:

ssh production

# Edit /opt/immich/postgres/postgresql.conf
wal_level = replica
max_wal_senders = 3
wal_keep_size = 1GB

# Edit /opt/immich/postgres/pg_hba.conf
# Allow Dev VPS to connect for replication
host replication immich 46.224.125.1/32 scram-sha-256

# Restart Postgres
docker-compose restart immich-postgres

Configure Development Postgres as Replica:

ssh dev

# Stop existing postgres if any
docker-compose down immich-replica-postgres

# Create base backup from Production
pg_basebackup -h 116.203.109.220 -U immich -D /opt/immich-replica/data -P -R

# Start replica in standby mode
docker-compose up -d immich-replica-postgres

Install pgBackRest on Eva (Kimsufi):

ssh eva

# Install pgBackRest
apt-get install pgbackrest

# Configure repository on Eva
# Point to Production Postgres via SSH
# Enable continuous WAL archiving

Test failover:
Simulate Production failure
Promote Dev replica to primary
Verify zero data loss
Document recovery procedures

Why Important: - Zero Data Loss: Continuous replication means no data loss even if Production VPS destroyed - Fast Recovery: Promote replica to primary in 5 minutes vs 15-minute database restore - Point-in-Time Recovery: pgBackRest enables recovery to any point in time (not just daily backups) - Easy Upgrades: Major Postgres version upgrades become trivial (restore to new version from backup)

Current RTO: 15 minutes (restore from daily backup) Future RTO: 5 minutes (promote replica) Current RPO: 24 hours (daily backups) Future RPO: 0 seconds (continuous replication)

📋 Medium Priority Tasks¶

5. Migrate from Syncthing to GitOps Workflow¶

Status: Syncthing deployed, GitOps workflow not yet active

Current State: - ✅ Syncthing running on Mac and Production - ✅ ~/kavi-infra syncing between laptop and server - ⚠️ Risk: Accidental local changes instantly break server - ❌ GitOps workflow not implemented

Action Required:

Stop syncing code with Syncthing:
Remove ~/kavi-infra from Syncthing sync folders
Relegate Syncthing to personal files only (Obsidian notes, etc.)

Implement GitOps workflow:

# Development cycle:
# 1. Make changes locally
vim ~/kavi-infra/docker-compose.yml

# 2. Commit and push to Forgejo
git add .
git commit -m "Update docker-compose configuration"
git push forgejo main

# 3. Pull on server and deploy
ssh production "cd ~/kavi-infra && git pull && docker-compose up -d --build"

Future: Automate deployment with Forgejo Actions (optional):
Set up webhook on Forgejo
Trigger git pull && docker-compose up -d on every push
Full CI/CD pipeline

Why Important: - Safety: Code changes are deliberate (git commit) not accidental (file save) - Rollback: Easy rollback with git revert - Audit Trail: All changes tracked in git history - Collaboration: Multiple people can work on infrastructure

Migration Path: 1. Stop Syncthing sync for code (keep for personal files) 2. Use manual Git workflow for 1 month 3. Evaluate need for automated deployment

🔄 Architectural Decisions Log¶

Decision 1: GitOps over Syncthing for Code¶

Problem: Syncthing syncs every file save instantly, accidental changes break production

Evaluated Options: 1. ❌ Keep Syncthing - Too risky 2. ✅ GitOps with Forgejo - Safe, auditable, rollback-able 3. ❌ Manual SCP - No version control

Decision: GitOps with Forgejo + manual pull workflow

Rationale: - Git provides version control and audit trail - Forgejo is self-hosted (no GitHub dependency) - Manual pull workflow gives control over when changes deploy - Can automate later if needed

Decision 2: Ansible for Reproducibility¶

Problem: "PromptOps" (configuring via AI) is not reproducible

Evaluated Options: 1. ❌ Manual configuration - Not reproducible 2. ❌ Shell scripts - Hard to maintain 3. ✅ Ansible - Industry standard, declarative, idempotent

Decision: Ansible for all server configuration

Rationale: - Declarative syntax (describe desired state, not steps) - Idempotent (safe to run multiple times) - Roles are reusable across servers - Disaster recovery becomes one command: ansible-playbook playbook.yml

Current Progress: 33% complete (common role done, docker + storage pending)

Decision 3: Scatter-Backup to Multiple Clouds¶

Problem: Single backup location (Storage Box) = single point of failure

Evaluated Options: 1. ❌ Keep only Storage Box - Hetzner account suspension = total loss 2. ✅ Scatter to Eva + OneDrive + Google Drive - Geographic + provider diversity 3. ❌ Paid backup service (Backblaze B2) - Unnecessary cost

Decision: Scatter-backup to 3 additional locations

Rationale: - Eva (Kimsufi) - Different provider, different datacenter - OneDrive - Microsoft cloud (2TB existing subscription) - Google Drive - Google cloud (2TB existing subscription) - rclone crypt for encryption (zero-knowledge backups) - No additional cost (using existing resources)

Backup Locations After Implementation: 1. Hetzner Storage Box (5TB) - Primary 2. Eva/Kimsufi (Canada) - Geographic diversity 3. OneDrive (2TB) - Microsoft cloud 4. Google Drive (2TB) - Google cloud

Redundancy Level: 4-way redundancy for critical data

Decision 4: Primary-Replica for Immich DB Only¶

Problem: Database major version upgrades are risky, downtime is unacceptable

Evaluated Options: 1. ❌ Single "Super Postgres" for all apps - Couples all databases together 2. ❌ Replication for all databases - Over-engineering (n8n, Infisical don't need it) 3. ✅ Replication only for Immich - Focus on critical data (100K+ photos)

Decision: Primary-Replica replication for Immich database only

Rationale: - Immich is "Tier 1 Critical" (100,805 photos, irreplaceable metadata) - n8n and Infisical can tolerate 24-hour RPO (daily backups sufficient) - Separate databases maintain app isolation - pgBackRest on Eva enables point-in-time recovery

Implementation Priority: High (pending after Ansible + Scatter-backup)

📊 Current Infrastructure State¶

Deployed Services ✅¶

Service	Status	Port	Access	Notes
Infisical	✅ Running	8081	secrets.kua.cl	Secrets manager, database migration fixed, secure password rotated
Forgejo	✅ Running	3001	Internal	Self-hosted Git, SQLite backend, persistent storage
Syncthing	✅ Running	8384	Internal	Mac ↔ Production sync, will migrate to GitOps
Immich	✅ Running	2283	photos.kua.cl	100,805+ photos, database needs replication
n8n	✅ Running	5678	n8n.kua.cl	Workflow automation
Kuanary	✅ Running	5001	media.kua.cl	Media CDN (renamed from kavicloud)
open-webui	✅ Running	3000	dev.kua.cl	AI chat interface (Dev VPS)

Pending Deployments ⏳¶

Service	Purpose	Priority	Estimated Time
Ansible roles	Server automation	HIGH	2 hours
backup-scatter.sh	Multi-cloud backups	HIGH	1 hour
Immich DB replica	High availability	HIGH	3 hours
pgBackRest	Point-in-time recovery	MEDIUM	2 hours
Forgejo Actions	CI/CD automation	LOW	4 hours

🎯 Next Session Action Items¶

When resuming work, prioritize in this order:

Push code to Forgejo (5 minutes)
Set up remote, push existing commits
Verify web UI shows repository
Complete Ansible docker role (1 hour)
Install Docker + Docker Compose
Test on Development VPS first
Complete Ansible storage role (1 hour)
Mount Storage Box via rclone
Create systemd service for persistence
Write backup-scatter.sh (1 hour)
Configure rclone remotes (Eva, OneDrive, Google Drive)
Implement parallel upload
Schedule cron job at 2 AM
Configure Immich DB replication (3 hours)
Enable WAL on Production
Set up replica on Development
Test failover scenario

Total Estimated Time: ~6-7 hours to complete all high-priority tasks

Services Overview - All Production VPS services
Backup Procedures - Current backup strategy
Disaster Recovery - Recovery procedures with Terraform
DNS Management - Terraform DNS configuration

Last updated: December 2025 - Post-migration pending tasks

Pending Infrastructure Tasks¶

🎯 High Priority Tasks¶

1. Push Code to Forgejo ✅ COMPLETE¶

2. Complete Ansible Automation ✅ COMPLETE¶

3. Implement Scatter-Backup Strategy 💾¶

4. Configure Immich Database Replication 🗄️¶

📋 Medium Priority Tasks¶

5. Migrate from Syncthing to GitOps Workflow¶

🔄 Architectural Decisions Log¶

Decision 1: GitOps over Syncthing for Code¶

Decision 2: Ansible for Reproducibility¶

Decision 3: Scatter-Backup to Multiple Clouds¶

Decision 4: Primary-Replica for Immich DB Only¶

📊 Current Infrastructure State¶

Deployed Services ✅¶

Pending Deployments ⏳¶

🎯 Next Session Action Items¶

📚 Related Documentation¶