Pending Infrastructure Tasks¶
Status as of December 2025 - Post-migration work items and architectural improvements.
🎯 High Priority Tasks¶
1. Push Code to Forgejo ✅ COMPLETE¶
Status: ✅ Complete - All code pushed to Forgejo
Completed: - ✅ Forgejo running on Production VPS (port 3001, SSH port 2222) - ✅ Port 2222 opened in Hetzner Cloud Firewall - ✅ git.kua.cl DNS record created - ✅ Push-to-create enabled for users and organizations - ✅ kavi-infra repository pushed to Forgejo - ✅ terraform-infra repository pushed to Forgejo - ✅ SSH key configured for Git access
Repository URLs:
- kavi-infra: ssh://git@116.203.109.220:2222/kuatecno/kavi-infra.git
- terraform-infra: ssh://git@116.203.109.220:2222/kuatecno/terraform-infra.git
Benefits Achieved: Self-hosted Git repository, GitOps workflow enabled, all infrastructure code centralized
2. Complete Ansible Automation ✅ COMPLETE¶
Status: ✅ Complete - Full Ansible automation implemented
Completed:
- ✅ roles/common - ufw, fail2ban, common packages
- ✅ roles/docker - Docker + Docker Compose installation
- ✅ roles/storage - Rclone + systemd mount services
- ✅ roles/mac-essentials - Essential CLI tools for all Macs
- ✅ roles/mac-developer-tools - Infrastructure tools (terraform, ansible, docker, k8s)
- ✅ roles/server-hardening - UFW, fail2ban, SSH hardening, auto-updates
Playbooks Created:
- playbooks/mac-regular.yml - For regular Macs
- playbooks/mac-developer.yml - For developer Macs
- playbooks/server-production.yml - For production servers
- playbooks/server-development.yml - For development servers
Benefits Achieved: - Mac onboarding: 30 minutes from zero to configured - Server configuration: 10-15 minutes automated - Disaster recovery: Server rebuild in 2 minutes (Terraform + Ansible)
3. Implement Scatter-Backup Strategy 💾¶
Status: Strategy defined, script not yet written
Current State:
- ✅ Backup targets identified: Eva (Kimsufi), OneDrive, Google Drive
- ✅ rclone installed on Production VPS
- ❌ rclone remotes not configured
- ❌ backup-scatter.sh script not written
- ❌ Cron job not scheduled
Action Required:
-
Configure rclone remotes:
-
Create
backup-scatter.shscript: - Dump PostgreSQL databases (Immich, n8n, Infisical)
- Dump SQLite databases (Forgejo)
- Encrypt dumps with rclone crypt
- Upload to Eva, OneDrive, Google Drive in parallel
-
Log results to
/var/log/backup-scatter.log -
Schedule nightly cron job:
Why Important: Eliminates single point of failure. Hetzner account suspension no longer means total data loss.
Data Redundancy: 4 backup locations (Storage Box + Eva + OneDrive + Google Drive)
4. Configure Immich Database Replication 🗄️¶
Status: Strategy defined, not yet implemented
Current State: - ✅ Immich database identified as "Tier 1 Critical" (100,805+ photos metadata) - ✅ Development VPS available as replica target - ❌ Primary-Replica replication not configured - ❌ pgBackRest not installed - ❌ WAL archiving not enabled
Action Required:
-
Configure Production Postgres as Primary:
ssh production # Edit /opt/immich/postgres/postgresql.conf wal_level = replica max_wal_senders = 3 wal_keep_size = 1GB # Edit /opt/immich/postgres/pg_hba.conf # Allow Dev VPS to connect for replication host replication immich 46.224.125.1/32 scram-sha-256 # Restart Postgres docker-compose restart immich-postgres -
Configure Development Postgres as Replica:
-
Install pgBackRest on Eva (Kimsufi):
-
Test failover:
- Simulate Production failure
- Promote Dev replica to primary
- Verify zero data loss
- Document recovery procedures
Why Important: - Zero Data Loss: Continuous replication means no data loss even if Production VPS destroyed - Fast Recovery: Promote replica to primary in 5 minutes vs 15-minute database restore - Point-in-Time Recovery: pgBackRest enables recovery to any point in time (not just daily backups) - Easy Upgrades: Major Postgres version upgrades become trivial (restore to new version from backup)
Current RTO: 15 minutes (restore from daily backup) Future RTO: 5 minutes (promote replica) Current RPO: 24 hours (daily backups) Future RPO: 0 seconds (continuous replication)
📋 Medium Priority Tasks¶
5. Migrate from Syncthing to GitOps Workflow¶
Status: Syncthing deployed, GitOps workflow not yet active
Current State:
- ✅ Syncthing running on Mac and Production
- ✅ ~/kavi-infra syncing between laptop and server
- ⚠️ Risk: Accidental local changes instantly break server
- ❌ GitOps workflow not implemented
Action Required:
- Stop syncing code with Syncthing:
- Remove
~/kavi-infrafrom Syncthing sync folders -
Relegate Syncthing to personal files only (Obsidian notes, etc.)
-
Implement GitOps workflow:
# Development cycle: # 1. Make changes locally vim ~/kavi-infra/docker-compose.yml # 2. Commit and push to Forgejo git add . git commit -m "Update docker-compose configuration" git push forgejo main # 3. Pull on server and deploy ssh production "cd ~/kavi-infra && git pull && docker-compose up -d --build" -
Future: Automate deployment with Forgejo Actions (optional):
- Set up webhook on Forgejo
- Trigger
git pull && docker-compose up -don every push - Full CI/CD pipeline
Why Important:
- Safety: Code changes are deliberate (git commit) not accidental (file save)
- Rollback: Easy rollback with git revert
- Audit Trail: All changes tracked in git history
- Collaboration: Multiple people can work on infrastructure
Migration Path: 1. Stop Syncthing sync for code (keep for personal files) 2. Use manual Git workflow for 1 month 3. Evaluate need for automated deployment
🔄 Architectural Decisions Log¶
Decision 1: GitOps over Syncthing for Code¶
Problem: Syncthing syncs every file save instantly, accidental changes break production
Evaluated Options: 1. ❌ Keep Syncthing - Too risky 2. ✅ GitOps with Forgejo - Safe, auditable, rollback-able 3. ❌ Manual SCP - No version control
Decision: GitOps with Forgejo + manual pull workflow
Rationale: - Git provides version control and audit trail - Forgejo is self-hosted (no GitHub dependency) - Manual pull workflow gives control over when changes deploy - Can automate later if needed
Decision 2: Ansible for Reproducibility¶
Problem: "PromptOps" (configuring via AI) is not reproducible
Evaluated Options: 1. ❌ Manual configuration - Not reproducible 2. ❌ Shell scripts - Hard to maintain 3. ✅ Ansible - Industry standard, declarative, idempotent
Decision: Ansible for all server configuration
Rationale:
- Declarative syntax (describe desired state, not steps)
- Idempotent (safe to run multiple times)
- Roles are reusable across servers
- Disaster recovery becomes one command: ansible-playbook playbook.yml
Current Progress: 33% complete (common role done, docker + storage pending)
Decision 3: Scatter-Backup to Multiple Clouds¶
Problem: Single backup location (Storage Box) = single point of failure
Evaluated Options: 1. ❌ Keep only Storage Box - Hetzner account suspension = total loss 2. ✅ Scatter to Eva + OneDrive + Google Drive - Geographic + provider diversity 3. ❌ Paid backup service (Backblaze B2) - Unnecessary cost
Decision: Scatter-backup to 3 additional locations
Rationale: - Eva (Kimsufi) - Different provider, different datacenter - OneDrive - Microsoft cloud (2TB existing subscription) - Google Drive - Google cloud (2TB existing subscription) - rclone crypt for encryption (zero-knowledge backups) - No additional cost (using existing resources)
Backup Locations After Implementation: 1. Hetzner Storage Box (5TB) - Primary 2. Eva/Kimsufi (Canada) - Geographic diversity 3. OneDrive (2TB) - Microsoft cloud 4. Google Drive (2TB) - Google cloud
Redundancy Level: 4-way redundancy for critical data
Decision 4: Primary-Replica for Immich DB Only¶
Problem: Database major version upgrades are risky, downtime is unacceptable
Evaluated Options: 1. ❌ Single "Super Postgres" for all apps - Couples all databases together 2. ❌ Replication for all databases - Over-engineering (n8n, Infisical don't need it) 3. ✅ Replication only for Immich - Focus on critical data (100K+ photos)
Decision: Primary-Replica replication for Immich database only
Rationale: - Immich is "Tier 1 Critical" (100,805 photos, irreplaceable metadata) - n8n and Infisical can tolerate 24-hour RPO (daily backups sufficient) - Separate databases maintain app isolation - pgBackRest on Eva enables point-in-time recovery
Implementation Priority: High (pending after Ansible + Scatter-backup)
📊 Current Infrastructure State¶
Deployed Services ✅¶
| Service | Status | Port | Access | Notes |
|---|---|---|---|---|
| Infisical | ✅ Running | 8081 | secrets.kua.cl | Secrets manager, database migration fixed, secure password rotated |
| Forgejo | ✅ Running | 3001 | Internal | Self-hosted Git, SQLite backend, persistent storage |
| Syncthing | ✅ Running | 8384 | Internal | Mac ↔ Production sync, will migrate to GitOps |
| Immich | ✅ Running | 2283 | photos.kua.cl | 100,805+ photos, database needs replication |
| n8n | ✅ Running | 5678 | n8n.kua.cl | Workflow automation |
| Kuanary | ✅ Running | 5001 | media.kua.cl | Media CDN (renamed from kavicloud) |
| open-webui | ✅ Running | 3000 | dev.kua.cl | AI chat interface (Dev VPS) |
Pending Deployments ⏳¶
| Service | Purpose | Priority | Estimated Time |
|---|---|---|---|
| Ansible roles | Server automation | HIGH | 2 hours |
| backup-scatter.sh | Multi-cloud backups | HIGH | 1 hour |
| Immich DB replica | High availability | HIGH | 3 hours |
| pgBackRest | Point-in-time recovery | MEDIUM | 2 hours |
| Forgejo Actions | CI/CD automation | LOW | 4 hours |
🎯 Next Session Action Items¶
When resuming work, prioritize in this order:
- Push code to Forgejo (5 minutes)
- Set up remote, push existing commits
-
Verify web UI shows repository
-
Complete Ansible docker role (1 hour)
- Install Docker + Docker Compose
-
Test on Development VPS first
-
Complete Ansible storage role (1 hour)
- Mount Storage Box via rclone
-
Create systemd service for persistence
-
Write backup-scatter.sh (1 hour)
- Configure rclone remotes (Eva, OneDrive, Google Drive)
- Implement parallel upload
-
Schedule cron job at 2 AM
-
Configure Immich DB replication (3 hours)
- Enable WAL on Production
- Set up replica on Development
- Test failover scenario
Total Estimated Time: ~6-7 hours to complete all high-priority tasks
📚 Related Documentation¶
- Services Overview - All Production VPS services
- Backup Procedures - Current backup strategy
- Disaster Recovery - Recovery procedures with Terraform
- DNS Management - Terraform DNS configuration
Last updated: December 2025 - Post-migration pending tasks