Skip to content

Pending Infrastructure Tasks

Status as of December 2025 - Post-migration work items and architectural improvements.


🎯 High Priority Tasks

1. Push Code to Forgejo ✅ COMPLETE

Status: ✅ Complete - All code pushed to Forgejo

Completed: - ✅ Forgejo running on Production VPS (port 3001, SSH port 2222) - ✅ Port 2222 opened in Hetzner Cloud Firewall - ✅ git.kua.cl DNS record created - ✅ Push-to-create enabled for users and organizations - ✅ kavi-infra repository pushed to Forgejo - ✅ terraform-infra repository pushed to Forgejo - ✅ SSH key configured for Git access

Repository URLs: - kavi-infra: ssh://git@116.203.109.220:2222/kuatecno/kavi-infra.git - terraform-infra: ssh://git@116.203.109.220:2222/kuatecno/terraform-infra.git

Benefits Achieved: Self-hosted Git repository, GitOps workflow enabled, all infrastructure code centralized


2. Complete Ansible Automation ✅ COMPLETE

Status: ✅ Complete - Full Ansible automation implemented

Completed: - ✅ roles/common - ufw, fail2ban, common packages - ✅ roles/docker - Docker + Docker Compose installation - ✅ roles/storage - Rclone + systemd mount services - ✅ roles/mac-essentials - Essential CLI tools for all Macs - ✅ roles/mac-developer-tools - Infrastructure tools (terraform, ansible, docker, k8s) - ✅ roles/server-hardening - UFW, fail2ban, SSH hardening, auto-updates

Playbooks Created: - playbooks/mac-regular.yml - For regular Macs - playbooks/mac-developer.yml - For developer Macs - playbooks/server-production.yml - For production servers - playbooks/server-development.yml - For development servers

Benefits Achieved: - Mac onboarding: 30 minutes from zero to configured - Server configuration: 10-15 minutes automated - Disaster recovery: Server rebuild in 2 minutes (Terraform + Ansible)


3. Implement Scatter-Backup Strategy 💾

Status: Strategy defined, script not yet written

Current State: - ✅ Backup targets identified: Eva (Kimsufi), OneDrive, Google Drive - ✅ rclone installed on Production VPS - ❌ rclone remotes not configured - ❌ backup-scatter.sh script not written - ❌ Cron job not scheduled

Action Required:

  1. Configure rclone remotes:

    ssh production
    
    # Configure Eva (Kimsufi) remote
    rclone config create eva sftp host=144.217.76.53 user=ubuntu key_file=~/.ssh/id_ed25519_macmini
    
    # Configure OneDrive remote
    rclone config create onedrive onedrive
    
    # Configure Google Drive remote
    rclone config create gdrive drive
    

  2. Create backup-scatter.sh script:

  3. Dump PostgreSQL databases (Immich, n8n, Infisical)
  4. Dump SQLite databases (Forgejo)
  5. Encrypt dumps with rclone crypt
  6. Upload to Eva, OneDrive, Google Drive in parallel
  7. Log results to /var/log/backup-scatter.log

  8. Schedule nightly cron job:

    # Run at 2 AM daily (before backup verification at 4 AM)
    0 2 * * * /opt/scripts/backup-scatter.sh
    

Why Important: Eliminates single point of failure. Hetzner account suspension no longer means total data loss.

Data Redundancy: 4 backup locations (Storage Box + Eva + OneDrive + Google Drive)


4. Configure Immich Database Replication 🗄️

Status: Strategy defined, not yet implemented

Current State: - ✅ Immich database identified as "Tier 1 Critical" (100,805+ photos metadata) - ✅ Development VPS available as replica target - ❌ Primary-Replica replication not configured - ❌ pgBackRest not installed - ❌ WAL archiving not enabled

Action Required:

  1. Configure Production Postgres as Primary:

    ssh production
    
    # Edit /opt/immich/postgres/postgresql.conf
    wal_level = replica
    max_wal_senders = 3
    wal_keep_size = 1GB
    
    # Edit /opt/immich/postgres/pg_hba.conf
    # Allow Dev VPS to connect for replication
    host replication immich 46.224.125.1/32 scram-sha-256
    
    # Restart Postgres
    docker-compose restart immich-postgres
    

  2. Configure Development Postgres as Replica:

    ssh dev
    
    # Stop existing postgres if any
    docker-compose down immich-replica-postgres
    
    # Create base backup from Production
    pg_basebackup -h 116.203.109.220 -U immich -D /opt/immich-replica/data -P -R
    
    # Start replica in standby mode
    docker-compose up -d immich-replica-postgres
    

  3. Install pgBackRest on Eva (Kimsufi):

    ssh eva
    
    # Install pgBackRest
    apt-get install pgbackrest
    
    # Configure repository on Eva
    # Point to Production Postgres via SSH
    # Enable continuous WAL archiving
    

  4. Test failover:

  5. Simulate Production failure
  6. Promote Dev replica to primary
  7. Verify zero data loss
  8. Document recovery procedures

Why Important: - Zero Data Loss: Continuous replication means no data loss even if Production VPS destroyed - Fast Recovery: Promote replica to primary in 5 minutes vs 15-minute database restore - Point-in-Time Recovery: pgBackRest enables recovery to any point in time (not just daily backups) - Easy Upgrades: Major Postgres version upgrades become trivial (restore to new version from backup)

Current RTO: 15 minutes (restore from daily backup) Future RTO: 5 minutes (promote replica) Current RPO: 24 hours (daily backups) Future RPO: 0 seconds (continuous replication)


📋 Medium Priority Tasks

5. Migrate from Syncthing to GitOps Workflow

Status: Syncthing deployed, GitOps workflow not yet active

Current State: - ✅ Syncthing running on Mac and Production - ✅ ~/kavi-infra syncing between laptop and server - ⚠️ Risk: Accidental local changes instantly break server - ❌ GitOps workflow not implemented

Action Required:

  1. Stop syncing code with Syncthing:
  2. Remove ~/kavi-infra from Syncthing sync folders
  3. Relegate Syncthing to personal files only (Obsidian notes, etc.)

  4. Implement GitOps workflow:

    # Development cycle:
    # 1. Make changes locally
    vim ~/kavi-infra/docker-compose.yml
    
    # 2. Commit and push to Forgejo
    git add .
    git commit -m "Update docker-compose configuration"
    git push forgejo main
    
    # 3. Pull on server and deploy
    ssh production "cd ~/kavi-infra && git pull && docker-compose up -d --build"
    

  5. Future: Automate deployment with Forgejo Actions (optional):

  6. Set up webhook on Forgejo
  7. Trigger git pull && docker-compose up -d on every push
  8. Full CI/CD pipeline

Why Important: - Safety: Code changes are deliberate (git commit) not accidental (file save) - Rollback: Easy rollback with git revert - Audit Trail: All changes tracked in git history - Collaboration: Multiple people can work on infrastructure

Migration Path: 1. Stop Syncthing sync for code (keep for personal files) 2. Use manual Git workflow for 1 month 3. Evaluate need for automated deployment


🔄 Architectural Decisions Log

Decision 1: GitOps over Syncthing for Code

Problem: Syncthing syncs every file save instantly, accidental changes break production

Evaluated Options: 1. ❌ Keep Syncthing - Too risky 2. ✅ GitOps with Forgejo - Safe, auditable, rollback-able 3. ❌ Manual SCP - No version control

Decision: GitOps with Forgejo + manual pull workflow

Rationale: - Git provides version control and audit trail - Forgejo is self-hosted (no GitHub dependency) - Manual pull workflow gives control over when changes deploy - Can automate later if needed


Decision 2: Ansible for Reproducibility

Problem: "PromptOps" (configuring via AI) is not reproducible

Evaluated Options: 1. ❌ Manual configuration - Not reproducible 2. ❌ Shell scripts - Hard to maintain 3. ✅ Ansible - Industry standard, declarative, idempotent

Decision: Ansible for all server configuration

Rationale: - Declarative syntax (describe desired state, not steps) - Idempotent (safe to run multiple times) - Roles are reusable across servers - Disaster recovery becomes one command: ansible-playbook playbook.yml

Current Progress: 33% complete (common role done, docker + storage pending)


Decision 3: Scatter-Backup to Multiple Clouds

Problem: Single backup location (Storage Box) = single point of failure

Evaluated Options: 1. ❌ Keep only Storage Box - Hetzner account suspension = total loss 2. ✅ Scatter to Eva + OneDrive + Google Drive - Geographic + provider diversity 3. ❌ Paid backup service (Backblaze B2) - Unnecessary cost

Decision: Scatter-backup to 3 additional locations

Rationale: - Eva (Kimsufi) - Different provider, different datacenter - OneDrive - Microsoft cloud (2TB existing subscription) - Google Drive - Google cloud (2TB existing subscription) - rclone crypt for encryption (zero-knowledge backups) - No additional cost (using existing resources)

Backup Locations After Implementation: 1. Hetzner Storage Box (5TB) - Primary 2. Eva/Kimsufi (Canada) - Geographic diversity 3. OneDrive (2TB) - Microsoft cloud 4. Google Drive (2TB) - Google cloud

Redundancy Level: 4-way redundancy for critical data


Decision 4: Primary-Replica for Immich DB Only

Problem: Database major version upgrades are risky, downtime is unacceptable

Evaluated Options: 1. ❌ Single "Super Postgres" for all apps - Couples all databases together 2. ❌ Replication for all databases - Over-engineering (n8n, Infisical don't need it) 3. ✅ Replication only for Immich - Focus on critical data (100K+ photos)

Decision: Primary-Replica replication for Immich database only

Rationale: - Immich is "Tier 1 Critical" (100,805 photos, irreplaceable metadata) - n8n and Infisical can tolerate 24-hour RPO (daily backups sufficient) - Separate databases maintain app isolation - pgBackRest on Eva enables point-in-time recovery

Implementation Priority: High (pending after Ansible + Scatter-backup)


📊 Current Infrastructure State

Deployed Services ✅

Service Status Port Access Notes
Infisical ✅ Running 8081 secrets.kua.cl Secrets manager, database migration fixed, secure password rotated
Forgejo ✅ Running 3001 Internal Self-hosted Git, SQLite backend, persistent storage
Syncthing ✅ Running 8384 Internal Mac ↔ Production sync, will migrate to GitOps
Immich ✅ Running 2283 photos.kua.cl 100,805+ photos, database needs replication
n8n ✅ Running 5678 n8n.kua.cl Workflow automation
Kuanary ✅ Running 5001 media.kua.cl Media CDN (renamed from kavicloud)
open-webui ✅ Running 3000 dev.kua.cl AI chat interface (Dev VPS)

Pending Deployments ⏳

Service Purpose Priority Estimated Time
Ansible roles Server automation HIGH 2 hours
backup-scatter.sh Multi-cloud backups HIGH 1 hour
Immich DB replica High availability HIGH 3 hours
pgBackRest Point-in-time recovery MEDIUM 2 hours
Forgejo Actions CI/CD automation LOW 4 hours

🎯 Next Session Action Items

When resuming work, prioritize in this order:

  1. Push code to Forgejo (5 minutes)
  2. Set up remote, push existing commits
  3. Verify web UI shows repository

  4. Complete Ansible docker role (1 hour)

  5. Install Docker + Docker Compose
  6. Test on Development VPS first

  7. Complete Ansible storage role (1 hour)

  8. Mount Storage Box via rclone
  9. Create systemd service for persistence

  10. Write backup-scatter.sh (1 hour)

  11. Configure rclone remotes (Eva, OneDrive, Google Drive)
  12. Implement parallel upload
  13. Schedule cron job at 2 AM

  14. Configure Immich DB replication (3 hours)

  15. Enable WAL on Production
  16. Set up replica on Development
  17. Test failover scenario

Total Estimated Time: ~6-7 hours to complete all high-priority tasks



Last updated: December 2025 - Post-migration pending tasks