diff --git a/AGENTS-PROD.md b/AGENTS-PROD.md index 709462d..073d552 100644 --- a/AGENTS-PROD.md +++ b/AGENTS-PROD.md @@ -1,478 +1,266 @@ # General Bots Cloud — Production Operations Guide ## Infrastructure Overview -- **Host OS:** Ubuntu 24.04 LTS, Incus +- **Host OS:** Ubuntu 24.04 LTS - **SSH:** Key auth only -- **Container engine:** Incus with ZFS storage pool -- **Tenant:** pragmatismo (migrated from LXD 82.29.59.188 to Incus 63.141.255.9) +- **Container engine:** LXC (Linux Containers) +- **Tenant:** pragmatismo --- -## Container Migration: pragmatismo (COMPLETED) +## PostgreSQL Container (tables) Management -### Summary -| Item | Detail | -|------|--------| -| Source | LXD 5.21 on Ubuntu 22.04 @ 82.29.59.188 | -| Destination | Incus 6.x on Ubuntu 24.04 @ 63.141.255.9 | -| Migration method | `incus copy --instance-only lxd-source:` | -| Data transfer | rsync via SSH (pull from destination → source:/opt/gbo) | -| Total downtime | ~4 hours | -| Containers migrated | 10 | -| Data transferred | ~44 GB | +### Container Info +- **Container Name:** pragmatismo-tables +- **Database:** PostgreSQL 14 +- **Access:** Inside LXC container on port 5432 +- **Data Location:** `/opt/gbo/tables/` inside container -### Migrated Containers (destination names) -``` -proxy → proxy (Caddy reverse proxy) -tables → tables (PostgreSQL) -system → system (botserver + botui, privileged) -drive → drive (MinIO S3) -dns → dns (CoreDNS) -email → email (Stalwart mail) -webmail → webmail (Roundcube) -alm → alm (Forgejo ALM) -alm-ci → alm-ci (Forgejo CI runner) -table-editor → table-editor (NocoDB) -``` +### Common Operations -### Data Paths -- **Source data:** `root@82.29.59.188:/opt/gbo/` (44 GB, tenant data + binaries) -- **Destination data:** `/home/administrator/gbo/tenants/pragmatismo/` (rsync in progress) -- **Final path:** `/opt/gbo/tenants/pragmatismo/` (symlink or mount) - -### Key Decisions Made -1. **No `pragmatismo-` prefix** on destination (unlike source) -2. **iptables NAT** instead of Incus proxy devices (proxy devices conflicted with NAT rules) -3. **Incus proxy devices removed** from all containers after NAT configured -4. **Disk devices removed** from source containers before migration (Incus can't resolve LXD paths) - -### Port Forwarding (iptables NAT) -| Port | Service | -|------|---------| -| 80, 443 | Caddy (HTTP/HTTPS) | -| 25, 465, 587 | SMTP | -| 993, 995, 143, 110, 4190 | IMAP/POP/Sieve | -| 53 | DNS | - -### Remaining Post-Migration Tasks -- [x] **rsync transfer:** Source /opt/gbo → destination ~/gbo ✓ -- [x] **Merge data:** rsync to /opt/gbo/tenants/pragmatismo/ ✓ -- [x] **Configure NAT:** iptables PREROUTING rules ✓ -- [x] **Update Caddy:** Replace old IPs with new 10.107.115.x IPs ✓ -- [x] **Copy data to containers:** tar.gz method for proxy, tables, email, webmail, alm-ci, table-editor ✓ -- [x] **Fix directory structure:** system, dns, alm ✓ -- [x] **Caddy installed and running** ✓ -- [ ] **SSL certificates:** Let's Encrypt rate limited - need to wait or use existing certs -- [ ] **botserver binary missing** in system container -- [ ] **DNS cutover:** Update NS/A records to point to 63.141.255.9 -- [ ] **Source cleanup:** Delete /opt/gbo/ on source after verification - -### Current Container Status (2026-03-22 17:50 UTC) -| Container | /opt/gbo/ contents | Status | -|-----------|---------------------|--------| -| proxy | conf, data, logs, Caddy running | ✓ OK (SSL pending) | -| tables | conf, data, logs, pgconf, pgdata | ✓ OK | -| email | conf, data, logs | ✓ OK | -| webmail | conf, data, logs | ✓ OK | -| alm-ci | conf, data, logs | ✓ OK | -| table-editor | conf, data, logs | ✓ OK | -| system | bin, botserver-stack, conf, data, logs | ✓ OK | -| drive | data, logs | ✓ OK | -| dns | bin, conf, data, logs | ✓ OK | -| alm | alm/, conf, data, logs | ✓ OK | - -### Known Issues -1. **Let's Encrypt rate limiting** - Too many cert requests from old server. Certificates will auto-renew after rate limit clears (~1 hour) -2. **botserver database connection** - PostgreSQL is in tables container (10.107.115.33), need to update DATABASE_URL in system container -3. **SSL certificates** - Caddy will retry obtaining certs after rate limit clears - -### Final Status (2026-03-22 18:30 UTC) - -#### Container Services Status -| Container | Service | Port | Status | -|-----------|---------|------|--------| -| system | Vault | 8200 | ✓ Running | -| system | Valkey | 6379 | ✓ Running | -| system | MinIO | 9100 | ✓ Running | -| system | Qdrant | 6333 | ✓ Running | -| system | botserver | - | ⚠️ Not listening | -| tables | PostgreSQL | 5432 | ✓ Running | -| proxy | Caddy | 80, 443 | ✓ Running | -| dns | CoreDNS | 53 | ❌ Not running | -| email | Stalwart | 25,143,465,993,995 | ❌ Not running | -| webmail | Roundcube | - | ❌ Not running | -| alm | Forgejo | 3000 | ❌ Not running | -| alm-ci | Forgejo-runner | - | ❌ Not running | -| table-editor | NocoDB | - | ❌ Not running | -| drive | MinIO | - | ❌ (in system container) | - -#### Issues Found -1. **botserver not listening** - needs DATABASE_URL pointing to tables container -2. **dns, email, webmail, alm, alm-ci, table-editor** - services not started -3. **SSL certificates** - Let's Encrypt rate limited - -### Data Structure - -**Host path:** `/opt/gbo/tenants/pragmatismo//` -**Container path:** `/opt/gbo/` (conf, data, logs, bin, etc.) - -| Container | Host Path | Container /opt/gbo/ | -|-----------|-----------|---------------------| -| system | `.../system/` | bin, botserver-stack, conf, data, logs | -| proxy | `.../proxy/` | conf, data, logs | -| tables | `.../tables/` | conf, data, logs | -| drive | `.../drive/` | data, logs | -| dns | `.../dns/` | bin, conf, data, logs | -| email | `.../email/` | conf, data, logs | -| webmail | `.../webmail/` | conf, data, logs | -| alm | `.../alm/` | conf, data, logs | -| alm-ci | `.../alm-ci/` | conf, data, logs | -| table-editor | `.../table-editor/` | conf, data, logs | - -### Attach Data Devices (after moving data) ```bash -# Move data to final location -ssh administrator@63.141.255.9 "sudo mv /home/administrator/gbo /opt/gbo/tenants/pragmatismo" +# Check container status +lxc list | grep pragmatismo-tables -# Attach per-container disk device -for container in system proxy tables drive dns email webmail alm alm-ci table-editor; do - incus config device add $container gbo disk \ - source=/opt/gbo/tenants/pragmatismo/$container \ - path=/opt/gbo -done +# Exec into container +lxc exec pragmatismo-tables -- bash -# Fix permissions (each container) -for container in system proxy tables drive dns email webmail alm alm-ci table-editor; do - incus exec $container -- chown -R gbuser:gbuser /opt/gbo/ 2>/dev/null || \ - incus exec $container -- chown -R root:root /opt/gbo/ -done +# Check PostgreSQL status +lxc exec pragmatismo-tables -- pg_isready + +# Query version +lxc exec pragmatismo-tables -- psql -U postgres -c 'SELECT version();' + +# Restart PostgreSQL +lxc exec pragmatismo-tables -- systemctl restart postgresql ``` -### Container IPs (for Caddy configuration) -``` -system: 10.107.115.229 -proxy: 10.107.115.189 -tables: 10.107.115.33 -drive: 10.107.115.114 -dns: 10.107.115.155 -email: 10.107.115.200 -webmail: 10.107.115.208 -alm: 10.107.115.4 -alm-ci: 10.107.115.190 -table-editor: (no IP - start container) +### Backup PostgreSQL + +```bash +# Create database dump +lxc exec pragmatismo-tables -- pg_dump -U postgres -F c -f /tmp/backup.dump pragmatismo + +# Copy backup to host +lxc file pull pragmatismo-tables/tmp/backup.dump ~/backups/postgresql-$(date +%Y%m%d).dump ``` --- -## LXC Container Architecture (destination) +## LXC Container Status (82.29.59.188) -| Container | Purpose | Exposed Ports | -|---|---|---| -| `proxy` | Caddy reverse proxy | 80, 443 | -| `system` | botserver + botui (privileged!) | internal only | -| `alm` | Forgejo (ALM/Git) | internal only | -| `alm-ci` | Forgejo CI runner | none | -| `email` | Stalwart mail server | 25,465,587,993,995,143,110 | -| `dns` | CoreDNS | 53 | -| `drive` | MinIO S3 | internal only | -| `tables` | PostgreSQL | internal only | -| `table-editor` | NocoDB | internal only | -| `webmail` | Roundcube | internal only | - -## Key Rules -- `system` must be **privileged** (`security.privileged: true`) — required for botserver to own `/opt/gbo/` mounts -- All containers use **iptables NAT** for port forwarding — NEVER use Incus proxy devices (they conflict with NAT) -- **Data copied into each container** at `/opt/gbo/` — NOT disk devices. Each container has its own copy of data. -- CI runner (`alm-ci`) must NOT have cross-container disk device mounts — deploy via SSH only -- Caddy config must have correct upstream IPs for each backend container - -## Container Migration (LXD to Incus) — COMPLETED - -### Migration Workflow (for future tenants) - -**Best Method:** `incus copy --instance-only` — transfers containers directly between LXD and Incus. - -#### Prerequisites -```bash -# 1. Open port 8443 on both servers -ssh root@ "iptables -I INPUT -p tcp --dport 8443 -j ACCEPT" -ssh administrator@ "sudo iptables -I INPUT -p tcp --dport 8443 -j ACCEPT" - -# 2. Exchange SSH keys (for rsync data transfer) -ssh administrator@ "cat ~/.ssh/id_rsa.pub" -ssh root@ "echo '' >> /root/.ssh/authorized_keys" - -# 3. Add source LXD as Incus remote -ssh administrator@ "incus remote add lxd-source --protocol=incus --accept-certificate" - -# 4. Add destination cert to source LXD trust -ssh @ "cat ~/.config/incus/client.crt" -ssh root@ "lxc config trust add -" -``` - -#### Migration Steps -```bash -# 1. On SOURCE: Remove disk devices (Incus won't have source paths) -for c in $(lxc list --format csv -c n); do - lxc stop $c - for d in $(lxc config device list $c); do - lxc config device remove $c $d - done -done - -# 2. On DESTINATION: Copy each container -incus copy --instance-only lxd-source: -incus start - -# 3. On DESTINATION: Add eth0 network to each container -incus config device add eth0 nic name=eth0 network=incusbr0 - -# 4. On DESTINATION: Configure iptables NAT (not proxy devices!) -# See iptables NAT Setup above - -# 5. On DESTINATION: Pull data via rsync (from destination to source) -ssh administrator@ "rsync -avz --progress root@:/opt/gbo/ /home/administrator/gbo/" - -# 6. On DESTINATION: Organize data per container -# Data is structured as: /home/administrator/gbo// -# Each container gets its own folder with {conf,data,logs,bin}/ - -# 7. On DESTINATION: Move to final location -ssh administrator@ "sudo mkdir -p /opt/gbo/tenants/" -ssh administrator@ "sudo mv /home/administrator/gbo /opt/gbo/tenants//" - -# 8. On DESTINATION: Copy data into each container -for container in system proxy tables drive dns email webmail alm alm-ci table-editor; do - incus exec $container -- mkdir -p /opt/gbo - incus file push --recursive /opt/gbo/tenants//$container/. $container/opt/gbo/ -done - -# 9. On DESTINATION: Fix permissions -for container in system proxy tables drive dns email webmail alm alm-ci table-editor; do - incus exec $container -- chown -R gbuser:gbuser /opt/gbo/ 2>/dev/null || \ - incus exec $container -- chown -R root:root /opt/gbo/ -done - -# 10. On DESTINATION: Update Caddy config with new container IPs -# sed -i 's/10.16.164.x/10.107.115.x/g' /opt/gbo/conf/config -incus file push /tmp/new_caddy_config proxy/opt/gbo/conf/config - -# 11. Reload Caddy -incus exec proxy -- /opt/gbo/bin/caddy reload --config /opt/gbo/conf/config --adapter caddyfile -``` - -#### iptables NAT Setup (on destination host) -```bash -# Enable IP forwarding -sudo sysctl -w net.ipv4.ip_forward=1 - -# NAT rules — proxy container (ports 80, 443) -sudo iptables -t nat -A PREROUTING -p tcp --dport 80 -j DNAT --to-destination 10.107.115.189:80 -sudo iptables -t nat -A PREROUTING -p tcp --dport 443 -j DNAT --to-destination 10.107.115.189:443 - -# NAT rules — email container (SMTP/IMAP) -sudo iptables -t nat -A PREROUTING -p tcp --dport 25 -j DNAT --to-destination 10.107.115.200:25 -sudo iptables -t nat -A PREROUTING -p tcp --dport 465 -j DNAT --to-destination 10.107.115.200:465 -sudo iptables -t nat -A PREROUTING -p tcp --dport 587 -j DNAT --to-destination 10.107.115.200:587 -sudo iptables -t nat -A PREROUTING -p tcp --dport 993 -j DNAT --to-destination 10.107.115.200:993 -sudo iptables -t nat -A PREROUTING -p tcp --dport 995 -j DNAT --to-destination 10.107.115.200:995 -sudo iptables -t nat -A PREROUTING -p tcp --dport 143 -j DNAT --to-destination 10.107.115.200:143 -sudo iptables -t nat -A PREROUTING -p tcp --dport 110 -j DNAT --to-destination 10.107.115.200:110 -sudo iptables -t nat -A PREROUTING -p tcp --dport 4190 -j DNAT --to-destination 10.107.115.200:4190 - -# NAT rules — dns container (DNS) -sudo iptables -t nat -A PREROUTING -p udp --dport 53 -j DNAT --to-destination 10.107.115.155:53 -sudo iptables -t nat -A PREROUTING -p tcp --dport 53 -j DNAT --to-destination 10.107.115.155:53 - -# Masquerade outgoing traffic -sudo iptables -t nat -A POSTROUTING -s 10.107.115.0/24 -j MASQUERADE - -# Save rules -sudo netfilter-persistent save -``` - -#### Remove Incus Proxy Devices (after NAT is working) -```bash -for c in $(incus list --format csv -c n); do - for d in $(incus config device list $c | grep proxy); do - incus config device remove $c $d - done -done -``` - -#### pragmatismo Migration Notes -- Source server: `root@82.29.59.188` (LXD 5.21, Ubuntu 22.04) -- Destination: `administrator@63.141.255.9` (Incus 6.x, Ubuntu 24.04) -- Container naming: No prefix on destination (`proxy` not `pragmatismo-proxy`) -- Data: rsync pull from destination (not push from source) - -## Firewall (host) - -### ⚠️ CRITICAL: NEVER Block SSH Port 22 -**When installing ANY firewall (UFW, iptables, etc.), ALWAYS allow SSH (port 22) FIRST, before enabling the firewall.** - -**Wrong order (will lock you out!):** -```bash -ufw enable # BLOCKS SSH! -``` - -**Correct order:** -```bash -ufw allow 22/tcp # FIRST: Allow SSH -ufw allow 80/tcp # Allow HTTP -ufw allow 443/tcp # Allow HTTPS -ufw enable # THEN enable firewall -``` - -### Firewall Setup Steps -1. **Always allow SSH before enabling firewall:** - ```bash - sudo ufw allow 22/tcp - ``` - -2. **Install UFW:** - ```bash - sudo apt-get install -y ufw - ``` - -3. **Configure UFW with SSH allowed:** - ```bash - sudo ufw default forward ACCEPT - sudo ufw allow 22/tcp - sudo ufw allow 80/tcp - sudo ufw allow 443/tcp - sudo ufw enable - ``` - -4. **Persist iptables rules for NAT (containers):** - Create `/etc/systemd/system/iptables-restore.service`: - ```ini - [Unit] - Description=Restore iptables rules on boot - After=network-pre.target - Before=network.target - DefaultDependencies=no - - [Service] - Type=oneshot - ExecStart=/bin/bash -c "/sbin/iptables-restore < /etc/iptables/rules.v4" - RemainAfterExit=yes - - [Install] - WantedBy=multi-user.target - ``` - - Save rules and enable: - ```bash - sudo iptables-save > /etc/iptables/rules.v4 - sudo systemctl enable iptables-restore.service - ``` - -5. **Install fail2ban:** - ```bash - # Download fail2ban deb from http://ftp.us.debian.org/debian/pool/main/f/fail2ban/ - sudo dpkg -i fail2ban_*.deb - sudo touch /var/log/auth.log - sudo systemctl enable fail2ban - sudo systemctl start fail2ban - ``` - -6. **Configure fail2ban SSH jail:** - ```bash - sudo fail2ban-client status # Should show sshd jail - ``` - -### Requirements -- **ufw** with `DEFAULT_FORWARD_POLICY=ACCEPT` (needed for container internet) -- **fail2ban** on host (SSH jail) and in email container (mail jail) -- iptables NAT rules must persist via systemd service +| Container | Status | Purpose | Notes | +|-----------|--------|---------|-------| +| **pragmatismo-dns** | ✅ RUNNING | CoreDNS | DNS server | +| **pragmatismo-proxy** | ✅ RUNNING | Caddy | Reverse proxy on :80/:443 | +| **pragmatismo-tables** | ✅ RUNNING | PostgreSQL 14 | Database | +| **pragmatismo-system** | ✅ RUNNING | botserver | Bot system | +| **pragmatismo-email** | ✅ RUNNING | Stalwart | Email server | +| **pragmatismo-webmail** | ✅ RUNNING | Roundcube | Webmail interface | +| **pragmatismo-alm** | ✅ RUNNING | Forgejo | Git/Code ALM | +| **pragmatismo-alm-ci** | ✅ RUNNING | Runner | CI/CD runner | +| **pragmatismo-drive** | ✅ RUNNING | MinIO | S3-compatible storage | --- -## 🔧 Common Production Issues & Fixes +## LXC Container Management -### Issue: Valkey/Redis Connection Timeout +### Common Commands -**Symptom:** botserver logs show `Connection timed out (os error 110)` when connecting to cache at `localhost:6379` - -**Root Cause:** iptables DROP rule for port 6379 blocks loopback traffic because no ACCEPT rule for `lo` interface exists before the DROP rules. - -**Fix:** ```bash -# Insert loopback ACCEPT at top of INPUT chain -incus exec system -- iptables -I INPUT 1 -i lo -j ACCEPT +# List all containers +lxc list -# Persist the rule -incus exec system -- bash -c 'iptables-save > /etc/iptables/rules.v4' +# Show container details +lxc info pragmatismo-dns -# Verify Valkey responds -incus exec system -- /opt/gbo/bin/botserver-stack/bin/cache/bin/valkey-cli ping -# Should return: PONG +# Exec into container +lxc exec pragmatismo-dns -- bash +lxc exec pragmatismo-dns -- /bin/sh -# Restart botserver to pick up working cache -incus exec system -- systemctl restart system.service ui.service +# Start/Stop/Restart containers +lxc start pragmatismo-dns +lxc stop pragmatismo-dns +lxc restart pragmatismo-dns + +# View logs +lxc log pragmatismo-dns +lxc log pragmatismo-dns --show-log + +# Copy files to/from container +lxc file push localfile pragmatismo-dns/opt/gbo/conf/ +lxc file pull pragmatismo-dns/opt/gbo/logs/output.log . ``` -**Prevention:** Always ensure loopback ACCEPT rule is at the top of iptables INPUT chain before any DROP rules. +--- -### Issue: Suggestions Not Showing in Frontend +## SSH Setup -**Symptom:** Bot's start.bas has `ADD_SUGGESTION_TOOL` calls but suggestions don't appear in the UI. +### On Production Server -**Diagnosis:** ```bash -# Get bot ID -incus exec system -- /opt/gbo/bin/botserver-stack/bin/tables/bin/psql -h localhost -U gbuser -d botserver -t -c "SELECT id, name FROM bots WHERE name = 'botname';" - -# Check if suggestions exist in cache with correct bot_id -incus exec system -- /opt/gbo/bin/botserver-stack/bin/cache/bin/valkey-cli --scan --pattern "suggestions::*" - -# If no keys found, check logs for wrong bot_id being used -incus exec system -- grep "Adding suggestion to Redis key" /opt/gbo/logs/error.log | tail -5 +# Add SSH public key for access +mkdir -p /root/.ssh +echo "" >> /root/.ssh/authorized_keys +chmod 700 /root/.ssh +chmod 600 /root/.ssh/authorized_keys ``` -**Fix:** This was a code bug where suggestions were stored with `user_id` instead of `bot_id`. After deploying the fix: -1. Wait for CI/CD to build and deploy new binary (~10 minutes) -2. Service auto-restarts on binary update -3. Test by opening a new session (old sessions may have stale keys) - -### Deployment & Testing Workflow +### From This Machine ```bash -# 1. Fix code in dev environment -# 2. Push to ALM (both submodules AND root) -cd botserver && git push alm main -cd .. && git add botserver && git commit -m "Update submodule" && git push alm main +# SSH key for passwordless access to production +ssh-copy-id root@82.29.59.188 -# 3. Wait ~4 minutes for CI/CD build -# Build time: ~3-4 minutes on CI runner - -# 4. Verify deployment -ssh root@pragmatismo.com.br "lxc exec pragmatismo-system -- stat /opt/gbo/bin/botserver | grep Modify" - -# 5. Test with Playwright -# Use Playwright MCP to open https://chat.pragmatismo.com.br/ -# Verify suggestions appear, TALK executes, no errors in console +# Test connection +ssh root@82.29.59.188 "lxc list" ``` -**Testing with Playwright:** -```bash -# Open bot in browser via Playwright MCP -Navigate to: https://chat.pragmatismo.com.br/ +--- -# Verify: -# - start.bas executes quickly (< 5 seconds) -# - Suggestions appear in UI -# - No errors in browser console +## LXC Container Data Sync (rsync) + +### Prerequisites +- Source: SSH access as `root` to source host +- Network: Same LAN or VPN connection between hosts + +### Copy Individual Container Data + +```bash +# Copy specific container data (e.g., dns container) +sudo rsync -avz --progress -e ssh \ + root@:/opt/gbo/tenants/pragmatismo/dns/ \ + /opt/gbo/dns/ + +# Copy tables (PostgreSQL data) +sudo rsync -avz --progress -e ssh \ + root@:/opt/gbo/tenants/pragmatismo/tables/ \ + /opt/gbo/tables/ + +# Copy drive (MinIO/S3 data) +sudo rsync -avz --progress -e ssh \ + root@:/opt/gbo/tenants/pragmatismo/drive/ \ + /opt/gbo/drive/ ``` -**On destination (Incus):** -```bash -# Verify botserver binary -incus exec system -- stat /opt/gbo/bin/botserver | grep Modify +### Dry Run Before Copy -# Restart services -incus exec system -- systemctl restart system.service ui.service +```bash +# Preview what will be copied (no changes made) +sudo rsync -avzn --progress -e ssh \ + root@:/opt/gbo/tenants/pragmatismo/dns/ \ + /opt/gbo/dns/ +``` + +### Exclude Patterns + +```bash +# Exclude logs, temp files, and system directories +sudo rsync -avz --progress -e ssh \ + --exclude='*.log' \ + --exclude='*.tmp' \ + --exclude='.git/' \ + --exclude='__pycache__/' \ + root@:/opt/gbo/tenants/pragmatismo/ \ + /opt/gbo/ +``` + +--- + +## DNS Management (pragmatismo-dns) + +### Container Info +- **Container Name:** pragmatismo-dns +- **Service:** CoreDNS +- **Access:** Inside LXC container on port 53 +- **Config:** `/opt/gbo/dns/conf/Corefile` inside container + +### Common Operations + +```bash +# Check container status +lxc list | grep pragmatismo-dns + +# View logs +lxc log pragmatismo-dns +lxc log pragmatismo-dns --show-log + +# Restart DNS +lxc restart pragmatismo-dns + +# Exec into container +lxc exec pragmatismo-dns -- /bin/sh + +# Test DNS +dig @localhost pragmatismo.com.br SOA +short +dig @localhost ddsites.com.br SOA +short +dig @localhost chat.pragmatismo.com.br A +short +``` + +### Update DNS Records + +```bash +# Edit Corefile inside container +lxc exec pragmatismo-dns -- nano /opt/gbo/dns/conf/Corefile + +# Edit zone files inside container +lxc exec pragmatismo-dns -- nano /opt/gbo/dns/data/pragmatismo.com.br.zone + +# Restart CoreDNS after changes +lxc exec pragmatismo-dns -- systemctl restart coredns +# or if running as systemd service: +lxc restart pragmatismo-dns +``` + +--- + +## DNS Cutover + +Update NS/A records at your registrar: +- **NS records:** Point to `ns1.pragmatismo.com.br` / `ns2.pragmatismo.com.br` +- **A records:** Update to `82.29.59.188` + +--- + +## Proxy/Caddy Management (pragmatismo-proxy) + +### Container Info +- **Container Name:** pragmatismo-proxy +- **Service:** Caddy reverse proxy +- **Access:** Inside LXC container on ports 80/443 +- **Config:** `/opt/gbo/proxy/conf/config` inside container + +### Common Operations + +```bash +# Check container status +lxc list | grep pragmatismo-proxy + +# View logs +lxc log pragmatismo-proxy +lxc log pragmatismo-proxy --show-log + +# Restart proxy +lxc restart pragmatismo-proxy + +# Exec into container +lxc exec pragmatismo-proxy -- bash +``` + +### Update Caddy Configuration + +```bash +# Edit Caddyfile inside container +lxc exec pragmatismo-proxy -- nano /opt/gbo/proxy/conf/config + +# Validate configuration +lxc exec pragmatismo-proxy -- caddy validate --config /opt/gbo/proxy/conf/config --adapter caddyfile + +# Reload Caddy after changes +lxc exec pragmatismo-proxy -- caddy reload --config /opt/gbo/proxy/conf/config --adapter caddyfile + +# Or restart the entire container +lxc restart pragmatismo-proxy ``` --- @@ -480,14 +268,14 @@ incus exec system -- systemctl restart system.service ui.service ## ⚠️ Caddy Config — CRITICAL RULES **NEVER replace the Caddyfile with a minimal/partial config.** -The full config has ~25 vhosts. If you only see 1-2 vhosts, you are looking at a broken/partial config. +The full config has ~25 vhosts. **Before ANY change:** -1. Backup: `cp /opt/gbo/conf/config /opt/gbo/conf/config.bak-$(date +%Y%m%d%H%M)` -2. Validate: `caddy validate --config /opt/gbo/conf/config --adapter caddyfile` -3. Reload (not restart): `caddy reload --config /opt/gbo/conf/config --adapter caddyfile` +1. Backup: `lxc exec pragmatismo-proxy -- cp /opt/gbo/proxy/conf/config /opt/gbo/proxy/conf/config.bak-$(date +%Y%m%d%H%M)` +2. Validate: `lxc exec pragmatismo-proxy -- caddy validate --config /opt/gbo/proxy/conf/config --adapter caddyfile` +3. Reload: `lxc exec pragmatismo-proxy -- caddy reload --config /opt/gbo/proxy/conf/config --adapter caddyfile` -**Caddy storage must be explicitly set** in the global block, otherwise Caddy uses `~/.local/share/caddy` and loses existing certificates on restart: +**Caddy storage must be explicitly set:** ``` { storage file_system { @@ -496,179 +284,52 @@ The full config has ~25 vhosts. If you only see 1-2 vhosts, you are looking at a } ``` -**Dead domains cause ERR_SSL_PROTOCOL_ERROR** — if a domain in the Caddyfile has no DNS record, Caddy loops trying to get a certificate and pollutes TLS state. Remove dead domains immediately. - -**After removing domains from config**, restart Caddy (not just reload) to clear in-memory ACME state from old domains. +--- --- -## botserver / botui +## Backup Strategy -- botserver: `/opt/gbo/bin/botserver` (system.service, port 5858) -- botui: `/opt/gbo/bin/botui` (ui.service, port 5859) -- `BOTSERVER_URL` in `ui.service` must point to **`http://localhost:5858`** (not HTTPS external URL) — using external URL causes WebSocket disconnect before TALK executes -- Valkey/Redis bound to `127.0.0.1:6379` — iptables rules must allow loopback on this port or suggestions/cache won't work -- Vault unseal keys stored in `/opt/gbo/vault-unseal-keys` (production only - never commit to git) +**Location:** Host machine `/opt/gbo/backups/` -### Caddy in Proxy Container -- Binary: `/usr/bin/caddy` (system container) or `caddy` in PATH -- Config: `/opt/gbo/conf/config` -- Reload: `incus exec proxy -- caddy reload --config /opt/gbo/conf/config --adapter caddyfile` -- Storage: `/opt/gbo/data/caddy` +**Format:** `container-name-YYYYMMDD-HHMM.tar.gz` -**Upstream IPs (after migration):** -| Backend | IP | -|---------|-----| -| system (botserver) | 10.107.115.229:5858 | -| system (botui) | 10.107.115.229:5859 | -| tables (PostgreSQL) | 10.107.115.33:5432 | -| drive (MinIO S3) | 10.107.115.114:9000 | -| webmail | 10.107.115.208 | -| alm | 10.107.115.4 | -| table-editor | 10.107.115.x (assign IP first) | +**Schedule:** Twice daily (6am and 6pm) via cron -### Log Locations +### Backup Individual Container -**botserver/botui logs:** ```bash -# Main application logs (in pragmatismo-system container) -/opt/gbo/logs/error.log # botserver logs -/opt/gbo/logs/botui-error.log # botui logs -/opt/gbo/logs/output.log # stdout/stderr output +# Backup container to tar.gz +lxc snapshot pragmatismo-tables backup-$(date +%Y%m%d-%H%M) +lxc publish pragmatismo-tables/backup-$(date +%Y%m%d-%H%M) --alias tables-backup-$(date +%Y%m%d-%H%M) +lxc image export tables-backup-$(date +%Y%m%d-%H%M) /opt/gbo/backups/tables-$(date +%Y%m%d-%H%M).tar.gz + +# List backups +ls -la /opt/gbo/backups/ + +# Restore from backup +lxc image import /opt/gbo/backups/tables-20260326-1227.tar.gz --alias tables-restore +lxc init tables-restore pragmatismo-tables-restored +lxc start pragmatismo-tables-restored ``` -**Component logs (in `/opt/gbo/bin/botserver-stack/logs/`):** +### Backup PostgreSQL Database + ```bash -cache/ # Valkey/Redis logs -directory/ # Zitadel logs -drive/ # MinIO S3 logs -llm/ # LLM (llama.cpp) logs -tables/ # PostgreSQL logs -vault/ # Vault secrets logs -vector_db/ # Qdrant vector DB logs +# Create database dump +lxc exec pragmatismo-tables -- pg_dump -U postgres -F c -f /tmp/backup.dump pragmatismo + +# Copy backup to host +lxc file pull pragmatismo-tables/tmp/backup.dump /opt/gbo/backups/postgresql-$(date +%Y%m%d).dump ``` -**Checking component logs:** -```bash -# Valkey -incus exec system -- tail -f /opt/gbo/bin/botserver-stack/logs/cache/valkey.log - -# PostgreSQL -incus exec system -- tail -f /opt/gbo/bin/botserver-stack/logs/tables/postgres.log - -# Qdrant -incus exec system -- tail -f /opt/gbo/bin/botserver-stack/logs/vector_db/qdrant.log -``` - -### iptables loopback rule (required) -Internal services (Valkey, MinIO) are protected by DROP rules. Loopback must be explicitly allowed **before** the DROP rules: -```bash -iptables -I INPUT -i lo -j ACCEPT -iptables -A INPUT -p tcp --dport 6379 -j DROP # external only -``` - ---- - -## CoreDNS Hardening - -Corefile must include `acl` plugin to prevent DNS amplification attacks: -``` -zone.example.com:53 { - file /opt/gbo/data/zone.example.com.zone - acl { - allow type ANY net 10.0.0.0/8 127.0.0.0/8 - allow type A net 0.0.0.0/0 - allow type AAAA net 0.0.0.0/0 - allow type MX net 0.0.0.0/0 - block - } - cache - errors -} -``` -Reload with SIGHUP: `pkill -HUP coredns` - ---- - -## fail2ban in Proxy Container - -Proxy container needs its own fail2ban for HTTP flood protection: -- Filter: match 4xx errors from Caddy JSON access log -- Jail: `caddy-http-flood` — 100 errors/60s → ban 1h -- Disable default `sshd` jail (no SSH in proxy container) via `jail.d/defaults-debian.conf` - ---- - -## CI/CD (Forgejo Runner) - -- **ALWAYS use CI for deployment** — NEVER manually scp binaries. CI ensures consistent, auditable deployments. -- Runner container must have **no cross-container disk mounts** -- Deploy via SSH: `scp binary :/opt/gbo/bin/botserver` (only from CI, not manually) -- SSH key from runner → system container must be pre-authorized -- sccache + cargo registry cache accumulates — daily cleanup cron required -- ZFS snapshots of CI container can be huge if taken while cross-mounts were active — delete stale snapshots after removing mounts - -### Forgejo Workflow Location -Each submodule has its own workflow at `.forgejo/workflows/.yaml`. - -**botserver workflow:** `botserver/.forgejo/workflows/botserver.yaml` - -### CI Deployment Flow -1. Push code to ALM → triggers CI workflow automatically -2. CI builds binary on `pragmatismo-alm-ci` runner -3. CI deploys to `pragmatismo-system` container via SSH -4. CI verifies botserver process is running after deploy -5. If CI fails → check logs at `/tmp/deploy-*.log` on CI runner - -**To trigger CI manually:** -```bash -# Push to ALM -cd botserver && git push alm main - -# Or via API -curl -X POST "http://alm.pragmatismo.com.br/api/v1/repos/GeneralBots/BotServer/actions/workflows/botserver.yaml/runs" -``` - -### SSH Hostname Setup (CI Runner) -The CI runner must resolve `system` hostname. Add to `/etc/hosts` **once** (manual step on host): -```bash -incus exec alm-ci -- bash -c 'echo "10.16.164.33 system" >> /etc/hosts' -``` - -### Deploy Step — CRITICAL -The deploy step must **kill the running botserver process before `scp`**, otherwise `scp` fails with `dest open: Failure` (binary is locked by running process): - -```yaml -- name: Deploy via SSH - run: | - ssh pragmatismo-system "pkill -f /opt/gbo/bin/botserver || true; sleep 2" - scp target/debug/botserver pragmatismo-system:/opt/gbo/bin/botserver - ssh pragmatismo-system "chmod +x /opt/gbo/bin/botserver && cd /opt/gbo/bin && nohup sudo -u gbuser ./botserver --noconsole >> /opt/gbo/logs/error.log 2>&1 &" -``` - -**Never use `systemctl stop system.service`** — botserver is not managed by systemd, it runs as a process under `gbuser`. - -### Binary Ownership -The binary at `/opt/gbo/bin/botserver` must be owned by `gbuser`, not `root`: -```bash -incus exec system -- chown gbuser:gbuser /opt/gbo/bin/botserver -``` -If owned by root, `scp` as `gbuser` will fail even after killing the process. - ---- - -## ZFS Disk Space - -- Check snapshots: `zfs list -t snapshot -o name,used | sort -k2 -rh` -- Snapshots retain data from device mounts at time of snapshot — removing mounts doesn't free space until snapshot is deleted -- Delete snapshot: `zfs destroy /containers/@` -- Daily rolling snapshots (7-day retention) via cron +**Retention:** Last 7 days --- ## Git Workflow -Push to both remotes after every change: +Push to both remotes: ```bash cd git push origin main @@ -678,96 +339,69 @@ git add git commit -m "Update submodule" git push alm main ``` -Failure to push the root `gb` repo will not trigger CI/CD pipelines. --- -## Useful Commands +## Git Workflow +Push to both remotes: ```bash -# Check all containers (Incus) -incus list - -# Check disk device mounts per container -for c in $(incus list --format csv -c n); do - devices=$(incus config device show $c | grep 'type: disk' | grep -v 'pool:' | wc -l) - [ $devices -gt 0 ] && echo "=== $c ===" && incus config device show $c | grep -E 'source:|path:' | grep -v pool -done - -# Tail Caddy errors -incus exec proxy -- tail -f /opt/gbo/logs/access.log - -# Restart botserver + botui -incus exec system -- systemctl restart system.service ui.service - -# Check iptables in system container -incus exec system -- iptables -L -n | grep -E 'DROP|ACCEPT.*lo' - -# ZFS snapshot usage -zfs list -t snapshot -o name,used | sort -k2 -rh | head -20 - -# Unseal Vault (use actual unseal key from init.json) -incus exec system -- bash -c " - export VAULT_ADDR=https://127.0.0.1:8200 VAULT_SKIP_VERIFY=true - /opt/gbo/bin/botserver-stack/bin/vault/vault operator unseal \$UNSEAL_KEY -" - -# Check rsync transfer progress (on destination) -du -sh /home/administrator/gbo +cd +git push origin main +git push alm main +cd .. +git add +git commit -m "Update submodule" +git push alm main ``` --- -## CI/CD Debugging +## Troubleshooting Common Issues + +### Container Not Starting -### Check CI Runner Container ```bash -# From production host, SSH to CI runner -ssh root@alm-ci +# Check container status +lxc list -# Check CI workspace for cloned repos -ls /root/workspace/ +# View logs for startup issues +lxc log pragmatismo- +lxc log pragmatismo- --show-log -# Test SSH to system container -ssh -o ConnectTimeout=5 system 'hostname' +# Check resource usage +lxc info pragmatismo- + +# Try manual start +lxc start pragmatismo- ``` -### Query CI Runs via Forgejo API -```bash -# List recent workflow runs for a repo -curl -s "http://alm.pragmatismo.com.br/api/v1/repos/GeneralBots//actions/runs?limit=5" +### Network Issues -# Trigger workflow manually (if token available) -curl -X POST "http://alm.pragmatismo.com.br/api/v1/repos/GeneralBots//actions/workflows/.yaml/runs" +```bash +# Check container IP +lxc list -c n,4 + +# Test connectivity from host +lxc exec pragmatismo-dns -- ping -c 3 8.8.8.8 + +# Check DNS resolution +lxc exec pragmatismo-system -- dig pragmatismo.com.br ``` -### Check Binary Deployed +### Service Issues Inside Container + ```bash -# From production host -incus exec system -- stat /opt/gbo/bin/ | grep Modify -incus exec system -- strings /opt/gbo/bin/ | grep '' +# Check service status inside container +lxc exec pragmatismo- -- systemctl status + +# Restart service inside container +lxc exec pragmatismo- -- systemctl restart + +# View service logs +lxc exec pragmatismo- -- journalctl -u -f ``` -### CI Build Logs Location -```bash -# On CI runner (alm-ci) -# Logs saved via: sudo cp /tmp/build.log /opt/gbo/logs/ +--- -# Access from production host -ssh root@alm-ci -- cat /opt/gbo/logs/*.log 2>/dev/null -``` - -### Common CI Issues - -**SSH Connection Refused:** -- CI runner must have `system` in `/root/.ssh/config` with correct IP -- Check: `ssh -o ConnectTimeout=5 system 'hostname'` - -**Binary Not Updated After Deploy:** -- Verify binary modification time matches CI run time -- Check CI build source: Clone on CI runner and verify code -- Ensure `embed-ui` feature includes the file (RustEmbed embeds at compile time) -```bash -# Rebuild with correct features -cargo build --release -p botui --features embed-ui -``` +(End of file)