# General Bots Cloud — Production Operations Guide ## Infrastructure Overview - **Host OS:** Ubuntu 24.04 LTS, LXD (snap) - **SSH:** Key auth only, sudoer user in `lxd` group - **Container engine:** LXD with ZFS storage pool ## LXC Container Architecture | Container | Purpose | Exposed Ports | |---|---|---| | `-proxy` | Caddy reverse proxy | 80, 443 | | `-system` | botserver + botui (privileged!) | internal only | | `-alm` | Forgejo (ALM/Git) | internal only | | `-alm-ci` | Forgejo CI runner | none | | `-email` | Stalwart mail server | 25,465,587,993,995,143,110 | | `-dns` | CoreDNS | 53 | | `-drive` | MinIO S3 | internal only | | `-tables` | PostgreSQL | internal only | | `-table-editor` | NocoDB | internal only | | `-webmail` | Roundcube | internal only | ## Key Rules - `-system` must be **privileged** (`security.privileged: true`) — required for botserver to own `/opt/gbo/` mounts - All containers use LXD **proxy devices** for port forwarding (network forwards don't work when external IP is on host NIC, not bridge) - Never remove proxy devices for ports: 80, 443, 25, 465, 587, 993, 995, 143, 110, 4190, 53 - CI runner (`alm-ci`) must NOT have cross-container disk device mounts — deploy via SSH instead ## Firewall (host) - **ufw** with `DEFAULT_FORWARD_POLICY=ACCEPT` (needed for container internet) - LXD forward rule must persist via systemd service - **fail2ban** on host (SSH jail) and in email container (mail jail) --- ## 🔧 Common Production Issues & Fixes ### Issue: Valkey/Redis Connection Timeout **Symptom:** botserver logs show `Connection timed out (os error 110)` when connecting to cache at `localhost:6379` **Root Cause:** iptables DROP rule for port 6379 blocks loopback traffic because no ACCEPT rule for `lo` interface exists before the DROP rules. **Fix:** ```bash # Insert loopback ACCEPT at top of INPUT chain lxc exec -system -- iptables -I INPUT 1 -i lo -j ACCEPT # Persist the rule lxc exec -system -- bash -c 'iptables-save > /etc/iptables/rules.v4' # Verify Valkey responds lxc exec -system -- /opt/gbo/bin/botserver-stack/bin/cache/bin/valkey-cli ping # Should return: PONG # Restart botserver to pick up working cache lxc exec -system -- systemctl restart system.service ui.service ``` **Prevention:** Always ensure loopback ACCEPT rule is at the top of iptables INPUT chain before any DROP rules. ### Issue: Suggestions Not Showing in Frontend **Symptom:** Bot's start.bas has `ADD_SUGGESTION_TOOL` calls but suggestions don't appear in the UI. **Diagnosis:** ```bash # Get bot ID lxc exec -system -- /opt/gbo/bin/botserver-stack/bin/tables/bin/psql -h localhost -U gbuser -d botserver -t -c "SELECT id, name FROM bots WHERE name = 'botname';" # Check if suggestions exist in cache with correct bot_id lxc exec -system -- /opt/gbo/bin/botserver-stack/bin/cache/bin/valkey-cli --scan --pattern "suggestions::*" # If no keys found, check logs for wrong bot_id being used lxc exec -system -- grep "Adding suggestion to Redis key" /opt/gbo/logs/error.log | tail -5 ``` **Fix:** This was a code bug where suggestions were stored with `user_id` instead of `bot_id`. After deploying the fix: 1. Wait for CI/CD to build and deploy new binary (~10 minutes) 2. Service auto-restarts on binary update 3. Test by opening a new session (old sessions may have stale keys) **Deployment Workflow:** ```bash # 1. Fix code in dev environment # 2. Commit and push to ALM cd botserver && git push alm main # 3. Update root gb repository cd .. && git add botserver && git commit -m "Update submodule" && git push alm main # 4. Wait 10 minutes for CI/CD pipeline # 5. Verify deployment lxc exec -system -- ls -lh /opt/gbo/bin/botserver lxc exec -system -- systemctl status system.service # 6. Test the fix # Open new session at https://chat./ # Suggestions should now appear ``` --- ## ⚠️ Caddy Config — CRITICAL RULES **NEVER replace the Caddyfile with a minimal/partial config.** The full config has ~25 vhosts. If you only see 1-2 vhosts, you are looking at a broken/partial config. **Before ANY change:** 1. Backup: `cp /opt/gbo/conf/config /opt/gbo/conf/config.bak-$(date +%Y%m%d%H%M)` 2. Validate: `caddy validate --config /opt/gbo/conf/config --adapter caddyfile` 3. Reload (not restart): `caddy reload --config /opt/gbo/conf/config --adapter caddyfile` **Caddy storage must be explicitly set** in the global block, otherwise Caddy uses `~/.local/share/caddy` and loses existing certificates on restart: ``` { storage file_system { root /opt/gbo/data/caddy } } ``` **Dead domains cause ERR_SSL_PROTOCOL_ERROR** — if a domain in the Caddyfile has no DNS record, Caddy loops trying to get a certificate and pollutes TLS state. Remove dead domains immediately. **After removing domains from config**, restart Caddy (not just reload) to clear in-memory ACME state from old domains. --- ## botserver / botui - botserver: `system.service` on port 5858 - botui: `ui.service` on port 5859 - `BOTSERVER_URL` in `ui.service` must point to **`http://localhost:5858`** (not HTTPS external URL) — using external URL causes WebSocket disconnect before TALK executes - Valkey/Redis bound to `127.0.0.1:6379` — iptables rules must allow loopback on this port or suggestions/cache won't work - Vault unseal keys stored in `/opt/gbo/bin/botserver-stack/conf/vault/init.json` (production only - never commit to git) ### iptables loopback rule (required) Internal services (Valkey, MinIO) are protected by DROP rules. Loopback must be explicitly allowed **before** the DROP rules: ```bash iptables -I INPUT -i lo -j ACCEPT iptables -A INPUT -p tcp --dport 6379 -j DROP # external only ``` --- ## CoreDNS Hardening Corefile must include `acl` plugin to prevent DNS amplification attacks: ``` zone.example.com:53 { file /opt/gbo/data/zone.example.com.zone acl { allow type ANY net 10.0.0.0/8 127.0.0.0/8 allow type A net 0.0.0.0/0 allow type AAAA net 0.0.0.0/0 allow type MX net 0.0.0.0/0 block } cache errors } ``` Reload with SIGHUP: `pkill -HUP coredns` --- ## fail2ban in Proxy Container Proxy container needs its own fail2ban for HTTP flood protection: - Filter: match 4xx errors from Caddy JSON access log - Jail: `caddy-http-flood` — 100 errors/60s → ban 1h - Disable default `sshd` jail (no SSH in proxy container) via `jail.d/defaults-debian.conf` --- ## CI/CD (Forgejo Runner) - Runner container must have **no cross-container disk mounts** - Deploy via SSH: `scp binary :/opt/gbo/bin/botserver` - SSH key from runner → system container must be pre-authorized - sccache + cargo registry cache accumulates — daily cleanup cron required - ZFS snapshots of CI container can be huge if taken while cross-mounts were active — delete stale snapshots after removing mounts ### Forgejo Workflow Location Each submodule has its own workflow at `.forgejo/workflows/.yaml`. **botserver workflow:** `botserver/.forgejo/workflows/botserver.yaml` ### Deploy Step — CRITICAL The deploy step must **kill the running botserver process before `scp`**, otherwise `scp` fails with `dest open: Failure` (binary is locked by running process): ```yaml - name: Deploy via SSH run: | ssh pragmatismo-system "pkill -f /opt/gbo/bin/botserver || true; sleep 2" scp target/debug/botserver pragmatismo-system:/opt/gbo/bin/botserver ssh pragmatismo-system "chmod +x /opt/gbo/bin/botserver && cd /opt/gbo/bin && nohup sudo -u gbuser ./botserver --noconsole >> /opt/gbo/logs/error.log 2>&1 &" ``` **Never use `systemctl stop system.service`** — botserver is not managed by systemd, it runs as a process under `gbuser`. ### Binary Ownership The binary at `/opt/gbo/bin/botserver` must be owned by `gbuser`, not `root`: ```bash lxc exec pragmatismo-system -- chown gbuser:gbuser /opt/gbo/bin/botserver ``` If owned by root, `scp` as `gbuser` will fail even after killing the process. --- ## ZFS Disk Space - Check snapshots: `zfs list -t snapshot -o name,used | sort -k2 -rh` - Snapshots retain data from device mounts at time of snapshot — removing mounts doesn't free space until snapshot is deleted - Delete snapshot: `zfs destroy /containers/@` - Daily rolling snapshots (7-day retention) via cron --- ## Git Workflow Push to both remotes after every change: ```bash cd git push origin main git push alm main cd .. git add git commit -m "Update submodule" git push alm main ``` Failure to push the root `gb` repo will not trigger CI/CD pipelines. --- ## Useful Commands ```bash # Check all containers lxc list # Check disk device mounts per container for c in $(lxc list --format csv -c n); do devices=$(lxc config device show $c | grep 'type: disk' | grep -v 'pool:' | wc -l) [ $devices -gt 0 ] && echo "=== $c ===" && lxc config device show $c | grep -E 'source:|path:' | grep -v pool done # Tail Caddy errors lxc exec -proxy -- tail -f /opt/gbo/logs/access.log # Restart botserver + botui lxc exec -system -- systemctl restart system.service ui.service # Check iptables in system container lxc exec -system -- iptables -L -n | grep -E 'DROP|ACCEPT.*lo' # ZFS snapshot usage zfs list -t snapshot -o name,used | sort -k2 -rh | head -20 # Unseal Vault (use actual unseal key from init.json) lxc exec -system -- bash -c " export VAULT_ADDR=https://127.0.0.1:8200 VAULT_SKIP_VERIFY=true /opt/gbo/bin/botserver-stack/bin/vault/vault operator unseal \$UNSEAL_KEY " ``` --- ## CI/CD Debugging ### Check CI Runner Container ```bash # From production host, SSH to CI runner ssh root@-alm-ci # Check CI workspace for cloned repos ls /root/workspace/ # Test SSH to system container ssh -o ConnectTimeout=5 pragmatismo-system 'hostname' ``` ### Query CI Runs via Forgejo API ```bash # List recent workflow runs for a repo curl -s "http://alm.pragmatismo.com.br/api/v1/repos/GeneralBots//actions/runs?limit=5" # Trigger workflow manually (if token available) curl -X POST "http://alm.pragmatismo.com.br/api/v1/repos/GeneralBots//actions/workflows/.yaml/runs" ``` ### Check Binary Deployed ```bash # From production host lxc exec -system -- stat /opt/gbo/bin/ | grep Modify lxc exec -system -- strings /opt/gbo/bin/ | grep '' ``` ### CI Build Logs Location ```bash # On CI runner (pragmatismo-alm-ci) # Logs saved via: sudo cp /tmp/build.log /opt/gbo/logs/ # Access from production host ssh root@-alm-ci -- cat /opt/gbo/logs/*.log 2>/dev/null ``` ### Common CI Issues **SSH Connection Refused:** - CI runner must have `pragmatismo-system` in `/root/.ssh/config` with IP `10.16.164.33` - Check: `ssh -o ConnectTimeout=5 pragmatismo-system 'hostname'` **Binary Not Updated After Deploy:** - Verify binary modification time matches CI run time - Check CI build source: Clone on CI runner and verify code - Ensure `embed-ui` feature includes the file (RustEmbed embeds at compile time) ```bash # Rebuild with correct features cargo build --release -p botui --features embed-ui ```