gb/AGENTS-PROD.md
Rodrigo Rodriguez (Pragmatismo) 5b259a2c5a docs: Remove sensitive info from AGENTS-PROD.md
- Remove example conversation with specific server details
- Remove specific commit hash
- Generalize Vault unseal command
- Add warning about Vault keys
2026-03-18 10:41:45 -03:00

240 lines
8.4 KiB
Markdown

# General Bots Cloud — Production Operations Guide
## Infrastructure Overview
- **Host OS:** Ubuntu 24.04 LTS, LXD (snap)
- **SSH:** Key auth only, sudoer user in `lxd` group
- **Container engine:** LXD with ZFS storage pool
## LXC Container Architecture
| Container | Purpose | Exposed Ports |
|---|---|---|
| `<tenant>-proxy` | Caddy reverse proxy | 80, 443 |
| `<tenant>-system` | botserver + botui (privileged!) | internal only |
| `<tenant>-alm` | Forgejo (ALM/Git) | internal only |
| `<tenant>-alm-ci` | Forgejo CI runner | none |
| `<tenant>-email` | Stalwart mail server | 25,465,587,993,995,143,110 |
| `<tenant>-dns` | CoreDNS | 53 |
| `<tenant>-drive` | MinIO S3 | internal only |
| `<tenant>-tables` | PostgreSQL | internal only |
| `<tenant>-table-editor` | NocoDB | internal only |
| `<tenant>-webmail` | Roundcube | internal only |
## Key Rules
- `<tenant>-system` must be **privileged** (`security.privileged: true`) — required for botserver to own `/opt/gbo/` mounts
- All containers use LXD **proxy devices** for port forwarding (network forwards don't work when external IP is on host NIC, not bridge)
- Never remove proxy devices for ports: 80, 443, 25, 465, 587, 993, 995, 143, 110, 4190, 53
- CI runner (`alm-ci`) must NOT have cross-container disk device mounts — deploy via SSH instead
## Firewall (host)
- **ufw** with `DEFAULT_FORWARD_POLICY=ACCEPT` (needed for container internet)
- LXD forward rule must persist via systemd service
- **fail2ban** on host (SSH jail) and in email container (mail jail)
---
## 🔧 Common Production Issues & Fixes
### Issue: Valkey/Redis Connection Timeout
**Symptom:** botserver logs show `Connection timed out (os error 110)` when connecting to cache at `localhost:6379`
**Root Cause:** iptables DROP rule for port 6379 blocks loopback traffic because no ACCEPT rule for `lo` interface exists before the DROP rules.
**Fix:**
```bash
# Insert loopback ACCEPT at top of INPUT chain
lxc exec <tenant>-system -- iptables -I INPUT 1 -i lo -j ACCEPT
# Persist the rule
lxc exec <tenant>-system -- bash -c 'iptables-save > /etc/iptables/rules.v4'
# Verify Valkey responds
lxc exec <tenant>-system -- /opt/gbo/bin/botserver-stack/bin/cache/bin/valkey-cli ping
# Should return: PONG
# Restart botserver to pick up working cache
lxc exec <tenant>-system -- systemctl restart system.service ui.service
```
**Prevention:** Always ensure loopback ACCEPT rule is at the top of iptables INPUT chain before any DROP rules.
### Issue: Suggestions Not Showing in Frontend
**Symptom:** Bot's start.bas has `ADD_SUGGESTION_TOOL` calls but suggestions don't appear in the UI.
**Diagnosis:**
```bash
# Get bot ID
lxc exec <tenant>-system -- /opt/gbo/bin/botserver-stack/bin/tables/bin/psql -h localhost -U gbuser -d botserver -t -c "SELECT id, name FROM bots WHERE name = 'botname';"
# Check if suggestions exist in cache with correct bot_id
lxc exec <tenant>-system -- /opt/gbo/bin/botserver-stack/bin/cache/bin/valkey-cli --scan --pattern "suggestions:<bot_id>:*"
# If no keys found, check logs for wrong bot_id being used
lxc exec <tenant>-system -- grep "Adding suggestion to Redis key" /opt/gbo/logs/error.log | tail -5
```
**Fix:** This was a code bug where suggestions were stored with `user_id` instead of `bot_id`. After deploying the fix:
1. Wait for CI/CD to build and deploy new binary (~10 minutes)
2. Service auto-restarts on binary update
3. Test by opening a new session (old sessions may have stale keys)
**Deployment Workflow:**
```bash
# 1. Fix code in dev environment
# 2. Commit and push to ALM
cd botserver && git push alm main
# 3. Update root gb repository
cd .. && git add botserver && git commit -m "Update submodule" && git push alm main
# 4. Wait 10 minutes for CI/CD pipeline
# 5. Verify deployment
lxc exec <tenant>-system -- ls -lh /opt/gbo/bin/botserver
lxc exec <tenant>-system -- systemctl status system.service
# 6. Test the fix
# Open new session at https://chat.<domain>/<botname>
# Suggestions should now appear
```
---
## ⚠️ Caddy Config — CRITICAL RULES
**NEVER replace the Caddyfile with a minimal/partial config.**
The full config has ~25 vhosts. If you only see 1-2 vhosts, you are looking at a broken/partial config.
**Before ANY change:**
1. Backup: `cp /opt/gbo/conf/config /opt/gbo/conf/config.bak-$(date +%Y%m%d%H%M)`
2. Validate: `caddy validate --config /opt/gbo/conf/config --adapter caddyfile`
3. Reload (not restart): `caddy reload --config /opt/gbo/conf/config --adapter caddyfile`
**Caddy storage must be explicitly set** in the global block, otherwise Caddy uses `~/.local/share/caddy` and loses existing certificates on restart:
```
{
storage file_system {
root /opt/gbo/data/caddy
}
}
```
**Dead domains cause ERR_SSL_PROTOCOL_ERROR** — if a domain in the Caddyfile has no DNS record, Caddy loops trying to get a certificate and pollutes TLS state. Remove dead domains immediately.
**After removing domains from config**, restart Caddy (not just reload) to clear in-memory ACME state from old domains.
---
## botserver / botui
- botserver: `system.service` on port 5858
- botui: `ui.service` on port 5859
- `BOTSERVER_URL` in `ui.service` must point to **`http://localhost:5858`** (not HTTPS external URL) — using external URL causes WebSocket disconnect before TALK executes
- Valkey/Redis bound to `127.0.0.1:6379` — iptables rules must allow loopback on this port or suggestions/cache won't work
- Vault unseal keys stored in `/opt/gbo/bin/botserver-stack/conf/vault/init.json` (production only - never commit to git)
### iptables loopback rule (required)
Internal services (Valkey, MinIO) are protected by DROP rules. Loopback must be explicitly allowed **before** the DROP rules:
```bash
iptables -I INPUT -i lo -j ACCEPT
iptables -A INPUT -p tcp --dport 6379 -j DROP # external only
```
---
## CoreDNS Hardening
Corefile must include `acl` plugin to prevent DNS amplification attacks:
```
zone.example.com:53 {
file /opt/gbo/data/zone.example.com.zone
acl {
allow type ANY net 10.0.0.0/8 127.0.0.0/8
allow type A net 0.0.0.0/0
allow type AAAA net 0.0.0.0/0
allow type MX net 0.0.0.0/0
block
}
cache
errors
}
```
Reload with SIGHUP: `pkill -HUP coredns`
---
## fail2ban in Proxy Container
Proxy container needs its own fail2ban for HTTP flood protection:
- Filter: match 4xx errors from Caddy JSON access log
- Jail: `caddy-http-flood` — 100 errors/60s → ban 1h
- Disable default `sshd` jail (no SSH in proxy container) via `jail.d/defaults-debian.conf`
---
## CI/CD (Forgejo Runner)
- Runner container must have **no cross-container disk mounts**
- Deploy via SSH: `scp binary <system-container>:/opt/gbo/bin/botserver`
- SSH key from runner → system container must be pre-authorized
- sccache + cargo registry cache accumulates — daily cleanup cron required
- ZFS snapshots of CI container can be huge if taken while cross-mounts were active — delete stale snapshots after removing mounts
---
## ZFS Disk Space
- Check snapshots: `zfs list -t snapshot -o name,used | sort -k2 -rh`
- Snapshots retain data from device mounts at time of snapshot — removing mounts doesn't free space until snapshot is deleted
- Delete snapshot: `zfs destroy <pool>/containers/<name>@<snapshot>`
- Daily rolling snapshots (7-day retention) via cron
---
## Git Workflow
Push to both remotes after every change:
```bash
cd <submodule>
git push origin main
git push alm main
cd ..
git add <submodule>
git commit -m "Update submodule"
git push alm main
```
Failure to push the root `gb` repo will not trigger CI/CD pipelines.
---
## Useful Commands
```bash
# Check all containers
lxc list
# Check disk device mounts per container
for c in $(lxc list --format csv -c n); do
devices=$(lxc config device show $c | grep 'type: disk' | grep -v 'pool:' | wc -l)
[ $devices -gt 0 ] && echo "=== $c ===" && lxc config device show $c | grep -E 'source:|path:' | grep -v pool
done
# Tail Caddy errors
lxc exec <tenant>-proxy -- tail -f /opt/gbo/logs/access.log
# Restart botserver + botui
lxc exec <tenant>-system -- systemctl restart system.service ui.service
# Check iptables in system container
lxc exec <tenant>-system -- iptables -L -n | grep -E 'DROP|ACCEPT.*lo'
# ZFS snapshot usage
zfs list -t snapshot -o name,used | sort -k2 -rh | head -20
# Unseal Vault (use actual unseal key from init.json)
lxc exec <tenant>-system -- bash -c "
export VAULT_ADDR=https://127.0.0.1:8200 VAULT_SKIP_VERIFY=true
/opt/gbo/bin/botserver-stack/bin/vault/vault operator unseal \$UNSEAL_KEY
"
```