6.2 KiB
General Bots Cloud — Production Operations Guide
Infrastructure Overview
- Host OS: Ubuntu 24.04 LTS, LXD (snap)
- SSH: Key auth only, sudoer user in
lxdgroup - Container engine: LXD with ZFS storage pool
LXC Container Architecture
| Container | Purpose | Exposed Ports |
|---|---|---|
<tenant>-proxy |
Caddy reverse proxy | 80, 443 |
<tenant>-system |
botserver + botui (privileged!) | internal only |
<tenant>-alm |
Forgejo (ALM/Git) | internal only |
<tenant>-alm-ci |
Forgejo CI runner | none |
<tenant>-email |
Stalwart mail server | 25,465,587,993,995,143,110 |
<tenant>-dns |
CoreDNS | 53 |
<tenant>-drive |
MinIO S3 | internal only |
<tenant>-tables |
PostgreSQL | internal only |
<tenant>-table-editor |
NocoDB | internal only |
<tenant>-webmail |
Roundcube | internal only |
Key Rules
<tenant>-systemmust be privileged (security.privileged: true) — required for botserver to own/opt/gbo/mounts- All containers use LXD proxy devices for port forwarding (network forwards don't work when external IP is on host NIC, not bridge)
- Never remove proxy devices for ports: 80, 443, 25, 465, 587, 993, 995, 143, 110, 4190, 53
- CI runner (
alm-ci) must NOT have cross-container disk device mounts — deploy via SSH instead
Firewall (host)
- ufw with
DEFAULT_FORWARD_POLICY=ACCEPT(needed for container internet) - LXD forward rule must persist via systemd service
- fail2ban on host (SSH jail) and in email container (mail jail)
⚠️ Caddy Config — CRITICAL RULES
NEVER replace the Caddyfile with a minimal/partial config. The full config has ~25 vhosts. If you only see 1-2 vhosts, you are looking at a broken/partial config.
Before ANY change:
- Backup:
cp /opt/gbo/conf/config /opt/gbo/conf/config.bak-$(date +%Y%m%d%H%M) - Validate:
caddy validate --config /opt/gbo/conf/config --adapter caddyfile - Reload (not restart):
caddy reload --config /opt/gbo/conf/config --adapter caddyfile
Caddy storage must be explicitly set in the global block, otherwise Caddy uses ~/.local/share/caddy and loses existing certificates on restart:
{
storage file_system {
root /opt/gbo/data/caddy
}
}
Dead domains cause ERR_SSL_PROTOCOL_ERROR — if a domain in the Caddyfile has no DNS record, Caddy loops trying to get a certificate and pollutes TLS state. Remove dead domains immediately.
After removing domains from config, restart Caddy (not just reload) to clear in-memory ACME state from old domains.
botserver / botui
- botserver:
system.serviceon port 5858 - botui:
ui.serviceon port 5859 BOTSERVER_URLinui.servicemust point tohttp://localhost:5858(not HTTPS external URL) — using external URL causes WebSocket disconnect before TALK executes- Valkey/Redis bound to
127.0.0.1:6379— iptables rules must allow loopback on this port or suggestions/cache won't work - Vault unseal keys stored in
/opt/gbo/bin/botserver-stack/conf/vault/init.json
iptables loopback rule (required)
Internal services (Valkey, MinIO) are protected by DROP rules. Loopback must be explicitly allowed before the DROP rules:
iptables -I INPUT -i lo -j ACCEPT
iptables -A INPUT -p tcp --dport 6379 -j DROP # external only
CoreDNS Hardening
Corefile must include acl plugin to prevent DNS amplification attacks:
zone.example.com:53 {
file /opt/gbo/data/zone.example.com.zone
acl {
allow type ANY net 10.0.0.0/8 127.0.0.0/8
allow type A net 0.0.0.0/0
allow type AAAA net 0.0.0.0/0
allow type MX net 0.0.0.0/0
block
}
cache
errors
}
Reload with SIGHUP: pkill -HUP coredns
fail2ban in Proxy Container
Proxy container needs its own fail2ban for HTTP flood protection:
- Filter: match 4xx errors from Caddy JSON access log
- Jail:
caddy-http-flood— 100 errors/60s → ban 1h - Disable default
sshdjail (no SSH in proxy container) viajail.d/defaults-debian.conf
CI/CD (Forgejo Runner)
- Runner container must have no cross-container disk mounts
- Deploy via SSH:
scp binary <system-container>:/opt/gbo/bin/botserver - SSH key from runner → system container must be pre-authorized
- sccache + cargo registry cache accumulates — daily cleanup cron required
- ZFS snapshots of CI container can be huge if taken while cross-mounts were active — delete stale snapshots after removing mounts
ZFS Disk Space
- Check snapshots:
zfs list -t snapshot -o name,used | sort -k2 -rh - Snapshots retain data from device mounts at time of snapshot — removing mounts doesn't free space until snapshot is deleted
- Delete snapshot:
zfs destroy <pool>/containers/<name>@<snapshot> - Daily rolling snapshots (7-day retention) via cron
Bot Compiler — Known Issues Fixed
Tools without PARAM declarations (e.g. USE KB only tools) were not getting .mcp.json generated, causing USE TOOL to silently skip them. Fixed in compiler: always generate .mcp.json even for parameterless tools.
Git Workflow
Push to both remotes after every change:
cd <submodule>
git push origin main
git push alm main
cd ..
git add <submodule>
git commit -m "Update submodule"
git push alm main
Failure to push the root gb repo will not trigger CI/CD pipelines.
Useful Commands
# Check all containers
lxc list
# Check disk device mounts per container
for c in $(lxc list --format csv -c n); do
devices=$(lxc config device show $c | grep 'type: disk' | grep -v 'pool:' | wc -l)
[ $devices -gt 0 ] && echo "=== $c ===" && lxc config device show $c | grep -E 'source:|path:' | grep -v pool
done
# Tail Caddy errors
lxc exec <tenant>-proxy -- tail -f /opt/gbo/logs/access.log
# Restart botserver + botui
lxc exec <tenant>-system -- systemctl restart system.service ui.service
# Check iptables in system container
lxc exec <tenant>-system -- iptables -L -n | grep -E 'DROP|ACCEPT.*lo'
# ZFS snapshot usage
zfs list -t snapshot -o name,used | sort -k2 -rh | head -20
# Unseal Vault
lxc exec <tenant>-system -- bash -c "
export VAULT_ADDR=https://127.0.0.1:8200 VAULT_SKIP_VERIFY=true
/opt/gbo/bin/botserver-stack/bin/vault/vault operator unseal <key>
"