GeneralBots/gb

Fork 0

Rodrigo Rodriguez (Pragmatismo) 4900274887 Remove BOTCODE files

2026-03-18 08:48:04 -03:00

6.2 KiB

Raw Blame History

General Bots Cloud — Production Operations Guide

Infrastructure Overview

Host OS: Ubuntu 24.04 LTS, LXD (snap)
SSH: Key auth only, sudoer user in lxd group
Container engine: LXD with ZFS storage pool

LXC Container Architecture

Container	Purpose	Exposed Ports
`<tenant>-proxy`	Caddy reverse proxy	80, 443
`<tenant>-system`	botserver + botui (privileged!)	internal only
`<tenant>-alm`	Forgejo (ALM/Git)	internal only
`<tenant>-alm-ci`	Forgejo CI runner	none
`<tenant>-email`	Stalwart mail server	25,465,587,993,995,143,110
`<tenant>-dns`	CoreDNS	53
`<tenant>-drive`	MinIO S3	internal only
`<tenant>-tables`	PostgreSQL	internal only
`<tenant>-table-editor`	NocoDB	internal only
`<tenant>-webmail`	Roundcube	internal only

Key Rules

<tenant>-system must be privileged (security.privileged: true) — required for botserver to own /opt/gbo/ mounts
All containers use LXD proxy devices for port forwarding (network forwards don't work when external IP is on host NIC, not bridge)
Never remove proxy devices for ports: 80, 443, 25, 465, 587, 993, 995, 143, 110, 4190, 53
CI runner (alm-ci) must NOT have cross-container disk device mounts — deploy via SSH instead

Firewall (host)

ufw with DEFAULT_FORWARD_POLICY=ACCEPT (needed for container internet)
LXD forward rule must persist via systemd service
fail2ban on host (SSH jail) and in email container (mail jail)

⚠️ Caddy Config — CRITICAL RULES

NEVER replace the Caddyfile with a minimal/partial config. The full config has ~25 vhosts. If you only see 1-2 vhosts, you are looking at a broken/partial config.

Before ANY change:

Backup: cp /opt/gbo/conf/config /opt/gbo/conf/config.bak-$(date +%Y%m%d%H%M)
Validate: caddy validate --config /opt/gbo/conf/config --adapter caddyfile
Reload (not restart): caddy reload --config /opt/gbo/conf/config --adapter caddyfile

Caddy storage must be explicitly set in the global block, otherwise Caddy uses ~/.local/share/caddy and loses existing certificates on restart:

{
    storage file_system {
        root /opt/gbo/data/caddy
    }
}

Dead domains cause ERR_SSL_PROTOCOL_ERROR — if a domain in the Caddyfile has no DNS record, Caddy loops trying to get a certificate and pollutes TLS state. Remove dead domains immediately.

After removing domains from config, restart Caddy (not just reload) to clear in-memory ACME state from old domains.

botserver / botui

botserver: system.service on port 5858
botui: ui.service on port 5859
BOTSERVER_URL in ui.service must point to http://localhost:5858 (not HTTPS external URL) — using external URL causes WebSocket disconnect before TALK executes
Valkey/Redis bound to 127.0.0.1:6379 — iptables rules must allow loopback on this port or suggestions/cache won't work
Vault unseal keys stored in /opt/gbo/bin/botserver-stack/conf/vault/init.json

iptables loopback rule (required)

Internal services (Valkey, MinIO) are protected by DROP rules. Loopback must be explicitly allowed before the DROP rules:

iptables -I INPUT -i lo -j ACCEPT
iptables -A INPUT -p tcp --dport 6379 -j DROP  # external only

CoreDNS Hardening

Corefile must include acl plugin to prevent DNS amplification attacks:

zone.example.com:53 {
    file /opt/gbo/data/zone.example.com.zone
    acl {
        allow type ANY net 10.0.0.0/8 127.0.0.0/8
        allow type A net 0.0.0.0/0
        allow type AAAA net 0.0.0.0/0
        allow type MX net 0.0.0.0/0
        block
    }
    cache
    errors
}

Reload with SIGHUP: pkill -HUP coredns

fail2ban in Proxy Container

Proxy container needs its own fail2ban for HTTP flood protection:

Filter: match 4xx errors from Caddy JSON access log
Jail: caddy-http-flood — 100 errors/60s → ban 1h
Disable default sshd jail (no SSH in proxy container) via jail.d/defaults-debian.conf

CI/CD (Forgejo Runner)

Runner container must have no cross-container disk mounts
Deploy via SSH: scp binary <system-container>:/opt/gbo/bin/botserver
SSH key from runner → system container must be pre-authorized
sccache + cargo registry cache accumulates — daily cleanup cron required
ZFS snapshots of CI container can be huge if taken while cross-mounts were active — delete stale snapshots after removing mounts

ZFS Disk Space

Check snapshots: zfs list -t snapshot -o name,used | sort -k2 -rh
Snapshots retain data from device mounts at time of snapshot — removing mounts doesn't free space until snapshot is deleted
Delete snapshot: zfs destroy <pool>/containers/<name>@<snapshot>
Daily rolling snapshots (7-day retention) via cron

Bot Compiler — Known Issues Fixed

Tools without PARAM declarations (e.g. USE KB only tools) were not getting .mcp.json generated, causing USE TOOL to silently skip them. Fixed in compiler: always generate .mcp.json even for parameterless tools.

Git Workflow

Push to both remotes after every change:

cd <submodule>
git push origin main
git push alm main
cd ..
git add <submodule>
git commit -m "Update submodule"
git push alm main

Failure to push the root gb repo will not trigger CI/CD pipelines.

Useful Commands

# Check all containers
lxc list

# Check disk device mounts per container
for c in $(lxc list --format csv -c n); do
  devices=$(lxc config device show $c | grep 'type: disk' | grep -v 'pool:' | wc -l)
  [ $devices -gt 0 ] && echo "=== $c ===" && lxc config device show $c | grep -E 'source:|path:' | grep -v pool
done

# Tail Caddy errors
lxc exec <tenant>-proxy -- tail -f /opt/gbo/logs/access.log

# Restart botserver + botui
lxc exec <tenant>-system -- systemctl restart system.service ui.service

# Check iptables in system container
lxc exec <tenant>-system -- iptables -L -n | grep -E 'DROP|ACCEPT.*lo'

# ZFS snapshot usage
zfs list -t snapshot -o name,used | sort -k2 -rh | head -20

# Unseal Vault
lxc exec <tenant>-system -- bash -c "
  export VAULT_ADDR=https://127.0.0.1:8200 VAULT_SKIP_VERIFY=true
  /opt/gbo/bin/botserver-stack/bin/vault/vault operator unseal <key>
"

6.2 KiB Raw Blame History