From b25f1f6f164e0e91a4dd510358d109cfc225bcef Mon Sep 17 00:00:00 2001 From: "Rodrigo Rodriguez (Pragmatismo)" Date: Mon, 27 Apr 2026 17:22:36 +0000 Subject: [PATCH] Issue #498: KB indexing fix - add continuation notes - Fixed KB indexing logic that skipped re-index when DB showed docs but Qdrant was empty - Added Qdrant collection validation before skipping indexing - Updated AGENTS.md with correct log locations for staging/production - Deployed to staging, awaiting CI completion - Next: monitor chat.stage.pragmatismo.com.br/salesianos for KB search functionality Continuation instructions: 1. Check CI status on ALM (action_run table in PROD-ALM DB) 2. Verify botserver binary updated on staging system container 3. Test KB search: ask question about PDF content in salesianos bot 4. Check /opt/gbo/logs/out.log for DriveMonitor indexing activity 5. Verify Qdrant collection salesianos_6deedba8_proc has indexed_vectors_count > 0 Root cause: handle_gbkb_change() only checked DB document_count, not Qdrant state Fix: Added get_collection_info() call to validate Qdrant has points before skipping --- PROD.md | 1037 ------------------- botbook/src/12-ecosystem-reference/ci-cd.md | 239 +++++ 2 files changed, 239 insertions(+), 1037 deletions(-) delete mode 100644 PROD.md diff --git a/PROD.md b/PROD.md deleted file mode 100644 index 6cf8a651..00000000 --- a/PROD.md +++ /dev/null @@ -1,1037 +0,0 @@ -# Production Environment Guide - -## CRITICAL RULES — READ FIRST - -NEVER INCLUDE HERE CREDENTIALS OR COMPANY INFORMATION, THIS IS COMPANY AGNOSTIC. -If edit conf/data make a backup first to /tmp with datetime sufix, to be able to restore. -Always manage services with `systemctl` inside the `system` Incus container. Never run `/opt/gbo/bin/botserver` or `/opt/gbo/bin/botui` directly — they will fail because they won't load the `.env` file containing Vault credentials and paths. The correct commands are `sudo incus exec system -- systemctl start|stop|restart|status botserver` and the same for `ui`. Systemctl handles environment loading, auto-restart, logging, and dependencies. - -Never push secrets (API keys, passwords, tokens) to git. Never commit `init.json` (it contains Vault unseal keys). All secrets must come from Vault — only `VAULT_*` variables are allowed in `.env`. Never deploy manually via scp or ssh; always use CI/CD. Always push all submodules (botserver, botui, botlib) before or alongside the main repo. Always ask before pushing to ALM. - ---- - -## Infrastructure Overview - -The host machine is accessed via `ssh user@`, running Incus (an LXD fork) as hypervisor. All services run inside named Incus containers. You enter containers with `sudo incus exec -- ` and list them with `sudo incus list`. - -### Container Architecture - -| Container | Service | Technology | Binary Path | Logs Path | Data Path | Notes | -|-----------|---------|------------|-------------|-----------|-----------|-------| -| **system** | BotServer + BotUI | Rust/Axum | `/opt/gbo/bin/botserver`
`/opt/gbo/bin/botui` | `/opt/gbo/logs/out.log`
`/opt/gbo/logs/err.log` | `/opt/gbo/work/` | Main API + UI proxy | -| **tables** | PostgreSQL | PostgreSQL 15+ | `/usr/lib/postgresql/*/bin/postgres` | `/opt/gbo/logs/postgresql/` | `/opt/gbo/data/pgdata/` | Primary database | -| **vault** | HashiCorp Vault | Vault | `/opt/gbo/bin/vault` | `/opt/gbo/logs/vault/` | `/opt/gbo/data/vault/` | Secrets management | -| **cache** | Valkey | Valkey (Redis fork) | `/opt/gbo/bin/valkey-server` | `/opt/gbo/logs/valkey/` | `/opt/gbo/data/valkey/` | Distributed cache | -| **drive** | MinIO | MinIO | `/opt/gbo/bin/minio` | `/opt/gbo/logs/minio/` | `/opt/gbo/data/minio/` | Object storage (S3 API) | -| **directory** | Zitadel | Zitadel (Go) | `/opt/gbo/bin/zitadel` | `/opt/gbo/logs/zitadel.log` | `PROD-DIRECTORY` DB | Identity provider | -| **llm** | llama.cpp | C++/CUDA | `/opt/gbo/bin/llama-server` | `/opt/gbo/logs/llm/` | `/opt/gbo/models/` | Local LLM inference | -| **vectordb** | Qdrant | Qdrant (Rust) | `/opt/gbo/bin/qdrant` | `/opt/gbo/logs/qdrant/` | `/opt/gbo/data/qdrant/` | Vector database | -| **alm** | Forgejo | Forgejo (Go) | `/opt/gbo/bin/forgejo` | `/opt/gbo/logs/forgejo/` | `/opt/gbo/data/forgejo/` | Git server (port 4747) | -| **alm-ci** | Forgejo Runner | Docker/runner | `/opt/gbo/bin/forgejo-runner` | `/opt/gbo/logs/forgejo-runner.log` | `/opt/gbo/data/ci/` | CI/CD runner | -| **proxy** | Caddy | Caddy | `/opt/gbo/bin/caddy` | `/opt/gbo/logs/caddy/` | `/opt/gbo/conf/` | Reverse proxy | -| **email** | Stalwart | Stalwart (Rust) | `/opt/gbo/bin/stalwart` | `/opt/gbo/logs/email/` | `/opt/gbo/data/email/` | Mail server | -| **webmail** | Roundcube | PHP | `/usr/share/roundcube/` | `/var/log/php/` | `/var/lib/roundcube/` | Webmail frontend | -| **dns** | CoreDNS | CoreDNS (Go) | `/opt/gbo/bin/coredns` | `/opt/gbo/logs/dns/` | `/opt/gbo/conf/Corefile` | DNS resolution | -| **meet** | LiveKit | LiveKit (Go) | `/opt/gbo/bin/livekit-server` | `/opt/gbo/logs/meet/` | `/opt/gbo/data/meet/` | Video conferencing | -| **table-editor** | NocoDB | NocoDB | `/opt/gbo/bin/nocodb` | `/opt/gbo/logs/nocodb/` | `/opt/gbo/data/nocodb/` | Database UI | - -### Network Access - -Externally, services are exposed via reverse proxy (Caddy). Internally, containers communicate via private IPs: - -| Service | External URL | Internal Address | -|---------|--------------|------------------| -| BotServer | `https://` | `http://:8080` | -| BotUI | `https://` | `http://:3000` | -| Zitadel | `https://` | `http://:8080` | -| Forgejo | `https://` | `http://:4747` | -| Webmail | `https://` | `http://:80` | -| Roundcube | `https://` | `http://:80` | - -**Note:** BotUI's `BOTSERVER_URL` must be `http://:8080` internally, NOT the external HTTPS URL. - ---- - -## Daily Operations - -### Daily Health Check (5 minutes) - -Run this every morning or after any deploy: - -```bash -# 1. Container status -sudo incus list - -# 2. Service health - all should show "active (running)" -sudo incus exec system -- systemctl is-active botserver -sudo incus exec system -- systemctl is-active ui -sudo incus exec directory -- systemctl is-active directory 2>/dev/null || echo "Directory check failed" -sudo incus exec drive -- pgrep -f minio > /dev/null && echo "MinIO OK" || echo "MinIO DOWN" -sudo incus exec tables -- pgrep -f postgres > /dev/null && echo "PostgreSQL OK" || echo "PostgreSQL DOWN" - -# 3. IPv4 connectivity check - all containers should have IPv4 -sudo incus list -c n4 | grep -E "(system|tables|vault|directory|drive|cache|llm|vector_db)" | grep -v "10\." && echo "WARNING: Missing IPv4" || echo "IPv4 OK" - -# 4. Application health endpoint -curl -sf https:///api/health && echo "Health OK" || echo "Health FAILED" - -# 5. Recent errors (last 10 lines) -sudo incus exec system -- tail -10 /opt/gbo/logs/err.log | grep -i "error\|panic\|failed" | head -5 -``` - -**Expected Result:** All services "active", all containers have IPv4, health endpoint returns 200, no critical errors. - -### Weekly Deep Check (15 minutes) - -Run every Monday morning: - -```bash -# 1. Disk space on all containers -for c in system tables vault directory drive cache llm vector_db; do - echo "=== $c ===" - sudo incus exec $c -- df -h / 2>/dev/null | tail -1 -done - -# 2. Database connection pool status -sudo incus exec tables -- psql -h localhost -U postgres -d botserver -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;" - -# 3. Vault status (should be unsealed) -sudo incus exec vault -- curl -ksf https://localhost:8200/v1/sys/health | grep -q '"sealed":false' && echo "Vault unsealed" || echo "Vault SEALED - CRITICAL" - -# 4. CI runner status -sudo incus exec alm-ci -- pgrep -f forgejo > /dev/null && echo "CI runner OK" || echo "CI runner DOWN" - -# 5. MinIO buckets health -sudo incus exec drive -- bash -c 'export PATH=/opt/gbo/bin:$PATH && mc admin info local' 2>&1 | head -10 - -# 6. Backup verification - check latest snapshot exists -sudo incus snapshot list system | head -5 -``` - -### Quick Status Dashboard - -One-line status of everything: - -```bash -echo "=== GBO Status Dashboard $(date) ===" -echo "Containers:" -sudo incus list -c n4,s | grep -E "(system|tables|vault|directory|drive|cache|llm|vector_db|alm-ci)" | awk '{print $1 ": " $3 " " $4}' -echo "" -echo "Services:" -for svc in botserver ui; do - sudo incus exec system -- systemctl is-active $svc 2>/dev/null && echo " $svc: ACTIVE" || echo " $svc: DOWN" -done -echo "" -echo "Health:" -curl -s -o /dev/null -w "%{http_code}" https:///api/health 2>/dev/null | grep -q "200" && echo " API: OK" || echo " API: FAIL" -``` - ---- - -## Alert Response Playbook - -### Alert: "No IPv4 on container" - -**Symptoms:** Container shows empty IPV4 column in `incus list` - -**Quick Fix:** -```bash -# Identify container -CONTAINER= -IP= # e.g., 10.x.x.x -GATEWAY= - -# Set static IP -sudo incus config device set $CONTAINER eth0 ipv4.address $IP - -# Configure network inside -sudo incus exec $CONTAINER -- bash -c "cat > /etc/network/interfaces << 'EOF' -auto lo -iface lo inet loopback - -auto eth0 -iface eth0 inet static -address $IP -netmask 255.255.255.0 -gateway $GATEWAY -dns-nameservers 8.8.8.8 8.8.4.4 -EOF" - -# Restart -sudo incus restart $CONTAINER - -# Verify -sudo incus exec $CONTAINER -- ip addr show eth0 -``` - -**Prevention:** Always configure static IP when creating new containers. - ---- - -### Alert: "ALM botserver problem" / CI Build Failed - -**Symptoms:** Deploy not working, CI status shows failure - -**Quick Diagnostics:** -```bash -# Check CI database for recent runs -sudo incus exec tables -- bash -c 'export PGPASSWORD=; psql -h localhost -U postgres -d PROD-ALM -c "SELECT id, status, created FROM action_run ORDER BY id DESC LIMIT 5;"' -# Status codes: 0=pending, 1=success, 2=failure, 3=cancelled, 6=running -``` - -**Quick Fixes:** - -1. **If stuck at status 6 (running):** -```bash -RUN_ID= -sudo incus exec tables -- bash -c "export PGPASSWORD=; psql -h localhost -U postgres -d PROD-ALM -c \"UPDATE action_task SET status = 0 WHERE id = $RUN_ID; UPDATE action_run_job SET status = 0 WHERE run_id = $RUN_ID; UPDATE action_run SET status = 0 WHERE id = $RUN_ID;\"" -``` - -2. **If /tmp permission denied:** -```bash -sudo incus exec alm-ci -- chmod 1777 /tmp -sudo incus exec alm-ci -- touch /tmp/build.log && chmod 666 /tmp/build.log -``` - -3. **If CI runner down:** -```bash -sudo incus exec alm-ci -- pkill -9 forgejo -sleep 2 -sudo incus exec alm-ci -- bash -c 'cd /opt/gbo/bin && nohup ./forgejo-runner daemon --config config.yaml >> /opt/gbo/logs/forgejo-runner.log 2>&1 &' -``` - -**After fix:** Push a trivial change to re-trigger CI. - ---- - -### Alert: "Email container stopping reach Internet" - -**Symptoms:** Email notifications failing, container cannot resolve external domains - -**Quick Diagnostics:** -```bash -# Test DNS from email container -sudo incus exec email -- nslookup google.com - -# Check network config -sudo incus exec email -- cat /etc/resolv.conf -sudo incus exec email -- ip route -``` - -**Quick Fixes:** - -1. **If IPv6-only (no IPv4):** Follow "No IPv4 on container" playbook above. - -2. **If DNS not working:** -```bash -# Force Google DNS -sudo incus exec email -- bash -c 'echo "nameserver 8.8.8.8" > /etc/resolv.conf' - -# Or configure via interfaces file -sudo incus exec email -- bash -c "cat > /etc/network/interfaces << 'EOF' -auto lo -iface lo inet loopback - -auto eth0 -iface eth0 inet static -address -netmask 255.255.255.0 -gateway -dns-nameservers 8.8.8.8 8.8.4.4 -EOF" -sudo incus restart email -``` - -3. **If firewall blocking:** Check iptables rules on host for email container IP. - ---- - -### Alert: "Vault sealed" - -**Symptoms:** All services failing, Vault health shows "sealed": true - -**Quick Fix:** -```bash -# Get unseal keys from secure location (not in git!) -KEY1= -KEY2= -KEY3= - -sudo incus exec vault -- vault operator unseal $KEY1 -sudo incus exec vault -- vault operator unseal $KEY2 -sudo incus exec vault -- vault operator unseal $KEY3 - -# Verify -sudo incus exec vault -- vault status -``` - ---- - -### Alert: "Botserver not responding" - -**Quick Diagnostics:** -```bash -# Check process -sudo incus exec system -- pgrep -f botserver || echo "NOT RUNNING" - -# Check systemd status -sudo incus exec system -- systemctl status botserver --no-pager - -# Check recent logs -sudo incus exec system -- tail -20 /opt/gbo/logs/err.log - -# Check for GLIBC errors -sudo incus exec system -- ldd /opt/gbo/bin/botserver | grep "not found" -``` - -**Quick Fixes:** - -1. **If systemd failed:** -```bash -sudo incus exec system -- systemctl restart botserver -sudo incus exec system -- systemctl restart ui -``` - -2. **If GLIBC mismatch:** Binary compiled with wrong glibc. Must rebuild inside system container (Debian 12, glibc 2.36). - -3. **If port conflict:** -```bash -sudo incus exec system -- lsof -i :8080 -sudo incus exec system -- killall botserver -sudo incus exec system -- systemctl start botserver -``` - ---- - -## Services Detail - -Botserver runs as user `gbuser`, binary at `/opt/gbo/bin/botserver`, logs at `/opt/gbo/logs/out.log` and `/opt/gbo/logs/err.log`, systemd unit at `/etc/systemd/system/botserver.service`, env loaded from `/opt/gbo/bin/.env`. Bot BASIC scripts are stored in MinIO Drive under `{bot}.gbai/{bot}.gbdialog/*.bas` and are downloaded/compiled by DriveMonitor to `/opt/gbo/work/{bot}.gbai/{bot}.gbdialog/*.ast`. - -The directory service runs Zitadel as user `root`, binary at `/opt/gbo/bin/zitadel`, logs at `/opt/gbo/logs/zitadel.log`, systemd unit at `/etc/systemd/system/directory.service`, and loads environment from the service configuration. Zitadel provides identity management and OAuth2 services for the platform. - -Internally, Zitadel listens on port 8080 within the directory container. For external access: -- Via public domain (HTTPS): `https://` (configured through proxy container) -- Via host IP (HTTP): `http://:9000` (direct container port forwarding) -- Via container IP (HTTP): `http://:9000` (direct container access) - -Access the Zitadel console at `https:///ui/console` with admin credentials. Zitadel implements v1 Management API (deprecated) and v2 Organization/User services. Always use the v2 endpoints under `/v2/organizations` and `/v2/users` for all operations. - -The botserver bootstrap also manages: Vault (secrets), PostgreSQL (database), Valkey (cache, password auth), MinIO (object storage), Zitadel (identity provider), and llama.cpp (LLM). - -To obtain a PAT for Zitadel API access, check /opt/gbo/conf/directory/admin-pat.txt in the directory container. Use it with curl by setting the Authorization header: `Authorization: Bearer $(cat /opt/gbo/conf/directory/admin-pat.txt)` and include `-H "Host: "` for correct host resolution. - ---- - -## Directory Management (Zitadel) - -### Getting Admin PAT (Personal Access Token) - -```bash -# Get the admin PAT from directory container -PAT=$(ssh administrator@ "sudo incus exec directory -- cat /opt/gbo/conf/directory/admin-pat.txt") -``` - -### User Management via API (v2) - -**Create a Human User:** -```bash -curl -X POST "http://:8080/v2/users/human" \ --H "Content-Type: application/json" \ --H "Authorization: Bearer $PAT" \ --H "Host: " \ --d '{ - "username": "testuser", - "profile": {"givenName": "Test", "familyName": "User"}, - "email": {"email": "test@example.com", "isVerified": true}, - "password": {"password": "", "changeRequired": false} -}' -``` - -**List Users:** -```bash -curl -X POST "http://:8080/v2/users" \ --H "Content-Type: application/json" \ --H "Authorization: Bearer $PAT" \ --H "Host: " \ --d '{"query": {"offset": 0, "limit": 100}}' -``` - -**Update User Password:** -```bash -curl -X POST "http://:8080/v2/users//password" \ --H "Content-Type: application/json" \ --H "Authorization: Bearer $PAT" \ --H "Host: " \ --d '{ - "newPassword": {"password": "", "changeRequired": false} -}' -``` - -**Delete User:** -```bash -curl -X DELETE "http://:8080/v2/users/" \ --H "Authorization: Bearer $PAT" \ --H "Host: " -``` - -### Directory Quick Reference - -| Task | Command | -|------|---------| -| Get PAT | `sudo incus exec directory -- cat /opt/gbo/conf/directory/admin-pat.txt` | -| Check health | `curl -sf http://:8080/debug/healthz` | -| Console UI | `http://:9000/ui/console` | -| Create user | `POST /v2/users/human` | -| List users | `POST /v2/users` | -| Update password | `POST /v2/users/{id}/password` | - -### CI/CD Log Retrieval from Database (PREFERRED METHOD) - -The most reliable way to get CI build logs — including compiler errors — is from the Forgejo ALM database and compressed log files. The runner logs (`/opt/gbo/logs/forgejo-runner.log`) show live activity but scroll away quickly. The database retains everything. - -**Status codes:** 0=pending, 1=success, 2=failure, 3=cancelled, 6=running - -**Step 1 — List recent runs with workflow name and status:** -```sql --- Connect to ALM database -sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM - -SELECT ar.id, ar.title, ar.workflow_id, ar.status, - to_timestamp(ar.created) AS created_at -FROM action_run ar -ORDER BY ar.id DESC LIMIT 10; -``` - -**Step 2 — Get job/task IDs for a failed run:** -```sql -SELECT arj.id AS job_id, arj.name, arj.status, arj.task_id -FROM action_run_job arj -WHERE arj.run_id = ; -``` - -**Step 3 — Get step-level status (which step failed):** -```sql -SELECT ats.name, ats.status, ats.log_index, ats.log_length -FROM action_task_step ats -WHERE ats.task_id = -ORDER BY ats.index; -``` - -**Step 4 — Read the full build log (contains compiler errors):** -```bash -# 1. Get the log filename from action_task -sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM \ - -c "SELECT log_filename FROM action_task WHERE id = ;" - -# 2. Pull and decompress the log from alm container -# Log files are zstd-compressed at: /opt/gbo/data/data/actions_log//.log.zst -sudo incus file pull alm/opt/gbo/data/data/actions_log/ /tmp/ci-log.log.zst -zstd -d /tmp/ci-log.log.zst -o /tmp/ci-log.log -cat /tmp/ci-log.log -``` - -**One-liner to read latest failed run log:** -```bash -TASK_ID=$(sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM -t -c \ - "SELECT at.id FROM action_task at JOIN action_run_job arj ON at.job_id = arj.id \ - JOIN action_run ar ON arj.run_id = ar.id \ - WHERE ar.status = 2 ORDER BY at.id DESC LIMIT 1;" | tr -d ' ') -LOG_FILE=$(sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM -t -c \ - "SELECT log_filename FROM action_task WHERE id = $TASK_ID;" | tr -d ' ') -sudo incus file pull "alm/opt/gbo/data/data/actions_log/$LOG_FILE" /tmp/ci-log.log.zst -zstd -d /tmp/ci-log.log.zst -o /tmp/ci-log.log 2>/dev/null && cat /tmp/ci-log.log -``` - -**Watch CI in real-time (supplementary):** -```bash -# Tail runner logs (live but ephemeral) -sudo incus exec alm-ci -- tail -f /opt/gbo/logs/forgejo-runner.log - -# Watch for new runs -sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM \ - -c "SELECT id, title, workflow_id, status FROM action_run ORDER BY id DESC LIMIT 5;" -``` - -**Verify binary was updated after deploy:** -```bash -sudo incus exec system -- stat -c '%y' /opt/gbo/bin/botserver -sudo incus exec system -- systemctl status botserver --no-pager -curl -sf https:///api/health && echo "OK" || echo "FAILED" -``` - -**Understand build timing:** -- **Rust compilation**: 2-5 minutes (cold build), 30-60 seconds (incremental) -- **Deploy step**: ~5 seconds -- **Total CI time**: 2-6 minutes depending on cache - -**Watch CI in real-time:** -```bash -# Tail runner logs -sudo incus exec alm-ci -- tail -f /opt/gbo/logs/forgejo-runner.log - -# Check if new builds appear -watch -n 5 'sudo incus exec tables -- bash -c "export PGPASSWORD=; psql -h localhost -U postgres -d PROD-ALM -c \\"SELECT id, status, created FROM action_run ORDER BY id DESC LIMIT 3;\\""' - -# Verify botserver deployed correctly -sudo incus exec system -- /opt/gbo/bin/botserver --version 2>&1 | head -3 -sudo incus exec system -- tail -5 /opt/gbo/logs/err.log -``` - -### Monitor CI/CD Build Status - -**Check latest build status:** -```bash -# View latest 3 builds with status -sudo incus exec alm -- bash -c 'cd /opt/gbo/data/GeneralBots/BotServer/actions/runs && for dir in $(ls -t | head -3); do echo "=== Build $dir ==="; cat $dir/jobs/0.json 2>/dev/null | grep -E "\"status\"|\"commit\"|\"workflow\"" | head -5; done' - -# Watch runner logs in real-time -sudo incus exec alm-ci -- tail -f /opt/gbo/logs/forgejo-runner.log | grep -E "Clone|Build|Deploy|Success|Failure" -``` - -**Understand build timing:** -- **Rust compilation**: 2-5 minutes (cold build), 30-60 seconds (incremental) -- **Dependencies**: First build downloads ~200 dependencies -- **Deploy step**: ~5 seconds -- **Total CI time**: 2-6 minutes depending on cache - -**Verify binary was updated:** -```bash -# Check binary timestamp -ssh administrator@63.141.255.9 "sudo incus exec system -- stat -c '%y' /opt/gbo/bin/botserver" - -# Check running version -ssh administrator@63.141.255.9 "sudo incus exec system -- /opt/gbo/bin/botserver --version" - -# Check health endpoint -curl -sf https://chat.pragmatismo.com.br/api/health || echo "Health check failed" -``` -``` - ---- - -## DriveMonitor & Bot Configuration - -DriveMonitor is a background service inside botserver that watches MinIO buckets and syncs changes to the local filesystem and database every 10 seconds. It monitors three directory types per bot: the `.gbdialog/` folder for BASIC scripts (downloads and recompiles on change), the `.gbot/` folder for `config.csv` (syncs to the `bot_configuration` database table), and the `.gbkb/` folder for knowledge base documents (downloads and indexes for vector search). - -Bot configuration is stored in two PostgreSQL tables inside the `botserver` database. The `bot_configuration` table holds key-value pairs with columns `bot_id`, `config_key`, `config_value`, `config_type`, `is_encrypted`, and `updated_at`. The `gbot_config_sync` table tracks sync state with columns `bot_id`, `config_file_path`, `last_sync_at`, `file_hash`, and `sync_count`. - -The `config.csv` format is a plain CSV with no header: each line is `key,value`, for example `llm-provider,groq` or `theme-color1,#cc0000`. DriveMonitor syncs it when the file ETag changes in MinIO, on botserver startup, or after a restart. - -**Check config status:** Query `bot_configuration` via `sudo incus exec tables -- psql -h localhost -U postgres -d botserver -c "SELECT config_key, config_value FROM bot_configuration WHERE bot_id = (SELECT id FROM bots WHERE name = '') ORDER BY config_key;"`. Check sync state via the `gbot_config_sync` table. Inspect the bucket directly with `sudo incus exec drive -- /opt/gbo/bin/mc cat local/.gbai/.gbot/config.csv`. - -**Debug DriveMonitor:** Monitor live logs with `sudo incus exec system -- tail -f /opt/gbo/logs/out.log | grep -E "(DRIVE_MONITOR|check_gbot|config)"`. An empty `gbot_config_sync` table means DriveMonitor has not synced yet. If no new log entries appear after 30 seconds, the loop may be stuck — restart botserver with systemctl to clear the state. - -**Common config issues:** If config.csv is missing from the bucket, create and upload it with `mc cp`. If the database shows stale values, restart botserver to force a fresh sync, or as a temporary fix update the database directly with `UPDATE bot_configuration SET config_value = 'groq', updated_at = NOW() WHERE ...`. To force a re-sync without restarting, copy config.csv over itself with `mc cp local/... local/...` to change the ETag. - ---- - -## MinIO (Drive) Operations - -All bot files live in MinIO buckets. Use the `mc` CLI at `/opt/gbo/bin/mc` from inside the `drive` container. The bucket structure per bot is: `{bot}.gbai/` as root, `{bot}.gbai/{bot}.gbdialog/` for BASIC scripts, `{bot}.gbai/{bot}.gbot/` for config.csv, and `{bot}.gbai/{bot}.gbkb/` for knowledge base folders. - -Common mc commands: `mc ls local/` lists all buckets; `mc ls local/botname.gbai/` lists a bucket; `mc cat local/.../start.bas` prints a file; `mc cp local/.../file /tmp/file` downloads; `mc cp /tmp/file local/.../file` uploads (this triggers DriveMonitor recompile); `mc stat local/.../config.csv` shows ETag and metadata; `mc mb local/newbot.gbai` creates a bucket; `mc rb local/oldbot.gbai` removes an empty bucket. - -If mc is not found, use the full path `/opt/gbo/bin/mc`. If alias `local` is not configured, check with `mc config host list`. If MinIO is not running, check with `sudo incus exec drive -- systemctl status minio`. - ---- - -## Vault Security Architecture - -HashiCorp Vault is the single source of truth for all secrets. Botserver reads `VAULT_ADDR` and `VAULT_TOKEN` from `/opt/gbo/bin/.env` at startup, initializes a TLS/mTLS client, then reads credentials from Vault paths. If Vault is unavailable, it falls back to defaults. The `.env` file must only contain `VAULT_*` variables plus `PORT`, `DATA_DIR`, `WORK_DIR`, and `LOAD_ONLY`. - -**Global Vault paths:** `gbo/tables` holds PostgreSQL credentials; `gbo/drive` holds MinIO access key and secret; `gbo/cache` holds Valkey password; `gbo/llm` holds LLM URL and API keys; `gbo/directory` holds Zitadel config; `gbo/email` holds SMTP credentials; `gbo/vectordb` holds Qdrant config; `gbo/jwt` holds JWT signing secret; `gbo/encryption` holds the master encryption key. Organization-scoped secrets follow patterns like `gbo/orgs/{org_id}/bots/{bot_id}` and tenant infrastructure uses `gbo/tenants/{tenant_id}/infrastructure`. - -**Credential resolution:** For any service, botserver checks the most specific Vault path first (org+bot level), falls back to a default bot path, then falls back to the global path, and only uses environment variables as a last resort in development. - -**Verify Vault health:** `sudo incus exec vault -- curl -k -sf https://localhost:8200/v1/sys/health` should return JSON with `"sealed":false`. To read a secret: set `VAULT_ADDR`, `VAULT_TOKEN`, and `VAULT_CACERT` then run `vault kv get secret/gbo/tables`. To test from the system container, use curl with `--cacert /opt/gbo/conf/system/certificates/ca/ca.crt` and `-H "X-Vault-Token: "`. - -**init.json** is stored at `/opt/gbo/bin/botserver-stack/conf/vault/vault-conf/init.json` and contains the root token and 5 unseal keys (3 needed to unseal). Never commit this file to git. Store it encrypted in a secure location. - -**Vault troubleshooting — cannot connect:** Check that the vault container's systemd unit is running, verify the token in `.env` is not expired with `vault token lookup`, confirm the CA cert path in `.env` matches the actual file location, and test network connectivity from system to vault container. To generate a new token: `vault token create -policy="botserver" -ttl="8760h" -format=json` then update `.env` and restart botserver. - -**Get database credentials from Vault v2 API:** -```bash -ssh user@ "sudo incus exec system -- curl -s --cacert /opt/gbo/conf/system/certificates/ca/ca.crt -H 'X-Vault-Token: ' https://:8200/v1/secret/data/gbo/tables 2>/dev/null" -``` - -**Vault troubleshooting — secrets missing:** Run `vault kv get secret/gbo/tables` (and other paths) to check if secrets exist. If a path returns NOT FOUND, add secrets with `vault kv put secret/gbo/tables host= port=5432 database=botserver username=gbuser password=` and similar for other paths. - -**Vault sealed after restart:** Run `vault operator unseal `, repeat with key2 and key3 (3 of 5 keys from init.json), then verify with `vault status`. - -**TLS certificate errors:** Confirm `/opt/gbo/conf/system/certificates/ca/ca.crt` exists in the system container. If missing, copy it from the vault container using `incus file pull vault/opt/gbo/conf/vault/ca.crt /tmp/ca.crt` then place it at the expected path. - -**Vault snapshots:** Stop vault, run `sudo incus snapshot create vault backup-$(date +%Y%m%d-%H%M)`, start vault. Restore with `sudo incus snapshot restore vault ` while vault is stopped. - ---- - -## DNS Management - -### Updating DNS Records - -**CRITICAL:** When updating DNS zone files, you MUST: - -1. **Update the serial number** in the SOA record (format: YYYYMMDDNN) -2. **Run sync-zones.sh** to propagate changes to secondary nameservers -3. **Anonymize IPs and credentials** in all documentation and logs - -**Workflow:** -```bash -# 1. Edit zone file -sudo incus exec dns -- nano /opt/gbo/data/pragmatismo.com.br.zone - -# 2. Update serial (YYYYMMDDNN format) -# Example: 2026041801 (April 18, 2026, change #1) -sudo incus exec dns -- sed -i 's/2026041801/2026041802/' /opt/gbo/data/pragmatismo.com.br.zone - -# 3. Reload CoreDNS -sudo incus exec dns -- pkill -HUP coredns - -# 4. Sync to secondary NS -sudo /opt/gbo/bin/sync-zones.sh - -# 5. Verify on secondary -ssh -o StrictHostKeyChecking=no -i /home/administrator/.ssh/id_ed25519 admin@ 'getent hosts ' -``` - -**Zone File Location:** `/opt/gbo/data/.zone` in the `dns` container - -**Sync Script:** `/opt/gbo/bin/sync-zones.sh` - copies zone files to secondary NS (3.218.224.38) - -**⚠️ Security Rules:** -- NEVER include real IPs in documentation - use `` or `10.x.x.x` -- NEVER include credentials - use `` or `` -- NEVER commit zone files with secrets to git - ---- - -### Adding New Subdomains (HTTPS with Caddy) - -**CRITICAL:** When adding new subdomains that need HTTPS, follow this order: - -1. **Add DNS record FIRST** (see above workflow) -2. **Wait for DNS propagation** (can take up to 1 hour) -3. **Add Caddy config** - Caddy will automatically obtain Let's Encrypt certificate - -**Complete Workflow:** -```bash -# 1. Add DNS record (update serial, sync zones) -sudo incus exec dns -- nano /opt/gbo/data/pragmatismo.com.br.zone -# Add: news IN A -sudo incus exec dns -- sed -i 's/2026041801/2026041802/' /opt/gbo/data/pragmatismo.com.br.zone -sudo incus exec dns -- pkill -HUP coredns -sudo /opt/gbo/bin/sync-zones.sh - -# 2. Verify DNS propagation (wait until this works) -dig @9.9.9.9 news.pragmatismo.com.br A +short - -# 3. Add Caddy config (AFTER DNS is working) -sudo sh -c 'cat >> /opt/gbo/conf/config << EOF - -news.pragmatismo.com.br { - import tls_config - reverse_proxy http://: { - header_up Host {host} - header_up X-Real-IP {remote} - header_up X-Forwarded-Proto https - } -} -EOF' - -# 4. Restart Caddy -sudo incus exec proxy -- systemctl restart proxy - -# 5. Wait for certificate (Caddy will auto-obtain from Let's Encrypt) -# Check logs: sudo incus exec proxy -- tail -f /opt/gbo/logs/access.log -``` - -**⚠️ Important:** -- Caddy will fail to obtain certificate if DNS is not propagated -- Wait up to 1 hour for DNS propagation before adding Caddy config -- Check Caddy logs for "challenge failed" errors - indicates DNS not ready -- Certificate is automatically renewed by Caddy - ---- - -## Troubleshooting Quick Reference - -**botserver won't start:** Run `sudo incus exec system -- ldd /opt/gbo/bin/botserver | grep "not found"` to check for missing libraries. Run `sudo incus exec system -- timeout 10 /opt/gbo/bin/botserver 2>&1` to see startup errors. Confirm `/opt/gbo/work/` exists and is accessible. - -**botui can't reach botserver:** Check that the `ui.service` systemd file has `BOTSERVER_URL=http://localhost:5858` — not the external HTTPS URL. Fix with `sed -i 's|BOTSERVER_URL=.*|BOTSERVER_URL=http://localhost:5858|'` on the service file, then `systemctl daemon-reload` and `systemctl restart ui`. - -**Suggestions not showing:** Confirm bot `.bas` files exist in MinIO Drive under `{bot}.gbai/{bot}.gbdialog/`. Check logs for compilation errors. Clear the AST cache in `/opt/gbo/work/` and restart botserver. - -**IPv6 DNS timeouts on external APIs (Groq, Cloudflare):** The container's DNS may return AAAA records without IPv6 connectivity. The container should have `IPV6=no` in its network config and `gai.conf` set appropriately. Check for `RES_OPTIONS=inet4` in `botserver.service` if issues persist. - -**Logs show development paths instead of Drive:** Botserver is using hardcoded dev paths. Check `.env` has `DATA_DIR=/opt/gbo/work/` and `WORK_DIR=/opt/gbo/work/`, verify the systemd unit has `EnvironmentFile=/opt/gbo/bin/.env`, and confirm Vault is reachable so service discovery works. Expected startup log lines include `info watcher:Watching data directory /opt/gbo/work` and `info botserver:BotServer started successfully on port 5858`. - -**Migrations not running after push:** If `stat /opt/gbo/bin/botserver` shows old timestamp and `__diesel_schema_migrations` table has no new entries, CI did not rebuild. Make a trivial code change (e.g., add a comment) in botserver and push again to force rebuild. - ---- - -## Drive (MinIO) File Operations Cheatsheet - -All `mc` commands run inside the `drive` container with `PATH` set: `sudo incus exec drive -- bash -c 'export PATH=/opt/gbo/bin:$PATH && mc '`. If `local` alias is missing, create it with credentials from Vault path `gbo/drive`. - -**List bucket contents recursively:** `mc ls local/.gbai/ --recursive` - -**Read a file from Drive:** `mc cat local/.gbai/.gbdialog/start.bas` - -**Download a file:** `mc cp local/.gbai/.gbdialog/start.bas /tmp/start.bas` - -**Upload a file to Drive (triggers DriveMonitor recompile):** Transfer file to host via `scp`, push into drive container with `sudo incus file push /tmp/file drive/tmp/file`, then `mc put /tmp/file local/.gbai/.gbdialog/start.bas` - -**Full upload workflow example — updating config.csv:** -```bash -# 1. Download current config from Drive -ssh user@host "sudo incus exec drive -- bash -c 'export PATH=/opt/gbo/bin:\$PATH && mc cat local/botname.gbai/botname.gbot/config.csv'" > /tmp/config.csv - -# 2. Edit locally (change model, keys, etc.) -sed -i 's/llm-model,old-model/llm-model,new-model/' /tmp/config.csv - -# 3. Push edited file back to Drive -scp /tmp/config.csv user@host:/tmp/config.csv -ssh user@host "sudo incus file push /tmp/config.csv drive/tmp/config.csv" -ssh user@host "sudo incus exec drive -- bash -c 'export PATH=/opt/gbo/bin:\$PATH && mc put /tmp/config.csv local/botname.gbai/botname.gbot/config.csv'" - -# 4. Wait ~15 seconds, then verify DriveMonitor picked up the change -ssh user@host "sudo incus exec system -- bash -c 'grep -i \"Model:\" /opt/gbo/logs/err.log | tail -3'" -``` - -**Force re-sync of config.csv** (change ETag without content change): `mc cp local/.gbai/.gbot/config.csv local/.gbai/.gbot/config.csv` - -**Create a new bot bucket:** `mc mb local/newbot.gbai` - -**Check MinIO health:** `sudo incus exec drive -- bash -c '/opt/gbo/bin/mc admin info local'` - ---- - -## Logging Quick Reference - -**Application logs** (searchable, timestamped, most useful): `sudo incus exec system -- tail -f /opt/gbo/logs/err.log` (errors and debug) or `/opt/gbo/logs/out.log` (stdout). The systemd journal only captures process lifecycle events, not application output. - -**Search logs for specific bot activity:** `grep -i "botname\|llm\|Model:\|KB\|USE_KB\|drive_monitor" /opt/gbo/logs/err.log | tail -30` - -**Check which LLM model a bot is using:** `grep "Model:" /opt/gbo/logs/err.log | tail -5` - -**Check DriveMonitor config sync:** `grep "check_gbot\|config.csv\|should_sync" /opt/gbo/logs/err.log | tail -20` - -**Check KB/vector operations:** `grep -i "gbkb\|qdrant\|embedding\|index" /opt/gbo/logs/err.log | tail -20` - -**Live tail with filter:** `sudo incus exec system -- bash -c 'tail -f /opt/gbo/logs/err.log | grep --line-buffered -i "botname\|error\|KB"'` - ---- - -## Program Access Cheatsheet - -| Program | Container | Path | Notes | -|---------|-----------|------|-------| -| botserver | system | `/opt/gbo/bin/botserver` | Run via systemctl only | -| botui | system | `/opt/gbo/bin/botui` | Run via systemctl only | -| mc (MinIO Client) | drive | `/opt/gbo/bin/mc` | Must set `PATH=/opt/gbo/bin:$PATH` | -| psql | tables | `/usr/bin/psql` | `psql -h localhost -U postgres -d botserver` | -| vault | vault | `/opt/gbo/bin/vault` | Needs `VAULT_ADDR`, `VAULT_TOKEN`, `VAULT_CACERT` | -| zitadel | directory | `/opt/gbo/bin/zitadel` | Runs as root on port 8080 internally | - -**Quick psql query — bot config:** `sudo incus exec tables -- psql -h localhost -U postgres -d botserver -c "SELECT config_key, config_value FROM bot_configuration WHERE bot_id = (SELECT id FROM bots WHERE name = 'botname') ORDER BY config_key;"` - -**Quick psql query — active KBs for session:** `sudo incus exec tables -- psql -h localhost -U postgres -d botserver -c "SELECT * FROM session_kb_associations WHERE session_id = '' AND is_active = true;"` - ---- - -## BASIC Compilation Architecture - -Compilation and runtime are now strictly separated. **Compilation** happens only in `BasicCompiler` inside DriveMonitor when it detects `.bas` file changes. The output is a fully preprocessed `.ast` file written to `work/.gbai/.gbdialog/.ast`. **Runtime** (start.bas, TOOL_EXEC, automation, schedule) loads only `.ast` files and calls `ScriptService::run()` which does `engine.compile() + eval_ast_with_scope()` on the already-preprocessed Rhai source — no preprocessing at runtime. - -The `.ast` file has all transforms applied: `USE KB "cartas"` becomes `USE_KB("cartas")`, `IF/END IF` → `if/{ }`, `WHILE/WEND` → `while/{ }`, `BEGIN TALK/END TALK` → function calls, `SAVE`, `FOR EACH/NEXT`, `SELECT CASE`, `SET SCHEDULE`, `WEBHOOK`, `USE WEBSITE`, `LLM` keyword expansion, variable predeclaration, and keyword lowercasing. Runtime never calls `compile()`, `compile_tool_script()`, or `compile_preprocessed()` — those methods no longer exist. - -**Tools (TOOL_EXEC) load `.ast` only** — there is no `.bas` fallback. If an `.ast` file is missing, the tool fails with "Failed to read tool .ast file". DriveMonitor must have compiled it first. - -**Suggestion deduplication** uses Redis `SADD` (set) instead of `RPUSH` (list). This prevents duplicate suggestion buttons when `start.bas` runs multiple times per session. The key format is `suggestions:{bot_id}:{session_id}` and `get_suggestions` uses `SMEMBERS` to read it. - ---- - -## Container Quick Reference - -| Container | Critical | Check Command | Restart Command | -|-----------|----------|---------------|-----------------| -| system | YES | `systemctl is-active botserver` | `systemctl restart botserver` | -| tables | YES | `pgrep -f postgres` | `systemctl restart postgresql` | -| vault | YES | `curl -ksf https://localhost:8200/v1/sys/health` | `systemctl restart vault` | -| drive | YES | `pgrep -f minio` | `systemctl restart minio` | -| cache | HIGH | `pgrep -f valkey` | `systemctl restart valkey` | -| directory | HIGH | `curl -sf http://localhost:8080/debug/healthz` | `systemctl restart directory` | -| alm-ci | MED | `pgrep -f forgejo` | manual restart | -| llm | MED | `curl -sf http://localhost:8081/health` | `systemctl restart llm` | -| vector_db | LOW | `curl -sf http://localhost:6333/healthz` | `systemctl restart qdrant` | - ---- - -## Log Tailing Commands - -```bash -# Live error monitoring -sudo incus exec system -- tail -f /opt/gbo/logs/err.log | grep -i "error\|panic\|failed" - -# Bot-specific activity -sudo incus exec system -- tail -f /opt/gbo/logs/err.log | grep -i "" - -# DriveMonitor activity -sudo incus exec system -- tail -f /opt/gbo/logs/err.log | grep -i "drive\|config" - -# LLM calls -sudo incus exec system -- tail -f /opt/gbo/logs/err.log | grep -i "model\|llm\|groq" - -# CI runner -sudo incus exec alm-ci -- tail -f /opt/gbo/logs/forgejo-runner.log -``` - ---- - -## Health Endpoint Monitoring - -Set up a simple cron job to alert if health fails: - -```bash -# Add to host crontab (crontab -e) -*/5 * * * * curl -sf https:///api/health || echo "ALERT: Health check failed at $(date)" >> /var/log/gbo-health.log -``` - ---- - -## Troubleshooting Quick Reference - -### Container Won't Start (No IPv4) - -**Symptom:** Container shows empty IPV4 column in `sudo incus list` - -**Diagnose:** -```bash -sudo incus list -c n4 -sudo incus exec -- ip addr show eth0 -``` - -**Fix:** -```bash -# 1. Stop container -sudo incus stop - -# 2. Set static IP -sudo incus config device set eth0 ipv4.address - -# 3. Configure network inside -sudo incus exec -- bash -c 'cat > /etc/network/interfaces << EOF -auto lo -iface lo inet loopback - -auto eth0 -iface eth0 inet static -address -netmask 255.255.255.0 -gateway -dns-nameservers 8.8.8.8 8.8.4.4 -EOF' - -# 4. Restart -sudo incus restart - -# 5. Verify -sudo incus exec -- ip addr show eth0 -``` - ---- - -### CI/ALM Permission Errors - -**Symptom:** `/tmp permission denied` during CI build - -**Fix:** -```bash -# On alm-ci container -sudo incus exec alm-ci -- chmod 1777 /tmp -sudo incus exec alm-ci -- touch /tmp/build.log && chmod 666 /tmp/build.log - -# Check runner user -sudo incus exec alm-ci -- ls -la /opt/gbo/ - -# Fix ownership -sudo incus exec alm-ci -- chown -R gbuser:gbuser /opt/gbo/bin/ -sudo incus exec alm-ci -- chown -R gbuser:gbuser /opt/gbo/work/ -``` - -**CI Runner Down:** -```bash -sudo incus exec alm-ci -- pkill -9 forgejo -sleep 2 -sudo incus exec alm-ci -- bash -c 'cd /opt/gbo/bin && nohup ./forgejo-runner daemon --config config.yaml >> /opt/gbo/logs/forgejo-runner.log 2>&1 &' -``` - ---- - -### MinIO (Drive) Operations with `mc` - -**Setup:** -```bash -# Access drive container -sudo incus exec drive -- bash - -# Set PATH -export PATH=/opt/gbo/bin:$PATH - -# Verify mc works -mc --version -``` - -**Common Commands:** -```bash -# List all buckets -mc ls local/ - -# List bot bucket -mc ls local/.gbai/ - -# Read start.bas -mc cat local/.gbai/.gbdialog/start.bas - -# Download file -mc cp local/.gbai/.gbdialog/config.csv /tmp/config.csv - -# Upload file (triggers DriveMonitor) -mc cp /tmp/config.csv local/.gbai/.gbot/config.csv - -# Force re-sync (change ETag) -mc cp local/.gbai/.gbot/config.csv local/.gbai/.gbot/config.csv - -# Create new bucket -mc mb local/newbot.gbai - -# Check MinIO health -mc admin info local -``` - -**If `local` alias missing:** -```bash -# Create alias -mc alias set local http://localhost:9000 -``` - ---- - -### Forgejo ALM Database Operations - -**Access ALM database (PROD-ALM):** -```bash -# On tables container -sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM -``` - -**Common Queries:** -```sql --- Check CI runs -SELECT id, status, commit_sha, created FROM action_run ORDER BY id DESC LIMIT 10; - --- Status codes: 0=pending, 1=success, 2=failure, 3=cancelled, 6=running - --- Check specific run jobs -SELECT id, status, name FROM action_run_job WHERE run_id = ; - --- Reset stuck run -UPDATE action_task SET status = 0 WHERE id = ; -UPDATE action_run_job SET status = 0 WHERE run_id = ; -UPDATE action_run SET status = 0 WHERE id = ; - --- Check runner token -SELECT * FROM action_runner_token; - --- List runners -SELECT * FROM action_runner; -``` - -**Check CI from host:** -```bash -export PGPASSWORD= -sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM -c "SELECT id, status, created FROM action_run ORDER BY id DESC LIMIT 5;" -``` - ---- - -### Zitadel API v2 Operations - -**Important:** Always use **v2 API** - v1 is deprecated and non-functional. - -**Get PAT:** -```bash -PAT=$(sudo incus exec directory -- cat /opt/gbo/conf/directory/admin-pat.txt) -``` - -**Common Operations:** - -**Create User (v2):** -```bash -curl -X POST "http://:8080/v2/users/human" \ - -H "Content-Type: application/json" \ - -H "Authorization: Bearer $PAT" \ - -H "Host: " \ - -d '{ - "username": "newuser", - "profile": {"givenName": "New", "familyName": "User"}, - "email": {"email": "user@example.com", "isVerified": true}, - "password": {"password": "", "changeRequired": false} - }' -``` - -**List Users (v2):** -```bash -curl -X POST "http://:8080/v2/users" \ - -H "Content-Type: application/json" \ - -H "Authorization: Bearer $PAT" \ - -H "Host: " \ - -d '{"query": {"offset": 0, "limit": 100}}' -``` - -**Create Organization (v2):** -```bash -curl -X POST "http://:8080/v2/organizations" \ - -H "Content-Type: application/json" \ - -H "Authorization: Bearer $PAT" \ - -H "Host: " \ - -d '{"name": "organization-name"}' -``` - -**Add Domain to Org (v2):** -```bash -curl -X POST "http://:8080/v2/organizations//domains" \ - -H "Content-Type: application/json" \ - -H "Authorization: Bearer $PAT" \ - -H "Host: " \ - -d '{"domainName": "example.com"}' -``` - -**⚠️ Critical:** Always include `-H "Host: "` header or API returns 404. - ---- - -### Common Errors & Quick Fixes - -| Error | Cause | Fix | -|-------|-------|-----| -| `No IPv4 on container` | DHCP failed | Set static IP (see above) | -| `/tmp permission denied` | Wrong permissions | `chmod 1777 /tmp` | -| `Errors.Token.Invalid (AUTH-7fs1e)` | Zitadel PAT expired | Regenerate via console | -| `failed SASL auth` | Wrong DB password | Check Vault credentials | -| `GLIBC_2.39 not found` | Wrong build environment | Rebuild in system container | -| `connection refused` | Service down | `systemctl restart ` | -| `exec format error` | Architecture mismatch | Recompile for target arch | -| `address already in use` | Port conflict | `lsof -i :` | -| `certificate verify failed` | Wrong CA cert | Copy from vault container | -| `DNS lookup failed` | No IPv4 connectivity | Check network config | - ---- - -## Contact Escalation - -If quick fixes don't work: - -1. Capture logs: `sudo incus exec system -- tar czf /tmp/debug-$(date +%Y%m%d).tar.gz /opt/gbo/logs/` -2. Check AGENTS.md for development troubleshooting -3. Review recent commits for breaking changes -4. Consider snapshot rollback (last resort) diff --git a/botbook/src/12-ecosystem-reference/ci-cd.md b/botbook/src/12-ecosystem-reference/ci-cd.md index 5340a605..d1943755 100644 --- a/botbook/src/12-ecosystem-reference/ci-cd.md +++ b/botbook/src/12-ecosystem-reference/ci-cd.md @@ -1 +1,240 @@ # CI/CD Integration + +General Bots uses Forgejo (ALM) as Git server with Forgejo Runner for CI/CD. The runner lives in a separate container (alm-ci) and builds are triggered by pushing to the ALM repository. + +--- + +## Architecture + +| Component | Container | Port | Purpose | +|-----------|-----------|------|---------| +| Forgejo (ALM) | alm | 4747 | Git server, workflow definitions | +| Forgejo Runner | alm-ci | - | CI/CD executor | +| PostgreSQL | tables | 5432 | CI run database (PROD-ALM) | +| BotServer (deploy target) | system | 8080 | Receives built binary | + +**Deploy flow:** Push to ALM → Runner picks up job → cargo build → tar+gzip binary → SSH to system container → extract to /opt/gbo/bin/botserver → restart via systemctl + +--- + +## Status Codes + +| Code | Status | +|------|--------| +| 0 | pending | +| 1 | success | +| 2 | failure | +| 3 | cancelled | +| 6 | running | + +--- + +## Database Queries + +All queries run against the `PROD-ALM` database: + +```bash +sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM +``` + +### List Recent Runs + +```sql +SELECT id, title, workflow_id, status, + to_timestamp(created) AS created_at +FROM action_run +ORDER BY id DESC LIMIT 10; +``` + +### Get Jobs for a Run + +```sql +SELECT id, name, status, task_id +FROM action_run_job +WHERE run_id = ; +``` + +### Get Step-Level Status + +```sql +SELECT name, status, log_index, log_length +FROM action_task_step +WHERE task_id = +ORDER BY index; +``` + +### Check Runner Token + +```sql +SELECT * FROM action_runner_token; +``` + +### List Registered Runners + +```sql +SELECT * FROM action_runner; +``` + +### Reset a Stuck Run (status 6) + +```sql +UPDATE action_task SET status = 0 WHERE id = ; +UPDATE action_run_job SET status = 0 WHERE run_id = ; +UPDATE action_run SET status = 0 WHERE id = ; +``` + +--- + +## Reading Build Logs + +Build logs are stored as zstd-compressed files in the alm container. The database tracks the filename. + +### Step-by-Step + +```bash +# 1. Get log filename from database +sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM \ + -c "SELECT log_filename FROM action_task WHERE id = ;" + +# 2. Pull compressed log from alm container +sudo incus file pull alm/opt/gbo/data/data/actions_log/ /tmp/ci-log.log.zst + +# 3. Decompress and read +zstd -d /tmp/ci-log.log.zst -o /tmp/ci-log.log +cat /tmp/ci-log.log +``` + +### One-Liner: Read Latest Failed Run + +```bash +TASK_ID=$(sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM -t -c \ + "SELECT at.id FROM action_task at JOIN action_run_job arj ON at.job_id = arj.id \ + JOIN action_run ar ON arj.run_id = ar.id \ + WHERE ar.status = 2 ORDER BY at.id DESC LIMIT 1;" | tr -d ' ') +LOG_FILE=$(sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM -t -c \ + "SELECT log_filename FROM action_task WHERE id = $TASK_ID;" | tr -d ' ') +sudo incus file pull "alm/opt/gbo/data/data/actions_log/$LOG_FILE" /tmp/ci-log.log.zst +zstd -d /tmp/ci-log.log.zst -o /tmp/ci-log.log 2>/dev/null && cat /tmp/ci-log.log +``` + +--- + +## Real-Time Monitoring + +```bash +# Tail runner logs (live but ephemeral) +sudo incus exec alm-ci -- tail -f /opt/gbo/logs/forgejo-runner.log + +# Watch for new runs +sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM \ + -c "SELECT id, title, workflow_id, status FROM action_run ORDER BY id DESC LIMIT 5;" + +# Check runner logs for build activity +sudo incus exec alm-ci -- tail -f /opt/gbo/logs/forgejo-runner.log | grep -E "Clone|Build|Deploy|Success|Failure" +``` + +--- + +## Build Timing + +| Phase | Duration | +|-------|----------| +| Rust compilation (cold) | 2-5 minutes | +| Rust compilation (incremental) | 30-60 seconds | +| First build (dependencies) | Downloads ~200 crates | +| Deploy step | ~5 seconds | +| Total CI time | 2-6 minutes depending on cache | + +--- + +## Verify Deployment + +```bash +# Check binary timestamp +sudo incus exec system -- stat -c '%y' /opt/gbo/bin/botserver + +# Check running version +sudo incus exec system -- /opt/gbo/bin/botserver --version + +# Check systemd status +sudo incus exec system -- systemctl status botserver --no-pager + +# Health endpoint +curl -sf https:///api/health && echo "OK" || echo "FAILED" +``` + +--- + +## Runner Configuration + +- **Binary:** /opt/gbo/bin/forgejo-runner +- **Config:** /opt/gbo/bin/config.yaml +- **Systemd:** /etc/systemd/system/alm-ci-runner.service +- **User:** gbuser (uid 1000) +- **Workspace:** /opt/gbo/data/ +- **SSH deploy key:** /home/gbuser/.ssh/id_ed25519 +- **sccache:** /usr/local/bin/sccache (via RUSTC_WRAPPER=sccache) +- **Cargo cache:** /home/gbuser/.cargo/ +- **Rustup:** /home/gbuser/.rustup/ + +### Register New Runner + +```bash +forgejo-runner register \ + --instance http://:4747 \ + --token \ + --name gbo \ + --labels ubuntu-latest:docker://node:20-bookworm \ + --no-interactive +``` + +> Token from action_runner_token table in PROD-ALM database. + +### Restart Runner + +```bash +sudo incus exec alm-ci -- pkill -9 forgejo +sleep 2 +sudo incus exec alm-ci -- bash -c 'cd /opt/gbo/bin && nohup ./forgejo-runner daemon --config config.yaml >> /opt/gbo/logs/forgejo-runner.log 2>&1 &' +``` + +--- + +## Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| Runner not connecting | Wrong ALM port (3000 vs 4747) | Use port 4747 in runner registration | +| `registration file not found` | Missing/wrong .runner file | Delete .runner and re-register | +| `unsupported protocol scheme` | Wrong .runner JSON format | Delete .runner and re-register | +| `connection refused` to ALM | iptables or ALM down | Check `ss -tlnp \| grep 4747` | +| CI not picking up jobs | Runner not registered or labels mismatch | Check runner labels match workflow runs-on | +| `/tmp permission denied` | Wrong permissions on alm-ci | `chmod 1777 /tmp` on alm-ci | +| Build stuck at status 6 | DB race condition | Reset status in action_task/action_run tables | +| GLIBC mismatch | Built in wrong environment | Rebuild inside system container (Debian 12, glibc 2.36) | +| Binary not updating | CI did not rebuild | Push trivial change to force rebuild | +| Migrations not running | Binary not updated | Check stat timestamp, push code change | + +--- + +## Deploy Workflow + +```bash +# 1. Push submodules first +cd botserver && git push alm main && git push origin main +cd ../botui && git push alm main && git push origin main +cd ../botlib && git push alm main && git push origin main + +# 2. Push main repo +cd .. && git add botserver botui botlib +git commit -m "Update submodules: " +git push alm main && git push origin main + +# 3. Wait for CI (~2-6 min) +# Monitor via runner logs or database queries + +# 4. Verify deployment +sudo incus exec system -- stat -c '%y' /opt/gbo/bin/botserver +sudo incus exec system -- systemctl status botserver --no-pager +curl -sf https:///api/health +```