gb/PROD.md
Rodrigo Rodriguez (Pragmatismo) cee8aeee34 Fix DriveMonitor dispatch failure - avoid double Arc in tokio::spawn
- Added static save_file_states_static() helper method
- Changed tokio::spawn calls to use Arc::clone instead of Arc::new(self.clone())
- This prevents double Arc wrapping which causes 'dispatch failure' errors
- Fixes config.csv not syncing from bucket to database for salesianos/default bots
2026-04-10 11:20:31 -03:00

39 KiB

Production Environment Guide

⚠️ CRITICAL PRODUCTION RULES

READ THIS FIRST:

🚫 NEVER Start Services Directly

In production, NEVER start botserver or botui directly. Always use systemctl:

# ❌ NEVER DO THIS IN PRODUCTION:
/opt/gbo/bin/botserver              # Wrong
./botserver                         # Wrong
/opt/gbo/bin/botserver &            # Wrong

# ✅ ALWAYS USE THIS:
sudo incus exec system -- systemctl start botserver
sudo incus exec system -- systemctl restart botserver
sudo incus exec system -- systemctl stop botserver
sudo incus exec system -- systemctl status botserver

Why:

  • systemctl loads /opt/gbo/bin/.env (Vault credentials, paths, etc.)
  • Direct execution skips environment variables → services fail
  • systemctl manages auto-restart, logging, and dependencies

🔐 Security Rules

  • NEVER push secrets to git (API keys, passwords, tokens)
  • NEVER commit init.json (Vault unseal keys)
  • ALWAYS use Vault for secrets (see Vault Security Architecture)
  • ONLY VAULT_* environment variables allowed in .env

🚢 Deployment Rules

  • NEVER deploy manually (scp, ssh copy) — use CI/CD only
  • NEVER push to ALM without asking first
  • ALWAYS push ALL submodules (botserver, botui, botlib) when pushing main repo
  • ALWAYS use systemctl to restart services after deployment

Infrastructure

Servers

Host IP Purpose
system <main-server-ip> Main botserver + botui container
alm-ci <ci-runner-ip> CI/CD runner (Forgejo Actions)
drive <storage-server-ip> Object storage
monitor <monitor-server-ip> Monitoring service

Port Mapping (system container)

Service Internal Port External URL
botserver 5858 https://system.example.com
botui 5859 https://chat.example.com

Access

# SSH to host
ssh admin@<host-ip>

# Execute inside system container
sudo incus exec system -- bash -c 'command'

# SSH from host to container (used by CI)
ssh -o StrictHostKeyChecking=no system "command"

Services

botserver.service

  • Binary: /opt/gbo/bin/botserver
  • Port: 5858
  • User: gbuser
  • Logs: /opt/gbo/logs/out.log, /opt/gbo/logs/err.log
  • Config: /etc/systemd/system/botserver.service
  • Env: PORT=5858

ui.service

  • Binary: /opt/gbo/bin/botui
  • Port: 5859
  • Config: /etc/systemd/system/ui.service
  • Env: BOTSERVER_URL=http://localhost:5858
    • ⚠️ MUST be http://localhost:5858 — NOT https://system.example.com
    • Rust proxy runs server-side, needs direct localhost access
    • JS client uses relative URLs through chat.example.com

Data Directory

  • Path: /opt/gbo/data/
  • Structure: <botname>.gbai/<botname>.gbdialog/*.bas
  • Work dir: /opt/gbo/work/ (compiled .ast cache)

Stack Services (managed by botserver bootstrap)

  • Vault: Secrets management
  • PostgreSQL: Database (port 5432)
  • Valkey: Cache (port 6379, password auth)
  • MinIO: Object storage
  • Zitadel: Identity provider
  • LLM: llama.cpp

CI/CD Pipeline

Repositories

Repo ALM URL GitHub URL
gb https://alm.example.com/organization/gb.git git@github.com:organization/gb.git
botserver https://alm.example.com/organization/BotServer.git git@github.com:organization/botserver.git
botui https://alm.example.com/organization/BotUI.git git@github.com:organization/botui.git
botlib https://alm.example.com/organization/botlib.git git@github.com:organization/botlib.git

Push Order

# 1. Push submodules first
cd botserver && git push alm main && git push origin main && cd ..
cd botui && git push alm main && git push origin main && cd ..

# 2. Update root workspace references
git add botserver botui botlib
git commit -m "Update submodules: <description>"
git push alm main && git push origin main

Build Environment

  • CI runner: ci-runner container (Debian Trixie, glibc 2.41)
  • Target: system container (Debian 12 Bookworm, glibc 2.36)
  • ⚠️ GLIBC MISMATCH: Building on CI runner produces binaries incompatible with system container
  • Solution: CI workflow transfers source to system container and builds there via SSH

Workflow File

  • Location: botserver/.forgejo/workflows/botserver.yaml
  • Triggers: Push to main branch
  • Steps:
    1. Setup workspace on CI runner (clone repos)
    2. Transfer source to system container via tar | ssh
    3. Build inside system container (matches glibc 2.36)
    4. Deploy binary inside container
    5. Verify botserver is running

Common Operations

Check Service Status

# From host
sudo incus exec system -- systemctl status botserver --no-pager
sudo incus exec system -- systemctl status ui --no-pager

# Check if running
sudo incus exec system -- pgrep -f botserver
sudo incus exec system -- pgrep -f botui

View Logs

# Systemd journal
sudo incus exec system -- journalctl -u botserver --no-pager -n 50
sudo incus exec system -- journalctl -u ui --no-pager -n 50

# Application logs
sudo incus exec system -- tail -50 /opt/gbo/logs/out.log
sudo incus exec system -- tail -50 /opt/gbo/logs/err.log

# Live tail
sudo incus exec system -- tail -f /opt/gbo/logs/out.log

Restart Services

CRITICAL PRODUCTION RULE: In production, NEVER start botserver or botui directly. Always use systemctl to ensure proper initialization, environment loading, and logging.

sudo incus exec system -- systemctl restart botserver
sudo incus exec system -- systemctl restart ui

PROHIBITED in production:

# ❌ NEVER DO THIS IN PRODUCTION:
sudo incus exec system -- /opt/gbo/bin/botserver  # Wrong - no systemd integration
sudo incus exec system -- /opt/gbo/bin/botserver &  # Wrong - no service management
sudo incus exec system -- cd /opt/gbo/bin && ./botserver  # Wrong - missing env vars

# ✅ CORRECT - Always use systemctl:
sudo incus exec system -- systemctl start botserver
sudo incus exec system -- systemctl restart botserver
sudo incus exec system -- systemctl stop botserver
sudo incus exec system -- systemctl status botserver

Why:

  • systemctl loads /opt/gbo/bin/.env (via EnvironmentFile in service definition)
  • systemctl manages process lifecycle, auto-restart, and dependencies
  • systemctl sends logs to /opt/gbo/logs/out.log and /opt/gbo/logs/err.log
  • Direct execution skips environment variables and systemd service configuration

Manual Deploy (emergency)

# Kill old process
sudo incus exec system -- killall botserver

# Copy binary (from host CI workspace or local)
sudo incus exec system -- cp /opt/gbo/ci/botserver/target/debug/botserver /opt/gbo/bin/botserver
sudo incus exec system -- chmod +x /opt/gbo/bin/botserver
sudo incus exec system -- chown gbuser:gbuser /opt/gbo/bin/botserver

# Start service
sudo incus exec system -- systemctl start botserver

Transfer Bot Files to Production

# From local to prod host
tar czf /tmp/bots.tar.gz -C /opt/gbo/data <botname>.gbai
scp /tmp/bots.tar.gz admin@<host-ip>:/tmp/

# From host to container
sudo incus exec system -- bash -c 'tar xzf /tmp/bots.tar.gz -C /opt/gbo/data/'

# Clear compiled cache
sudo incus exec system -- find /opt/gbo/data -name "*.ast" -delete
sudo incus exec system -- find /opt/gbo/work -name "*.ast" -delete

Snapshots

# List snapshots
sudo incus snapshot list system

# Restore snapshot
sudo incus snapshot restore system <snapshot-name>

DriveMonitor & Bot Configuration Sync

DriveMonitor Architecture

DriveMonitor is a background service that synchronizes bot files from MinIO (S3-compatible storage) to the local filesystem and database. It monitors three directories per bot:

Directory Purpose Sync Behavior
{bot}.gbai/{bot}.gbdialog/ BASIC scripts (.bas) Downloads and compiles on change
{bot}.gbai/{bot}.gbot/ Configuration files Syncs to bot_configuration table
{bot}.gbkb/ Knowledge base documents Downloads and indexes for vector search

Bot Configuration Database Tables

bot_configuration (main config table)

-- Location: botserver database
SELECT * FROM bot_configuration WHERE bot_id = '<bot_uuid>';

-- Key columns:
-- - bot_id: Bot UUID (link to bots table)
-- - config_key: Configuration key (e.g., "llm-provider", "system-prompt")
-- - config_value: Configuration value
-- - config_type: Type (string, boolean, number)
-- - is_encrypted: Whether value is encrypted
-- - updated_at: Last modification timestamp

gbot_config_sync (sync tracking table)

-- Location: botserver database
-- Tracks config.csv sync status from bucket
SELECT * FROM gbot_config_sync g
  JOIN bots b ON g.bot_id = b.id
  WHERE b.name = 'salesianos';

-- Key columns:
-- - bot_id: Bot UUID
-- - config_file_path: Path to config.csv in bucket
-- - last_sync_at: Timestamp of last successful sync
-- - file_hash: ETag/MD5 of synced file
-- - sync_count: Number of times synced

config.csv Sync Process

File Locations:

  • Source: {bot}.gbai/{bot}.gbot/config.csv in MinIO bucket
  • Sync method: DriveMonitor → ConfigManager → bot_configuration table
  • Sync frequency: Every 10 seconds (DriveMonitor periodic check)

Sync Trigger Conditions:

  1. File ETag changes in MinIO
  2. Initial DriveMonitor startup
  3. Manual botserver restart

CSV Format:

llm-provider,groq
llm-api-key,sk-xxx
llm-url,http://localhost:8085
system-prompt-file,PROMPT.md
theme-color1,#cc0000
theme-title,MyBot
whatsapp-id,botname

Checking Bot Configuration Status

Method 1: Query bot_configuration table

# Get all config for a bot
sudo incus exec tables -- psql -h localhost -U postgres -d botserver -c "
  SELECT b.name, bc.config_key, bc.config_value, bc.updated_at
  FROM bot_configuration bc
  JOIN bots b ON bc.bot_id = b.id
  WHERE b.name = 'salesianos'
  ORDER BY bc.config_key;
"

# Get specific LLM provider config
sudo incus exec tables -- psql -h localhost -U postgres -d botserver -c "
  SELECT config_key, config_value, updated_at
  FROM bot_configuration
  WHERE bot_id = (
    SELECT id FROM bots WHERE name = 'salesianos'
  )
  AND config_key LIKE 'llm-%'
  ORDER BY config_key;
"

Method 2: Check DriveMonitor sync status

# Check if config.csv has been synced
sudo incus exec tables -- psql -h localhost -U postgres -d botserver -c "
  SELECT b.name, gcs.last_sync_at, gcs.sync_count, gcs.config_file_path
  FROM gbot_config_sync gcs
  JOIN bots b ON gcs.bot_id = b.id
  WHERE b.name IN ('salesianos', 'default');
"

-- Empty result = DriveMonitor hasn't synced config.csv yet
-- If sync_count = 0, config.csv exists but hasn't been processed

Method 3: Direct MinIO inspection

# Check if config.csv exists in bucket
sudo incus exec drive -- /opt/gbo/bin/mc ls local/salesianos.gbai/salesianos.gbot/

# View config.csv contents
sudo incus exec drive -- /opt/gbo/bin/mc cat local/salesianos.gbai/salesianos.gbot/config.csv

# Check file ETag (for sync comparison)
sudo incus exec drive -- /opt/gbo/bin/mc stat local/salesianos.gbai/salesianos.gbot/config.csv

DriveMonitor Debugging Logs

Key log patterns to monitor

# Monitor DriveMonitor activity in real-time
sudo incus exec system -- tail -f /opt/gbo/logs/out.log | grep -E "(DRIVE_MONITOR|check_gbot|config)"

# Check for config.csv sync attempts
sudo incus exec system -- grep "check_gbot" /opt/gbo/logs/out.log | tail -20

# Check for config synchronization
sudo incus exec system -- grep "sync_gbot_config" /opt/gbo/logs/out.log | tail -20

# Check for DriveMonitor errors
sudo incus exec system -- grep -i "drive.*error" /opt/gbo/logs/err.log | tail -20

Expected successful sync logs

check_gbot: Checking bucket salesianos.gbai for config.csv changes
check_gbot: Found config.csv at path: salesianos.gai/salesianos.gbot/config.csv
info config:Synced config.csv for bot <uuid> - updated 3 keys

Error patterns and meanings

# Config.csv not found in bucket
check_gbot: Config file not found or inaccessible: path/to/config.csv

# Sync to database failed
error config:Failed to sync_gbot_config: <database error>

# DriveMonitor not running
(no check_gbot logs in output.log)

# MinIO connection failed
error drive_monitor:S3/MinIO unavailable for bucket <bucket>

Common Issues and Fixes

Issue 1: config.csv not syncing to database

Symptoms:

  • gbot_config_sync table empty (0 rows)
  • LLM provider changes in bucket not reflected in bot behavior
  • Database shows old configuration values

Diagnosis:

# 1. Check if config.csv exists in bucket
sudo incus exec drive -- /opt/gbo/bin/mc ls local/salesianos.gbai/salesianos.gbot/

# 2. Check DriveMonitor logs for sync attempts
sudo incus exec system -- grep "check_gbot" /opt/gbo/logs/out.log | tail -10

# 3. Check if DriveMonitor is running for the bot
sudo incus exec system -- ps aux | grep botserver

Root Causes:

  1. config.csv missing from {bot}.gai/{bot}.gbot/ folder
  2. DriveMonitor not started for the bot
  3. MinIO connection issues
  4. Database write permissions

Fixes:

# Case 1: Create missing config.csv
sudo incus exec drive -- bash -c '
cat > /tmp/config.csv << EOF
llm-provider,groq
llm-api-key,your-api-key
llm-url,http://localhost:8085
system-prompt-file,PROMPT.md
theme-color1,#cc0000
theme-title,Salesianos
EOF
/opt/gbo/bin/mc cp /tmp/config.csv local/salesianos.gbai/salesianos.gbot/config.csv
'

# Case 2: Restart botserver to reinitialize DriveMonitor
sudo incus exec system -- systemctl restart botserver

# Case 3: Force immediate sync by touching config.csv
sudo incus exec drive -- /opt/gbo/bin/mc cp local/salesianos.gbai/salesianos.gbot/config.csv local/salesianos.gbai/salesianos.gbot/config.csv

Issue 2: LLM provider changes not taking effect

Symptoms:

  • config.csv shows correct provider (e.g., groq)
  • Bot still uses old provider
  • Database shows old value

Diagnosis:

# Compare bucket vs database
BUCKET_PROVIDER=$(sudo incus exec drive -- /opt/gbo/bin/mc cat local/salesianos.gbai/salesianos.gbot/config.csv | grep "^llm-provider" | cut -d',' -f2)
DB_PROVIDER=$(sudo incus exec tables -- psql -h localhost -U postgres -d botserver -t -c "
  SELECT config_value FROM bot_configuration
  WHERE bot_id = (SELECT id FROM bots WHERE name = 'salesianos')
  AND config_key = 'llm-provider';
")

echo "Bucket: $BUCKET_PROVIDER"
echo "Database: $DB_PROVIDER"

# Check last sync time
sudo incus exec tables -- psql -h localhost -U postgres -d botserver -t -c "
  SELECT last_sync_at FROM gbot_config_sync
  WHERE bot_id = (SELECT id FROM bots WHERE name = 'salesianos');
"

Fix:

# If sync is stale (> 10 minutes), restart DriveMonitor
sudo incus exec system -- systemctl restart botserver

# Or manually update config value in database (temporary fix)
sudo incus exec tables -- psql -h localhost -U postgres -d botserver -c "
  UPDATE bot_configuration
  SET config_value = 'groq', updated_at = NOW()
  WHERE bot_id = (SELECT id FROM bots WHERE name = 'salesianos')
  AND config_key = 'llm-provider';
"

Issue 3: DriveMonitor not checking for changes

Symptoms:

  • No new log entries after 30 seconds
  • File changes in bucket not detected
  • Bot compilation not happening after .bas file updates

Diagnosis:

# Check DriveMonitor loop logs
sudo incus exec system -- tail -100 /opt/gbo/logs/out.log | grep "DRIVE_MONITOR.*Inside monitoring loop"

# Check if is_processing flag is stuck
sudo incus exec system -- tail -100 /opt/gbo/logs/out.log | grep -E "(is_processing|monitoring loop)"

Fix:

# Restart botserver to clear stuck state
sudo incus exec system -- systemctl restart botserver

# Monitor startup logs to verify DriveMonitor started
sudo incus exec system -- tail -50 /opt/gbo/logs/out.log | grep "Drive Monitor"

Database Schema Reference

List all bot databases

sudo incus exec tables -- psql -h localhost -U postgres -d postgres -c "\l" | grep bot_

List tables in a specific bot database

sudo incus exec tables -- psql -h localhost -U postgres -d bot_salesianos -c "\dt"

List botserver management tables

sudo incus exec tables -- psql -h localhost -U postgres -d botserver -c "\dt" | grep -E "(bot|config|sync)"

Connection Methods Summary

Method Use Case Command Pattern
SSH to host Initial access, file transfer ssh admin@63.141.255.9
incus exec Execute inside container sudo incus exec system -- command
psql direct Database queries from container sudo incus exec tables -- psql ...
mc (MinIO CLI) Inspect buckets, copy files sudo incus exec drive -- /opt/gbo/bin/mc ...
HTTP/curl Service health checks curl http://<ip>:5858/health
journalctl Systemd service logs sudo incus exec system -- journalctl -u botserver

Vault Security Architecture

Overview

The production environment uses HashiCorp Vault as the centralized secrets management system. All sensitive credentials (database passwords, API keys, tokens) are stored in Vault, NEVER in code or environment files.

Vault Connection Flow

1. botserver starts
   ↓
2. Reads VAULT_ADDR, VAULT_TOKEN from .env
   ↓
3. Initializes VaultClient with TLS/mTLS
   ↓
4. Reads secrets from Vault paths (gbo/tables, gbo/drive, etc.)
   ↓
5. Falls back to defaults if Vault unavailable

Environment Variables (Allowed)

File Location: /opt/gbo/bin/.env (system container)

# Vault Connection (MANDATORY for production)
VAULT_ADDR=https://<vault-ip>:8200
VAULT_TOKEN=<root-token>
VAULT_CACERT=/opt/gbo/conf/system/certificates/ca/ca.crt

# Optional: Skip TLS verification (NOT recommended for production)
VAULT_SKIP_VERIFY=false

# Optional: Use mTLS certificates
VAULT_CLIENT_CERT=/opt/gbo/conf/system/certificates/botserver/client.crt
VAULT_CLIENT_KEY=/opt/gbo/conf/system/certificates/botserver/client.key

# Optional: Cache TTL in seconds (default: 300)
VAULT_CACHE_TTL=300

# Server Configuration
PORT=5858
DATA_DIR=/opt/gbo/data/
WORK_DIR=/opt/gbo/work/
LOAD_ONLY=default,salesianos

Security Rule:

  • ONLY VAULT_* environment variables are allowed in .env
  • All other secrets MUST come from Vault
  • Hardcoded secrets in code are FORBIDDEN (see AGENTS.md)

Vault Secret Paths Structure

System-Wide Paths (Global)

Path Purpose Example Keys
gbo/tables Database (PostgreSQL) host, port, database, username, password
gbo/drive MinIO (Object Storage) host, accesskey, secret
gbo/cache Valkey (Redis) host, port, password
gbo/directory Zitadel (Auth) url, project_id, client_id, client_secret
gbo/email SMTP Email smtp_host, smtp_port, smtp_user, smtp_password
gbo/llm LLM Configuration url, model, openai_key, anthropic_key
gbo/vectordb Qdrant (Vector DB) url, api_key
gbo/jwt JWT Signing secret
gbo/meet Jitsi Meet url, app_id, app_secret
gbo/alm ALM Repository url, token
gbo/encryption Encryption Keys master_key
gbo/system/observability Monitoring url, org, bucket, token
gbo/system/security Security Policies require_auth, anonymous_paths
gbo/system/cloud Cloud Config region, access_key, secret_key
gbo/system/app Application Settings url, environment
gbo/system/models BotModels API url

Organization-Specific Paths

Path Pattern Purpose
gbo/orgs/{org_id}/config Organization configuration
gbo/orgs/{org_id}/bots/{bot_id} Bot-specific secrets
gbo/orgs/{org_id}/users/{user_id} User-specific secrets
gbo/tenants/{tenant_id}/infrastructure Tenant database/cache/drive
gbo/tenants/{tenant_id}/config Tenant configuration

Credential Resolution Hierarchy

For bot email configuration (example):

1. Check gbo/orgs/{org_id}/bots/{bot_id}/email
2. Fallback: gbo/bots/default/email
3. Fallback: gbo/email
4. Fallback: Environment variables (development only)

Vault Client Initialization (Code Reference)

File: botserver/src/core/secrets/mod.rs

// SecretsManager::from_env() reads:
// - VAULT_ADDR (required)
// - VAULT_TOKEN (required)
// - VAULT_CACERT (optional, has default)
// - VAULT_SKIP_VERIFY (optional, default: false)
// - VAULT_CLIENT_CERT (optional, mTLS)
// - VAULT_CLIENT_KEY (optional, mTLS)
// - VAULT_CACHE_TTL (optional, default: 300s)

impl SecretsManager {
    pub fn from_env() -> Result<Self> {
        let addr = env::var("VAULT_ADDR").unwrap_or_default();
        let token = env::var("VAULT_TOKEN").unwrap_or_default();

        if token.is_empty() || addr.is_empty() {
            // Vault not configured - use environment variables directly
            warn!("Vault not configured. Using environment variables directly.");
            return Ok(Self { client: None, enabled: false, ... });
        }

        // Initialize VaultClient with TLS
        let client = VaultClient::new(settings)?;
        Ok(Self { client: Some(client), enabled: true, ... })
    }
}

Vault Operations - Production Usage

Read Secrets from Vault

# From system container (using vault CLI)
sudo incus exec system -- bash -c '
  export VAULT_ADDR=https://10.157.134.250:8200
  export VAULT_TOKEN=<vault-token>
  export VAULT_CACERT=/opt/gbo/conf/system/certificates/ca/ca.crt

  # Read database secrets
  vault kv get -field=password secret/gbo/tables
  vault kv get secret/gbo/tables

  # Read drive secrets
  vault kv get secret/gbo/drive

  # Read LLM configuration
  vault kv get secret/gbo/llm
'

Read Secrets via HTTP API (from any container)

sudo incus exec system -- curl -sf \
  --cacert /opt/gbo/conf/system/certificates/ca/ca.crt \
  -H "X-Vault-Token: <vault-token>" \
  https://10.157.134.250:8200/v1/secret/data/gbo/drive | jq

Verify Vault Health

sudo incus exec vault -- curl -k -sf https://localhost:8200/v1/sys/health

# Expected output:
# {"initialized":true,"sealed":false,"standby":false,"performance_standby":false,"replication_performance_mode":"disabled","replication_dr_mode":"disabled","server_time_utc":"2026-04-10T13:55:00.123Z"}

init.json (Vault Initialization Data)

Location: /opt/gbo/bin/botserver-stack/conf/vault/vault-conf/init.json

Purpose: Stores Vault unseal keys and root token (created during Vault initialization)

Contents:

{
  "recovery_keys_b64": [],
  "recovery_keys_hex": [],
  "recovery_keys_shares": 0,
  "recovery_keys_threshold": 0,
  "root_token": "<vault-token>",
  "unseal_keys_b64": ["<5 unseal keys base64-encoded>"],
  "unseal_keys_hex": ["<5 unseal keys hex-encoded>"],
  "unseal_shares": 5,
  "unseal_threshold": 3
}

Security Notes:

  • root_token: Used to authenticate to Vault as admin
  • unseal_keys: Required to unseal Vault after restart (5 keys, need 3 to unseal)
  • CRITICAL: Store init.json in a secure, encrypted location
  • Never commit init.json to git or store in repo

Troubleshooting Vault Connection

Issue 1: Botserver cannot connect to Vault

Symptoms:

  • Logs show "Vault connection failed"
  • Secrets fall back to defaults
  • Bot cannot authenticate to database

Diagnosis:

# Check Vault is running
sudo incus exec vault -- systemctl status vault

# Check Vault health
sudo incus exec vault -- curl -k -sf https://localhost:8200/v1/sys/health

# Check .env has Vault credentials
sudo incus exec system -- grep "^VAULT_" /opt/gbo/bin/.env

# Test Vault connection from system container
sudo incus exec system -- bash -c '
  curl -k -sf --cacert /opt/gbo/conf/system/certificates/ca/ca.crt \
    -H "X-Vault-Token: $(grep VAULT_TOKEN /opt/gbo/bin/.env | cut -d= -f2)" \
    https://10.157.134.250:8200/v1/secret/data/gbo/tables
'

Common Causes:

  1. Vault service not running (vault container stopped)
  2. VAULT_TOKEN expired or invalid
  3. TLS certificate path incorrect or CA certificate missing
  4. Network connectivity between system and vault containers

Fix:

# 1. Restart Vault if stopped
sudo incus exec vault -- systemctl restart vault

# 2. Generate new token if expired
sudo incus exec vault -- bash -c '
  export VAULT_ADDR=https://localhost:8200
  export VAULT_TOKEN=<root-token-from-init.json>
  vault token create -policy="botserver" -ttl="8760h" -format=json | jq -r .auth.client_token
'

# 3. Update .env with new token
sudo incus exec system -- sed -i "s|VAULT_TOKEN=.*|VAULT_TOKEN=<new-token>|" /opt/gbo/bin/.env

# 4. Restart botserver
sudo incus exec system -- systemctl restart botserver

Issue 2: Secrets not being read from Vault

Symptoms:

  • Logs show "Vault read failed for 'gbo/drive'"
  • Services use default credentials
  • DriveMonitor cannot access MinIO

Diagnosis:

# Check if Vault has secrets configured
sudo incus exec system -- bash -c '
  export VAULT_ADDR=https://10.157.134.250:8200
  export VAULT_TOKEN=$(grep VAULT_TOKEN /opt/gbo/bin/.env | cut -d= -f2)
  export VAULT_CACERT=/opt/gbo/conf/system/certificates/ca/ca.crt

  echo "=== Database Secrets ==="
  vault kv get secret/gbo/tables || echo "NOT FOUND"

  echo "=== Drive Secrets ==="
  vault kv get secret/gbo/drive || echo "NOT FOUND"

  echo "=== LLM Secrets ==="
  vault kv get secret/gbo/llm || echo "NOT FOUND"
'

Fix - Adding Secrets to Vault:

sudo incus exec vault -- bash -c '
  export VAULT_ADDR=https://localhost:8200
  export VAULT_TOKEN=<root-token>

  # Add database secrets
  vault kv put secret/gbo/tables \
    host=<tables-ip> \
    port=5432 \
    database=botserver \
    username=gbuser \
    password=<secure-password>

  # Add drive (MinIO) secrets
  vault kv put secret/gbo/drive \
    host=<drive-ip> \
    port=9100 \
    accesskey=<minio-access-key> \
    secret=<minio-secret>

  # Add LLM secrets
  vault kv put secret/gbo/llm \
    url=http://localhost:8085 \
    model=gpt-4 \
    openai_key=<openai-api-key> \
    anthropic_key=<anthropic-api-key>
'

Issue 3: Vault sealed after restart

Symptoms:

  • All Vault operations fail
  • botserver cannot read secrets
  • Logs show "Vault is sealed"

Diagnosis:

sudo incus exec vault -- curl -k -sf https://localhost:8200/v1/sys/health | jq .sealed

Fix - Unseal Vault:

sudo incus exec vault -- bash -c '
  # Need 3 of 5 unseal keys from init.json
  vault operator unseal <key1>
  vault operator unseal <key2>
  vault operator unseal <key3>

  # Verify unsealed
  vault status
'

Issue 4: TLS certificate errors

Symptoms:

  • "certificate verify failed" errors
  • TLS handshake failures
  • curl: (60) SSL certificate problem

Diagnosis:

sudo incus exec system -- bash -c '
  # Check CA certificate exists
  ls -la /opt/gbo/conf/system/certificates/ca/ca.crt

  # Test certificate
  openssl x509 -in /opt/gbo/conf/system/certificates/ca/ca.crt -text -noout
'

Fix:

# If CA cert is missing, copy from vault container
sudo incus exec vault -- cp /opt/gbo/conf/vault/ca.crt /tmp/

sudo incus exec system -- mkdir -p /opt/gbo/conf/system/certificates/ca/
sudo incus exec system -- bash -c '
  # Copy certificate from vault container
  incus file pull vault/opt/gbo/conf/vault/ca.crt /tmp/ca.crt
  cp /tmp/ca.crt /opt/gbo/conf/system/certificates/ca/
  chmod 644 /opt/gbo/conf/system/certificates/ca/ca.crt
'

Security Best Practices

  1. Never commit secrets to git

    • No API keys, passwords, tokens in code
    • Use Vault for ALL sensitive data
    • Init secrets from SecretsManager::from_env()
  2. Use Vault for all service credentials

    • Database passwords: gbo/tables
    • MinIO keys: gbo/drive
    • LLM API keys: gbo/llm
    • Email passwords: gbo/email
  3. Rotate credentials regularly

    • Generate new tokens/keys periodically
    • Update Vault using vault kv put
    • No need to restart services (next read gets new values)
  4. Enable TLS/mTLS in production

    • Always use VAULT_CACERT
    • Enable mTLS for critical services: VAULT_CLIENT_CERT + VAULT_CLIENT_KEY
    • Never use VAULT_SKIP_VERIFY=true in production
  5. Limit token lifetimes

    • Root token: single use or very short TTL
    • Service tokens: limited to needed time (e.g., 8760h = 1 year)
    • Generate new tokens when old ones expire
  6. Audit Vault access

    # Check recent Vault operations
    sudo incus exec vault -- vault audit list
    sudo incus exec vault -- vault audit file /var/log/vault_audit.log
    

Vault Backup & Recovery

Backup Vault Data

# Snapshot vault container (includes all secrets)
sudo incus snapshot create vault backup-$(date +%Y%m%d-%H%M)

# Export Vault config (init.json with unseal keys)
sudo incus exec vault -- cat /opt/gbo/bin/botserver-stack/conf/vault/vault-conf/init.json > /tmp/vault-init.json

# Backup all secrets (JSON format)
sudo incus exec vault -- bash -c '
  export VAULT_ADDR=https://localhost:8200
  export VAULT_TOKEN=<root-token>

  # Backup each path
  for path in gbo/tables gbo/drive gbo/cache gbo/llm; do
    vault kv get -format=json secret/$path > /tmp/vault-$path.json
  done
'

Restore from Snapshot

# Stop vault
sudo incus exec vault -- systemctl stop vault

# Restore snapshot
sudo incus snapshot restore vault <snapshot-name>

# Start vault
sudo incus exec vault -- systemctl start vault

# Wait for Vault to be ready
sleep 10

# Verify health
sudo incus exec vault -- curl -k -sf https://localhost:8200/v1/sys/health

Troubleshooting

GLIBC Version Mismatch

Symptom: GLIBC_2.39 not found or GLIBC_2.38 not found

Cause: Binary compiled on CI runner (glibc 2.41) but runs in system container (glibc 2.36)

Fix: CI workflow must build inside the system container. Check botserver.yaml uses SSH to build in container.

botserver Not Starting

# Check binary
sudo incus exec system -- ldd /opt/gbo/bin/botserver | grep "not found"

# Check direct execution
sudo incus exec system -- timeout 10 /opt/gbo/bin/botserver 2>&1

# Check data directory
sudo incus exec system -- ls -la /opt/gbo/data/

botui Can't Reach botserver

# Check BOTSERVER_URL
sudo incus exec system -- grep BOTSERVER_URL /etc/systemd/system/ui.service

# Must be http://localhost:5858, NOT https://system.example.com
# Fix:
sudo incus exec system -- sed -i 's|BOTSERVER_URL=.*|BOTSERVER_URL=http://localhost:5858|' /etc/systemd/system/ui.service
sudo incus exec system -- systemctl daemon-reload
sudo incus exec system -- systemctl restart ui

Suggestions Not Showing

# Check bot files exist
sudo incus exec system -- ls -la /opt/gbo/data/<bot>.gbai/<bot>.gbdialog/

# Check for compilation errors
sudo incus exec system -- tail -50 /opt/gbo/logs/out.log | grep -i "error\|fail\|compile"

# Clear cache and restart
sudo incus exec system -- find /opt/gbo/work -name "*.ast" -delete
sudo incus exec system -- systemctl restart botserver

IPv6 DNS Issues

Symptom: External API calls (Groq, Cloudflare) timeout

Cause: Container DNS returns AAAA records but no IPv6 connectivity

Fix: Container has IPV6=no in network config and gai.conf labels. If issues persist, check RES_OPTIONS=inet4 in botserver.service.

Vault Connection & Service Discovery Issues

Symptom: Logs show Failed to read data directory <development-path> or Config scan failed

Cause: Botserver is using hardcoded development paths instead of production paths

Fix:

  1. Check current configuration:

    # Check .env file
    sudo incus exec system -- cat /opt/gbo/bin/.env
    
    # Check data directory
    sudo incus exec system -- ls -la /opt/gbo/data/
    sudo incus exec system -- ls -la /opt/gbo/work/
    
  2. Verify Vault connection:

    # Test Vault from system container
    sudo incus exec system -- curl -k -sf https://<vault-ip>:8200/v1/sys/health
    
    # Check Vault token
    sudo incus exec system -- grep VAULT_TOKEN /opt/gbo/bin/.env
    
  3. Check service discovery:

    # Check if botserver is reading Vault secrets
    sudo incus exec system -- tail -100 /opt/gbo/logs/out.log | grep -i vault
    
    # Check for service configuration errors
    sudo incus exec system -- tail -100 /opt/gbo/logs/err.log | grep -i "config\|service"
    
  4. Fix data directory paths:

    • Ensure botserver uses /opt/gbo/data/ instead of development paths
    • Update configuration if hardcoded paths exist
    • Restart botserver after fixing
  5. Verify all services are accessible:

    # Check PostgreSQL
    sudo incus exec system -- pg_isready -h <database-ip> -p 5432
    
    # Check Valkey
    sudo incus exec system -- redis-cli -h <cache-ip> -a <password> ping
    
    # Check MinIO
    sudo incus exec system -- curl -sf http://<storage-ip>:9100/minio/health/live
    
  6. Update botserver configuration:

    • Ensure botserver reads from /opt/gbo/bin/.env for Vault configuration
    • Verify service discovery uses Vault to get service endpoints
    • Check that data directory is set to /opt/gbo/data/ in configuration
    • Update systemd service if needed:
      sudo incus exec system -- cat /etc/systemd/system/botserver.service
      # Ensure EnvironmentFile=/opt/gbo/bin/.env is present
      
  7. Test after fixes:

    # Restart botserver
    sudo incus exec system -- systemctl restart botserver
    
    # Wait for startup
    sleep 10
    
    # Check logs for errors
    sudo incus exec system -- tail -50 /opt/gbo/logs/err.log
    
    # Verify health endpoint
    curl -sf http://<main-server-ip>:5858/health
    

Vault Connection Errors

Symptom: Vault connection failed or Vault token invalid

Fix:

# Check Vault is running
sudo incus exec vault -- systemctl status vault

# Check Vault health
sudo incus exec vault -- curl -k -sf https://localhost:8200/v1/sys/health

# Verify token is valid
sudo incus exec system -- bash -c '
  export VAULT_ADDR=https://<vault-ip>:8200
  export VAULT_TOKEN=<vault_token>
  export VAULT_CACERT=/opt/gbo/conf/system/certificates/ca/ca.crt
  vault token lookup
'

# If token is invalid, generate new one
sudo incus exec vault -- bash -c '
  export VAULT_ADDR=https://localhost:8200
  export VAULT_TOKEN=<root_token>
  vault token create -policy="botserver" -ttl="8760h"
'

# Update .env with new token
sudo incus exec system -- sed -i 's|VAULT_TOKEN=.*|VAULT_TOKEN=<new_token>|' /opt/gbo/bin/.env
sudo incus exec system -- systemctl restart botserver

Service Discovery Failures

Symptom: Service not found or Failed to connect to service

Fix:

# Check if service is running
sudo incus exec tables -- systemctl status postgresql
sudo incus exec cache -- systemctl status valkey
sudo incus exec drive -- systemctl status minio

# Check if service is accessible from system container
sudo incus exec system -- nc -zv <database-ip> 5432  # PostgreSQL
sudo incus exec system -- nc -zv <cache-ip> 6379  # Valkey
sudo incus exec system -- nc -zv <storage-ip> 9100  # MinIO

# Check Vault has service configuration
sudo incus exec system -- bash -c '
  export VAULT_ADDR=https://<vault-ip>:8200
  export VAULT_TOKEN=<vault_token>
  export VAULT_CACERT=/opt/gbo/conf/system/certificates/ca/ca.crt
  vault kv list secret/botserver
'

# If service config is missing, add it (see Vault Configuration section)

Monitoring & Verification

Check botserver is working correctly:

# Health check
curl -sf http://<main-server-ip>:5858/health

# Check logs for errors
sudo incus exec system -- tail -100 /opt/gbo/logs/err.log | grep -i "error\|fail"

# Check logs for successful service connections
sudo incus exec system -- tail -100 /opt/gbo/logs/out.log | grep -i "connected\|service\|vault"

# Verify data directory is correct
sudo incus exec system -- tail -100 /opt/gbo/logs/out.log | grep -i "data\|work"

# Should show /opt/gbo/data/ and /opt/gbo/work/, not development paths

Expected log output:

info vault:Connected to Vault at https://<vault-ip>:8200
info service_discovery:Loaded service configuration from Vault
info database:Connected to PostgreSQL at <database-ip>:5432
info cache:Connected to Valkey at <cache-ip>:6379
info storage:Connected to MinIO at http://<storage-ip>:9100
info watcher:Watching data directory /opt/gbo/data
info botserver:BotServer started successfully on port 5858

If logs show errors:

  1. Check Vault connection (see Vault Connection Errors section)
  2. Check service accessibility (see Service Discovery Failures section)
  3. Fix data directory paths (see Fix Development Paths in Production section)
  4. Restart botserver and verify again

Vault Backup & Restore

Create Vault snapshot:

# Stop Vault
sudo incus exec vault -- systemctl stop vault

# Create snapshot
sudo incus snapshot create vault manual-$(date +%Y-%m-%d-%H%M)

# Start Vault
sudo incus exec vault -- systemctl start vault

# Verify
sudo incus snapshot list vault

Restore Vault from snapshot:

# Stop Vault
sudo incus exec vault -- systemctl stop vault

# List snapshots
sudo incus snapshot list vault

# Restore from latest snapshot
sudo incus snapshot restore vault <snapshot-name>

# Start Vault
sudo incus exec vault -- systemctl start vault

# Verify Vault is running
sudo incus exec vault -- systemctl status vault
sudo incus exec vault -- curl -k -sf https://localhost:8200/v1/sys/health

Automated snapshots:

# Create cron job for daily snapshots
sudo incus exec vault -- bash -c 'cat > /etc/cron.daily/vault-snapshot << EOF
#!/bin/bash
systemctl stop vault
incus snapshot create vault daily-$(date +\%Y\%m\%d)
systemctl start vault
EOF
chmod +x /etc/cron.daily/vault-snapshot'

Update Botserver for Production

Required changes in botserver code:

  1. Read configuration from Vault:

    • Add Vault client initialization
    • Read service endpoints from Vault
    • Read secrets from Vault
    • Fallback to environment variables if Vault is unavailable
  2. Use production paths:

    • Remove hardcoded development paths
    • Use environment variables for data directory
    • Default to /opt/gbo/data/ for production
  3. Update .env file:

    # /opt/gbo/bin/.env
    VAULT_ADDR=https://<vault-ip>:8200
    VAULT_TOKEN=<vault_token>
    VAULT_CACERT=/opt/gbo/conf/system/certificates/ca/ca.crt
    DATA_DIR=/opt/gbo/data/
    WORK_DIR=/opt/gbo/work/
    PORT=5858
    
  4. Update systemd service:

    sudo incus exec system -- cat > /etc/systemd/system/botserver.service << 'EOF'
    [Unit]
    Description=BotServer Service
    After=network.target
    
    [Service]
    User=root
    Group=root
    WorkingDirectory=/opt/gbo/bin
    EnvironmentFile=/opt/gbo/bin/.env
    ExecStart=/opt/gbo/bin/botserver --noconsole
    Restart=always
    RestartSec=5
    StandardOutput=append:/opt/gbo/logs/out.log
    StandardError=append:/opt/gbo/logs/err.log
    
    [Install]
    WantedBy=multi-user.target
    EOF
    
    sudo incus exec system -- systemctl daemon-reload
    sudo incus exec system -- systemctl restart botserver
    
  5. Deploy updated botserver:

    # Push changes to ALM
    cd botserver && git push alm main && git push origin main
    
    # CI will build and deploy automatically
    # Or manually deploy (see Manual Deploy section)
    

Security

  • NEVER push secrets to git
  • NEVER commit files to root with credentials
  • Vault is single source of truth for secrets
  • CI/CD is the only deployment method — never manually scp binaries
  • ALM is production — ask before pushing