# Production Environment Guide ## ⚠️ CRITICAL PRODUCTION RULES **READ THIS FIRST:** ### 🚫 NEVER Start Services Directly In production, **NEVER** start botserver or botui directly. Always use `systemctl`: ```bash # ❌ NEVER DO THIS IN PRODUCTION: /opt/gbo/bin/botserver # Wrong ./botserver # Wrong /opt/gbo/bin/botserver & # Wrong # ✅ ALWAYS USE THIS: sudo incus exec system -- systemctl start botserver sudo incus exec system -- systemctl restart botserver sudo incus exec system -- systemctl stop botserver sudo incus exec system -- systemctl status botserver ``` **Why:** - `systemctl` loads `/opt/gbo/bin/.env` (Vault credentials, paths, etc.) - Direct execution skips environment variables → services fail - `systemctl` manages auto-restart, logging, and dependencies ### 🔐 Security Rules - **NEVER** push secrets to git (API keys, passwords, tokens) - **NEVER** commit `init.json` (Vault unseal keys) - **ALWAYS** use Vault for secrets (see [Vault Security Architecture](#vault-security-architecture)) - **ONLY** `VAULT_*` environment variables allowed in `.env` ### 🚢 Deployment Rules - **NEVER** deploy manually (scp, ssh copy) — use CI/CD only - **NEVER** push to ALM without asking first - **ALWAYS** push ALL submodules (botserver, botui, botlib) when pushing main repo - **ALWAYS** use `systemctl` to restart services after deployment --- ## Infrastructure ### Servers | Host | IP | Purpose | |------|-----|---------| | `system` | `` | Main botserver + botui container | | `alm-ci` | `` | CI/CD runner (Forgejo Actions) | | `drive` | `` | Object storage | | `monitor` | `` | Monitoring service | ### Port Mapping (system container) | Service | Internal Port | External URL | |---------|--------------|--------------| | botserver | `5858` | `https://system.example.com` | | botui | `5859` | `https://chat.example.com` | ### Access ```bash # SSH to host ssh admin@ # Execute inside system container sudo incus exec system -- bash -c 'command' # SSH from host to container (used by CI) ssh -o StrictHostKeyChecking=no system "command" ``` ## Services ### botserver.service - **Binary**: `/opt/gbo/bin/botserver` - **Port**: `5858` - **User**: `gbuser` - **Logs**: `/opt/gbo/logs/out.log`, `/opt/gbo/logs/err.log` - **Config**: `/etc/systemd/system/botserver.service` - **Env**: `PORT=5858` ### ui.service - **Binary**: `/opt/gbo/bin/botui` - **Port**: `5859` - **Config**: `/etc/systemd/system/ui.service` - **Env**: `BOTSERVER_URL=http://localhost:5858` - ⚠️ MUST be `http://localhost:5858` — NOT `https://system.example.com` - Rust proxy runs server-side, needs direct localhost access - JS client uses relative URLs through `chat.example.com` ### Data Directory - **Path**: `/opt/gbo/data/` - **Structure**: `.gbai/.gbdialog/*.bas` - **Work dir**: `/opt/gbo/work/` (compiled .ast cache) ### Stack Services (managed by botserver bootstrap) - **Vault**: Secrets management - **PostgreSQL**: Database (port 5432) - **Valkey**: Cache (port 6379, password auth) - **MinIO**: Object storage - **Zitadel**: Identity provider - **LLM**: llama.cpp ## CI/CD Pipeline ### Repositories | Repo | ALM URL | GitHub URL | |------|---------|------------| | gb | `https://alm.example.com/organization/gb.git` | `git@github.com:organization/gb.git` | | botserver | `https://alm.example.com/organization/BotServer.git` | `git@github.com:organization/botserver.git` | | botui | `https://alm.example.com/organization/BotUI.git` | `git@github.com:organization/botui.git` | | botlib | `https://alm.example.com/organization/botlib.git` | `git@github.com:organization/botlib.git` | ### Push Order ```bash # 1. Push submodules first cd botserver && git push alm main && git push origin main && cd .. cd botui && git push alm main && git push origin main && cd .. # 2. Update root workspace references git add botserver botui botlib git commit -m "Update submodules: " git push alm main && git push origin main ``` ### Build Environment - **CI runner**: `ci-runner` container (Debian Trixie, glibc 2.41) - **Target**: `system` container (Debian 12 Bookworm, glibc 2.36) - **⚠️ GLIBC MISMATCH**: Building on CI runner produces binaries incompatible with system container - **Solution**: CI workflow transfers source to system container and builds there via SSH ### Workflow File - **Location**: `botserver/.forgejo/workflows/botserver.yaml` - **Triggers**: Push to `main` branch - **Steps**: 1. Setup workspace on CI runner (clone repos) 2. Transfer source to system container via `tar | ssh` 3. Build inside system container (matches glibc 2.36) 4. Deploy binary inside container 5. Verify botserver is running ## Common Operations ### Check Service Status ```bash # From host sudo incus exec system -- systemctl status botserver --no-pager sudo incus exec system -- systemctl status ui --no-pager # Check if running sudo incus exec system -- pgrep -f botserver sudo incus exec system -- pgrep -f botui ``` ### View Logs ```bash # Systemd journal sudo incus exec system -- journalctl -u botserver --no-pager -n 50 sudo incus exec system -- journalctl -u ui --no-pager -n 50 # Application logs sudo incus exec system -- tail -50 /opt/gbo/logs/out.log sudo incus exec system -- tail -50 /opt/gbo/logs/err.log # Live tail sudo incus exec system -- tail -f /opt/gbo/logs/out.log ``` ### Restart Services **CRITICAL PRODUCTION RULE:** In production, NEVER start botserver or botui directly. Always use `systemctl` to ensure proper initialization, environment loading, and logging. ```bash sudo incus exec system -- systemctl restart botserver sudo incus exec system -- systemctl restart ui ``` **PROHIBITED in production:** ```bash # ❌ NEVER DO THIS IN PRODUCTION: sudo incus exec system -- /opt/gbo/bin/botserver # Wrong - no systemd integration sudo incus exec system -- /opt/gbo/bin/botserver & # Wrong - no service management sudo incus exec system -- cd /opt/gbo/bin && ./botserver # Wrong - missing env vars # ✅ CORRECT - Always use systemctl: sudo incus exec system -- systemctl start botserver sudo incus exec system -- systemctl restart botserver sudo incus exec system -- systemctl stop botserver sudo incus exec system -- systemctl status botserver ``` **Why:** - `systemctl` loads `/opt/gbo/bin/.env` (via `EnvironmentFile` in service definition) - `systemctl` manages process lifecycle, auto-restart, and dependencies - `systemctl` sends logs to `/opt/gbo/logs/out.log` and `/opt/gbo/logs/err.log` - Direct execution skips environment variables and systemd service configuration ### Manual Deploy (emergency) ```bash # Kill old process sudo incus exec system -- killall botserver # Copy binary (from host CI workspace or local) sudo incus exec system -- cp /opt/gbo/ci/botserver/target/debug/botserver /opt/gbo/bin/botserver sudo incus exec system -- chmod +x /opt/gbo/bin/botserver sudo incus exec system -- chown gbuser:gbuser /opt/gbo/bin/botserver # Start service sudo incus exec system -- systemctl start botserver ``` ### Transfer Bot Files to Production ```bash # From local to prod host tar czf /tmp/bots.tar.gz -C /opt/gbo/data .gbai scp /tmp/bots.tar.gz admin@:/tmp/ # From host to container sudo incus exec system -- bash -c 'tar xzf /tmp/bots.tar.gz -C /opt/gbo/data/' # Clear compiled cache sudo incus exec system -- find /opt/gbo/data -name "*.ast" -delete sudo incus exec system -- find /opt/gbo/work -name "*.ast" -delete ``` ### Snapshots ```bash # List snapshots sudo incus snapshot list system # Restore snapshot sudo incus snapshot restore system ``` ## DriveMonitor & Bot Configuration Sync ### DriveMonitor Architecture DriveMonitor is a background service that synchronizes bot files from MinIO (S3-compatible storage) to the local filesystem and database. It monitors three directories per bot: | Directory | Purpose | Sync Behavior | |-----------|---------|---------------| | `{bot}.gbai/{bot}.gbdialog/` | BASIC scripts (.bas) | Downloads and compiles on change | | `{bot}.gbai/{bot}.gbot/` | Configuration files | Syncs to `bot_configuration` table | | `{bot}.gbkb/` | Knowledge base documents | Downloads and indexes for vector search | ### Bot Configuration Database Tables #### `bot_configuration` (main config table) ```sql -- Location: botserver database SELECT * FROM bot_configuration WHERE bot_id = ''; -- Key columns: -- - bot_id: Bot UUID (link to bots table) -- - config_key: Configuration key (e.g., "llm-provider", "system-prompt") -- - config_value: Configuration value -- - config_type: Type (string, boolean, number) -- - is_encrypted: Whether value is encrypted -- - updated_at: Last modification timestamp ``` #### `gbot_config_sync` (sync tracking table) ```sql -- Location: botserver database -- Tracks config.csv sync status from bucket SELECT * FROM gbot_config_sync g JOIN bots b ON g.bot_id = b.id WHERE b.name = 'salesianos'; -- Key columns: -- - bot_id: Bot UUID -- - config_file_path: Path to config.csv in bucket -- - last_sync_at: Timestamp of last successful sync -- - file_hash: ETag/MD5 of synced file -- - sync_count: Number of times synced ``` ### config.csv Sync Process **File Locations:** - Source: `{bot}.gbai/{bot}.gbot/config.csv` in MinIO bucket - Sync method: DriveMonitor → ConfigManager → `bot_configuration` table - Sync frequency: Every 10 seconds (DriveMonitor periodic check) **Sync Trigger Conditions:** 1. File ETag changes in MinIO 2. Initial DriveMonitor startup 3. Manual botserver restart **CSV Format:** ```csv llm-provider,groq llm-api-key,sk-xxx llm-url,http://localhost:8085 system-prompt-file,PROMPT.md theme-color1,#cc0000 theme-title,MyBot whatsapp-id,botname ``` ### Checking Bot Configuration Status #### Method 1: Query bot_configuration table ```bash # Get all config for a bot sudo incus exec tables -- psql -h localhost -U postgres -d botserver -c " SELECT b.name, bc.config_key, bc.config_value, bc.updated_at FROM bot_configuration bc JOIN bots b ON bc.bot_id = b.id WHERE b.name = 'salesianos' ORDER BY bc.config_key; " # Get specific LLM provider config sudo incus exec tables -- psql -h localhost -U postgres -d botserver -c " SELECT config_key, config_value, updated_at FROM bot_configuration WHERE bot_id = ( SELECT id FROM bots WHERE name = 'salesianos' ) AND config_key LIKE 'llm-%' ORDER BY config_key; " ``` #### Method 2: Check DriveMonitor sync status ```bash # Check if config.csv has been synced sudo incus exec tables -- psql -h localhost -U postgres -d botserver -c " SELECT b.name, gcs.last_sync_at, gcs.sync_count, gcs.config_file_path FROM gbot_config_sync gcs JOIN bots b ON gcs.bot_id = b.id WHERE b.name IN ('salesianos', 'default'); " -- Empty result = DriveMonitor hasn't synced config.csv yet -- If sync_count = 0, config.csv exists but hasn't been processed ``` #### Method 3: Direct MinIO inspection ```bash # Check if config.csv exists in bucket sudo incus exec drive -- /opt/gbo/bin/mc ls local/salesianos.gbai/salesianos.gbot/ # View config.csv contents sudo incus exec drive -- /opt/gbo/bin/mc cat local/salesianos.gbai/salesianos.gbot/config.csv # Check file ETag (for sync comparison) sudo incus exec drive -- /opt/gbo/bin/mc stat local/salesianos.gbai/salesianos.gbot/config.csv ``` ### DriveMonitor Debugging Logs #### Key log patterns to monitor ```bash # Monitor DriveMonitor activity in real-time sudo incus exec system -- tail -f /opt/gbo/logs/out.log | grep -E "(DRIVE_MONITOR|check_gbot|config)" # Check for config.csv sync attempts sudo incus exec system -- grep "check_gbot" /opt/gbo/logs/out.log | tail -20 # Check for config synchronization sudo incus exec system -- grep "sync_gbot_config" /opt/gbo/logs/out.log | tail -20 # Check for DriveMonitor errors sudo incus exec system -- grep -i "drive.*error" /opt/gbo/logs/err.log | tail -20 ``` #### Expected successful sync logs ``` check_gbot: Checking bucket salesianos.gbai for config.csv changes check_gbot: Found config.csv at path: salesianos.gai/salesianos.gbot/config.csv info config:Synced config.csv for bot - updated 3 keys ``` #### Error patterns and meanings ``` # Config.csv not found in bucket check_gbot: Config file not found or inaccessible: path/to/config.csv # Sync to database failed error config:Failed to sync_gbot_config: # DriveMonitor not running (no check_gbot logs in output.log) # MinIO connection failed error drive_monitor:S3/MinIO unavailable for bucket ``` ### Common Issues and Fixes #### Issue 1: config.csv not syncing to database **Symptoms:** - `gbot_config_sync` table empty (0 rows) - LLM provider changes in bucket not reflected in bot behavior - Database shows old configuration values **Diagnosis:** ```bash # 1. Check if config.csv exists in bucket sudo incus exec drive -- /opt/gbo/bin/mc ls local/salesianos.gbai/salesianos.gbot/ # 2. Check DriveMonitor logs for sync attempts sudo incus exec system -- grep "check_gbot" /opt/gbo/logs/out.log | tail -10 # 3. Check if DriveMonitor is running for the bot sudo incus exec system -- ps aux | grep botserver ``` **Root Causes:** 1. config.csv missing from `{bot}.gai/{bot}.gbot/` folder 2. DriveMonitor not started for the bot 3. MinIO connection issues 4. Database write permissions **Fixes:** ```bash # Case 1: Create missing config.csv sudo incus exec drive -- bash -c ' cat > /tmp/config.csv << EOF llm-provider,groq llm-api-key,your-api-key llm-url,http://localhost:8085 system-prompt-file,PROMPT.md theme-color1,#cc0000 theme-title,Salesianos EOF /opt/gbo/bin/mc cp /tmp/config.csv local/salesianos.gbai/salesianos.gbot/config.csv ' # Case 2: Restart botserver to reinitialize DriveMonitor sudo incus exec system -- systemctl restart botserver # Case 3: Force immediate sync by touching config.csv sudo incus exec drive -- /opt/gbo/bin/mc cp local/salesianos.gbai/salesianos.gbot/config.csv local/salesianos.gbai/salesianos.gbot/config.csv ``` #### Issue 2: LLM provider changes not taking effect **Symptoms:** - config.csv shows correct provider (e.g., groq) - Bot still uses old provider - Database shows old value **Diagnosis:** ```bash # Compare bucket vs database BUCKET_PROVIDER=$(sudo incus exec drive -- /opt/gbo/bin/mc cat local/salesianos.gbai/salesianos.gbot/config.csv | grep "^llm-provider" | cut -d',' -f2) DB_PROVIDER=$(sudo incus exec tables -- psql -h localhost -U postgres -d botserver -t -c " SELECT config_value FROM bot_configuration WHERE bot_id = (SELECT id FROM bots WHERE name = 'salesianos') AND config_key = 'llm-provider'; ") echo "Bucket: $BUCKET_PROVIDER" echo "Database: $DB_PROVIDER" # Check last sync time sudo incus exec tables -- psql -h localhost -U postgres -d botserver -t -c " SELECT last_sync_at FROM gbot_config_sync WHERE bot_id = (SELECT id FROM bots WHERE name = 'salesianos'); " ``` **Fix:** ```bash # If sync is stale (> 10 minutes), restart DriveMonitor sudo incus exec system -- systemctl restart botserver # Or manually update config value in database (temporary fix) sudo incus exec tables -- psql -h localhost -U postgres -d botserver -c " UPDATE bot_configuration SET config_value = 'groq', updated_at = NOW() WHERE bot_id = (SELECT id FROM bots WHERE name = 'salesianos') AND config_key = 'llm-provider'; " ``` #### Issue 3: DriveMonitor not checking for changes **Symptoms:** - No new log entries after 30 seconds - File changes in bucket not detected - Bot compilation not happening after .bas file updates **Diagnosis:** ```bash # Check DriveMonitor loop logs sudo incus exec system -- tail -100 /opt/gbo/logs/out.log | grep "DRIVE_MONITOR.*Inside monitoring loop" # Check if is_processing flag is stuck sudo incus exec system -- tail -100 /opt/gbo/logs/out.log | grep -E "(is_processing|monitoring loop)" ``` **Fix:** ```bash # Restart botserver to clear stuck state sudo incus exec system -- systemctl restart botserver # Monitor startup logs to verify DriveMonitor started sudo incus exec system -- tail -50 /opt/gbo/logs/out.log | grep "Drive Monitor" ``` ### Database Schema Reference #### List all bot databases ```bash sudo incus exec tables -- psql -h localhost -U postgres -d postgres -c "\l" | grep bot_ ``` #### List tables in a specific bot database ```bash sudo incus exec tables -- psql -h localhost -U postgres -d bot_salesianos -c "\dt" ``` #### List botserver management tables ```bash sudo incus exec tables -- psql -h localhost -U postgres -d botserver -c "\dt" | grep -E "(bot|config|sync)" ``` ### Connection Methods Summary | Method | Use Case | Command Pattern | |--------|-----------|-----------------| | **SSH to host** | Initial access, file transfer | `ssh admin@63.141.255.9` | | **incus exec** | Execute inside container | `sudo incus exec system -- command` | | **psql direct** | Database queries from container | `sudo incus exec tables -- psql ...` | | **mc (MinIO CLI)** | Inspect buckets, copy files | `sudo incus exec drive -- /opt/gbo/bin/mc ...` | | **HTTP/curl** | Service health checks | `curl http://:5858/health` | | **journalctl** | Systemd service logs | `sudo incus exec system -- journalctl -u botserver` | ## Vault Security Architecture ### Overview The production environment uses **HashiCorp Vault** as the centralized secrets management system. All sensitive credentials (database passwords, API keys, tokens) are stored in Vault, NEVER in code or environment files. ### Vault Connection Flow ``` 1. botserver starts ↓ 2. Reads VAULT_ADDR, VAULT_TOKEN from .env ↓ 3. Initializes VaultClient with TLS/mTLS ↓ 4. Reads secrets from Vault paths (gbo/tables, gbo/drive, etc.) ↓ 5. Falls back to defaults if Vault unavailable ``` ### Environment Variables (Allowed) **File Location:** `/opt/gbo/bin/.env` (system container) ```bash # Vault Connection (MANDATORY for production) VAULT_ADDR=https://:8200 VAULT_TOKEN= VAULT_CACERT=/opt/gbo/conf/system/certificates/ca/ca.crt # Optional: Skip TLS verification (NOT recommended for production) VAULT_SKIP_VERIFY=false # Optional: Use mTLS certificates VAULT_CLIENT_CERT=/opt/gbo/conf/system/certificates/botserver/client.crt VAULT_CLIENT_KEY=/opt/gbo/conf/system/certificates/botserver/client.key # Optional: Cache TTL in seconds (default: 300) VAULT_CACHE_TTL=300 # Server Configuration PORT=5858 DATA_DIR=/opt/gbo/data/ WORK_DIR=/opt/gbo/work/ LOAD_ONLY=default,salesianos ``` **Security Rule:** - **ONLY** `VAULT_*` environment variables are allowed in `.env` - All other secrets MUST come from Vault - Hardcoded secrets in code are FORBIDDEN (see AGENTS.md) ### Vault Secret Paths Structure #### System-Wide Paths (Global) | Path | Purpose | Example Keys | |------|---------|---------------| | `gbo/tables` | Database (PostgreSQL) | host, port, database, username, password | | `gbo/drive` | MinIO (Object Storage) | host, accesskey, secret | | `gbo/cache` | Valkey (Redis) | host, port, password | | `gbo/directory` | Zitadel (Auth) | url, project_id, client_id, client_secret | | `gbo/email` | SMTP Email | smtp_host, smtp_port, smtp_user, smtp_password | | `gbo/llm` | LLM Configuration | url, model, openai_key, anthropic_key | | `gbo/vectordb` | Qdrant (Vector DB) | url, api_key | | `gbo/jwt` | JWT Signing | secret | | `gbo/meet` | Jitsi Meet | url, app_id, app_secret | | `gbo/alm` | ALM Repository | url, token | | `gbo/encryption` | Encryption Keys | master_key | | `gbo/system/observability` | Monitoring | url, org, bucket, token | | `gbo/system/security` | Security Policies | require_auth, anonymous_paths | | `gbo/system/cloud` | Cloud Config | region, access_key, secret_key | | `gbo/system/app` | Application Settings | url, environment | | `gbo/system/models` | BotModels API | url | #### Organization-Specific Paths | Path Pattern | Purpose | |--------------|---------| | `gbo/orgs/{org_id}/config` | Organization configuration | | `gbo/orgs/{org_id}/bots/{bot_id}` | Bot-specific secrets | | `gbo/orgs/{org_id}/users/{user_id}` | User-specific secrets | | `gbo/tenants/{tenant_id}/infrastructure` | Tenant database/cache/drive | | `gbo/tenants/{tenant_id}/config` | Tenant configuration | ### Credential Resolution Hierarchy For bot email configuration (example): ``` 1. Check gbo/orgs/{org_id}/bots/{bot_id}/email 2. Fallback: gbo/bots/default/email 3. Fallback: gbo/email 4. Fallback: Environment variables (development only) ``` ### Vault Client Initialization (Code Reference) **File:** `botserver/src/core/secrets/mod.rs` ```rust // SecretsManager::from_env() reads: // - VAULT_ADDR (required) // - VAULT_TOKEN (required) // - VAULT_CACERT (optional, has default) // - VAULT_SKIP_VERIFY (optional, default: false) // - VAULT_CLIENT_CERT (optional, mTLS) // - VAULT_CLIENT_KEY (optional, mTLS) // - VAULT_CACHE_TTL (optional, default: 300s) impl SecretsManager { pub fn from_env() -> Result { let addr = env::var("VAULT_ADDR").unwrap_or_default(); let token = env::var("VAULT_TOKEN").unwrap_or_default(); if token.is_empty() || addr.is_empty() { // Vault not configured - use environment variables directly warn!("Vault not configured. Using environment variables directly."); return Ok(Self { client: None, enabled: false, ... }); } // Initialize VaultClient with TLS let client = VaultClient::new(settings)?; Ok(Self { client: Some(client), enabled: true, ... }) } } ``` ### Vault Operations - Production Usage #### Read Secrets from Vault ```bash # From system container (using vault CLI) sudo incus exec system -- bash -c ' export VAULT_ADDR=https://10.157.134.250:8200 export VAULT_TOKEN= export VAULT_CACERT=/opt/gbo/conf/system/certificates/ca/ca.crt # Read database secrets vault kv get -field=password secret/gbo/tables vault kv get secret/gbo/tables # Read drive secrets vault kv get secret/gbo/drive # Read LLM configuration vault kv get secret/gbo/llm ' ``` #### Read Secrets via HTTP API (from any container) ```bash sudo incus exec system -- curl -sf \ --cacert /opt/gbo/conf/system/certificates/ca/ca.crt \ -H "X-Vault-Token: " \ https://10.157.134.250:8200/v1/secret/data/gbo/drive | jq ``` #### Verify Vault Health ```bash sudo incus exec vault -- curl -k -sf https://localhost:8200/v1/sys/health # Expected output: # {"initialized":true,"sealed":false,"standby":false,"performance_standby":false,"replication_performance_mode":"disabled","replication_dr_mode":"disabled","server_time_utc":"2026-04-10T13:55:00.123Z"} ``` ### init.json (Vault Initialization Data) **Location:** `/opt/gbo/bin/botserver-stack/conf/vault/vault-conf/init.json` **Purpose:** Stores Vault unseal keys and root token (created during Vault initialization) **Contents:** ```json { "recovery_keys_b64": [], "recovery_keys_hex": [], "recovery_keys_shares": 0, "recovery_keys_threshold": 0, "root_token": "", "unseal_keys_b64": ["<5 unseal keys base64-encoded>"], "unseal_keys_hex": ["<5 unseal keys hex-encoded>"], "unseal_shares": 5, "unseal_threshold": 3 } ``` **Security Notes:** - `root_token`: Used to authenticate to Vault as admin - `unseal_keys`: Required to unseal Vault after restart (5 keys, need 3 to unseal) - **CRITICAL:** Store `init.json` in a secure, encrypted location - Never commit `init.json` to git or store in repo ### Troubleshooting Vault Connection #### Issue 1: Botserver cannot connect to Vault **Symptoms:** - Logs show "Vault connection failed" - Secrets fall back to defaults - Bot cannot authenticate to database **Diagnosis:** ```bash # Check Vault is running sudo incus exec vault -- systemctl status vault # Check Vault health sudo incus exec vault -- curl -k -sf https://localhost:8200/v1/sys/health # Check .env has Vault credentials sudo incus exec system -- grep "^VAULT_" /opt/gbo/bin/.env # Test Vault connection from system container sudo incus exec system -- bash -c ' curl -k -sf --cacert /opt/gbo/conf/system/certificates/ca/ca.crt \ -H "X-Vault-Token: $(grep VAULT_TOKEN /opt/gbo/bin/.env | cut -d= -f2)" \ https://10.157.134.250:8200/v1/secret/data/gbo/tables ' ``` **Common Causes:** 1. Vault service not running (vault container stopped) 2. `VAULT_TOKEN` expired or invalid 3. TLS certificate path incorrect or CA certificate missing 4. Network connectivity between system and vault containers **Fix:** ```bash # 1. Restart Vault if stopped sudo incus exec vault -- systemctl restart vault # 2. Generate new token if expired sudo incus exec vault -- bash -c ' export VAULT_ADDR=https://localhost:8200 export VAULT_TOKEN= vault token create -policy="botserver" -ttl="8760h" -format=json | jq -r .auth.client_token ' # 3. Update .env with new token sudo incus exec system -- sed -i "s|VAULT_TOKEN=.*|VAULT_TOKEN=|" /opt/gbo/bin/.env # 4. Restart botserver sudo incus exec system -- systemctl restart botserver ``` #### Issue 2: Secrets not being read from Vault **Symptoms:** - Logs show "Vault read failed for 'gbo/drive'" - Services use default credentials - DriveMonitor cannot access MinIO **Diagnosis:** ```bash # Check if Vault has secrets configured sudo incus exec system -- bash -c ' export VAULT_ADDR=https://10.157.134.250:8200 export VAULT_TOKEN=$(grep VAULT_TOKEN /opt/gbo/bin/.env | cut -d= -f2) export VAULT_CACERT=/opt/gbo/conf/system/certificates/ca/ca.crt echo "=== Database Secrets ===" vault kv get secret/gbo/tables || echo "NOT FOUND" echo "=== Drive Secrets ===" vault kv get secret/gbo/drive || echo "NOT FOUND" echo "=== LLM Secrets ===" vault kv get secret/gbo/llm || echo "NOT FOUND" ' ``` **Fix - Adding Secrets to Vault:** ```bash sudo incus exec vault -- bash -c ' export VAULT_ADDR=https://localhost:8200 export VAULT_TOKEN= # Add database secrets vault kv put secret/gbo/tables \ host= \ port=5432 \ database=botserver \ username=gbuser \ password= # Add drive (MinIO) secrets vault kv put secret/gbo/drive \ host= \ port=9100 \ accesskey= \ secret= # Add LLM secrets vault kv put secret/gbo/llm \ url=http://localhost:8085 \ model=gpt-4 \ openai_key= \ anthropic_key= ' ``` #### Issue 3: Vault sealed after restart **Symptoms:** - All Vault operations fail - botserver cannot read secrets - Logs show "Vault is sealed" **Diagnosis:** ```bash sudo incus exec vault -- curl -k -sf https://localhost:8200/v1/sys/health | jq .sealed ``` **Fix - Unseal Vault:** ```bash sudo incus exec vault -- bash -c ' # Need 3 of 5 unseal keys from init.json vault operator unseal vault operator unseal vault operator unseal # Verify unsealed vault status ' ``` #### Issue 4: TLS certificate errors **Symptoms:** - "certificate verify failed" errors - TLS handshake failures - curl: (60) SSL certificate problem **Diagnosis:** ```bash sudo incus exec system -- bash -c ' # Check CA certificate exists ls -la /opt/gbo/conf/system/certificates/ca/ca.crt # Test certificate openssl x509 -in /opt/gbo/conf/system/certificates/ca/ca.crt -text -noout ' ``` **Fix:** ```bash # If CA cert is missing, copy from vault container sudo incus exec vault -- cp /opt/gbo/conf/vault/ca.crt /tmp/ sudo incus exec system -- mkdir -p /opt/gbo/conf/system/certificates/ca/ sudo incus exec system -- bash -c ' # Copy certificate from vault container incus file pull vault/opt/gbo/conf/vault/ca.crt /tmp/ca.crt cp /tmp/ca.crt /opt/gbo/conf/system/certificates/ca/ chmod 644 /opt/gbo/conf/system/certificates/ca/ca.crt ' ``` ### Security Best Practices 1. **Never commit secrets to git** - No API keys, passwords, tokens in code - Use Vault for ALL sensitive data - Init secrets from `SecretsManager::from_env()` 2. **Use Vault for all service credentials** - Database passwords: `gbo/tables` - MinIO keys: `gbo/drive` - LLM API keys: `gbo/llm` - Email passwords: `gbo/email` 3. **Rotate credentials regularly** - Generate new tokens/keys periodically - Update Vault using `vault kv put` - No need to restart services (next read gets new values) 4. **Enable TLS/mTLS in production** - Always use `VAULT_CACERT` - Enable mTLS for critical services: `VAULT_CLIENT_CERT` + `VAULT_CLIENT_KEY` - Never use `VAULT_SKIP_VERIFY=true` in production 5. **Limit token lifetimes** - Root token: single use or very short TTL - Service tokens: limited to needed time (e.g., 8760h = 1 year) - Generate new tokens when old ones expire 6. **Audit Vault access** ```bash # Check recent Vault operations sudo incus exec vault -- vault audit list sudo incus exec vault -- vault audit file /var/log/vault_audit.log ``` ### Vault Backup & Recovery #### Backup Vault Data ```bash # Snapshot vault container (includes all secrets) sudo incus snapshot create vault backup-$(date +%Y%m%d-%H%M) # Export Vault config (init.json with unseal keys) sudo incus exec vault -- cat /opt/gbo/bin/botserver-stack/conf/vault/vault-conf/init.json > /tmp/vault-init.json # Backup all secrets (JSON format) sudo incus exec vault -- bash -c ' export VAULT_ADDR=https://localhost:8200 export VAULT_TOKEN= # Backup each path for path in gbo/tables gbo/drive gbo/cache gbo/llm; do vault kv get -format=json secret/$path > /tmp/vault-$path.json done ' ``` #### Restore from Snapshot ```bash # Stop vault sudo incus exec vault -- systemctl stop vault # Restore snapshot sudo incus snapshot restore vault # Start vault sudo incus exec vault -- systemctl start vault # Wait for Vault to be ready sleep 10 # Verify health sudo incus exec vault -- curl -k -sf https://localhost:8200/v1/sys/health ``` ## Troubleshooting ### GLIBC Version Mismatch **Symptom**: `GLIBC_2.39 not found` or `GLIBC_2.38 not found` **Cause**: Binary compiled on CI runner (glibc 2.41) but runs in system container (glibc 2.36) **Fix**: CI workflow must build inside the system container. Check `botserver.yaml` uses SSH to build in container. ### botserver Not Starting ```bash # Check binary sudo incus exec system -- ldd /opt/gbo/bin/botserver | grep "not found" # Check direct execution sudo incus exec system -- timeout 10 /opt/gbo/bin/botserver 2>&1 # Check data directory sudo incus exec system -- ls -la /opt/gbo/data/ ``` ### botui Can't Reach botserver ```bash # Check BOTSERVER_URL sudo incus exec system -- grep BOTSERVER_URL /etc/systemd/system/ui.service # Must be http://localhost:5858, NOT https://system.example.com # Fix: sudo incus exec system -- sed -i 's|BOTSERVER_URL=.*|BOTSERVER_URL=http://localhost:5858|' /etc/systemd/system/ui.service sudo incus exec system -- systemctl daemon-reload sudo incus exec system -- systemctl restart ui ``` ### Suggestions Not Showing ```bash # Check bot files exist sudo incus exec system -- ls -la /opt/gbo/data/.gbai/.gbdialog/ # Check for compilation errors sudo incus exec system -- tail -50 /opt/gbo/logs/out.log | grep -i "error\|fail\|compile" # Clear cache and restart sudo incus exec system -- find /opt/gbo/work -name "*.ast" -delete sudo incus exec system -- systemctl restart botserver ``` ### IPv6 DNS Issues **Symptom**: External API calls (Groq, Cloudflare) timeout **Cause**: Container DNS returns AAAA records but no IPv6 connectivity **Fix**: Container has `IPV6=no` in network config and `gai.conf` labels. If issues persist, check `RES_OPTIONS=inet4` in botserver.service. ### Vault Connection & Service Discovery Issues **Symptom**: Logs show `Failed to read data directory ` or `Config scan failed` **Cause**: Botserver is using hardcoded development paths instead of production paths **Fix**: 1. **Check current configuration**: ```bash # Check .env file sudo incus exec system -- cat /opt/gbo/bin/.env # Check data directory sudo incus exec system -- ls -la /opt/gbo/data/ sudo incus exec system -- ls -la /opt/gbo/work/ ``` 2. **Verify Vault connection**: ```bash # Test Vault from system container sudo incus exec system -- curl -k -sf https://:8200/v1/sys/health # Check Vault token sudo incus exec system -- grep VAULT_TOKEN /opt/gbo/bin/.env ``` 3. **Check service discovery**: ```bash # Check if botserver is reading Vault secrets sudo incus exec system -- tail -100 /opt/gbo/logs/out.log | grep -i vault # Check for service configuration errors sudo incus exec system -- tail -100 /opt/gbo/logs/err.log | grep -i "config\|service" ``` 4. **Fix data directory paths**: - Ensure botserver uses `/opt/gbo/data/` instead of development paths - Update configuration if hardcoded paths exist - Restart botserver after fixing 5. **Verify all services are accessible**: ```bash # Check PostgreSQL sudo incus exec system -- pg_isready -h -p 5432 # Check Valkey sudo incus exec system -- redis-cli -h -a ping # Check MinIO sudo incus exec system -- curl -sf http://:9100/minio/health/live ``` 6. **Update botserver configuration**: - Ensure botserver reads from `/opt/gbo/bin/.env` for Vault configuration - Verify service discovery uses Vault to get service endpoints - Check that data directory is set to `/opt/gbo/data/` in configuration - Update systemd service if needed: ```bash sudo incus exec system -- cat /etc/systemd/system/botserver.service # Ensure EnvironmentFile=/opt/gbo/bin/.env is present ``` 7. **Test after fixes**: ```bash # Restart botserver sudo incus exec system -- systemctl restart botserver # Wait for startup sleep 10 # Check logs for errors sudo incus exec system -- tail -50 /opt/gbo/logs/err.log # Verify health endpoint curl -sf http://:5858/health ``` ### Vault Connection Errors **Symptom**: `Vault connection failed` or `Vault token invalid` **Fix**: ```bash # Check Vault is running sudo incus exec vault -- systemctl status vault # Check Vault health sudo incus exec vault -- curl -k -sf https://localhost:8200/v1/sys/health # Verify token is valid sudo incus exec system -- bash -c ' export VAULT_ADDR=https://:8200 export VAULT_TOKEN= export VAULT_CACERT=/opt/gbo/conf/system/certificates/ca/ca.crt vault token lookup ' # If token is invalid, generate new one sudo incus exec vault -- bash -c ' export VAULT_ADDR=https://localhost:8200 export VAULT_TOKEN= vault token create -policy="botserver" -ttl="8760h" ' # Update .env with new token sudo incus exec system -- sed -i 's|VAULT_TOKEN=.*|VAULT_TOKEN=|' /opt/gbo/bin/.env sudo incus exec system -- systemctl restart botserver ``` ### Service Discovery Failures **Symptom**: `Service not found` or `Failed to connect to service` **Fix**: ```bash # Check if service is running sudo incus exec tables -- systemctl status postgresql sudo incus exec cache -- systemctl status valkey sudo incus exec drive -- systemctl status minio # Check if service is accessible from system container sudo incus exec system -- nc -zv 5432 # PostgreSQL sudo incus exec system -- nc -zv 6379 # Valkey sudo incus exec system -- nc -zv 9100 # MinIO # Check Vault has service configuration sudo incus exec system -- bash -c ' export VAULT_ADDR=https://:8200 export VAULT_TOKEN= export VAULT_CACERT=/opt/gbo/conf/system/certificates/ca/ca.crt vault kv list secret/botserver ' # If service config is missing, add it (see Vault Configuration section) ``` ### Monitoring & Verification **Check botserver is working correctly**: ```bash # Health check curl -sf http://:5858/health # Check logs for errors sudo incus exec system -- tail -100 /opt/gbo/logs/err.log | grep -i "error\|fail" # Check logs for successful service connections sudo incus exec system -- tail -100 /opt/gbo/logs/out.log | grep -i "connected\|service\|vault" # Verify data directory is correct sudo incus exec system -- tail -100 /opt/gbo/logs/out.log | grep -i "data\|work" # Should show /opt/gbo/data/ and /opt/gbo/work/, not development paths ``` **Expected log output**: ``` info vault:Connected to Vault at https://:8200 info service_discovery:Loaded service configuration from Vault info database:Connected to PostgreSQL at :5432 info cache:Connected to Valkey at :6379 info storage:Connected to MinIO at http://:9100 info watcher:Watching data directory /opt/gbo/data info botserver:BotServer started successfully on port 5858 ``` **If logs show errors**: 1. Check Vault connection (see Vault Connection Errors section) 2. Check service accessibility (see Service Discovery Failures section) 3. Fix data directory paths (see Fix Development Paths in Production section) 4. Restart botserver and verify again ### Vault Backup & Restore **Create Vault snapshot**: ```bash # Stop Vault sudo incus exec vault -- systemctl stop vault # Create snapshot sudo incus snapshot create vault manual-$(date +%Y-%m-%d-%H%M) # Start Vault sudo incus exec vault -- systemctl start vault # Verify sudo incus snapshot list vault ``` **Restore Vault from snapshot**: ```bash # Stop Vault sudo incus exec vault -- systemctl stop vault # List snapshots sudo incus snapshot list vault # Restore from latest snapshot sudo incus snapshot restore vault # Start Vault sudo incus exec vault -- systemctl start vault # Verify Vault is running sudo incus exec vault -- systemctl status vault sudo incus exec vault -- curl -k -sf https://localhost:8200/v1/sys/health ``` **Automated snapshots**: ```bash # Create cron job for daily snapshots sudo incus exec vault -- bash -c 'cat > /etc/cron.daily/vault-snapshot << EOF #!/bin/bash systemctl stop vault incus snapshot create vault daily-$(date +\%Y\%m\%d) systemctl start vault EOF chmod +x /etc/cron.daily/vault-snapshot' ``` ### Update Botserver for Production **Required changes in botserver code**: 1. **Read configuration from Vault**: - Add Vault client initialization - Read service endpoints from Vault - Read secrets from Vault - Fallback to environment variables if Vault is unavailable 2. **Use production paths**: - Remove hardcoded development paths - Use environment variables for data directory - Default to `/opt/gbo/data/` for production 3. **Update .env file**: ```bash # /opt/gbo/bin/.env VAULT_ADDR=https://:8200 VAULT_TOKEN= VAULT_CACERT=/opt/gbo/conf/system/certificates/ca/ca.crt DATA_DIR=/opt/gbo/data/ WORK_DIR=/opt/gbo/work/ PORT=5858 ``` 4. **Update systemd service**: ```bash sudo incus exec system -- cat > /etc/systemd/system/botserver.service << 'EOF' [Unit] Description=BotServer Service After=network.target [Service] User=root Group=root WorkingDirectory=/opt/gbo/bin EnvironmentFile=/opt/gbo/bin/.env ExecStart=/opt/gbo/bin/botserver --noconsole Restart=always RestartSec=5 StandardOutput=append:/opt/gbo/logs/out.log StandardError=append:/opt/gbo/logs/err.log [Install] WantedBy=multi-user.target EOF sudo incus exec system -- systemctl daemon-reload sudo incus exec system -- systemctl restart botserver ``` 5. **Deploy updated botserver**: ```bash # Push changes to ALM cd botserver && git push alm main && git push origin main # CI will build and deploy automatically # Or manually deploy (see Manual Deploy section) ``` ## Security - **NEVER** push secrets to git - **NEVER** commit files to root with credentials - **Vault** is single source of truth for secrets - **CI/CD** is the only deployment method — never manually scp binaries - **ALM** is production — ask before pushing