Commit graph

4135 commits

Author SHA1 Message Date
b3edf21d21 fix: init_redis fetches cache password from Vault (fixes connection timeout)
All checks were successful
BotServer CI/CD / build (push) Successful in 4m59s
- Root cause: init_redis() used redis://localhost:6379 without password
- Valkey requires authentication, causing connection timeouts
- Fix: use get_cache_config() from SecretsManager to build URL with password
- Falls back to env vars (CACHE_URL/REDIS_URL/VALKEY_URL) if set
2026-04-01 20:17:37 -03:00
3c9e4ba6e7 fix: cache_health_check uses ss instead of nc (nc missing in prod container)
All checks were successful
BotServer CI/CD / build (push) Successful in 4m42s
- Root cause: prod container lacks nc (netcat), causing fallback to valkey-cli ping
- valkey-cli ping hangs indefinitely when Valkey requires password auth
- Fix: use ss -tlnp as primary check (always available), nc as fallback
- Testing: verified ss is available in prod, nc is not
2026-04-01 20:06:13 -03:00
d098961142 fix: Bootstrap checks stack/.env path in addition to ./.env
All checks were successful
BotServer CI/CD / build (push) Successful in 4m39s
- Production has .env in botserver-stack/.env not ./.env
- Checks both locations to detect completed bootstrap
- Fixes E0716: use let bindings for Path borrows
2026-04-01 19:30:08 -03:00
8fd3254334 fix: Bootstrap checks stack/.env path in addition to ./.env
Some checks failed
BotServer CI/CD / build (push) Failing after 1m25s
- Production has .env in botserver-stack/.env not ./.env
- Checks both locations to detect completed bootstrap
- Prevents full re-bootstrap on restart in production
2026-04-01 19:26:32 -03:00
318367d439 fix: Valkey health check uses nc first (avoids password hang)
All checks were successful
BotServer CI/CD / build (push) Successful in 3m58s
- nc -z checks port connectivity instantly (no auth needed)
- valkey-cli ping as fallback (hangs when password required)
- Fixes bootstrap hang on production where Valkey has Vault password
2026-04-01 18:52:04 -03:00
c26e483cc9 fix: All services check health before starting (idempotent bootstrap)
All checks were successful
BotServer CI/CD / build (push) Successful in 4m9s
- Tables (PostgreSQL): pg_isready health check before start
- Drive (MinIO): /minio/health/live check before start
- ALM (Forgejo): HTTP health check before start
- ALM CI (Forgejo Runner): pgrep check before start
- Valkey: health check uses absolute path to valkey-cli
- Vault, Qdrant, Zitadel: already had health checks
- Result: no duplicate starts, no hangs on restart
2026-04-01 18:28:54 -03:00
b47d928608 fix: CI system dependencies check before apt-get update
All checks were successful
BotServer CI/CD / build (push) Successful in 3m47s
- Check if packages already installed with dpkg before running apt-get
- Skip apt-get update/install if all dependencies present
- Reduces CI time from ~30s to instant on subsequent runs
2026-04-01 18:19:10 -03:00
ba7f1ba5eb fix: Valkey health check uses absolute path to valkey-cli
Some checks failed
BotServer CI/CD / build (push) Has been cancelled
- Use BOTSERVER_STACK_PATH/bin/cache/bin/valkey-cli instead of relying on PATH
- Remove bash /dev/tcp fallback (unreliable in restricted environments)
- Falls back to redis-cli and nc if valkey-cli unavailable
2026-04-01 18:11:26 -03:00
68ef554132 fix: Vault as single source of truth - credentials + location for all services
All checks were successful
BotServer CI/CD / build (push) Successful in 4m53s
- Qdrant health check: recognize 'healthz check passed' response (fixes 45s timeout)
- seed_vault_defaults: add host/port/url/grpc_port for ALL 10 services
- fetch_vault_credentials: fetch ALL services via generic loop (drive, cache, tables, vectordb, directory, llm, meet, alm, encryption)
- vectordb URL: fix https://localhost:6334 -> http://localhost:6333 in all config getters
- get_from_env: add host/port/grpc_port for vectordb fallback
- Tested: .reset (fresh install) + .restart (idempotent) - zero errors
2026-04-01 16:46:16 -03:00
fb2e5242da fix: Vault seeding, service health checks, and restart idempotency
All checks were successful
BotServer CI/CD / build (push) Successful in 55m52s
- Replace hardcoded passwords with generate_random_string() for all Vault-seeded services
- Add valkey-cli, nc to SafeCommand allowlist; fix PATH in all 4 execution methods
- Fix empty Vault KV values ('none' placeholder) preventing 'Failed to parse K=V' errors
- Fix special chars in generated passwords triggering shell injection false positives
- Add ALM app.ini creation with absolute paths for Forgejo CLI
- Increase Qdrant timeout 15s→45s, ALM wait 5s→20s
- Persist file_states and kb_states to disk for .bas/KB idempotency across restarts
- Add duplicate check to use_website registration (debug log for existing)
- Remove dead code (SERVER_START_EPOCH, server_epoch)
- Add generate_random_string() to shared mod.rs, remove duplicates
2026-04-01 12:22:57 -03:00
3e46a16469 fix: Seed default credentials into Vault after initialization
Some checks failed
BotServer CI/CD / build (push) Failing after 3h13m28s
- Add seed_vault_defaults() to write default creds for all components
  (drive, cache, tables, directory, email, llm, encryption, meet, vectordb, alm)
- Call seed_vault_defaults() after KV2 enable in initialize_vault_local()
- Call seed_vault_defaults() in recover_existing_vault() for recovery path
- Rewrite fetch_vault_credentials() to use SafeCommand directly instead of
  safe_sh_command, avoiding '//' shell injection false positive on URLs
- Components like Drive now get credentials from Vault instead of 403 errors
2026-03-31 22:19:09 -03:00
9919a8321c fix: Use SafeCommand directly for vault health check to avoid shell injection false positive
All checks were successful
BotServer CI/CD / build (push) Successful in 6m46s
- Replace safe_sh_command with SafeCommand::new("curl").args() in vault_health_check()
- The URL contains https:// which triggered '//' pattern detection in shell command
- Direct SafeCommand bypasses shell parsing, URL passed as single argument
- Add vault data directory existence check before recovery attempt
- Prevents 'Dangerous pattern // detected' errors during bootstrap
2026-03-31 21:34:04 -03:00
07a6c1edb3 Merge commit '582ea634'
All checks were successful
BotServer CI/CD / build (push) Successful in 7m38s
2026-03-31 21:10:25 -03:00
582ea634e7 fix: Vault bootstrap recovery for sealed but initialized instances
- Fix vault_health_check() stub that always returned false
- Add recover_existing_vault() to handle Vault with existing data but no init.json
- Add unseal_vault() helper to unseal with existing vault-unseal-keys
- Detect initialized Vault via health endpoint or data directory presence
- Prevents bootstrap failure when reset.sh deletes init.json but Vault data persists

Root cause: vault_health_check() was a stub returning false, causing bootstrap
to always try vault operator init on already-initialized (but sealed) Vault,
which failed with connection refused. This cascaded to all services failing
to fetch credentials from Vault.
2026-03-31 20:49:29 -03:00
4ae16017ff Merge commit '644dfe2d'
Some checks failed
BotServer CI/CD / build (push) Has been cancelled
2026-03-31 19:57:57 -03:00
644dfe2d19 fix: Improve .gbdialog file detection for nested paths 2026-03-31 19:57:33 -03:00
2fa59057fa fix: Resolve migration error, Vault 403, cache timeout, and shell injection false positives
Some checks failed
BotServer CI/CD / build (push) Has been cancelled
- Fix migration 6.2.5: Create lost_reason column before VIEW that references it
- Fix Vault 403: Enable KV2 secrets engine after initialization
- Fix cache timeout: Increase Valkey readiness wait from 12s to 30s
- Fix command_guard: Remove () from forbidden chars (safe in std::process::Command)
2026-03-31 19:55:16 -03:00
26684e2db3 fix(ci): Separate deploy steps, use killall instead of pkill, add numbered traces
All checks were successful
BotServer CI/CD / build (push) Successful in 3m24s
2026-03-31 19:09:01 -03:00
dfa78890ab fix(ci): Use tar+zstd over ssh for binary deploy
Some checks failed
BotServer CI/CD / build (push) Failing after 7s
2026-03-31 19:07:31 -03:00
ab620aa729 fix(ci): Separate deploy steps with timeouts
Some checks failed
BotServer CI/CD / build (push) Failing after 8s
2026-03-31 19:02:21 -03:00
1fdb0df634 fix(ci): Add verbose traces to rsync deploy step
Some checks failed
BotServer CI/CD / build (push) Failing after 8s
2026-03-31 18:58:24 -03:00
1e14bd8b3b fix(ci): Use rsync for deploy instead of incus
Some checks failed
BotServer CI/CD / build (push) Failing after 14s
2026-03-31 18:35:16 -03:00
64fea8ea4e fix(ci): Direct incus file push between containers
Some checks failed
BotServer CI/CD / build (push) Failing after 7s
2026-03-31 18:15:33 -03:00
dca097f63b fix(ci): Use incus file transfer for large binary deploy
Some checks failed
BotServer CI/CD / build (push) Failing after 8s
2026-03-31 18:14:26 -03:00
307aa3180f fix(ci): Add /root/.cargo/bin to PATH for alm-ci runner
All checks were successful
BotServer CI/CD / build (push) Successful in 20s
2026-03-31 15:32:37 -03:00
fa8701d69e fix(ci): Remove hardcoded hosts entry
Some checks failed
BotServer CI/CD / build (push) Failing after 7s
2026-03-31 15:28:53 -03:00
a5dca2216b fix(ci): Update alm IP to 63.141.255.9 and simplify setup step 2026-03-31 15:28:43 -03:00
4c0cfb72eb fix(ci): Simplify SSL setup step, remove fragile git config parsing 2026-03-31 15:28:11 -03:00
e3faf7711c fix(ci): Remove internet-dependent rust/sccache installation steps
Some checks failed
BotServer CI/CD / build (push) Failing after 0s
- alm-ci has Rust and tools pre-installed offline
- Remove sccache download from GitHub releases
- Remove sccache server startup and stats display
2026-03-31 15:21:35 -03:00
3da26f497c fix: Use pre-installed Rust tools instead of downloading
- Remove Install Rust step (pre-installed in /root/.cargo/bin)
- Remove Install sccache step (pre-installed)
- Use pre-installed tools path instead
2026-03-31 12:55:21 -03:00
071f66d96e revert: Remove Google DNS workaround, fixed at DNS root 2026-03-31 12:35:46 -03:00
492e3c9dc8 fix: Add Google DNS for external domains 2026-03-31 12:26:27 -03:00
bc16a00e4d fix: Update container name and remove bad git url replacements 2026-03-31 12:15:49 -03:00
885b20a626 fix: Update container name and remove bad git url replacements 2026-03-31 12:15:25 -03:00
291b7e2e65 ci: Trigger build
Some checks failed
BotServer CI/CD / build (push) Failing after 13s
2026-03-21 21:09:49 -03:00
a30aeada83 revert: Remove IP hardcode from workflow 2026-03-21 21:08:50 -03:00
4893a01375 ci: Add hostname setup step 2026-03-21 21:08:11 -03:00
43f2eb7e5c ci: Add health check after botserver deploy 2026-03-21 20:47:13 -03:00
b83b4ffc4d fix: Remove server_epoch() from start_bas_executed Redis key
The epoch caused a new key to be created every second, bypassing
the 'already executed' check and running start.bas multiple times,
resulting in triplicated suggestions.
2026-03-21 20:40:25 -03:00
059ecb7187 feat(salesianos): add gbdialog tools with AST and tool.json
- inscricao, consultar_inscricao, agendamento_visita, informacoes_curso
- documentos_necessarios, contato_secretaria, calendario_letivo
- start.bas with USE_KB carta, proc
2026-03-21 19:21:46 -03:00
1132983064 feat(kb): add with_bot_config to load embedding from bot config
- Adds KnowledgeBaseManager::with_default_config() as alias to new()
- Adds KnowledgeBaseManager::with_bot_config() to load embedding_url,
  embedding_model, and qdrant config from bot's config.csv
- Updates bootstrap to use with_bot_config with default_bot_id
- Enables per-bot embedding configuration instead of global env vars
2026-03-21 18:55:36 -03:00
9641842e8d fix(migration): add WHERE org_id IS NOT NULL to SLA policy inserts to prevent crash on bots with no org 2026-03-21 17:00:46 -03:00
622f1222dc fix(websocket): force start.bas execution on connection to restore chat on page reload while preventing duplicate execution 2026-03-21 16:38:03 -03:00
363c056bab fix(bootstrap): add strict timeout to Redis connection initialization to prevent hanging on dropped tcp packets 2026-03-21 14:37:04 -03:00
adb26330d2 fix: Simple 50ms timeout for Redis connection 2026-03-21 10:48:47 -03:00
9d6c2686f1 fix: Remove connection caching (no Clone) 2026-03-21 10:37:49 -03:00
b3ce293487 fix: Clean up duplicate Redis code and fix WebSocket log level 2026-03-21 10:30:19 -03:00
cfe6453d1e perf: Add shared Redis connection pool with 50ms timeout 2026-03-21 10:14:10 -03:00
43fd40aed9 fix: Add timeout to Redis get_connection to prevent blocking
- Added get_redis_connection() helper with 2s timeout
- All cache operations now fail fast if Valkey is not ready
- Prevents start.bas from blocking for minutes waiting for cache
- Changes: add_suggestion.rs
2026-03-21 09:34:41 -03:00
4f43b9eefc ci: Fix pkill pattern to match running process 2026-03-20 22:57:10 -03:00