Root cause: AuthConfig::from_env() was creating a new tokio runtime
with Runtime::new() inside an existing runtime during initialization.
Impact: Botserver crashed with "Cannot start a runtime from within a
runtime" panic right after CORS layer initialization.
Fix: Use new_current_thread() + std:🧵:spawn pattern (same as
get_database_url_sync fix) to create an isolated thread for async operations.
Files: src/security/auth_api/config.rs
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Root cause: block_in_place + new_current_thread().block_on() panics when
called from within tokio runtime (including spawn_blocking). Tokio doesn't
allow nested block_on() calls.
Fix: Replace ALL block_in_place patterns with std:🧵:spawn + mpsc channel.
This creates a completely separate OS thread with its own runtime, avoiding
any nesting issues. Works from any context: async, spawn_blocking, or sync.
Files: 14 files across secrets, utils, state, calendar, analytics, email,
and all keyword handlers (universal_messaging, search, book, create_draft,
create_site, hearing/syntax, use_tool, find, admin_email, goals)
Root cause: new_current_thread().block_on() panics when called from within
an existing tokio runtime (including from spawn_blocking). Tokio doesn't
allow nested block_on() calls.
Fix: Use std:🧵:spawn to create a completely separate OS thread
with its own runtime, communicating via mpsc channel. This works from
any context: async, spawn_blocking, or sync.
Root cause: block_in_place + new_current_thread().block_on() panics when
called from within tokio::task::spawn_blocking because block_in_place is
designed for async worker threads, not blocking threads.
Fix: Remove all block_in_place wrappers and use new_current_thread().build().block_on()
directly. This works from both async contexts and spawn_blocking contexts.
Affected: utils.rs (get_database_url_sync, get_work_path)
Root cause: Handle::current().block_on() panics inside multi_thread runtime
with 'Cannot start a runtime from within a runtime' error.
Fix: All sync-to-async bridges now use tokio::runtime::Builder::new_current_thread()
instead of Handle::current().block_on(). Also changed SECRETS_MANAGER from
tokio::sync::RwLock to std::sync::RwLock to eliminate unnecessary async overhead.
Files: 14 files across keywords, secrets, utils, state, calendar, analytics, email
Impact: Fixes production crash during bot loading phase
- main.rs: Skip init.json check when VAULT_ADDR points to remote server
- This allows botserver to read database credentials from Vault in production
- Without this fix, database URL falls back to localhost and connection fails
- install_all() returns early if VAULT_ADDR is remote
- start_all() returns early if VAULT_ADDR is remote
- bootstrap.rs treats remote VAULT_ADDR as bootstrap_completed=true
- Prevents botserver from trying to install/start local services
when all services are running in separate containers
- Remove all std::env::var calls except VAULT_* and PORT
- get_from_env returns hardcoded defaults only (no env var reading)
- Auth config, rate limits, email, analytics, calendar all use Vault
- WORK_PATH replaced with get_work_path() helper reading from Vault
- .env on production cleaned to only VAULT_ADDR, VAULT_TOKEN, VAULT_CACERT, PORT
- All service IPs/credentials stored in Vault secret/gbo/*
- Root cause: Valkey in prod runs without password but Vault stores one
- Previous code only tried password URL, got AUTH failed
- Fix: try no-password URL first, then password URL as fallback
- Also removed unused cache_url variable and cleaned up retry logic
- Root cause: Vault seeding writes to secret/gbo/cache but code reads gbo/system/cache
- kv2::read prepends secret/ so it looks for secret/gbo/system/cache (wrong)
- Fix: update SecretPaths to match seeding paths (gbo/cache, gbo/drive, etc.)
- Testing: compiles clean, paths now match vault kv list output
- Root cause: get_cache_config() uses runtime.block_on() which panics
when called from within an async runtime
- Fix: call SecretsManager::get_secret() directly with .await
- Testing: compiles clean, no runtime nesting issues
- Root cause: init_redis() used redis://localhost:6379 without password
- Valkey requires authentication, causing connection timeouts
- Fix: use get_cache_config() from SecretsManager to build URL with password
- Falls back to env vars (CACHE_URL/REDIS_URL/VALKEY_URL) if set
- Root cause: prod container lacks nc (netcat), causing fallback to valkey-cli ping
- valkey-cli ping hangs indefinitely when Valkey requires password auth
- Fix: use ss -tlnp as primary check (always available), nc as fallback
- Testing: verified ss is available in prod, nc is not
- Production has .env in botserver-stack/.env not ./.env
- Checks both locations to detect completed bootstrap
- Fixes E0716: use let bindings for Path borrows
- Production has .env in botserver-stack/.env not ./.env
- Checks both locations to detect completed bootstrap
- Prevents full re-bootstrap on restart in production
- nc -z checks port connectivity instantly (no auth needed)
- valkey-cli ping as fallback (hangs when password required)
- Fixes bootstrap hang on production where Valkey has Vault password
- Tables (PostgreSQL): pg_isready health check before start
- Drive (MinIO): /minio/health/live check before start
- ALM (Forgejo): HTTP health check before start
- ALM CI (Forgejo Runner): pgrep check before start
- Valkey: health check uses absolute path to valkey-cli
- Vault, Qdrant, Zitadel: already had health checks
- Result: no duplicate starts, no hangs on restart
- Use BOTSERVER_STACK_PATH/bin/cache/bin/valkey-cli instead of relying on PATH
- Remove bash /dev/tcp fallback (unreliable in restricted environments)
- Falls back to redis-cli and nc if valkey-cli unavailable
- Add seed_vault_defaults() to write default creds for all components
(drive, cache, tables, directory, email, llm, encryption, meet, vectordb, alm)
- Call seed_vault_defaults() after KV2 enable in initialize_vault_local()
- Call seed_vault_defaults() in recover_existing_vault() for recovery path
- Rewrite fetch_vault_credentials() to use SafeCommand directly instead of
safe_sh_command, avoiding '//' shell injection false positive on URLs
- Components like Drive now get credentials from Vault instead of 403 errors
- Replace safe_sh_command with SafeCommand::new("curl").args() in vault_health_check()
- The URL contains https:// which triggered '//' pattern detection in shell command
- Direct SafeCommand bypasses shell parsing, URL passed as single argument
- Add vault data directory existence check before recovery attempt
- Prevents 'Dangerous pattern // detected' errors during bootstrap
- Fix vault_health_check() stub that always returned false
- Add recover_existing_vault() to handle Vault with existing data but no init.json
- Add unseal_vault() helper to unseal with existing vault-unseal-keys
- Detect initialized Vault via health endpoint or data directory presence
- Prevents bootstrap failure when reset.sh deletes init.json but Vault data persists
Root cause: vault_health_check() was a stub returning false, causing bootstrap
to always try vault operator init on already-initialized (but sealed) Vault,
which failed with connection refused. This cascaded to all services failing
to fetch credentials from Vault.
The epoch caused a new key to be created every second, bypassing
the 'already executed' check and running start.bas multiple times,
resulting in triplicated suggestions.
- Adds KnowledgeBaseManager::with_default_config() as alias to new()
- Adds KnowledgeBaseManager::with_bot_config() to load embedding_url,
embedding_model, and qdrant config from bot's config.csv
- Updates bootstrap to use with_bot_config with default_bot_id
- Enables per-bot embedding configuration instead of global env vars
- Added get_redis_connection() helper with 2s timeout
- All cache operations now fail fast if Valkey is not ready
- Prevents start.bas from blocking for minutes waiting for cache
- Changes: add_suggestion.rs
- Replace thread spawn + tokio runtime creation with block_in_place
- Eliminates 10+ runtime creations per start.bas execution
- Reduces USE TOOL execution from ~2min to milliseconds
- Fixes suggestions not appearing due to start.bas timeout
- Remove suggestions fetching from TALK function
- WebSocket handler now fetches and sends suggestions after start.bas executes
- Clear suggestions and start_bas_executed keys to allow re-run on refresh
- Decouple TALK from suggestions handling