- bootstrap_utils.rs: Change Vec<(&'static str,...)> to Vec<(String,...)> to avoid dangling references
- bootstrap_manager.rs: Use name.as_str() for safe_pkill
- setup.rs: Use PathBuf instead of Path::new with format!
- directory/bootstrap.rs: Use PathBuf for pat_dir
- main.rs: Use PathBuf for vault_init_path_early
- Adds get_stack_path() helper: returns /opt/gbo in production (.env without botserver-stack), ./botserver-stack in dev
- Adds get_work_path() helper: returns /opt/gbo/work in production, ./botserver-stack/data/system/work in dev
- Updated 35+ files to use dynamic path resolution
- Production system container no longer needs botserver-stack directory
- Work files go to /opt/gbo/work instead of /opt/gbo/bin/botserver-stack
- Root cause: prod container lacks nc (netcat), causing fallback to valkey-cli ping
- valkey-cli ping hangs indefinitely when Valkey requires password auth
- Fix: use ss -tlnp as primary check (always available), nc as fallback
- Testing: verified ss is available in prod, nc is not
- nc -z checks port connectivity instantly (no auth needed)
- valkey-cli ping as fallback (hangs when password required)
- Fixes bootstrap hang on production where Valkey has Vault password
- Tables (PostgreSQL): pg_isready health check before start
- Drive (MinIO): /minio/health/live check before start
- ALM (Forgejo): HTTP health check before start
- ALM CI (Forgejo Runner): pgrep check before start
- Valkey: health check uses absolute path to valkey-cli
- Vault, Qdrant, Zitadel: already had health checks
- Result: no duplicate starts, no hangs on restart
- Use BOTSERVER_STACK_PATH/bin/cache/bin/valkey-cli instead of relying on PATH
- Remove bash /dev/tcp fallback (unreliable in restricted environments)
- Falls back to redis-cli and nc if valkey-cli unavailable
- Replace safe_sh_command with SafeCommand::new("curl").args() in vault_health_check()
- The URL contains https:// which triggered '//' pattern detection in shell command
- Direct SafeCommand bypasses shell parsing, URL passed as single argument
- Add vault data directory existence check before recovery attempt
- Prevents 'Dangerous pattern // detected' errors during bootstrap
- Fix vault_health_check() stub that always returned false
- Add recover_existing_vault() to handle Vault with existing data but no init.json
- Add unseal_vault() helper to unseal with existing vault-unseal-keys
- Detect initialized Vault via health endpoint or data directory presence
- Prevents bootstrap failure when reset.sh deletes init.json but Vault data persists
Root cause: vault_health_check() was a stub returning false, causing bootstrap
to always try vault operator init on already-initialized (but sealed) Vault,
which failed with connection refused. This cascaded to all services failing
to fetch credentials from Vault.
- Fix PAT extraction timing with retry loop (waits up to 60s for PAT in logs)
- Add sync command to flush filesystem buffers before extraction
- Improve logging with progress messages and PAT verification
- Refactor setup code into consolidated setup.rs module
- Fix YAML indentation for PatPath and MachineKeyPath
- Change Zitadel init parameter from --config to --steps
The timing issue occurred because:
1. Zitadel writes PAT to logs at startup (~18:08:59)
2. Post-install extraction ran too early (~18:09:35)
3. PAT file wasn't created until ~18:10:38 (63s after installation)
4. OAuth client creation failed because PAT file didn't exist yet
With the retry loop:
- Waits for PAT to appear in logs with sync+grep check
- Extracts PAT immediately when found
- OAuth client creation succeeds
- directory_config.json saved with valid credentials
- Login flow works end-to-end
Tested: Full reset.sh and login verification successful
Fixed compilation errors by adding .await to all ensure_admin_token() calls:
- create_organization()
- create_user()
- save_config()
The method was made async but the calls weren't updated.
- Add vector_db_health_check() function to verify Qdrant availability
- Add wait loop for vector_db startup in bootstrap (15 seconds)
- Fallback to local LLM when external URL configured but no API key provided
- Prevent external LLM (api.z.ai) usage without authentication key
This fixes the production issues:
- Qdrant vector database not available at https://localhost:6333
- External LLM being used instead of local when no key is configured
- Ensures vector_db is properly started and ready before use
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The previous /dev/tcp test was giving false positives, reporting that
Valkey was running when it was actually down. This caused bootstrap to
skip starting Valkey, leading to botserver hanging on cache connection.
Changes:
- Use nc (netcat) with -z flag for reliable port checking
- Final fallback: /dev/tcp with actual PING/PONG verification
- Only returns true if port is open AND responds correctly
This ensures cache_health_check() accurately reports Valkey status.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Try valkey-cli first (preferred for Valkey installations)
- Fall back to redis-cli (for Redis installations)
- Fall back to TCP connection test (works for both)
This fixes environments that only have Valkey installed without
Redis symlinks or redis-cli.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Fix component name mismatch: "redis" -> "cache" in bootstrap_manager
- Add cache_health_check() function to verify Valkey is responding
- Add health check loop after starting cache (12s wait with PING test)
- Ensures cache is ready before proceeding with bootstrap
This fixes the issue where botserver would hang waiting for cache
connection because the cache component was never started.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>