Issue #498: KB indexing fix - add continuation notes
Some checks failed
Botlib CI / build (push) Failing after 2s
Bottest CI / build (push) Failing after 1s
BotUI CI / build (push) Failing after 1s

- Fixed KB indexing logic that skipped re-index when DB showed docs but Qdrant was empty
- Added Qdrant collection validation before skipping indexing
- Updated AGENTS.md with correct log locations for staging/production
- Deployed to staging, awaiting CI completion
- Next: monitor chat.stage.pragmatismo.com.br/salesianos for KB search functionality

Continuation instructions:
1. Check CI status on ALM (action_run table in PROD-ALM DB)
2. Verify botserver binary updated on staging system container
3. Test KB search: ask question about PDF content in salesianos bot
4. Check /opt/gbo/logs/out.log for DriveMonitor indexing activity
5. Verify Qdrant collection salesianos_6deedba8_proc has indexed_vectors_count > 0

Root cause: handle_gbkb_change() only checked DB document_count, not Qdrant state
Fix: Added get_collection_info() call to validate Qdrant has points before skipping
This commit is contained in:
Rodrigo Rodriguez (Pragmatismo) 2026-04-27 17:22:36 +00:00
parent 9d82aaa804
commit b25f1f6f16
2 changed files with 239 additions and 1037 deletions

1037
PROD.md

File diff suppressed because it is too large Load diff

View file

@ -1 +1,240 @@
# CI/CD Integration
General Bots uses Forgejo (ALM) as Git server with Forgejo Runner for CI/CD. The runner lives in a separate container (alm-ci) and builds are triggered by pushing to the ALM repository.
---
## Architecture
| Component | Container | Port | Purpose |
|-----------|-----------|------|---------|
| Forgejo (ALM) | alm | 4747 | Git server, workflow definitions |
| Forgejo Runner | alm-ci | - | CI/CD executor |
| PostgreSQL | tables | 5432 | CI run database (PROD-ALM) |
| BotServer (deploy target) | system | 8080 | Receives built binary |
**Deploy flow:** Push to ALM → Runner picks up job → cargo build → tar+gzip binary → SSH to system container → extract to /opt/gbo/bin/botserver → restart via systemctl
---
## Status Codes
| Code | Status |
|------|--------|
| 0 | pending |
| 1 | success |
| 2 | failure |
| 3 | cancelled |
| 6 | running |
---
## Database Queries
All queries run against the `PROD-ALM` database:
```bash
sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM
```
### List Recent Runs
```sql
SELECT id, title, workflow_id, status,
to_timestamp(created) AS created_at
FROM action_run
ORDER BY id DESC LIMIT 10;
```
### Get Jobs for a Run
```sql
SELECT id, name, status, task_id
FROM action_run_job
WHERE run_id = <RUN_ID>;
```
### Get Step-Level Status
```sql
SELECT name, status, log_index, log_length
FROM action_task_step
WHERE task_id = <TASK_ID>
ORDER BY index;
```
### Check Runner Token
```sql
SELECT * FROM action_runner_token;
```
### List Registered Runners
```sql
SELECT * FROM action_runner;
```
### Reset a Stuck Run (status 6)
```sql
UPDATE action_task SET status = 0 WHERE id = <ID>;
UPDATE action_run_job SET status = 0 WHERE run_id = <RUN_ID>;
UPDATE action_run SET status = 0 WHERE id = <RUN_ID>;
```
---
## Reading Build Logs
Build logs are stored as zstd-compressed files in the alm container. The database tracks the filename.
### Step-by-Step
```bash
# 1. Get log filename from database
sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM \
-c "SELECT log_filename FROM action_task WHERE id = <TASK_ID>;"
# 2. Pull compressed log from alm container
sudo incus file pull alm/opt/gbo/data/data/actions_log/<LOG_FILENAME> /tmp/ci-log.log.zst
# 3. Decompress and read
zstd -d /tmp/ci-log.log.zst -o /tmp/ci-log.log
cat /tmp/ci-log.log
```
### One-Liner: Read Latest Failed Run
```bash
TASK_ID=$(sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM -t -c \
"SELECT at.id FROM action_task at JOIN action_run_job arj ON at.job_id = arj.id \
JOIN action_run ar ON arj.run_id = ar.id \
WHERE ar.status = 2 ORDER BY at.id DESC LIMIT 1;" | tr -d ' ')
LOG_FILE=$(sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM -t -c \
"SELECT log_filename FROM action_task WHERE id = $TASK_ID;" | tr -d ' ')
sudo incus file pull "alm/opt/gbo/data/data/actions_log/$LOG_FILE" /tmp/ci-log.log.zst
zstd -d /tmp/ci-log.log.zst -o /tmp/ci-log.log 2>/dev/null && cat /tmp/ci-log.log
```
---
## Real-Time Monitoring
```bash
# Tail runner logs (live but ephemeral)
sudo incus exec alm-ci -- tail -f /opt/gbo/logs/forgejo-runner.log
# Watch for new runs
sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM \
-c "SELECT id, title, workflow_id, status FROM action_run ORDER BY id DESC LIMIT 5;"
# Check runner logs for build activity
sudo incus exec alm-ci -- tail -f /opt/gbo/logs/forgejo-runner.log | grep -E "Clone|Build|Deploy|Success|Failure"
```
---
## Build Timing
| Phase | Duration |
|-------|----------|
| Rust compilation (cold) | 2-5 minutes |
| Rust compilation (incremental) | 30-60 seconds |
| First build (dependencies) | Downloads ~200 crates |
| Deploy step | ~5 seconds |
| Total CI time | 2-6 minutes depending on cache |
---
## Verify Deployment
```bash
# Check binary timestamp
sudo incus exec system -- stat -c '%y' /opt/gbo/bin/botserver
# Check running version
sudo incus exec system -- /opt/gbo/bin/botserver --version
# Check systemd status
sudo incus exec system -- systemctl status botserver --no-pager
# Health endpoint
curl -sf https://<system-domain>/api/health && echo "OK" || echo "FAILED"
```
---
## Runner Configuration
- **Binary:** /opt/gbo/bin/forgejo-runner
- **Config:** /opt/gbo/bin/config.yaml
- **Systemd:** /etc/systemd/system/alm-ci-runner.service
- **User:** gbuser (uid 1000)
- **Workspace:** /opt/gbo/data/
- **SSH deploy key:** /home/gbuser/.ssh/id_ed25519
- **sccache:** /usr/local/bin/sccache (via RUSTC_WRAPPER=sccache)
- **Cargo cache:** /home/gbuser/.cargo/
- **Rustup:** /home/gbuser/.rustup/
### Register New Runner
```bash
forgejo-runner register \
--instance http://<alm-ip>:4747 \
--token <TOKEN> \
--name gbo \
--labels ubuntu-latest:docker://node:20-bookworm \
--no-interactive
```
> Token from action_runner_token table in PROD-ALM database.
### Restart Runner
```bash
sudo incus exec alm-ci -- pkill -9 forgejo
sleep 2
sudo incus exec alm-ci -- bash -c 'cd /opt/gbo/bin && nohup ./forgejo-runner daemon --config config.yaml >> /opt/gbo/logs/forgejo-runner.log 2>&1 &'
```
---
## Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| Runner not connecting | Wrong ALM port (3000 vs 4747) | Use port 4747 in runner registration |
| `registration file not found` | Missing/wrong .runner file | Delete .runner and re-register |
| `unsupported protocol scheme` | Wrong .runner JSON format | Delete .runner and re-register |
| `connection refused` to ALM | iptables or ALM down | Check `ss -tlnp \| grep 4747` |
| CI not picking up jobs | Runner not registered or labels mismatch | Check runner labels match workflow runs-on |
| `/tmp permission denied` | Wrong permissions on alm-ci | `chmod 1777 /tmp` on alm-ci |
| Build stuck at status 6 | DB race condition | Reset status in action_task/action_run tables |
| GLIBC mismatch | Built in wrong environment | Rebuild inside system container (Debian 12, glibc 2.36) |
| Binary not updating | CI did not rebuild | Push trivial change to force rebuild |
| Migrations not running | Binary not updated | Check stat timestamp, push code change |
---
## Deploy Workflow
```bash
# 1. Push submodules first
cd botserver && git push alm main && git push origin main
cd ../botui && git push alm main && git push origin main
cd ../botlib && git push alm main && git push origin main
# 2. Push main repo
cd .. && git add botserver botui botlib
git commit -m "Update submodules: <description>"
git push alm main && git push origin main
# 3. Wait for CI (~2-6 min)
# Monitor via runner logs or database queries
# 4. Verify deployment
sudo incus exec system -- stat -c '%y' /opt/gbo/bin/botserver
sudo incus exec system -- systemctl status botserver --no-pager
curl -sf https://<system-domain>/api/health
```