docs: Add ALM/CI debugging and monitoring section to PROD.md

This commit is contained in:
Rodrigo Rodriguez (Pragmatismo) 2026-04-16 08:54:35 -03:00
parent ea36abbea3
commit 2ff5c43531
2 changed files with 77 additions and 1 deletions

76
PROD.md
View file

@ -58,6 +58,82 @@ Repositories exist on both GitHub and the internal ALM (Forgejo). The four repos
The CI runner container (`alm-ci`) runs Debian Trixie with glibc 2.41, but the `system` container runs Debian 12 Bookworm with glibc 2.36. Binaries compiled on the CI runner are incompatible with the system container. The CI workflow (`botserver/.forgejo/workflows/botserver.yaml`) solves this by transferring source to the system container via `tar | ssh` and building there. The workflow triggers on pushes to `main`, clones repos, transfers source, builds inside system container, deploys the binary, and verifies botserver is running.
### ALM/CI Debugging & Monitoring
**Access ALM/CI containers:**
```bash
ssh administrator@HOST
sudo incus exec alm-ci -- bash # CI runner container
sudo incus exec tables -- bash # PostgreSQL (ALM database)
sudo incus exec system -- bash # botserver container
```
**Check CI runner status:**
```bash
# Runner process
sudo incus exec alm-ci -- ps aux | grep forgejo
# Runner logs
sudo incus exec alm-ci -- cat /opt/gbo/logs/forgejo-runner.log
# If runner is down, restart:
sudo incus exec alm-ci -- pkill -9 forgejo; sleep 2; cd /opt/gbo/bin && nohup ./forgejo-runner daemon --config config.yaml >> /opt/gbo/logs/forgejo-runner.log 2>&1 &
```
**Monitor CI runs in database:**
```bash
# List recent runs (status: 0=pending, 1=running, 2=failure, 3=cancelled, 6=success)
sudo incus exec tables -- bash -c 'export PGPASSWORD=postgres; psql -h localhost -U postgres -d PROD-ALM -c "SELECT id, status, commit_sha, created FROM action_run ORDER BY id DESC LIMIT 5;"'
# Check specific run jobs
sudo incus exec tables -- bash -c 'export PGPASSWORD=postgres; psql -h localhost -U postgres -d PROD-ALM -c "SELECT id, status, name FROM action_run_job WHERE run_id = <ID>;"'
# Check tasks
sudo incus exec tables -- bash -c 'export PGPASSWORD=postgres; psql -h localhost -U postgres -d PROD-ALM -c "SELECT id, status FROM action_task WHERE repo_id = 3 ORDER BY id DESC LIMIT 3;"'
# Reset stuck run to re-trigger
sudo incus exec tables -- bash -c 'export PGPASSWORD=postgres; psql -h localhost -U postgres -d PROD-ALM -c "UPDATE action_task SET status = 0 WHERE id = <ID>; UPDATE action_run_job SET status = 0 WHERE id = <RUN_ID>; UPDATE action_run SET status = 0 WHERE id = <RUN_ID>;"'
```
**Fix common CI issues:**
```bash
# /tmp permission denied for build.log
sudo incus exec alm-ci -- chmod 1777 /tmp
sudo incus exec alm-ci -- touch /tmp/build.log && chmod 666 /tmp/build.log
# Clean old CI runs (keep recent)
sudo incus exec tables -- bash -c 'export PGPASSWORD=postgres; psql -h localhost -U postgres -d PROD-ALM -c "DELETE FROM action_run WHERE id < <RECENT_ID>;"'
sudo incus exec tables -- bash -c 'export PGPASSWORD=postgres; psql -h localhost -U postgres -d PROD-ALM -c "DELETE FROM action_run_job WHERE run_id < <RECENT_ID>;"'
# Check deploy.log missing error - fix workflow step
# The Save deploy log step expects /tmp/deploy.log which the workflow doesn't create
# Fix: ensure deploy step outputs to /tmp/deploy.log
```
**Watch CI in real-time:**
```bash
# Tail runner logs
sudo incus exec alm-ci -- tail -f /opt/gbo/logs/forgejo-runner.log
# Check if new builds appear
watch -n 5 'sudo incus exec tables -- bash -c "export PGPASSWORD=postgres; psql -h localhost -U postgres -d PROD-ALM -c \"SELECT id, status, created FROM action_run ORDER BY id DESC LIMIT 3;\""'
# Verify botserver deployed correctly
sudo incus exec system -- /opt/gbo/bin/botserver --version 2>&1 | head -3
sudo incus exec system -- tail -5 /opt/gbo/logs/err.log
```
**CI Workflow Structure:**
1. Setup Git (disable SSL verify, add safe directories)
2. Setup Workspace (clone/merge gb workspace Cargo.toml)
3. Install system dependencies
4. Clean up workspaces
5. Build BotServer (output to /tmp/build.log)
6. Save build log
7. Deploy via ssh tar gzip
8. Verify botserver started
9. Save deploy log
---
## DriveMonitor & Bot Configuration

@ -1 +1 @@
Subproject commit e63c187f322e13a6d750d783888bf47c4a01b37f
Subproject commit 04bfd668a42fda91c90e8b6b6a346edcc1288111