docs: Add ALM/CI debugging and monitoring section to PROD.md
This commit is contained in:
parent
ea36abbea3
commit
2ff5c43531
2 changed files with 77 additions and 1 deletions
76
PROD.md
76
PROD.md
|
|
@ -58,6 +58,82 @@ Repositories exist on both GitHub and the internal ALM (Forgejo). The four repos
|
|||
|
||||
The CI runner container (`alm-ci`) runs Debian Trixie with glibc 2.41, but the `system` container runs Debian 12 Bookworm with glibc 2.36. Binaries compiled on the CI runner are incompatible with the system container. The CI workflow (`botserver/.forgejo/workflows/botserver.yaml`) solves this by transferring source to the system container via `tar | ssh` and building there. The workflow triggers on pushes to `main`, clones repos, transfers source, builds inside system container, deploys the binary, and verifies botserver is running.
|
||||
|
||||
### ALM/CI Debugging & Monitoring
|
||||
|
||||
**Access ALM/CI containers:**
|
||||
```bash
|
||||
ssh administrator@HOST
|
||||
sudo incus exec alm-ci -- bash # CI runner container
|
||||
sudo incus exec tables -- bash # PostgreSQL (ALM database)
|
||||
sudo incus exec system -- bash # botserver container
|
||||
```
|
||||
|
||||
**Check CI runner status:**
|
||||
```bash
|
||||
# Runner process
|
||||
sudo incus exec alm-ci -- ps aux | grep forgejo
|
||||
|
||||
# Runner logs
|
||||
sudo incus exec alm-ci -- cat /opt/gbo/logs/forgejo-runner.log
|
||||
|
||||
# If runner is down, restart:
|
||||
sudo incus exec alm-ci -- pkill -9 forgejo; sleep 2; cd /opt/gbo/bin && nohup ./forgejo-runner daemon --config config.yaml >> /opt/gbo/logs/forgejo-runner.log 2>&1 &
|
||||
```
|
||||
|
||||
**Monitor CI runs in database:**
|
||||
```bash
|
||||
# List recent runs (status: 0=pending, 1=running, 2=failure, 3=cancelled, 6=success)
|
||||
sudo incus exec tables -- bash -c 'export PGPASSWORD=postgres; psql -h localhost -U postgres -d PROD-ALM -c "SELECT id, status, commit_sha, created FROM action_run ORDER BY id DESC LIMIT 5;"'
|
||||
|
||||
# Check specific run jobs
|
||||
sudo incus exec tables -- bash -c 'export PGPASSWORD=postgres; psql -h localhost -U postgres -d PROD-ALM -c "SELECT id, status, name FROM action_run_job WHERE run_id = <ID>;"'
|
||||
|
||||
# Check tasks
|
||||
sudo incus exec tables -- bash -c 'export PGPASSWORD=postgres; psql -h localhost -U postgres -d PROD-ALM -c "SELECT id, status FROM action_task WHERE repo_id = 3 ORDER BY id DESC LIMIT 3;"'
|
||||
|
||||
# Reset stuck run to re-trigger
|
||||
sudo incus exec tables -- bash -c 'export PGPASSWORD=postgres; psql -h localhost -U postgres -d PROD-ALM -c "UPDATE action_task SET status = 0 WHERE id = <ID>; UPDATE action_run_job SET status = 0 WHERE id = <RUN_ID>; UPDATE action_run SET status = 0 WHERE id = <RUN_ID>;"'
|
||||
```
|
||||
|
||||
**Fix common CI issues:**
|
||||
```bash
|
||||
# /tmp permission denied for build.log
|
||||
sudo incus exec alm-ci -- chmod 1777 /tmp
|
||||
sudo incus exec alm-ci -- touch /tmp/build.log && chmod 666 /tmp/build.log
|
||||
|
||||
# Clean old CI runs (keep recent)
|
||||
sudo incus exec tables -- bash -c 'export PGPASSWORD=postgres; psql -h localhost -U postgres -d PROD-ALM -c "DELETE FROM action_run WHERE id < <RECENT_ID>;"'
|
||||
sudo incus exec tables -- bash -c 'export PGPASSWORD=postgres; psql -h localhost -U postgres -d PROD-ALM -c "DELETE FROM action_run_job WHERE run_id < <RECENT_ID>;"'
|
||||
|
||||
# Check deploy.log missing error - fix workflow step
|
||||
# The Save deploy log step expects /tmp/deploy.log which the workflow doesn't create
|
||||
# Fix: ensure deploy step outputs to /tmp/deploy.log
|
||||
```
|
||||
|
||||
**Watch CI in real-time:**
|
||||
```bash
|
||||
# Tail runner logs
|
||||
sudo incus exec alm-ci -- tail -f /opt/gbo/logs/forgejo-runner.log
|
||||
|
||||
# Check if new builds appear
|
||||
watch -n 5 'sudo incus exec tables -- bash -c "export PGPASSWORD=postgres; psql -h localhost -U postgres -d PROD-ALM -c \"SELECT id, status, created FROM action_run ORDER BY id DESC LIMIT 3;\""'
|
||||
|
||||
# Verify botserver deployed correctly
|
||||
sudo incus exec system -- /opt/gbo/bin/botserver --version 2>&1 | head -3
|
||||
sudo incus exec system -- tail -5 /opt/gbo/logs/err.log
|
||||
```
|
||||
|
||||
**CI Workflow Structure:**
|
||||
1. Setup Git (disable SSL verify, add safe directories)
|
||||
2. Setup Workspace (clone/merge gb workspace Cargo.toml)
|
||||
3. Install system dependencies
|
||||
4. Clean up workspaces
|
||||
5. Build BotServer (output to /tmp/build.log)
|
||||
6. Save build log
|
||||
7. Deploy via ssh tar gzip
|
||||
8. Verify botserver started
|
||||
9. Save deploy log
|
||||
|
||||
---
|
||||
|
||||
## DriveMonitor & Bot Configuration
|
||||
|
|
|
|||
|
|
@ -1 +1 @@
|
|||
Subproject commit e63c187f322e13a6d750d783888bf47c4a01b37f
|
||||
Subproject commit 04bfd668a42fda91c90e8b6b6a346edcc1288111
|
||||
Loading…
Add table
Reference in a new issue