botbook/src/10-rest/monitoring-api.md

6 KiB

Monitoring API

The Monitoring API provides endpoints for real-time system monitoring, performance metrics, and health checks.

Status: Roadmap

This API is on the development roadmap. The endpoints documented below represent the planned interface design.

Base URL

http://localhost:8080/api/v1/monitoring

Authentication

Uses the standard botserver authentication mechanism with appropriate role-based permissions.

Endpoints

System Health

Method Endpoint Description
GET /api/v1/monitoring/health Get overall system health
GET /api/v1/monitoring/health/live Kubernetes liveness probe
GET /api/v1/monitoring/health/ready Kubernetes readiness probe

Performance Metrics

Method Endpoint Description
GET /api/v1/monitoring/metrics Get all metrics (Prometheus format)
GET /api/v1/monitoring/metrics/summary Get metrics summary
GET /api/v1/monitoring/metrics/{metric_name} Get specific metric

Service Status

Method Endpoint Description
GET /api/v1/monitoring/services List all services status
GET /api/v1/monitoring/services/{service_id} Get specific service status
POST /api/v1/monitoring/services/{service_id}/restart Restart a service

Resource Usage

Method Endpoint Description
GET /api/v1/monitoring/resources Get resource usage overview
GET /api/v1/monitoring/resources/cpu Get CPU usage
GET /api/v1/monitoring/resources/memory Get memory usage
GET /api/v1/monitoring/resources/disk Get disk usage
GET /api/v1/monitoring/resources/network Get network statistics

Alert Configuration

Method Endpoint Description
GET /api/v1/monitoring/alerts List all alerts
POST /api/v1/monitoring/alerts Create a new alert rule
GET /api/v1/monitoring/alerts/{alert_id} Get alert details
PUT /api/v1/monitoring/alerts/{alert_id} Update alert rule
DELETE /api/v1/monitoring/alerts/{alert_id} Delete alert rule
GET /api/v1/monitoring/alerts/active Get currently firing alerts

Log Stream

Method Endpoint Description
GET /api/v1/monitoring/logs Get recent logs
GET /api/v1/monitoring/logs/stream Stream logs via WebSocket
GET /api/v1/monitoring/logs/search Search logs with query

Request Examples

Check System Health

health = GET "/api/v1/monitoring/health"
TALK "System Status: " + health.status
TALK "Uptime: " + health.uptime
FOR EACH component IN health.components
    TALK component.name + ": " + component.status
NEXT

Get Performance Metrics

metrics = GET "/api/v1/monitoring/metrics/summary"
TALK "Request Rate: " + metrics.requests_per_second + "/s"
TALK "Average Latency: " + metrics.avg_latency_ms + "ms"
TALK "Error Rate: " + metrics.error_rate + "%"

Check Resource Usage

resources = GET "/api/v1/monitoring/resources"
TALK "CPU: " + resources.cpu.usage_percent + "%"
TALK "Memory: " + resources.memory.used_mb + "/" + resources.memory.total_mb + " MB"
TALK "Disk: " + resources.disk.used_gb + "/" + resources.disk.total_gb + " GB"

Create Alert Rule

alert = NEW OBJECT
alert.name = "High CPU Alert"
alert.metric = "cpu_usage_percent"
alert.condition = ">"
alert.threshold = 80
alert.duration = "5m"
alert.severity = "warning"
alert.notify = ["ops@example.com"]

result = POST "/api/v1/monitoring/alerts", alert
TALK "Alert created: " + result.alert_id

Get Active Alerts

alerts = GET "/api/v1/monitoring/alerts/active"
IF alerts.count > 0 THEN
    TALK "Active alerts: " + alerts.count
    FOR EACH alert IN alerts.items
        TALK alert.severity + ": " + alert.message
    NEXT
ELSE
    TALK "No active alerts"
END IF

Search Logs

params = NEW OBJECT
params.query = "error"
params.level = "error"
params.start_time = "2025-01-01T00:00:00Z"
params.limit = 100

logs = GET "/api/v1/monitoring/logs/search?" + ENCODE_PARAMS(params)
FOR EACH log IN logs.entries
    TALK log.timestamp + " [" + log.level + "] " + log.message
NEXT

Health Response Format

{
  "status": "healthy",
  "uptime": "5d 12h 30m",
  "version": "6.1.0",
  "components": [
    {"name": "database", "status": "healthy", "latency_ms": 2},
    {"name": "cache", "status": "healthy", "latency_ms": 1},
    {"name": "storage", "status": "healthy", "latency_ms": 5},
    {"name": "llm", "status": "healthy", "latency_ms": 150}
  ]
}

Metrics Format

Metrics are exposed in Prometheus format:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 12345

# HELP http_request_duration_seconds HTTP request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 8000
http_request_duration_seconds_bucket{le="0.5"} 11000
http_request_duration_seconds_bucket{le="1"} 12000

Alert Severity Levels

Level Description
info Informational, no action required
warning Attention needed, not critical
error Error condition, requires attention
critical Critical, immediate action required

Response Codes

Code Description
200 Success
201 Alert created
400 Bad Request (invalid parameters)
401 Unauthorized
403 Forbidden (insufficient permissions)
404 Resource not found
500 Internal Server Error
503 Service Unavailable

Required Permissions

Endpoint Category Required Role
Health Checks Public (no auth for basic health)
Metrics monitor or admin
Service Status monitor or admin
Resource Usage monitor or admin
Alert Configuration admin
Logs admin or log_viewer