GeneralBots/botbook

Fork 0

Rodrigo Rodriguez (Pragmatismo) a6cf79c9ae Update botbook

2025-12-21 23:40:39 -03:00

6 KiB

Raw Blame History

Monitoring API

The Monitoring API provides endpoints for real-time system monitoring, performance metrics, and health checks.

Status: Roadmap

This API is on the development roadmap. The endpoints documented below represent the planned interface design.

Base URL

http://localhost:8080/api/v1/monitoring

Authentication

Uses the standard botserver authentication mechanism with appropriate role-based permissions.

Endpoints

System Health

Method	Endpoint	Description
GET	`/api/v1/monitoring/health`	Get overall system health
GET	`/api/v1/monitoring/health/live`	Kubernetes liveness probe
GET	`/api/v1/monitoring/health/ready`	Kubernetes readiness probe

Performance Metrics

Method	Endpoint	Description
GET	`/api/v1/monitoring/metrics`	Get all metrics (Prometheus format)
GET	`/api/v1/monitoring/metrics/summary`	Get metrics summary
GET	`/api/v1/monitoring/metrics/{metric_name}`	Get specific metric

Service Status

Method	Endpoint	Description
GET	`/api/v1/monitoring/services`	List all services status
GET	`/api/v1/monitoring/services/{service_id}`	Get specific service status
POST	`/api/v1/monitoring/services/{service_id}/restart`	Restart a service

Resource Usage

Method	Endpoint	Description
GET	`/api/v1/monitoring/resources`	Get resource usage overview
GET	`/api/v1/monitoring/resources/cpu`	Get CPU usage
GET	`/api/v1/monitoring/resources/memory`	Get memory usage
GET	`/api/v1/monitoring/resources/disk`	Get disk usage
GET	`/api/v1/monitoring/resources/network`	Get network statistics

Alert Configuration

Method	Endpoint	Description
GET	`/api/v1/monitoring/alerts`	List all alerts
POST	`/api/v1/monitoring/alerts`	Create a new alert rule
GET	`/api/v1/monitoring/alerts/{alert_id}`	Get alert details
PUT	`/api/v1/monitoring/alerts/{alert_id}`	Update alert rule
DELETE	`/api/v1/monitoring/alerts/{alert_id}`	Delete alert rule
GET	`/api/v1/monitoring/alerts/active`	Get currently firing alerts

Log Stream

Method	Endpoint	Description
GET	`/api/v1/monitoring/logs`	Get recent logs
GET	`/api/v1/monitoring/logs/stream`	Stream logs via WebSocket
GET	`/api/v1/monitoring/logs/search`	Search logs with query

Request Examples

Check System Health

health = GET "/api/v1/monitoring/health"
TALK "System Status: " + health.status
TALK "Uptime: " + health.uptime
FOR EACH component IN health.components
    TALK component.name + ": " + component.status
NEXT

Get Performance Metrics

metrics = GET "/api/v1/monitoring/metrics/summary"
TALK "Request Rate: " + metrics.requests_per_second + "/s"
TALK "Average Latency: " + metrics.avg_latency_ms + "ms"
TALK "Error Rate: " + metrics.error_rate + "%"

Check Resource Usage

resources = GET "/api/v1/monitoring/resources"
TALK "CPU: " + resources.cpu.usage_percent + "%"
TALK "Memory: " + resources.memory.used_mb + "/" + resources.memory.total_mb + " MB"
TALK "Disk: " + resources.disk.used_gb + "/" + resources.disk.total_gb + " GB"

Create Alert Rule

alert = NEW OBJECT
alert.name = "High CPU Alert"
alert.metric = "cpu_usage_percent"
alert.condition = ">"
alert.threshold = 80
alert.duration = "5m"
alert.severity = "warning"
alert.notify = ["ops@example.com"]

result = POST "/api/v1/monitoring/alerts", alert
TALK "Alert created: " + result.alert_id

Get Active Alerts

alerts = GET "/api/v1/monitoring/alerts/active"
IF alerts.count > 0 THEN
    TALK "Active alerts: " + alerts.count
    FOR EACH alert IN alerts.items
        TALK alert.severity + ": " + alert.message
    NEXT
ELSE
    TALK "No active alerts"
END IF

Search Logs

params = NEW OBJECT
params.query = "error"
params.level = "error"
params.start_time = "2025-01-01T00:00:00Z"
params.limit = 100

logs = GET "/api/v1/monitoring/logs/search?" + ENCODE_PARAMS(params)
FOR EACH log IN logs.entries
    TALK log.timestamp + " [" + log.level + "] " + log.message
NEXT

Health Response Format

{
  "status": "healthy",
  "uptime": "5d 12h 30m",
  "version": "6.1.0",
  "components": [
    {"name": "database", "status": "healthy", "latency_ms": 2},
    {"name": "cache", "status": "healthy", "latency_ms": 1},
    {"name": "storage", "status": "healthy", "latency_ms": 5},
    {"name": "llm", "status": "healthy", "latency_ms": 150}
  ]
}

Metrics Format

Metrics are exposed in Prometheus format:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 12345

# HELP http_request_duration_seconds HTTP request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 8000
http_request_duration_seconds_bucket{le="0.5"} 11000
http_request_duration_seconds_bucket{le="1"} 12000

Alert Severity Levels

Level	Description
`info`	Informational, no action required
`warning`	Attention needed, not critical
`error`	Error condition, requires attention
`critical`	Critical, immediate action required

Response Codes

Code	Description
200	Success
201	Alert created
400	Bad Request (invalid parameters)
401	Unauthorized
403	Forbidden (insufficient permissions)
404	Resource not found
500	Internal Server Error
503	Service Unavailable

Required Permissions

Endpoint Category	Required Role
Health Checks	Public (no auth for basic health)
Metrics	`monitor` or `admin`
Service Status	`monitor` or `admin`
Resource Usage	`monitor` or `admin`
Alert Configuration	`admin`
Logs	`admin` or `log_viewer`

6 KiB Raw Blame History