# Monitoring API The Monitoring API provides endpoints for real-time system monitoring, performance metrics, and health checks. ## Status: Roadmap This API is on the development roadmap. The endpoints documented below represent the planned interface design. ## Base URL ``` http://localhost:8080/api/v1/monitoring ``` ## Authentication Uses the standard botserver authentication mechanism with appropriate role-based permissions. ## Endpoints ### System Health | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/api/v1/monitoring/health` | Get overall system health | | GET | `/api/v1/monitoring/health/live` | Kubernetes liveness probe | | GET | `/api/v1/monitoring/health/ready` | Kubernetes readiness probe | ### Performance Metrics | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/api/v1/monitoring/metrics` | Get all metrics (Prometheus format) | | GET | `/api/v1/monitoring/metrics/summary` | Get metrics summary | | GET | `/api/v1/monitoring/metrics/{metric_name}` | Get specific metric | ### Service Status | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/api/v1/monitoring/services` | List all services status | | GET | `/api/v1/monitoring/services/{service_id}` | Get specific service status | | POST | `/api/v1/monitoring/services/{service_id}/restart` | Restart a service | ### Resource Usage | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/api/v1/monitoring/resources` | Get resource usage overview | | GET | `/api/v1/monitoring/resources/cpu` | Get CPU usage | | GET | `/api/v1/monitoring/resources/memory` | Get memory usage | | GET | `/api/v1/monitoring/resources/disk` | Get disk usage | | GET | `/api/v1/monitoring/resources/network` | Get network statistics | ### Alert Configuration | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/api/v1/monitoring/alerts` | List all alerts | | POST | `/api/v1/monitoring/alerts` | Create a new alert rule | | GET | `/api/v1/monitoring/alerts/{alert_id}` | Get alert details | | PUT | `/api/v1/monitoring/alerts/{alert_id}` | Update alert rule | | DELETE | `/api/v1/monitoring/alerts/{alert_id}` | Delete alert rule | | GET | `/api/v1/monitoring/alerts/active` | Get currently firing alerts | ### Log Stream | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/api/v1/monitoring/logs` | Get recent logs | | GET | `/api/v1/monitoring/logs/stream` | Stream logs via WebSocket | | GET | `/api/v1/monitoring/logs/search` | Search logs with query | ## Request Examples ### Check System Health ```bas health = GET "/api/v1/monitoring/health" TALK "System Status: " + health.status TALK "Uptime: " + health.uptime FOR EACH component IN health.components TALK component.name + ": " + component.status NEXT ``` ### Get Performance Metrics ```bas metrics = GET "/api/v1/monitoring/metrics/summary" TALK "Request Rate: " + metrics.requests_per_second + "/s" TALK "Average Latency: " + metrics.avg_latency_ms + "ms" TALK "Error Rate: " + metrics.error_rate + "%" ``` ### Check Resource Usage ```bas resources = GET "/api/v1/monitoring/resources" TALK "CPU: " + resources.cpu.usage_percent + "%" TALK "Memory: " + resources.memory.used_mb + "/" + resources.memory.total_mb + " MB" TALK "Disk: " + resources.disk.used_gb + "/" + resources.disk.total_gb + " GB" ``` ### Create Alert Rule ```bas alert = NEW OBJECT alert.name = "High CPU Alert" alert.metric = "cpu_usage_percent" alert.condition = ">" alert.threshold = 80 alert.duration = "5m" alert.severity = "warning" alert.notify = ["ops@example.com"] result = POST "/api/v1/monitoring/alerts", alert TALK "Alert created: " + result.alert_id ``` ### Get Active Alerts ```bas alerts = GET "/api/v1/monitoring/alerts/active" IF alerts.count > 0 THEN TALK "Active alerts: " + alerts.count FOR EACH alert IN alerts.items TALK alert.severity + ": " + alert.message NEXT ELSE TALK "No active alerts" END IF ``` ### Search Logs ```bas params = NEW OBJECT params.query = "error" params.level = "error" params.start_time = "2025-01-01T00:00:00Z" params.limit = 100 logs = GET "/api/v1/monitoring/logs/search?" + ENCODE_PARAMS(params) FOR EACH log IN logs.entries TALK log.timestamp + " [" + log.level + "] " + log.message NEXT ``` ## Health Response Format ```json { "status": "healthy", "uptime": "5d 12h 30m", "version": "6.1.0", "components": [ {"name": "database", "status": "healthy", "latency_ms": 2}, {"name": "cache", "status": "healthy", "latency_ms": 1}, {"name": "storage", "status": "healthy", "latency_ms": 5}, {"name": "llm", "status": "healthy", "latency_ms": 150} ] } ``` ## Metrics Format Metrics are exposed in Prometheus format: ``` # HELP http_requests_total Total HTTP requests # TYPE http_requests_total counter http_requests_total{method="GET",status="200"} 12345 # HELP http_request_duration_seconds HTTP request latency # TYPE http_request_duration_seconds histogram http_request_duration_seconds_bucket{le="0.1"} 8000 http_request_duration_seconds_bucket{le="0.5"} 11000 http_request_duration_seconds_bucket{le="1"} 12000 ``` ## Alert Severity Levels | Level | Description | |-------|-------------| | `info` | Informational, no action required | | `warning` | Attention needed, not critical | | `error` | Error condition, requires attention | | `critical` | Critical, immediate action required | ## Response Codes | Code | Description | |------|-------------| | 200 | Success | | 201 | Alert created | | 400 | Bad Request (invalid parameters) | | 401 | Unauthorized | | 403 | Forbidden (insufficient permissions) | | 404 | Resource not found | | 500 | Internal Server Error | | 503 | Service Unavailable | ## Required Permissions | Endpoint Category | Required Role | |-------------------|---------------| | Health Checks | Public (no auth for basic health) | | Metrics | `monitor` or `admin` | | Service Status | `monitor` or `admin` | | Resource Usage | `monitor` or `admin` | | Alert Configuration | `admin` | | Logs | `admin` or `log_viewer` |