Monitoring and Observability
A comprehensive monitoring stack is essential for maintaining a healthy production Corgi deployment. This guide covers all monitoring layers: from basic health checks to advanced Prometheus metrics and autonomous agent-based monitoring.
Monitoring Architecture
graph TB
subgraph API["Corgi API"]
HEALTH["/health endpoint"]
METRICS["/metrics endpoint"]
LOGS["Application Logs"]
end
subgraph Collectors["Data Collectors"]
PROM[Prometheus<br/>Port 9090]
HEALTH_MON[Health Monitor<br/>Script]
BROWSER_MON[Browser Monitor<br/>Script]
end
subgraph Agents["Autonomous Agents"]
MANAGER[ManagerAgent]
WEB_AGENT[WebsiteHealthAgent]
PERF_AGENT[PerformanceOptimizationAgent]
end
subgraph Storage["Data Storage & Visualization"]
GRAFANA[Grafana<br/>Port 3000]
SLACK[Slack Alerts]
LOGS_DIR[logs/ directory]
end
API --> Collectors
Collectors --> Storage
Agents --> Storage
Agents --> API
style API fill:#2b3e50
style Collectors fill:#3498db
style Agents fill:#e74c3c
style Storage fill:#27ae60
Basic Health Checks
Health Endpoints
Corgi provides two health check endpoints for monitoring service availability:
Simple Health Check
curl http://localhost:5002/health
Response:
{
"status": "healthy",
"timestamp": "2024-01-20T15:30:45.123Z"
}
Detailed Health Check
curl http://localhost:5002/api/v1/health
Response:
{
"status": "healthy",
"version": "1.0.0",
"components": {
"database": {
"status": "connected",
"latency_ms": 2.3
},
"redis": {
"status": "connected",
"latency_ms": 0.8
},
"rag_service": {
"status": "available",
"chunks_loaded": 2947
}
},
"uptime_seconds": 86400,
"memory_usage_mb": 256.4,
"cpu_percent": 15.2
}
Implementing Health Checks in Load Balancers
For AWS ALB:
HealthCheckPath: /health
HealthCheckIntervalSeconds: 30
HealthCheckTimeoutSeconds: 5
HealthyThresholdCount: 2
UnhealthyThresholdCount: 3
For Nginx:
upstream corgi_backend {
server localhost:5002 max_fails=3 fail_timeout=30s;
server localhost:5003 max_fails=3 fail_timeout=30s;
}
location /health {
access_log off;
proxy_pass http://corgi_backend;
}
Prometheus & Grafana
Exposed Metrics
Corgi exposes the following Prometheus metrics at /metrics:
| Metric Name | Type | Description | Labels |
|---|---|---|---|
corgi_request_count |
Counter | Total HTTP requests | method, endpoint, status |
corgi_request_latency_seconds |
Histogram | Request latency distribution | method, endpoint |
corgi_recommendation_generation_seconds |
Histogram | Time to generate recommendations | algorithm |
corgi_active_users |
Gauge | Currently active users | - |
corgi_cache_hit_ratio |
Gauge | Redis cache hit ratio | cache_type |
corgi_database_connections |
Gauge | Active database connections | pool_name |
corgi_agent_executions |
Counter | Agent execution count | agent_type, status |
corgi_ml_model_predictions |
Counter | ML model prediction count | model_name |
Example Prometheus Queries
Average Request Latency (5m)
rate(corgi_request_latency_seconds_sum[5m]) / rate(corgi_request_latency_seconds_count[5m])
Request Error Rate
sum(rate(corgi_request_count{status=~"5.."}[5m])) / sum(rate(corgi_request_count[5m]))
P95 Recommendation Generation Time
histogram_quantile(0.95, rate(corgi_recommendation_generation_seconds_bucket[5m]))
Cache Performance
avg(corgi_cache_hit_ratio) by (cache_type)
Grafana Dashboard Setup
Import the pre-built dashboard:
# Copy dashboard to Grafana
cp monitoring/grafana/dashboards/corgi-dashboard.json /var/lib/grafana/dashboards/
# Or import via API
curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d @monitoring/grafana/dashboards/corgi-dashboard.json
Key dashboard panels: - Request rate and latency - Error rate and status codes - Recommendation performance - Cache and database metrics - Agent execution status - System resources (CPU, memory)
Agent Framework Monitoring
ManagerAgent Monitoring
The ManagerAgent provides centralized monitoring of all autonomous agents:
# Query agent status programmatically
import requests
response = requests.get("http://localhost:5002/api/v1/agents/status")
agent_status = response.json()
for agent in agent_status["agents"]:
print(f"{agent['name']}: {agent['status']} - Last run: {agent['last_execution']}")
Agent Metrics
Monitor agent performance via logs:
# View ManagerAgent logs
tail -f logs/manager_agent.log | grep "ALERT"
# Check agent execution history
grep "execution_time" logs/agent_*.log | awk '{sum+=$NF; count++} END {print "Avg execution time:", sum/count, "seconds"}'
Slack Integration
Configure Slack alerts for critical agent events:
# In your .env file
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
SLACK_CHANNEL=#corgi-alerts
Alert types: - Security vulnerabilities detected - Performance degradation - Test failures - System resource alerts - Agent execution failures
Automated Monitoring Scripts
Health Monitor
The health monitor continuously checks all endpoints:
# Start health monitoring
make dev-health-monitor
# Or run directly
python scripts/development/health_monitor.py
Monitor output:
{
"timestamp": "2024-01-20T15:45:00Z",
"checks": [
{
"endpoint": "/api/v1/recommendations",
"status_code": 200,
"response_time": 0.145,
"healthy": true
}
],
"summary": {
"total_checks": 4,
"healthy": 4,
"unhealthy": 0,
"avg_response_time": 0.089
}
}
Browser Monitor
Monitor frontend integration and console errors:
# Start browser monitoring
make dev-browser-monitor
# Or run directly
python scripts/development/browser_monitor.py
Features: - Automated page navigation - Console error detection - Screenshot capture on errors - Performance timing collection
Setting Up the Full Stack
Docker Compose Configuration
Add monitoring services to your docker-compose.yml:
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=secure_password
- GF_INSTALL_PLUGINS=redis-datasource
volumes:
prometheus_data:
grafana_data:
Quick Start
# Start all monitoring components
docker-compose --profile monitoring up -d
# Verify services
curl http://localhost:9090/-/healthy # Prometheus
curl http://localhost:3000/api/health # Grafana
Alert Configuration
Prometheus Alert Rules
Create monitoring/prometheus/alerts.yml:
groups:
- name: corgi_alerts
rules:
- alert: HighErrorRate
expr: rate(corgi_request_count{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} (> 5%)"
- alert: SlowResponseTime
expr: rate(corgi_request_latency_seconds_sum[5m]) / rate(corgi_request_latency_seconds_count[5m]) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "Slow response times"
description: "Average response time is {{ $value }} seconds"
- alert: DatabaseConnectionPoolExhausted
expr: corgi_database_connections >= 90
for: 5m
labels:
severity: critical
annotations:
summary: "Database connection pool nearly exhausted"
description: "{{ $value }} connections in use (>90%)"
Email/PagerDuty Integration
Configure Alertmanager for notifications:
# monitoring/alertmanager/config.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'team-notifications'
receivers:
- name: 'team-notifications'
email_configs:
- to: 'ops-team@example.com'
from: 'corgi-alerts@example.com'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
Performance Tuning
Optimizing Metric Collection
# In config.py - adjust metric collection intervals
METRICS_CONFIG = {
"collection_interval": 10, # seconds
"histogram_buckets": [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
"cardinality_limit": 10000
}
Reducing Monitoring Overhead
- Sampling: Only collect detailed metrics for a percentage of requests
- Aggregation: Pre-aggregate metrics before sending to Prometheus
- Retention: Configure appropriate data retention policies
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
environment: 'production'
# Retention configuration
storage:
tsdb:
retention.time: 30d
retention.size: 10GB
Troubleshooting
Common Issues
High Memory Usage
# Check memory consumption by component
ps aux | grep -E "corgi|prometheus|grafana" | awk '{sum+=$6} END {print "Total RSS:", sum/1024, "MB"}'
# Analyze memory profile
python -m memory_profiler scripts/analyze_memory.py
Missing Metrics
# Verify metrics endpoint
curl http://localhost:5002/metrics | grep corgi_
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets
Agent Not Reporting
# Check agent logs
tail -f logs/manager_agent.log
# Verify agent configuration
python scripts/agent_health_check.py
Debug Commands
# Test all monitoring endpoints
make test-monitoring
# Generate load for testing
python scripts/load_test.py --duration 300 --rps 100
# Validate Prometheus configuration
promtool check config monitoring/prometheus/prometheus.yml
Best Practices
- Set Baseline Metrics: Establish normal operating ranges during initial deployment
- Progressive Alerting: Start with conservative thresholds and adjust based on experience
- Correlation: Use Grafana's correlation features to link metrics with logs
- Capacity Planning: Use historical metrics to predict resource needs
- Regular Reviews: Monthly monitoring review to optimize alerts and dashboards