Monitoring and Observability

A comprehensive monitoring stack is essential for maintaining a healthy production Corgi deployment. This guide covers all monitoring layers: from basic health checks to advanced Prometheus metrics and autonomous agent-based monitoring.

Monitoring Architecture

graph TB
    subgraph API["Corgi API"]
        HEALTH["/health endpoint"]
        METRICS["/metrics endpoint"]
        LOGS["Application Logs"]
    end

    subgraph Collectors["Data Collectors"]
        PROM[Prometheus<br/>Port 9090]
        HEALTH_MON[Health Monitor<br/>Script]
        BROWSER_MON[Browser Monitor<br/>Script]
    end

    subgraph Agents["Autonomous Agents"]
        MANAGER[ManagerAgent]
        WEB_AGENT[WebsiteHealthAgent]
        PERF_AGENT[PerformanceOptimizationAgent]
    end

    subgraph Storage["Data Storage & Visualization"]
        GRAFANA[Grafana<br/>Port 3000]
        SLACK[Slack Alerts]
        LOGS_DIR[logs/ directory]
    end

    API --> Collectors
    Collectors --> Storage
    Agents --> Storage
    Agents --> API

    style API fill:#2b3e50
    style Collectors fill:#3498db
    style Agents fill:#e74c3c
    style Storage fill:#27ae60

Basic Health Checks

Health Endpoints

Corgi provides two health check endpoints for monitoring service availability:

Simple Health Check

curl http://localhost:5002/health

Response:

{
  "status": "healthy",
  "timestamp": "2024-01-20T15:30:45.123Z"
}

Detailed Health Check

curl http://localhost:5002/api/v1/health

Response:

{
  "status": "healthy",
  "version": "1.0.0",
  "components": {
    "database": {
      "status": "connected",
      "latency_ms": 2.3
    },
    "redis": {
      "status": "connected",
      "latency_ms": 0.8
    },
    "rag_service": {
      "status": "available",
      "chunks_loaded": 2947
    }
  },
  "uptime_seconds": 86400,
  "memory_usage_mb": 256.4,
  "cpu_percent": 15.2
}

Implementing Health Checks in Load Balancers

For AWS ALB:

HealthCheckPath: /health
HealthCheckIntervalSeconds: 30
HealthCheckTimeoutSeconds: 5
HealthyThresholdCount: 2
UnhealthyThresholdCount: 3

For Nginx:

upstream corgi_backend {
    server localhost:5002 max_fails=3 fail_timeout=30s;
    server localhost:5003 max_fails=3 fail_timeout=30s;
}

location /health {
    access_log off;
    proxy_pass http://corgi_backend;
}

Prometheus & Grafana

Exposed Metrics

Corgi exposes the following Prometheus metrics at /metrics:

Metric Name	Type	Description	Labels
`corgi_request_count`	Counter	Total HTTP requests	`method`, `endpoint`, `status`
`corgi_request_latency_seconds`	Histogram	Request latency distribution	`method`, `endpoint`
`corgi_recommendation_generation_seconds`	Histogram	Time to generate recommendations	`algorithm`
`corgi_active_users`	Gauge	Currently active users	-
`corgi_cache_hit_ratio`	Gauge	Redis cache hit ratio	`cache_type`
`corgi_database_connections`	Gauge	Active database connections	`pool_name`
`corgi_agent_executions`	Counter	Agent execution count	`agent_type`, `status`
`corgi_ml_model_predictions`	Counter	ML model prediction count	`model_name`

Example Prometheus Queries

Average Request Latency (5m)

rate(corgi_request_latency_seconds_sum[5m]) / rate(corgi_request_latency_seconds_count[5m])

Request Error Rate

sum(rate(corgi_request_count{status=~"5.."}[5m])) / sum(rate(corgi_request_count[5m]))

P95 Recommendation Generation Time

histogram_quantile(0.95, rate(corgi_recommendation_generation_seconds_bucket[5m]))

Cache Performance

avg(corgi_cache_hit_ratio) by (cache_type)

Grafana Dashboard Setup

Import the pre-built dashboard:

# Copy dashboard to Grafana
cp monitoring/grafana/dashboards/corgi-dashboard.json /var/lib/grafana/dashboards/

# Or import via API
curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @monitoring/grafana/dashboards/corgi-dashboard.json

Key dashboard panels: - Request rate and latency - Error rate and status codes - Recommendation performance - Cache and database metrics - Agent execution status - System resources (CPU, memory)

Agent Framework Monitoring

ManagerAgent Monitoring

The ManagerAgent provides centralized monitoring of all autonomous agents:

# Query agent status programmatically
import requests

response = requests.get("http://localhost:5002/api/v1/agents/status")
agent_status = response.json()

for agent in agent_status["agents"]:
    print(f"{agent['name']}: {agent['status']} - Last run: {agent['last_execution']}")

Agent Metrics

Monitor agent performance via logs:

# View ManagerAgent logs
tail -f logs/manager_agent.log | grep "ALERT"

# Check agent execution history
grep "execution_time" logs/agent_*.log | awk '{sum+=$NF; count++} END {print "Avg execution time:", sum/count, "seconds"}'

Slack Integration

Configure Slack alerts for critical agent events:

# In your .env file
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
SLACK_CHANNEL=#corgi-alerts

Alert types: - Security vulnerabilities detected - Performance degradation - Test failures - System resource alerts - Agent execution failures

Automated Monitoring Scripts

Health Monitor

The health monitor continuously checks all endpoints:

# Start health monitoring
make dev-health-monitor

# Or run directly
python scripts/development/health_monitor.py

Monitor output:

{
  "timestamp": "2024-01-20T15:45:00Z",
  "checks": [
    {
      "endpoint": "/api/v1/recommendations",
      "status_code": 200,
      "response_time": 0.145,
      "healthy": true
    }
  ],
  "summary": {
    "total_checks": 4,
    "healthy": 4,
    "unhealthy": 0,
    "avg_response_time": 0.089
  }
}

Browser Monitor

Monitor frontend integration and console errors:

# Start browser monitoring
make dev-browser-monitor

# Or run directly
python scripts/development/browser_monitor.py

Features: - Automated page navigation - Console error detection - Screenshot capture on errors - Performance timing collection

Setting Up the Full Stack

Docker Compose Configuration

Add monitoring services to your docker-compose.yml:

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - ./monitoring/grafana/provisioning:/etc/grafana/provisioning
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=secure_password
      - GF_INSTALL_PLUGINS=redis-datasource

volumes:
  prometheus_data:
  grafana_data:

Quick Start

# Start all monitoring components
docker-compose --profile monitoring up -d

# Verify services
curl http://localhost:9090/-/healthy  # Prometheus
curl http://localhost:3000/api/health  # Grafana

Alert Configuration

Prometheus Alert Rules

Create monitoring/prometheus/alerts.yml:

groups:
  - name: corgi_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(corgi_request_count{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} (> 5%)"

      - alert: SlowResponseTime
        expr: rate(corgi_request_latency_seconds_sum[5m]) / rate(corgi_request_latency_seconds_count[5m]) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Slow response times"
          description: "Average response time is {{ $value }} seconds"

      - alert: DatabaseConnectionPoolExhausted
        expr: corgi_database_connections >= 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool nearly exhausted"
          description: "{{ $value }} connections in use (>90%)"

Email/PagerDuty Integration

Configure Alertmanager for notifications:

# monitoring/alertmanager/config.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'team-notifications'

receivers:
  - name: 'team-notifications'
    email_configs:
      - to: 'ops-team@example.com'
        from: 'corgi-alerts@example.com'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'

Performance Tuning

Optimizing Metric Collection

# In config.py - adjust metric collection intervals
METRICS_CONFIG = {
    "collection_interval": 10,  # seconds
    "histogram_buckets": [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
    "cardinality_limit": 10000
}

Reducing Monitoring Overhead

Sampling: Only collect detailed metrics for a percentage of requests
Aggregation: Pre-aggregate metrics before sending to Prometheus
Retention: Configure appropriate data retention policies

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    environment: 'production'

# Retention configuration
storage:
  tsdb:
    retention.time: 30d
    retention.size: 10GB

Troubleshooting

Common Issues

High Memory Usage

# Check memory consumption by component
ps aux | grep -E "corgi|prometheus|grafana" | awk '{sum+=$6} END {print "Total RSS:", sum/1024, "MB"}'

# Analyze memory profile
python -m memory_profiler scripts/analyze_memory.py

Missing Metrics

# Verify metrics endpoint
curl http://localhost:5002/metrics | grep corgi_

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

Agent Not Reporting

# Check agent logs
tail -f logs/manager_agent.log

# Verify agent configuration
python scripts/agent_health_check.py

Debug Commands

# Test all monitoring endpoints
make test-monitoring

# Generate load for testing
python scripts/load_test.py --duration 300 --rps 100

# Validate Prometheus configuration
promtool check config monitoring/prometheus/prometheus.yml

Best Practices

Set Baseline Metrics: Establish normal operating ranges during initial deployment
Progressive Alerting: Start with conservative thresholds and adjust based on experience
Correlation: Use Grafana's correlation features to link metrics with logs
Capacity Planning: Use historical metrics to predict resource needs
Regular Reviews: Monthly monitoring review to optimize alerts and dashboards