Skip to content

Agent Framework

The Corgi Recommender Service employs a sophisticated multi-agent architecture that provides autonomous monitoring, self-healing, and optimization capabilities. This framework enables the system to maintain high availability, security, and performance with minimal human intervention.

Why an Agent-Based Architecture?

Traditional monitoring systems require constant human oversight and manual intervention. Corgi's agent framework was designed to solve three critical challenges:

  1. 24/7 Autonomous Operation: Agents continuously monitor and maintain the system without human intervention
  2. Intelligent Self-Healing: Agents can detect, analyze, and fix issues automatically with appropriate safety controls
  3. Cost-Effective Scaling: The circuit breaker and cost tracking prevent runaway AI costs while maximizing value

The agent framework transforms Corgi from a passive system into an active, self-managing platform that proactively addresses issues before they impact users.

Architecture Overview

graph TB
    subgraph Orchestration ["Orchestration Layer"]
        MA[ManagerAgent<br/>Central Orchestrator]
        AO[AgentOrchestrator<br/>Execution Coordinator]
        AS[Agent Scheduler<br/>Timing Controller]
    end

    subgraph Specialized ["Specialized Agents"]
        WH[WebsiteHealthAgent<br/>Monitors endpoints & performance]
        SA[SecurityHealingAgent<br/>Scans & fixes vulnerabilities]
        PO[ProfilerOptimizerAgent<br/>Analyzes & optimizes performance]
        TA[TesterAgent<br/>Generates & maintains tests]
        BA[BrowserAgent<br/>UI automation & monitoring]
        UX[UserExperienceAgent<br/>Monitors Core Web Vitals]
        CM[ContentManagementAgent<br/>Manages content quality]
        ML[MLModelAgent<br/>Monitors model performance]
        DA[DeploymentAgent<br/>Manages infrastructure]
    end

    subgraph Control ["Control Systems"]
        CT[CostTracker<br/>Tracks AI spending]
        CB[Circuit Breaker<br/>Prevents cost overruns]
        SN[SlackNotifier<br/>Sends alerts]
    end

    subgraph External ["External Systems"]
        SLACK[Slack<br/>Notifications]
        DB[(Database<br/>State & Metrics)]
        API[Corgi API<br/>System Under Management]
        CLAUDE[Claude AI<br/>Intelligence Provider]
    end

    MA --> AO
    MA --> AS
    MA --> CT
    MA --> CB
    MA --> SN

    AO --> WH
    AO --> SA
    AO --> PO
    AO --> TA
    AO --> BA
    AO --> UX
    AO --> CM
    AO --> ML
    AO --> DA

    MA --> DB
    SN --> SLACK
    WH --> API
    SA --> CLAUDE
    PO --> API
    TA --> API

    classDef orchestrator fill:#e8f5e9,stroke:#4caf50,color:#000
    classDef agent fill:#e3f2fd,stroke:#2196f3,color:#000
    classDef control fill:#fff3e0,stroke:#ff9800,color:#000
    classDef external fill:#f3e5f5,stroke:#9c27b0,color:#000

    class MA,AO,AS orchestrator
    class WH,SA,PO,TA,BA,UX,CM,ML,DA agent
    class CT,CB,SN control
    class SLACK,DB,API,CLAUDE external

The ManagerAgent: Central Orchestrator

The ManagerAgent serves as the brain of the agent framework, monitoring all other agents and maintaining system-wide health. It's not just a monitor—it's an intelligent orchestrator that makes decisions about agent execution, resource allocation, and alert prioritization.

Key Responsibilities

  1. Agent Monitoring: Tracks the health, performance, and costs of all specialized agents
  2. Cost Control: Enforces budget limits and triggers circuit breakers when spending exceeds thresholds
  3. Issue Detection: Identifies problems like cost spikes, infinite loops, and system failures
  4. Intelligent Alerting: Aggregates and prioritizes alerts before sending to Slack
  5. Resource Coordination: Ensures agents don't conflict or overwhelm system resources

Implementation Example

class ManagerAgent:
    """Comprehensive agent monitoring and management system"""

    async def _check_all_agents(self):
        """Monitor all registered agents for issues"""
        for agent_id in self.cost_tracker.get_all_agents():
            try:
                # Get agent's current usage stats
                usage = self.cost_tracker.get_agent_usage(agent_id, 'hourly')

                # Calculate comprehensive status
                status = self._calculate_agent_status(usage)
                self.agent_statuses[agent_id] = status

                # Check for various issues
                await self._check_agent_issues(agent_id, status, usage)

                # Log status
                self.logger.info(
                    f"Agent {agent_id}: health={status.health_status}, "
                    f"cost=${status.current_cost_hourly:.2f}/hr, "
                    f"calls={status.api_calls_last_minute}/min"
                )

            except Exception as e:
                self.logger.error(f"Failed to check agent {agent_id}: {e}")

    async def _check_cost_spike(self, agent_id: str, status: AgentStatus, usage: UsageStats):
        """Detect unusual cost increases"""
        # Get historical baseline
        historical_avg = self._get_historical_average_cost(agent_id)

        if historical_avg > 0:
            spike_percentage = ((status.current_cost_hourly - historical_avg) / historical_avg) * 100

            if spike_percentage > self.cost_spike_threshold:
                issue = Issue(
                    issue_id=self._generate_issue_id(),
                    agent_id=agent_id,
                    issue_type='cost_spike',
                    severity='warning' if spike_percentage < 50 else 'critical',
                    message=f"Cost spike detected: {spike_percentage:.1f}% increase",
                    timestamp=datetime.now(),
                    metadata={'spike_percentage': spike_percentage}
                )

                self.active_issues[issue.issue_id] = issue

Cost Tracking Integration

The ManagerAgent integrates deeply with the cost tracking system to prevent runaway AI costs:

  • Hourly Limits: Each agent has configurable hourly spending limits
  • Token Tracking: Monitors token usage across different AI models
  • Budget Enforcement: Automatically disables agents that exceed their budgets
  • Cost Optimization: Suggests cheaper alternatives when costs spike

Specialized Agent Categories

1. Infrastructure & Health Agents

WebsiteHealthAgent

Monitors endpoint availability, response times, and overall system health. Performs regular health checks on all critical endpoints and alerts when performance degrades.

Key Features: - Endpoint monitoring with configurable thresholds - Response time tracking and alerting - Automatic incident detection and reporting

DeploymentAgent

Manages infrastructure scaling, backups, and deployment health. Ensures the system maintains high availability through intelligent resource management.

Key Features: - Auto-scaling based on load patterns - Automated backup verification - Infrastructure cost optimization

2. Security & Compliance Agents

SecurityHealingAgent

The most sophisticated agent in the framework, using Claude AI to analyze vulnerabilities and automatically apply fixes with appropriate safety controls.

Key Features: - Multi-scanner integration (pip-audit, npm audit, bandit) - LLM-powered vulnerability analysis and fix generation - Automatic patching with rollback capabilities - Human approval workflows for high-risk changes

Security Workflow:

async def execute(self) -> List[AgentAction]:
    # 1. Comprehensive vulnerability scanning
    vulnerabilities = await self.scan_vulnerabilities()

    # 2. AI-powered analysis and healing plan generation
    healing_plan = await self.analyze_with_llm(vulnerabilities)

    # 3. Execute approved fixes with safety controls
    for action in healing_plan.actions:
        if self._should_auto_execute(action):
            result = await self.execute_healing_action(action)
        else:
            # Queue for human approval
            await self.request_approval(action)

3. Performance & Optimization Agents

ProfilerOptimizerAgent

Conducts deep performance analysis using profiling tools and generates optimization recommendations. Runs on schedule and can be triggered by performance degradation.

Key Features: - Database query analysis and optimization - Memory usage profiling and leak detection - API endpoint performance analysis - Automated optimization implementation

UserExperienceAgent

Monitors Core Web Vitals and user experience metrics, ensuring the platform maintains excellent performance from the user's perspective.

Key Features: - Core Web Vitals monitoring (LCP, FID, CLS) - Accessibility compliance checking - User behavior analysis - UI/UX optimization recommendations

4. Quality & Testing Agents

TesterAgent

Maintains comprehensive test coverage by analyzing code changes and automatically generating appropriate tests. Uses AI to understand code context and create meaningful test scenarios.

Key Features: - Automatic test generation for uncovered code - Test quality analysis and improvement - Coverage monitoring and reporting - Integration with CI/CD pipelines

ContentManagementAgent

Ensures content quality, freshness, and SEO optimization across the platform.

Key Features: - Content freshness monitoring - Quality validation and scoring - SEO optimization suggestions - Documentation update automation

5. Machine Learning Agents

MLModelAgent

Monitors recommendation model performance, detects drift, and manages model updates.

Key Features: - Model performance tracking - Drift detection and alerting - A/B testing coordination - Automatic model weight updates

Agent Communication & Coordination

Base Agent Pattern

All agents inherit from a common BaseAgent class that provides standardized functionality:

class BaseAgent(ABC):
    """Base class for all agents in the optimized system"""

    async def run_with_monitoring(self) -> AgentResult:
        """Execute agent with monitoring and error handling"""
        # Check circuit breaker before execution
        circuit_check = self._check_circuit_breaker()
        if not circuit_check['allowed']:
            return AgentResult(
                success=False,
                message=f"Blocked by circuit breaker: {circuit_check['message']}"
            )

        # Execute with timing and error tracking
        start_time = datetime.now()
        try:
            result = await self.execute()
            self.execution_count += 1
            if not result.success:
                self.error_count += 1
            return result
        except Exception as e:
            self.error_count += 1
            return AgentResult(success=False, message=str(e))

Agent Scheduler

The SpecializedAgentScheduler orchestrates agent execution timing:

  • Nightly Runs: Performance analysis and deep system checks
  • Continuous Monitoring: Health checks every 5 minutes
  • Event-Driven: Triggered by code changes or system events
  • Intelligent Scheduling: Prevents resource conflicts and overload

Inter-Agent Communication

Agents communicate through: 1. Shared Database: Persistent state and metrics storage 2. Message Passing: Direct communication for urgent issues 3. Event System: Pub/sub for loosely coupled coordination 4. ManagerAgent Hub: Centralized coordination and conflict resolution

Safety Controls & Circuit Breakers

Weekly Spending Circuit Breaker

Prevents runaway AI costs with intelligent controls:

class WeeklySpendingBreaker:
    def should_allow_execution(self, agent_id: str) -> Dict[str, Any]:
        current_spending = self.get_weekly_spending()

        if current_spending >= self.limits['hard_limit']:
            return {
                'allowed': False,
                'reason': 'hard_limit_exceeded',
                'message': f'Weekly limit of ${self.limits["hard_limit"]} exceeded'
            }

        if current_spending >= self.limits['soft_limit']:
            # Allow only critical agents
            if self.get_agent_priority(agent_id) > 2:
                return {
                    'allowed': False,
                    'reason': 'soft_limit_exceeded',
                    'message': 'Non-critical agents paused due to spending'
                }

        return {'allowed': True, 'reason': 'within_limits'}

Execution Safety

Each agent implements safety controls: - Rate Limiting: Prevents agents from overwhelming the system - Rollback Procedures: Every change can be undone - Human Approval: High-risk actions require explicit approval - Validation Checks: Changes are verified before committing

Monitoring & Observability

Real-Time Dashboards

The agent framework provides comprehensive monitoring: - Agent health status and uptime - Cost tracking and budget utilization - Performance metrics and trends - Issue detection and resolution tracking

Slack Integration

Intelligent alerting ensures the right people are notified: - Aggregated Alerts: Similar issues are grouped - Priority-Based Routing: Critical issues go to #corgi-alerts - Context-Rich Messages: Include actionable information - Follow-Up Tracking: Ensures issues are resolved

Metrics & Reporting

The framework tracks extensive metrics: - Execution Metrics: Success rates, timing, resource usage - Cost Metrics: Per-agent spending, model usage, budget tracking - Health Metrics: Uptime, error rates, performance scores - Business Metrics: Issues resolved, optimizations applied, tests generated

Best Practices for Agent Development

1. Design for Failure

  • Always include retry logic with exponential backoff
  • Implement circuit breakers for external dependencies
  • Log extensively for debugging
  • Design idempotent operations

2. Cost Consciousness

  • Set appropriate budget limits for each agent
  • Use cheaper models when possible
  • Cache results to avoid repeated API calls
  • Monitor token usage continuously

3. Safety First

  • Never auto-execute destructive operations
  • Always provide rollback mechanisms
  • Require human approval for high-risk actions
  • Validate all changes before applying

4. Observability

  • Log all significant actions and decisions
  • Track detailed metrics for performance analysis
  • Provide clear status reporting
  • Enable easy debugging through comprehensive logs

Future Enhancements

The agent framework is designed for extensibility:

  1. Learning Agents: Agents that improve their performance over time
  2. Predictive Agents: Anticipate issues before they occur
  3. Collaborative Agents: Multiple agents working together on complex tasks
  4. Domain-Specific Agents: Specialized agents for specific business needs

Summary

Corgi's agent framework represents a paradigm shift in system management—from reactive monitoring to proactive, intelligent maintenance. By combining specialized agents with centralized orchestration, comprehensive safety controls, and intelligent cost management, the framework ensures Corgi remains performant, secure, and cost-effective with minimal human intervention.

The true power of the framework lies not in any individual agent, but in how they work together as a coordinated system, each contributing their specialized capabilities while the ManagerAgent ensures everything runs smoothly and efficiently.