Agent Framework
The Corgi Recommender Service employs a sophisticated multi-agent architecture that provides autonomous monitoring, self-healing, and optimization capabilities. This framework enables the system to maintain high availability, security, and performance with minimal human intervention.
Why an Agent-Based Architecture?
Traditional monitoring systems require constant human oversight and manual intervention. Corgi's agent framework was designed to solve three critical challenges:
- 24/7 Autonomous Operation: Agents continuously monitor and maintain the system without human intervention
- Intelligent Self-Healing: Agents can detect, analyze, and fix issues automatically with appropriate safety controls
- Cost-Effective Scaling: The circuit breaker and cost tracking prevent runaway AI costs while maximizing value
The agent framework transforms Corgi from a passive system into an active, self-managing platform that proactively addresses issues before they impact users.
Architecture Overview
graph TB
subgraph Orchestration ["Orchestration Layer"]
MA[ManagerAgent<br/>Central Orchestrator]
AO[AgentOrchestrator<br/>Execution Coordinator]
AS[Agent Scheduler<br/>Timing Controller]
end
subgraph Specialized ["Specialized Agents"]
WH[WebsiteHealthAgent<br/>Monitors endpoints & performance]
SA[SecurityHealingAgent<br/>Scans & fixes vulnerabilities]
PO[ProfilerOptimizerAgent<br/>Analyzes & optimizes performance]
TA[TesterAgent<br/>Generates & maintains tests]
BA[BrowserAgent<br/>UI automation & monitoring]
UX[UserExperienceAgent<br/>Monitors Core Web Vitals]
CM[ContentManagementAgent<br/>Manages content quality]
ML[MLModelAgent<br/>Monitors model performance]
DA[DeploymentAgent<br/>Manages infrastructure]
end
subgraph Control ["Control Systems"]
CT[CostTracker<br/>Tracks AI spending]
CB[Circuit Breaker<br/>Prevents cost overruns]
SN[SlackNotifier<br/>Sends alerts]
end
subgraph External ["External Systems"]
SLACK[Slack<br/>Notifications]
DB[(Database<br/>State & Metrics)]
API[Corgi API<br/>System Under Management]
CLAUDE[Claude AI<br/>Intelligence Provider]
end
MA --> AO
MA --> AS
MA --> CT
MA --> CB
MA --> SN
AO --> WH
AO --> SA
AO --> PO
AO --> TA
AO --> BA
AO --> UX
AO --> CM
AO --> ML
AO --> DA
MA --> DB
SN --> SLACK
WH --> API
SA --> CLAUDE
PO --> API
TA --> API
classDef orchestrator fill:#e8f5e9,stroke:#4caf50,color:#000
classDef agent fill:#e3f2fd,stroke:#2196f3,color:#000
classDef control fill:#fff3e0,stroke:#ff9800,color:#000
classDef external fill:#f3e5f5,stroke:#9c27b0,color:#000
class MA,AO,AS orchestrator
class WH,SA,PO,TA,BA,UX,CM,ML,DA agent
class CT,CB,SN control
class SLACK,DB,API,CLAUDE external
The ManagerAgent: Central Orchestrator
The ManagerAgent serves as the brain of the agent framework, monitoring all other agents and maintaining system-wide health. It's not just a monitor—it's an intelligent orchestrator that makes decisions about agent execution, resource allocation, and alert prioritization.
Key Responsibilities
- Agent Monitoring: Tracks the health, performance, and costs of all specialized agents
- Cost Control: Enforces budget limits and triggers circuit breakers when spending exceeds thresholds
- Issue Detection: Identifies problems like cost spikes, infinite loops, and system failures
- Intelligent Alerting: Aggregates and prioritizes alerts before sending to Slack
- Resource Coordination: Ensures agents don't conflict or overwhelm system resources
Implementation Example
class ManagerAgent:
"""Comprehensive agent monitoring and management system"""
async def _check_all_agents(self):
"""Monitor all registered agents for issues"""
for agent_id in self.cost_tracker.get_all_agents():
try:
# Get agent's current usage stats
usage = self.cost_tracker.get_agent_usage(agent_id, 'hourly')
# Calculate comprehensive status
status = self._calculate_agent_status(usage)
self.agent_statuses[agent_id] = status
# Check for various issues
await self._check_agent_issues(agent_id, status, usage)
# Log status
self.logger.info(
f"Agent {agent_id}: health={status.health_status}, "
f"cost=${status.current_cost_hourly:.2f}/hr, "
f"calls={status.api_calls_last_minute}/min"
)
except Exception as e:
self.logger.error(f"Failed to check agent {agent_id}: {e}")
async def _check_cost_spike(self, agent_id: str, status: AgentStatus, usage: UsageStats):
"""Detect unusual cost increases"""
# Get historical baseline
historical_avg = self._get_historical_average_cost(agent_id)
if historical_avg > 0:
spike_percentage = ((status.current_cost_hourly - historical_avg) / historical_avg) * 100
if spike_percentage > self.cost_spike_threshold:
issue = Issue(
issue_id=self._generate_issue_id(),
agent_id=agent_id,
issue_type='cost_spike',
severity='warning' if spike_percentage < 50 else 'critical',
message=f"Cost spike detected: {spike_percentage:.1f}% increase",
timestamp=datetime.now(),
metadata={'spike_percentage': spike_percentage}
)
self.active_issues[issue.issue_id] = issue
Cost Tracking Integration
The ManagerAgent integrates deeply with the cost tracking system to prevent runaway AI costs:
- Hourly Limits: Each agent has configurable hourly spending limits
- Token Tracking: Monitors token usage across different AI models
- Budget Enforcement: Automatically disables agents that exceed their budgets
- Cost Optimization: Suggests cheaper alternatives when costs spike
Specialized Agent Categories
1. Infrastructure & Health Agents
WebsiteHealthAgent
Monitors endpoint availability, response times, and overall system health. Performs regular health checks on all critical endpoints and alerts when performance degrades.
Key Features: - Endpoint monitoring with configurable thresholds - Response time tracking and alerting - Automatic incident detection and reporting
DeploymentAgent
Manages infrastructure scaling, backups, and deployment health. Ensures the system maintains high availability through intelligent resource management.
Key Features: - Auto-scaling based on load patterns - Automated backup verification - Infrastructure cost optimization
2. Security & Compliance Agents
SecurityHealingAgent
The most sophisticated agent in the framework, using Claude AI to analyze vulnerabilities and automatically apply fixes with appropriate safety controls.
Key Features: - Multi-scanner integration (pip-audit, npm audit, bandit) - LLM-powered vulnerability analysis and fix generation - Automatic patching with rollback capabilities - Human approval workflows for high-risk changes
Security Workflow:
async def execute(self) -> List[AgentAction]:
# 1. Comprehensive vulnerability scanning
vulnerabilities = await self.scan_vulnerabilities()
# 2. AI-powered analysis and healing plan generation
healing_plan = await self.analyze_with_llm(vulnerabilities)
# 3. Execute approved fixes with safety controls
for action in healing_plan.actions:
if self._should_auto_execute(action):
result = await self.execute_healing_action(action)
else:
# Queue for human approval
await self.request_approval(action)
3. Performance & Optimization Agents
ProfilerOptimizerAgent
Conducts deep performance analysis using profiling tools and generates optimization recommendations. Runs on schedule and can be triggered by performance degradation.
Key Features: - Database query analysis and optimization - Memory usage profiling and leak detection - API endpoint performance analysis - Automated optimization implementation
UserExperienceAgent
Monitors Core Web Vitals and user experience metrics, ensuring the platform maintains excellent performance from the user's perspective.
Key Features: - Core Web Vitals monitoring (LCP, FID, CLS) - Accessibility compliance checking - User behavior analysis - UI/UX optimization recommendations
4. Quality & Testing Agents
TesterAgent
Maintains comprehensive test coverage by analyzing code changes and automatically generating appropriate tests. Uses AI to understand code context and create meaningful test scenarios.
Key Features: - Automatic test generation for uncovered code - Test quality analysis and improvement - Coverage monitoring and reporting - Integration with CI/CD pipelines
ContentManagementAgent
Ensures content quality, freshness, and SEO optimization across the platform.
Key Features: - Content freshness monitoring - Quality validation and scoring - SEO optimization suggestions - Documentation update automation
5. Machine Learning Agents
MLModelAgent
Monitors recommendation model performance, detects drift, and manages model updates.
Key Features: - Model performance tracking - Drift detection and alerting - A/B testing coordination - Automatic model weight updates
Agent Communication & Coordination
Base Agent Pattern
All agents inherit from a common BaseAgent class that provides standardized functionality:
class BaseAgent(ABC):
"""Base class for all agents in the optimized system"""
async def run_with_monitoring(self) -> AgentResult:
"""Execute agent with monitoring and error handling"""
# Check circuit breaker before execution
circuit_check = self._check_circuit_breaker()
if not circuit_check['allowed']:
return AgentResult(
success=False,
message=f"Blocked by circuit breaker: {circuit_check['message']}"
)
# Execute with timing and error tracking
start_time = datetime.now()
try:
result = await self.execute()
self.execution_count += 1
if not result.success:
self.error_count += 1
return result
except Exception as e:
self.error_count += 1
return AgentResult(success=False, message=str(e))
Agent Scheduler
The SpecializedAgentScheduler orchestrates agent execution timing:
- Nightly Runs: Performance analysis and deep system checks
- Continuous Monitoring: Health checks every 5 minutes
- Event-Driven: Triggered by code changes or system events
- Intelligent Scheduling: Prevents resource conflicts and overload
Inter-Agent Communication
Agents communicate through: 1. Shared Database: Persistent state and metrics storage 2. Message Passing: Direct communication for urgent issues 3. Event System: Pub/sub for loosely coupled coordination 4. ManagerAgent Hub: Centralized coordination and conflict resolution
Safety Controls & Circuit Breakers
Weekly Spending Circuit Breaker
Prevents runaway AI costs with intelligent controls:
class WeeklySpendingBreaker:
def should_allow_execution(self, agent_id: str) -> Dict[str, Any]:
current_spending = self.get_weekly_spending()
if current_spending >= self.limits['hard_limit']:
return {
'allowed': False,
'reason': 'hard_limit_exceeded',
'message': f'Weekly limit of ${self.limits["hard_limit"]} exceeded'
}
if current_spending >= self.limits['soft_limit']:
# Allow only critical agents
if self.get_agent_priority(agent_id) > 2:
return {
'allowed': False,
'reason': 'soft_limit_exceeded',
'message': 'Non-critical agents paused due to spending'
}
return {'allowed': True, 'reason': 'within_limits'}
Execution Safety
Each agent implements safety controls: - Rate Limiting: Prevents agents from overwhelming the system - Rollback Procedures: Every change can be undone - Human Approval: High-risk actions require explicit approval - Validation Checks: Changes are verified before committing
Monitoring & Observability
Real-Time Dashboards
The agent framework provides comprehensive monitoring: - Agent health status and uptime - Cost tracking and budget utilization - Performance metrics and trends - Issue detection and resolution tracking
Slack Integration
Intelligent alerting ensures the right people are notified: - Aggregated Alerts: Similar issues are grouped - Priority-Based Routing: Critical issues go to #corgi-alerts - Context-Rich Messages: Include actionable information - Follow-Up Tracking: Ensures issues are resolved
Metrics & Reporting
The framework tracks extensive metrics: - Execution Metrics: Success rates, timing, resource usage - Cost Metrics: Per-agent spending, model usage, budget tracking - Health Metrics: Uptime, error rates, performance scores - Business Metrics: Issues resolved, optimizations applied, tests generated
Best Practices for Agent Development
1. Design for Failure
- Always include retry logic with exponential backoff
- Implement circuit breakers for external dependencies
- Log extensively for debugging
- Design idempotent operations
2. Cost Consciousness
- Set appropriate budget limits for each agent
- Use cheaper models when possible
- Cache results to avoid repeated API calls
- Monitor token usage continuously
3. Safety First
- Never auto-execute destructive operations
- Always provide rollback mechanisms
- Require human approval for high-risk actions
- Validate all changes before applying
4. Observability
- Log all significant actions and decisions
- Track detailed metrics for performance analysis
- Provide clear status reporting
- Enable easy debugging through comprehensive logs
Future Enhancements
The agent framework is designed for extensibility:
- Learning Agents: Agents that improve their performance over time
- Predictive Agents: Anticipate issues before they occur
- Collaborative Agents: Multiple agents working together on complex tasks
- Domain-Specific Agents: Specialized agents for specific business needs
Summary
Corgi's agent framework represents a paradigm shift in system management—from reactive monitoring to proactive, intelligent maintenance. By combining specialized agents with centralized orchestration, comprehensive safety controls, and intelligent cost management, the framework ensures Corgi remains performant, secure, and cost-effective with minimal human intervention.
The true power of the framework lies not in any individual agent, but in how they work together as a coordinated system, each contributing their specialized capabilities while the ManagerAgent ensures everything runs smoothly and efficiently.