RAG Maintenance Guide
Keep your Corgi RAG system accurate and up-to-date with proper knowledge base maintenance. This guide provides simple commands to validate freshness and rebuild the knowledge base when needed.
Why RAG Maintenance Matters
A stale knowledge base leads to serious problems:
- Inaccurate Answers: Outdated information produces misleading responses
- Missing Features: New functionality won't be documented in RAG queries
- Hallucinations: AI may fabricate information about non-existent features
- Developer Confusion: Team members get wrong guidance from RAG system
Critical Impact
The RAG system powers development workflows, documentation queries, and architectural decisions. A stale knowledge base can mislead your entire team.
Quick Health Check
Check if your knowledge base is current:
# Simple freshness check
make check-kb
Example output when current:
🔍 Checking knowledge base freshness...
📝 Last Git commit: 2025-01-15 14:30:25 UTC
📚 Knowledge base updated: 2025-01-15 14:32:18 UTC
✅ Knowledge base is current
Knowledge base is 113 seconds ahead of latest commit
Example output when stale:
🔍 Checking knowledge base freshness...
📝 Last Git commit: 2025-01-15 16:45:30 UTC
📚 Knowledge base updated: 2025-01-15 14:32:18 UTC
🚨 Error: Knowledge base is stale. Please run 'python3 scripts/populate_knowledge_base.py'
Knowledge base is 8032 seconds behind the latest commit
Grace Period
The system allows a 5-minute grace period between Git commits and knowledge base updates to account for processing time.
Updating the Knowledge Base
Method 1: Quick Update (Recommended)
# Standard incremental update
python3 scripts/populate_knowledge_base.py
This performs an incremental update: - Only processes new or changed content - Faster execution (typically 30-60 seconds) - Preserves existing embeddings - Safe for regular use
Method 2: Force Complete Rebuild
# Complete rebuild from scratch
python3 scripts/populate_knowledge_base.py --force-rebuild
Use complete rebuild when: - Documentation structure has changed significantly - Embedding model has been updated - Database corruption is suspected - Major codebase reorganization occurred
Complete Rebuild Performance
Force rebuilds take 5-15 minutes depending on codebase size. Use sparingly for major changes only.
Understanding the Update Process
The knowledge base builder processes multiple data sources:
%%{init: {'theme': 'base'}}%%
graph TD
subgraph "Input Sources"
direction LR
A["Git Repository"]
B["Codebase"]
C["Documentation"]
D["Dev Logs & Reports"]
end
subgraph "Orchestration"
H("Knowledge Base Builder<br/>populate_knowledge_base.py")
end
subgraph "Processing Pipeline"
direction LR
E("Content Chunking") --> F("Embedding Generation") --> G("Database Storage<br/>knowledge_embeddings")
end
A --> H
B --> H
C --> H
D --> H
H --> E
classDef source fill:#1E293B,stroke:#475569,color:#E2E8F0
classDef process fill:#0F172A,stroke:#334155,color:#94A3B8
classDef orchestrator fill:#059669,stroke:#047857,color:#ffffff
class A,B,C,D source
class E,F,G process
class H orchestrator
Monitoring and Statistics
Update Statistics
The population script provides detailed statistics:
python3 scripts/populate_knowledge_base.py
# Example output:
📖 Building knowledge base chunks...
Found 47 existing chunks in database
Processing 12 new/updated chunks...
📦 Processing batch 1/1 (12 chunks)
✅ Knowledge base population complete!
📊 Stats: 12 stored, 35 skipped, 0 errors
Knowledge Base Health Metrics
Check comprehensive knowledge base statistics:
# Get detailed KB stats
from scripts.populate_knowledge_base import KnowledgeBasePopulator
populator = KnowledgeBasePopulator()
stats = populator.get_knowledge_base_stats()
print(f"Total chunks: {stats['total_chunks']}")
print(f"Last updated: {stats['last_updated']}")
print(f"Data sources: {stats['data_sources']}")
Automation and CI/CD Integration
GitHub Actions Integration
Add knowledge base validation to your CI/CD pipeline:
name: Knowledge Base Validation
on:
push:
branches: [ main, develop ]
pull_request:
branches: [ main ]
jobs:
validate-kb:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Need full history for Git timestamps
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r requirements.txt
- name: Check knowledge base freshness
run: |
if ! make check-kb; then
echo "Knowledge base is stale. Updating..."
python3 scripts/populate_knowledge_base.py
# Verify update was successful
make check-kb
else
echo "Knowledge base is current"
fi
- name: Validate RAG system
run: |
# Test RAG queries to ensure system is working
python3 scripts/cursor_rag_query.py "What is the system architecture?"
Pre-commit Hook
Automatically check KB freshness before commits:
# .git/hooks/pre-commit
#!/bin/sh
echo "Checking knowledge base freshness..."
if ! make check-kb; then
echo "Warning: Knowledge base is stale"
echo "Run 'python3 scripts/populate_knowledge_base.py' to update"
echo "Continue anyway? (y/N)"
read -r response
if [[ ! $response =~ ^([yY][eE][sS]|[yY])$ ]]; then
exit 1
fi
fi
Automated Daily Updates
Schedule automatic knowledge base updates:
# Add to crontab
# Update knowledge base daily at 2 AM
0 2 * * * cd /path/to/corgi && python3 scripts/populate_knowledge_base.py >> logs/kb_updates.log 2>&1
Troubleshooting
Common Issues
"Could not retrieve Git commit timestamp"
# Ensure you're in a Git repository
git status
# Check if Git is installed
git --version
"Could not retrieve knowledge base timestamp"
# Check database connection
python3 -c "from db.connection import get_db_connection; print('DB connection OK')"
# Verify knowledge_embeddings table exists
# For PostgreSQL:
psql -d your_db -c "\dt knowledge_embeddings"
# For SQLite:
sqlite3 db.sqlite ".tables knowledge_embeddings"
"Knowledge base is always stale"
# Force rebuild to reset timestamps
python3 scripts/populate_knowledge_base.py --force-rebuild
# Verify system clock
date
Debug Commands
# Verbose freshness check
python3 tools/testing/validate_kb_freshness.py
# Check embedding service
python3 -c "
from agents.local_embedding_service import get_local_embedding_service
service = get_local_embedding_service()
print('Embedding service loaded successfully')
"
# Test knowledge base builder
python3 -c "
from scripts.knowledge_base_builder import KnowledgeBaseBuilder
builder = KnowledgeBaseBuilder()
chunks = builder.build_knowledge_base()
print(f'Built {len(chunks)} knowledge chunks')
"
Performance Optimization
Slow Updates
# Use batch processing (default: 50 chunks per batch)
# Adjust batch size for your hardware:
# Edit scripts/populate_knowledge_base.py
batch_size = 25 # Reduce for limited memory
batch_size = 100 # Increase for powerful hardware
Memory Issues
# Monitor memory usage during updates
# Add to populate_knowledge_base.py:
import psutil
print(f"Memory usage: {psutil.virtual_memory().percent}%")
Best Practices
Development Workflow Integration
-
Check freshness before starting work:
make check-kb -
Update after significant changes:
# After adding new documentation python3 scripts/populate_knowledge_base.py -
Verify RAG accuracy:
# Test with recent changes python3 scripts/cursor_rag_query.py "How does [new feature] work?"
Update Frequency Guidelines
| Scenario | Recommended Action |
|---|---|
| Daily development | Check: make check-kb |
| New documentation | Update: incremental rebuild |
| Major refactoring | Update: force rebuild |
| CI/CD pipeline | Check: automated validation |
| Production deployment | Update: ensure KB is current |
Monitoring and Alerts
Set up alerts for stale knowledge base:
# Weekly knowledge base health check
#!/bin/bash
# save as scripts/weekly_kb_check.sh
if ! make check-kb; then
echo "ALERT: Knowledge base is stale" | mail -s "KB Maintenance Required" team@company.com
# Attempt automatic update
python3 scripts/populate_knowledge_base.py
# Verify fix
if make check-kb; then
echo "Knowledge base automatically updated" | mail -s "KB Updated" team@company.com
fi
fi
Advanced Configuration
Custom Data Sources
Add custom data sources to the knowledge base:
# Edit scripts/knowledge_base_builder.py
def build_knowledge_base(self) -> List[KnowledgeChunk]:
chunks = []
# Existing sources
chunks.extend(self._process_documentation())
chunks.extend(self._process_dev_logs())
# Add custom sources
chunks.extend(self._process_api_schemas())
chunks.extend(self._process_test_results())
return chunks
Embedding Model Updates
When updating the embedding model:
# 1. Update model configuration
# Edit config.py or environment variables
# 2. Force complete rebuild
python3 scripts/populate_knowledge_base.py --force-rebuild
# 3. Verify compatibility
python3 scripts/cursor_rag_query.py "Test query"
Summary
Regular RAG maintenance ensures your AI-powered development workflows remain accurate and helpful:
- Daily: Run
make check-kbbefore starting work - Weekly: Review KB statistics and performance
- After major changes: Run incremental updates
- Monthly: Consider force rebuild for optimal performance
A well-maintained knowledge base is the foundation of reliable AI-assisted development!