Skip to content

RAG Maintenance Guide

Keep your Corgi RAG system accurate and up-to-date with proper knowledge base maintenance. This guide provides simple commands to validate freshness and rebuild the knowledge base when needed.

Why RAG Maintenance Matters

A stale knowledge base leads to serious problems:

  • Inaccurate Answers: Outdated information produces misleading responses
  • Missing Features: New functionality won't be documented in RAG queries
  • Hallucinations: AI may fabricate information about non-existent features
  • Developer Confusion: Team members get wrong guidance from RAG system

Critical Impact

The RAG system powers development workflows, documentation queries, and architectural decisions. A stale knowledge base can mislead your entire team.

Quick Health Check

Check if your knowledge base is current:

# Simple freshness check
make check-kb

Example output when current:

🔍 Checking knowledge base freshness...
📝 Last Git commit: 2025-01-15 14:30:25 UTC
📚 Knowledge base updated: 2025-01-15 14:32:18 UTC
✅ Knowledge base is current
   Knowledge base is 113 seconds ahead of latest commit

Example output when stale:

🔍 Checking knowledge base freshness...
📝 Last Git commit: 2025-01-15 16:45:30 UTC
📚 Knowledge base updated: 2025-01-15 14:32:18 UTC
🚨 Error: Knowledge base is stale. Please run 'python3 scripts/populate_knowledge_base.py'
   Knowledge base is 8032 seconds behind the latest commit

Grace Period

The system allows a 5-minute grace period between Git commits and knowledge base updates to account for processing time.

Updating the Knowledge Base

# Standard incremental update
python3 scripts/populate_knowledge_base.py

This performs an incremental update: - Only processes new or changed content - Faster execution (typically 30-60 seconds) - Preserves existing embeddings - Safe for regular use

Method 2: Force Complete Rebuild

# Complete rebuild from scratch
python3 scripts/populate_knowledge_base.py --force-rebuild

Use complete rebuild when: - Documentation structure has changed significantly - Embedding model has been updated - Database corruption is suspected - Major codebase reorganization occurred

Complete Rebuild Performance

Force rebuilds take 5-15 minutes depending on codebase size. Use sparingly for major changes only.

Understanding the Update Process

The knowledge base builder processes multiple data sources:

%%{init: {'theme': 'base'}}%%
graph TD
    subgraph "Input Sources"
        direction LR
        A["Git Repository"]
        B["Codebase"]
        C["Documentation"]
        D["Dev Logs & Reports"]
    end

    subgraph "Orchestration"
        H("Knowledge Base Builder<br/>populate_knowledge_base.py")
    end

    subgraph "Processing Pipeline"
        direction LR
        E("Content Chunking") --> F("Embedding Generation") --> G("Database Storage<br/>knowledge_embeddings")
    end

    A --> H
    B --> H
    C --> H
    D --> H
    H --> E

    classDef source fill:#1E293B,stroke:#475569,color:#E2E8F0
    classDef process fill:#0F172A,stroke:#334155,color:#94A3B8
    classDef orchestrator fill:#059669,stroke:#047857,color:#ffffff

    class A,B,C,D source
    class E,F,G process
    class H orchestrator

Monitoring and Statistics

Update Statistics

The population script provides detailed statistics:

python3 scripts/populate_knowledge_base.py

# Example output:
📖 Building knowledge base chunks...
Found 47 existing chunks in database
Processing 12 new/updated chunks...
📦 Processing batch 1/1 (12 chunks) Knowledge base population complete!
📊 Stats: 12 stored, 35 skipped, 0 errors

Knowledge Base Health Metrics

Check comprehensive knowledge base statistics:

# Get detailed KB stats
from scripts.populate_knowledge_base import KnowledgeBasePopulator

populator = KnowledgeBasePopulator()
stats = populator.get_knowledge_base_stats()

print(f"Total chunks: {stats['total_chunks']}")
print(f"Last updated: {stats['last_updated']}")
print(f"Data sources: {stats['data_sources']}")

Automation and CI/CD Integration

GitHub Actions Integration

Add knowledge base validation to your CI/CD pipeline:

name: Knowledge Base Validation

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

jobs:
  validate-kb:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v4
      with:
        fetch-depth: 0  # Need full history for Git timestamps

    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'

    - name: Install dependencies
      run: |
        pip install -r requirements.txt

    - name: Check knowledge base freshness
      run: |
        if ! make check-kb; then
          echo "Knowledge base is stale. Updating..."
          python3 scripts/populate_knowledge_base.py

          # Verify update was successful
          make check-kb
        else
          echo "Knowledge base is current"
        fi

    - name: Validate RAG system
      run: |
        # Test RAG queries to ensure system is working
        python3 scripts/cursor_rag_query.py "What is the system architecture?"

Pre-commit Hook

Automatically check KB freshness before commits:

# .git/hooks/pre-commit
#!/bin/sh
echo "Checking knowledge base freshness..."

if ! make check-kb; then
    echo "Warning: Knowledge base is stale"
    echo "Run 'python3 scripts/populate_knowledge_base.py' to update"
    echo "Continue anyway? (y/N)"
    read -r response
    if [[ ! $response =~ ^([yY][eE][sS]|[yY])$ ]]; then
        exit 1
    fi
fi

Automated Daily Updates

Schedule automatic knowledge base updates:

# Add to crontab
# Update knowledge base daily at 2 AM
0 2 * * * cd /path/to/corgi && python3 scripts/populate_knowledge_base.py >> logs/kb_updates.log 2>&1

Troubleshooting

Common Issues

"Could not retrieve Git commit timestamp"

# Ensure you're in a Git repository
git status

# Check if Git is installed
git --version

"Could not retrieve knowledge base timestamp"

# Check database connection
python3 -c "from db.connection import get_db_connection; print('DB connection OK')"

# Verify knowledge_embeddings table exists
# For PostgreSQL:
psql -d your_db -c "\dt knowledge_embeddings"

# For SQLite:
sqlite3 db.sqlite ".tables knowledge_embeddings"

"Knowledge base is always stale"

# Force rebuild to reset timestamps
python3 scripts/populate_knowledge_base.py --force-rebuild

# Verify system clock
date

Debug Commands

# Verbose freshness check
python3 tools/testing/validate_kb_freshness.py

# Check embedding service
python3 -c "
from agents.local_embedding_service import get_local_embedding_service
service = get_local_embedding_service()
print('Embedding service loaded successfully')
"

# Test knowledge base builder
python3 -c "
from scripts.knowledge_base_builder import KnowledgeBaseBuilder
builder = KnowledgeBaseBuilder()
chunks = builder.build_knowledge_base()
print(f'Built {len(chunks)} knowledge chunks')
"

Performance Optimization

Slow Updates

# Use batch processing (default: 50 chunks per batch)
# Adjust batch size for your hardware:
# Edit scripts/populate_knowledge_base.py
batch_size = 25  # Reduce for limited memory
batch_size = 100  # Increase for powerful hardware

Memory Issues

# Monitor memory usage during updates
# Add to populate_knowledge_base.py:
import psutil
print(f"Memory usage: {psutil.virtual_memory().percent}%")

Best Practices

Development Workflow Integration

  1. Check freshness before starting work:

    make check-kb
    

  2. Update after significant changes:

    # After adding new documentation
    python3 scripts/populate_knowledge_base.py
    

  3. Verify RAG accuracy:

    # Test with recent changes
    python3 scripts/cursor_rag_query.py "How does [new feature] work?"
    

Update Frequency Guidelines

Scenario Recommended Action
Daily development Check: make check-kb
New documentation Update: incremental rebuild
Major refactoring Update: force rebuild
CI/CD pipeline Check: automated validation
Production deployment Update: ensure KB is current

Monitoring and Alerts

Set up alerts for stale knowledge base:

# Weekly knowledge base health check
#!/bin/bash
# save as scripts/weekly_kb_check.sh

if ! make check-kb; then
    echo "ALERT: Knowledge base is stale" | mail -s "KB Maintenance Required" team@company.com

    # Attempt automatic update
    python3 scripts/populate_knowledge_base.py

    # Verify fix
    if make check-kb; then
        echo "Knowledge base automatically updated" | mail -s "KB Updated" team@company.com
    fi
fi

Advanced Configuration

Custom Data Sources

Add custom data sources to the knowledge base:

# Edit scripts/knowledge_base_builder.py
def build_knowledge_base(self) -> List[KnowledgeChunk]:
    chunks = []

    # Existing sources
    chunks.extend(self._process_documentation())
    chunks.extend(self._process_dev_logs())

    # Add custom sources
    chunks.extend(self._process_api_schemas())
    chunks.extend(self._process_test_results())

    return chunks

Embedding Model Updates

When updating the embedding model:

# 1. Update model configuration
# Edit config.py or environment variables

# 2. Force complete rebuild
python3 scripts/populate_knowledge_base.py --force-rebuild

# 3. Verify compatibility
python3 scripts/cursor_rag_query.py "Test query"

Summary

Regular RAG maintenance ensures your AI-powered development workflows remain accurate and helpful:

  • Daily: Run make check-kb before starting work
  • Weekly: Review KB statistics and performance
  • After major changes: Run incremental updates
  • Monthly: Consider force rebuild for optimal performance

A well-maintained knowledge base is the foundation of reliable AI-assisted development!