A/B Testing Framework Guide

Build confidence in your recommendation improvements with Corgi's comprehensive A/B testing framework. This guide covers everything from creating experiments to analyzing results with statistical rigor.

What is A/B Testing?

A/B testing (also called split testing) allows you to scientifically compare different recommendation algorithms by showing different variants to different users and measuring which performs better.

Corgi's A/B testing framework provides:

Deterministic User Assignment: Users always see the same variant for consistent experience
Statistical Analysis: Built-in significance testing with confidence intervals
Performance Monitoring: Real-time tracking with automatic poor performer blocking
Traffic Management: Sophisticated allocation from 1% canary tests to 50/50 splits
Multiple Experiment Types: Algorithm weights, semantic similarity, and custom model variants

A/B Testing Lifecycle

%%{init: {'theme': 'base'}}%%
graph TD
    A[Create Experiment] --> B[Define Variants]
    B --> C[Configure Traffic]
    C --> D[Start Experiment]
    D --> E[User Assignment]
    E --> F[Performance Tracking]
    F --> G[Statistical Analysis]
    G --> H[Results & Decision]
    H --> I[Winner Deployment]

    subgraph "Creation Phase"
        A
        B
        C
    end

    subgraph "Execution Phase"
        D
        E
        F
    end

    subgraph "Analysis Phase"
        G
        H
        I
    end

    classDef creation fill:#1E293B,stroke:#475569,color:#E2E8F0
    classDef execution fill:#059669,stroke:#047857,color:#ffffff
    classDef analysis fill:#7C2D12,stroke:#92400E,color:#FED7AA

    class A,B,C creation
    class D,E,F execution
    class G,H,I analysis

Step 1: Create an Experiment

Method 1: Dashboard Interface (Recommended)

Access the A/B Testing Dashboard
Navigate to /dashboard in your Corgi instance
Click the "A/B Testing" tab
Click "Create New Experiment"

Define Your Experiment

Name: "Semantic Weight Optimization"
Description: "Testing semantic similarity weights: 10%, 15%, 20%"

Configure Variants
Control: Current algorithm (semantic weight: 0%)
Treatment A: Semantic weight: 10%
Treatment B: Semantic weight: 15%
Treatment C: Semantic weight: 20%
Set Traffic Allocation
Control: 40%
Treatment A: 20%
Treatment B: 20%
Treatment C: 20%

Method 2: REST API

# Create experiment via API
curl -X POST http://localhost:5002/api/v1/analytics/experiments \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{
    "name": "Semantic Weight Optimization",
    "description": "Testing semantic similarity weights",
    "traffic_percentage": 100,
    "minimum_sample_size": 1000,
    "confidence_level": 0.95,
    "variants": [
      {
        "name": "Control",
        "description": "Current algorithm",
        "traffic_allocation": 40,
        "algorithm_config": {
          "weights": {
            "author_preference": 0.40,
            "content_engagement": 0.30,
            "recency": 0.30,
            "semantic_similarity": 0.00
          }
        },
        "is_control": true
      },
      {
        "name": "Semantic 10%",
        "description": "10% semantic weight",
        "traffic_allocation": 20,
        "algorithm_config": {
          "weights": {
            "author_preference": 0.35,
            "content_engagement": 0.30,
            "recency": 0.25,
            "semantic_similarity": 0.10
          }
        },
        "is_control": false
      }
    ]
  }'

Method 3: Command Line (Semantic Experiments)

# Create semantic experiment
python3 scripts/semantic_ab_experiment_manager.py create \
  --name "Semantic Weight Test" \
  --weights "0.10,0.15,0.20" \
  --duration 14 \
  --min-sample-size 1000 \
  --confidence-level 0.95

# Output:
# 🧪 Creating semantic A/B test experiment: Semantic Weight Test
# ✅ Successfully created experiment with ID: 1
# 📊 Traffic Allocation:
#    control: 40.0%
#    semantic_10: 20.0%
#    semantic_15: 20.0%
#    semantic_20: 20.0%

Step 2: Configure Algorithm Variants

Algorithm Weight Configuration

Each variant can have different algorithm weights:

# Control variant (traditional ranking)
control_config = {
    "weights": {
        "author_preference": 0.40,    # User's author interaction history
        "content_engagement": 0.30,   # Post popularity metrics
        "recency": 0.30,              # Time-based relevance
        "semantic_similarity": 0.00   # Semantic matching (disabled)
    },
    "use_semantic_scoring": False
}

# Treatment variant (semantic-enhanced)
treatment_config = {
    "weights": {
        "author_preference": 0.35,    # Reduced to make room for semantic
        "content_engagement": 0.30,
        "recency": 0.25,
        "semantic_similarity": 0.10   # 10% semantic similarity
    },
    "use_semantic_scoring": True,
    "semantic_weight": 0.10
}

Model Variant Configuration

For testing completely different models:

# Register model variants
from core.model_registry import get_registry

registry = get_registry()

# Register new model variant
registry.register_model(
    name="collaborative_filtering",
    version="2.0",
    model_instance=new_model,
    model_type=ModelType.COLLABORATIVE_FILTERING,
    author="data-science-team",
    description="Improved collaborative filtering with matrix factorization"
)

# Set traffic split for A/B testing
registry.set_traffic_split(
    experiment_id="cf_comparison",
    model_configs={
        "simple_1.0": 50.0,           # Control: 50%
        "collaborative_filtering_2.0": 50.0  # Treatment: 50%
    }
)

Step 3: Start and Monitor Experiments

Start Experiment

# Via command line
python3 scripts/semantic_ab_experiment_manager.py start --experiment-id 1

# Via API
curl -X POST http://localhost:5002/api/v1/analytics/experiments/1/start \
  -H "Authorization: Bearer YOUR_TOKEN"

Monitor Progress

# Check experiment status
python3 scripts/semantic_ab_experiment_manager.py status --experiment-id 1

# Output:
# 📊 Experiment 1 Status
# ==================================================
# Name: Semantic Weight Test
# Status: ACTIVE
# Users Assigned: 1,247
# Variants:
#   Control: 498 users (40.0%)
#   Semantic 10%: 251 users (20.1%)
#   Semantic 15%: 249 users (20.0%)
#   Semantic 20%: 249 users (19.9%)

Real-Time Monitoring

# View live metrics
curl http://localhost:5002/api/v1/analytics/experiments/1/metrics \
  -H "Authorization: Bearer YOUR_TOKEN"

Step 4: User Assignment and Experience

How User Assignment Works

Corgi uses consistent hashing to ensure users always get the same variant:

# Deterministic assignment process
def assign_user_to_variant(user_id: str, experiment_id: int):
    # 1. Create hash from user + experiment
    hash_input = f"{user_id}_{experiment_id}"
    hash_value = hashlib.md5(hash_input.encode()).hexdigest()

    # 2. Convert to numeric value (0-99)
    numeric_hash = int(hash_value[:8], 16) % 100

    # 3. Map to variant based on traffic allocation
    if numeric_hash < 40:
        return "control"
    elif numeric_hash < 60:
        return "treatment_a"
    elif numeric_hash < 80:
        return "treatment_b"
    else:
        return "treatment_c"

User Experience

Users experience A/B testing transparently:

Consistent Experience: Same user always sees same variant
No UI Changes: Interface remains identical across variants
Different Recommendations: Only the underlying algorithm differs
Performance Tracking: All interactions are tracked for analysis

Step 5: Performance Tracking

Metrics Collected

Corgi automatically tracks comprehensive metrics:

# Core engagement metrics
engagement_metrics = {
    "click_through_rate": 0.15,      # Clicks / impressions
    "interaction_rate": 0.08,        # Likes, shares, saves / impressions
    "session_duration": 420,         # Average session time (seconds)
    "bounce_rate": 0.25,             # Single-page sessions
    "recommendations_consumed": 8.5   # Avg recommendations per session
}

# Algorithm-specific metrics
algorithm_metrics = {
    "avg_recommendation_score": 0.72,  # Average ranking score
    "score_distribution": [0.1, 0.3, 0.5, 0.7, 0.9],  # Score quartiles
    "recommendation_diversity": 0.85,  # Content variety measure
    "cold_start_performance": 0.68,   # New user experience
    "semantic_match_quality": 0.78    # Semantic similarity accuracy
}

# Performance metrics
performance_metrics = {
    "response_time_p95": 95,         # 95th percentile response time (ms)
    "recommendations_per_second": 150,
    "error_rate": 0.001,             # Request error rate
    "cache_hit_rate": 0.92           # Cache efficiency
}

Performance Gates

Automatic protection against poor performers:

# Performance gates configuration
performance_gates = {
    "max_response_time_p95": 200,    # Block if > 200ms
    "min_click_through_rate": 0.05,  # Block if < 5% CTR
    "max_error_rate": 0.01,          # Block if > 1% errors
    "min_user_satisfaction": 0.6     # Block if < 60% satisfaction
}

Step 6: Statistical Analysis

Built-in Statistical Tests

Corgi provides comprehensive statistical analysis:

# Statistical significance testing
def analyze_experiment_results(experiment_id: int):
    results = {
        "statistical_significance": {
            "control_vs_treatment_a": {
                "p_value": 0.032,
                "confidence_interval": [0.001, 0.028],
                "effect_size": 0.015,
                "significance": "statistically_significant"
            }
        },
        "confidence_intervals": {
            "control_ctr": [0.142, 0.158],
            "treatment_a_ctr": [0.159, 0.175]
        },
        "sample_sizes": {
            "control": 1250,
            "treatment_a": 1180
        }
    }

Results Interpretation

# Analyze experiment results
python3 scripts/semantic_ab_experiment_manager.py analyze --experiment-id 1

# Output:
# 📊 Experiment Analysis: Semantic Weight Test
# ==================================================
# 
# 🏆 WINNER: Semantic 15% (Treatment B)
# 
# Key Metrics Comparison:
#                    Control   Semantic 10%   Semantic 15%   Semantic 20%
# Click-Through Rate   14.2%        15.8%        17.1%        16.3%
# Interaction Rate      7.8%         8.4%         9.2%         8.9%
# Session Duration      387s         412s         445s         431s
# 
# Statistical Significance:
# ✅ Semantic 15% vs Control: p=0.008 (significant)
# ✅ Semantic 15% vs Semantic 10%: p=0.041 (significant)
# ⚠️  Semantic 15% vs Semantic 20%: p=0.127 (not significant)
# 
# 📈 Recommendation: Deploy Semantic 15% configuration

Step 7: Advanced Features

Canary Deployments

Test new algorithms with minimal risk:

# Start with 1% traffic
registry.set_traffic_split(
    experiment_id="canary_test",
    model_configs={
        "current_model": 99.0,      # 99% stay on current
        "new_model": 1.0            # 1% try new model
    }
)

# Gradually increase if performing well
registry.set_traffic_split(
    experiment_id="canary_test",
    model_configs={
        "current_model": 90.0,      # 90% 
        "new_model": 10.0           # 10%
    }
)

Multi-Segment Testing

Test different algorithms for different user segments:

# Configure segment-specific experiments
registry.set_traffic_split(
    experiment_id="segment_test",
    model_configs={
        "simple_model": 50.0,
        "advanced_model": 50.0
    },
    user_segments=["power_users", "early_adopters"]
)

Custom Success Metrics

Define domain-specific success criteria:

# Custom metrics for different experiment types
custom_metrics = {
    "content_discovery": ["diversity_score", "explore_rate"],
    "engagement_optimization": ["time_on_site", "return_rate"],
    "personalization": ["relevance_score", "satisfaction_rating"]
}

Best Practices

Experiment Design

Clear Hypothesis: Define what you're testing and why
Sufficient Sample Size: Ensure statistical power (minimum 1000 users per variant)
Limited Variables: Test one change at a time
Control Groups: Always include a control for comparison
Duration: Run for at least 1-2 weeks to account for weekly patterns

Traffic Allocation

# Recommended allocation strategies
allocation_strategies = {
    "conservative": {"control": 80, "treatment": 20},  # Low risk
    "balanced": {"control": 50, "treatment": 50},      # Standard
    "aggressive": {"control": 30, "treatment": 70},    # High confidence
    "multi_variant": {"control": 40, "a": 20, "b": 20, "c": 20}  # Multiple tests
}

Statistical Rigor

Minimum Sample Size: 1000 users per variant
Confidence Level: 95% (p < 0.05)
Effect Size: Measure practical significance, not just statistical
Multiple Testing: Adjust p-values when running multiple comparisons

Troubleshooting

Common Issues

Low Assignment Rate

# Check experiment status
curl http://localhost:5002/api/v1/analytics/experiments/1/status

# Verify traffic allocation adds to 100%
# Check user eligibility rules

Inconsistent Results

# Verify user assignment consistency
# Check for external factors (holidays, outages)
# Ensure sufficient sample size

Performance Issues

# Monitor performance gates
# Check algorithm complexity
# Verify database query performance

Debug Commands

# Check user assignment
python3 -c "
from utils.ab_testing import ABTestingEngine
engine = ABTestingEngine()
assignment = engine.assign_user_to_variant('user123', 1)
print(f'User assigned to variant: {assignment}')
"

# Verify experiment configuration
python3 -c "
from utils.ab_testing import ABTestingEngine
engine = ABTestingEngine()
experiment = engine.get_experiment(1)
print(f'Experiment config: {experiment}')
"

Next Steps

After completing your A/B testing setup:

Set up monitoring: Configure alerts for key metrics
Automate analysis: Schedule regular statistical analysis
Scale experiments: Run multiple experiments simultaneously
Advanced techniques: Explore multi-armed bandit testing
Team training: Ensure team understands statistical concepts

Ready to start testing? Create your first experiment using the dashboard interface or jump straight to the command line tools for advanced configurations.