A/B Testing Framework Guide
Build confidence in your recommendation improvements with Corgi's comprehensive A/B testing framework. This guide covers everything from creating experiments to analyzing results with statistical rigor.
What is A/B Testing?
A/B testing (also called split testing) allows you to scientifically compare different recommendation algorithms by showing different variants to different users and measuring which performs better.
Corgi's A/B testing framework provides:
- Deterministic User Assignment: Users always see the same variant for consistent experience
- Statistical Analysis: Built-in significance testing with confidence intervals
- Performance Monitoring: Real-time tracking with automatic poor performer blocking
- Traffic Management: Sophisticated allocation from 1% canary tests to 50/50 splits
- Multiple Experiment Types: Algorithm weights, semantic similarity, and custom model variants
A/B Testing Lifecycle
%%{init: {'theme': 'base'}}%%
graph TD
A[Create Experiment] --> B[Define Variants]
B --> C[Configure Traffic]
C --> D[Start Experiment]
D --> E[User Assignment]
E --> F[Performance Tracking]
F --> G[Statistical Analysis]
G --> H[Results & Decision]
H --> I[Winner Deployment]
subgraph "Creation Phase"
A
B
C
end
subgraph "Execution Phase"
D
E
F
end
subgraph "Analysis Phase"
G
H
I
end
classDef creation fill:#1E293B,stroke:#475569,color:#E2E8F0
classDef execution fill:#059669,stroke:#047857,color:#ffffff
classDef analysis fill:#7C2D12,stroke:#92400E,color:#FED7AA
class A,B,C creation
class D,E,F execution
class G,H,I analysis
Step 1: Create an Experiment
Method 1: Dashboard Interface (Recommended)
- Access the A/B Testing Dashboard
- Navigate to
/dashboardin your Corgi instance - Click the "A/B Testing" tab
-
Click "Create New Experiment"
-
Define Your Experiment
Name: "Semantic Weight Optimization" Description: "Testing semantic similarity weights: 10%, 15%, 20%" -
Configure Variants
- Control: Current algorithm (semantic weight: 0%)
- Treatment A: Semantic weight: 10%
- Treatment B: Semantic weight: 15%
-
Treatment C: Semantic weight: 20%
-
Set Traffic Allocation
- Control: 40%
- Treatment A: 20%
- Treatment B: 20%
- Treatment C: 20%
Method 2: REST API
# Create experiment via API
curl -X POST http://localhost:5002/api/v1/analytics/experiments \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_TOKEN" \
-d '{
"name": "Semantic Weight Optimization",
"description": "Testing semantic similarity weights",
"traffic_percentage": 100,
"minimum_sample_size": 1000,
"confidence_level": 0.95,
"variants": [
{
"name": "Control",
"description": "Current algorithm",
"traffic_allocation": 40,
"algorithm_config": {
"weights": {
"author_preference": 0.40,
"content_engagement": 0.30,
"recency": 0.30,
"semantic_similarity": 0.00
}
},
"is_control": true
},
{
"name": "Semantic 10%",
"description": "10% semantic weight",
"traffic_allocation": 20,
"algorithm_config": {
"weights": {
"author_preference": 0.35,
"content_engagement": 0.30,
"recency": 0.25,
"semantic_similarity": 0.10
}
},
"is_control": false
}
]
}'
Method 3: Command Line (Semantic Experiments)
# Create semantic experiment
python3 scripts/semantic_ab_experiment_manager.py create \
--name "Semantic Weight Test" \
--weights "0.10,0.15,0.20" \
--duration 14 \
--min-sample-size 1000 \
--confidence-level 0.95
# Output:
# ๐งช Creating semantic A/B test experiment: Semantic Weight Test
# โ
Successfully created experiment with ID: 1
# ๐ Traffic Allocation:
# control: 40.0%
# semantic_10: 20.0%
# semantic_15: 20.0%
# semantic_20: 20.0%
Step 2: Configure Algorithm Variants
Algorithm Weight Configuration
Each variant can have different algorithm weights:
# Control variant (traditional ranking)
control_config = {
"weights": {
"author_preference": 0.40, # User's author interaction history
"content_engagement": 0.30, # Post popularity metrics
"recency": 0.30, # Time-based relevance
"semantic_similarity": 0.00 # Semantic matching (disabled)
},
"use_semantic_scoring": False
}
# Treatment variant (semantic-enhanced)
treatment_config = {
"weights": {
"author_preference": 0.35, # Reduced to make room for semantic
"content_engagement": 0.30,
"recency": 0.25,
"semantic_similarity": 0.10 # 10% semantic similarity
},
"use_semantic_scoring": True,
"semantic_weight": 0.10
}
Model Variant Configuration
For testing completely different models:
# Register model variants
from core.model_registry import get_registry
registry = get_registry()
# Register new model variant
registry.register_model(
name="collaborative_filtering",
version="2.0",
model_instance=new_model,
model_type=ModelType.COLLABORATIVE_FILTERING,
author="data-science-team",
description="Improved collaborative filtering with matrix factorization"
)
# Set traffic split for A/B testing
registry.set_traffic_split(
experiment_id="cf_comparison",
model_configs={
"simple_1.0": 50.0, # Control: 50%
"collaborative_filtering_2.0": 50.0 # Treatment: 50%
}
)
Step 3: Start and Monitor Experiments
Start Experiment
# Via command line
python3 scripts/semantic_ab_experiment_manager.py start --experiment-id 1
# Via API
curl -X POST http://localhost:5002/api/v1/analytics/experiments/1/start \
-H "Authorization: Bearer YOUR_TOKEN"
Monitor Progress
# Check experiment status
python3 scripts/semantic_ab_experiment_manager.py status --experiment-id 1
# Output:
# ๐ Experiment 1 Status
# ==================================================
# Name: Semantic Weight Test
# Status: ACTIVE
# Users Assigned: 1,247
# Variants:
# Control: 498 users (40.0%)
# Semantic 10%: 251 users (20.1%)
# Semantic 15%: 249 users (20.0%)
# Semantic 20%: 249 users (19.9%)
Real-Time Monitoring
# View live metrics
curl http://localhost:5002/api/v1/analytics/experiments/1/metrics \
-H "Authorization: Bearer YOUR_TOKEN"
Step 4: User Assignment and Experience
How User Assignment Works
Corgi uses consistent hashing to ensure users always get the same variant:
# Deterministic assignment process
def assign_user_to_variant(user_id: str, experiment_id: int):
# 1. Create hash from user + experiment
hash_input = f"{user_id}_{experiment_id}"
hash_value = hashlib.md5(hash_input.encode()).hexdigest()
# 2. Convert to numeric value (0-99)
numeric_hash = int(hash_value[:8], 16) % 100
# 3. Map to variant based on traffic allocation
if numeric_hash < 40:
return "control"
elif numeric_hash < 60:
return "treatment_a"
elif numeric_hash < 80:
return "treatment_b"
else:
return "treatment_c"
User Experience
Users experience A/B testing transparently:
- Consistent Experience: Same user always sees same variant
- No UI Changes: Interface remains identical across variants
- Different Recommendations: Only the underlying algorithm differs
- Performance Tracking: All interactions are tracked for analysis
Step 5: Performance Tracking
Metrics Collected
Corgi automatically tracks comprehensive metrics:
# Core engagement metrics
engagement_metrics = {
"click_through_rate": 0.15, # Clicks / impressions
"interaction_rate": 0.08, # Likes, shares, saves / impressions
"session_duration": 420, # Average session time (seconds)
"bounce_rate": 0.25, # Single-page sessions
"recommendations_consumed": 8.5 # Avg recommendations per session
}
# Algorithm-specific metrics
algorithm_metrics = {
"avg_recommendation_score": 0.72, # Average ranking score
"score_distribution": [0.1, 0.3, 0.5, 0.7, 0.9], # Score quartiles
"recommendation_diversity": 0.85, # Content variety measure
"cold_start_performance": 0.68, # New user experience
"semantic_match_quality": 0.78 # Semantic similarity accuracy
}
# Performance metrics
performance_metrics = {
"response_time_p95": 95, # 95th percentile response time (ms)
"recommendations_per_second": 150,
"error_rate": 0.001, # Request error rate
"cache_hit_rate": 0.92 # Cache efficiency
}
Performance Gates
Automatic protection against poor performers:
# Performance gates configuration
performance_gates = {
"max_response_time_p95": 200, # Block if > 200ms
"min_click_through_rate": 0.05, # Block if < 5% CTR
"max_error_rate": 0.01, # Block if > 1% errors
"min_user_satisfaction": 0.6 # Block if < 60% satisfaction
}
Step 6: Statistical Analysis
Built-in Statistical Tests
Corgi provides comprehensive statistical analysis:
# Statistical significance testing
def analyze_experiment_results(experiment_id: int):
results = {
"statistical_significance": {
"control_vs_treatment_a": {
"p_value": 0.032,
"confidence_interval": [0.001, 0.028],
"effect_size": 0.015,
"significance": "statistically_significant"
}
},
"confidence_intervals": {
"control_ctr": [0.142, 0.158],
"treatment_a_ctr": [0.159, 0.175]
},
"sample_sizes": {
"control": 1250,
"treatment_a": 1180
}
}
Results Interpretation
# Analyze experiment results
python3 scripts/semantic_ab_experiment_manager.py analyze --experiment-id 1
# Output:
# ๐ Experiment Analysis: Semantic Weight Test
# ==================================================
#
# ๐ WINNER: Semantic 15% (Treatment B)
#
# Key Metrics Comparison:
# Control Semantic 10% Semantic 15% Semantic 20%
# Click-Through Rate 14.2% 15.8% 17.1% 16.3%
# Interaction Rate 7.8% 8.4% 9.2% 8.9%
# Session Duration 387s 412s 445s 431s
#
# Statistical Significance:
# โ
Semantic 15% vs Control: p=0.008 (significant)
# โ
Semantic 15% vs Semantic 10%: p=0.041 (significant)
# โ ๏ธ Semantic 15% vs Semantic 20%: p=0.127 (not significant)
#
# ๐ Recommendation: Deploy Semantic 15% configuration
Step 7: Advanced Features
Canary Deployments
Test new algorithms with minimal risk:
# Start with 1% traffic
registry.set_traffic_split(
experiment_id="canary_test",
model_configs={
"current_model": 99.0, # 99% stay on current
"new_model": 1.0 # 1% try new model
}
)
# Gradually increase if performing well
registry.set_traffic_split(
experiment_id="canary_test",
model_configs={
"current_model": 90.0, # 90%
"new_model": 10.0 # 10%
}
)
Multi-Segment Testing
Test different algorithms for different user segments:
# Configure segment-specific experiments
registry.set_traffic_split(
experiment_id="segment_test",
model_configs={
"simple_model": 50.0,
"advanced_model": 50.0
},
user_segments=["power_users", "early_adopters"]
)
Custom Success Metrics
Define domain-specific success criteria:
# Custom metrics for different experiment types
custom_metrics = {
"content_discovery": ["diversity_score", "explore_rate"],
"engagement_optimization": ["time_on_site", "return_rate"],
"personalization": ["relevance_score", "satisfaction_rating"]
}
Best Practices
Experiment Design
- Clear Hypothesis: Define what you're testing and why
- Sufficient Sample Size: Ensure statistical power (minimum 1000 users per variant)
- Limited Variables: Test one change at a time
- Control Groups: Always include a control for comparison
- Duration: Run for at least 1-2 weeks to account for weekly patterns
Traffic Allocation
# Recommended allocation strategies
allocation_strategies = {
"conservative": {"control": 80, "treatment": 20}, # Low risk
"balanced": {"control": 50, "treatment": 50}, # Standard
"aggressive": {"control": 30, "treatment": 70}, # High confidence
"multi_variant": {"control": 40, "a": 20, "b": 20, "c": 20} # Multiple tests
}
Statistical Rigor
- Minimum Sample Size: 1000 users per variant
- Confidence Level: 95% (p < 0.05)
- Effect Size: Measure practical significance, not just statistical
- Multiple Testing: Adjust p-values when running multiple comparisons
Troubleshooting
Common Issues
Low Assignment Rate
# Check experiment status
curl http://localhost:5002/api/v1/analytics/experiments/1/status
# Verify traffic allocation adds to 100%
# Check user eligibility rules
Inconsistent Results
# Verify user assignment consistency
# Check for external factors (holidays, outages)
# Ensure sufficient sample size
Performance Issues
# Monitor performance gates
# Check algorithm complexity
# Verify database query performance
Debug Commands
# Check user assignment
python3 -c "
from utils.ab_testing import ABTestingEngine
engine = ABTestingEngine()
assignment = engine.assign_user_to_variant('user123', 1)
print(f'User assigned to variant: {assignment}')
"
# Verify experiment configuration
python3 -c "
from utils.ab_testing import ABTestingEngine
engine = ABTestingEngine()
experiment = engine.get_experiment(1)
print(f'Experiment config: {experiment}')
"
Next Steps
After completing your A/B testing setup:
- Set up monitoring: Configure alerts for key metrics
- Automate analysis: Schedule regular statistical analysis
- Scale experiments: Run multiple experiments simultaneously
- Advanced techniques: Explore multi-armed bandit testing
- Team training: Ensure team understands statistical concepts
Ready to start testing? Create your first experiment using the dashboard interface or jump straight to the command line tools for advanced configurations.